M ANSARI wrote: ↑Sun Jul 28, 2019 8:05 am
You have to realize that GPU graphic cards that have the ability to do AI are in the first generation. So if you look at CPU power, this would be sort of like running SF using a single core 386 CPU. Of course a GPU card like 2080Ti is expensive today, but so was a 386 when it first came out. My guess is that GPU's that can do AI will quickly get much more powerful and much cheaper. There is no doubt that AI will transform every thing in our life and maybe it will be a transformation similar to when humanity discovered electricity. Lc0 is only competitive once you have reasonably good hardware. I don't think you need a 2080Ti card for that and most likely the new 2070 Super cards are very competitive with SF. This hardware pricing of cards that can do AI will probably change exponentially with much more powerful cards coming out at a fraction of today's prices. Also Lc0 will probably patch many of its weaknesses (tactical and endgame weakness) via software … remember Lc0 is only a little over a year old.
The "Tensor Cores" of GPUs are an interesting trick, but the fundamental SIMD-architecture of GPUs is well-researched and well-discussed by the graphics community for the last 20 years. Modern GPUs represent the sum of decades of research and development.
Case in point: a 2080 Ti has 11 Trillion-Operations/second worth of compute on 616GB/s main-memory speed. True, the "Tensor-ops" allow 100+ Trillion "16-bit floating point Multiplications" (aka: neural network operations), but that's only useful when you have an algorithm dominated by neural-nets. And frankly, I'm not convinced that the whole tensor-op methodology is working out too well.
Take a nice CPU, like the AMD Ryzen 3950x: 16 cores (with 256-bit AVX2) x 4.7 GHz. AVX2 is 8x 32-bit operations per core x 16 cores x 4.7 GHz == 0.6 Trillion operations/second, on only 50GBps main-memory speed. Actually, most chess engines avoid AVX2, and instead stick with 64-bit operations. If you're using 64-bit bitboards on traditional 64-bit operations, your CPU-algorithm only has access to 0.075 Trillion operations/second.
In any case: a CPU is operating with 10% of the main-memory bandwidth and 1% of the raw compute power. The real questions people need to be asking themselves are:
1: Why are CPUs able to play chess so well, despite the
hugely deficient compute and memory resources?
2: Why do you require so many Trillions-of-ops before Neural Nets become useful? Leela-Zero is on a machine capable of performing 100-Trillion operations / second. Shouldn't we be expecting it to perform better?
3: Are there other algorithms to be discovered that can take advantage of the CPU-algorithms, and port them to use the massively improved compute and memory available to GPUs?
-------
GPU-algorithms must take advantage of parallel compute resources. Its hard to think in parallel, especially if you've been doing sequential programming for years. But I analyze the parallel-algorithms in the CPU-world and they simply are insufficient for GPU translation: YBWC, ABDADA, etc. etc. All of these are designed for low-core count machines (maybe 20 or 50 cores), and will fall apart with 4000+ cores of a 2080 TI. Every thread will visit every node in YBWC or ABDADA. In most cases, a "visit" is pining the transposition table and sharing the work done from other threads, but you will run out of main-memory bandwidth very quickly when 4000+ cores are pinging the TT so hard.
Other GPU-programmers have proven that elements of a chess engine can be ported to a GPU. With over
20-Billion nodes/second perft on an ancient 780 Ti GPU, part of the GPU-programming problem has already been solved. The only remaining part is the search algorithm.