Re: LCZero update
Posted: Thu Mar 29, 2018 5:01 pm
Fixed number of playouts is sames as using fixed number of nodes or depth, so there shouldn't be any difference in strength on CPU and GPU with that setup.
True. But there was doubt whether the GPU version was working correctly.Daniel Shawul wrote:Fixed number of playouts is sames as using fixed number of nodes or depth, so there shouldn't be any difference in strength on CPU and GPU with that setup.
Still intrigues me that using full CPU (4 cores), I can get speeds (NPS) achievable only with the best GPUs. Shouldn't top GPU be an order of magnitude faster than full CPU? On 4 cores, CPU version seems to be 1800+ CCRL Elo level. Gauntlet of games at 1s/move:Joost Buijs wrote:Over here the CPU only version does about 400 n/s on a single core (Broadwell 3.8 GHz.), when I use my cheap GT-720 GPU with 192 Cuda cores this figure drops down to 250 n/s. On my GTX-1080Ti it runs at ~3500 n/s (when running 2 instances of the client).Guenther wrote:I had a very different experience with the (finally) working cpu version.Laskos wrote:I have a weak video card, but I didn't expect that:CMCanavessi wrote:New official version released:
https://github.com/glinscott/leela-ches ... s/tag/v0.4
Finally includes a windows build with all the dlls, and a working windows CPU-Only build as well
http://www.talkchess.com/forum/viewtopi ... 45&start=5
CPU version is performing much better. Is LCZero using the GPU card properly?
Here it was around 4 times slower on one thread despite having a cheap
gpu card. May be I create exact numbers again. Currently I have already
deleted the cpu version after my measurement.
Code: Select all
Games Completed = 40 of 40 (Avg game length = 72.851 sec)
Settings = Gauntlet/64MB/1000ms per move/M 500cp for 3 moves, D 140 moves/EPD:C:\LittleBlitzer\2moves_v1.epd(32000)
Time = 3721 sec elapsed, 0 sec remaining
1. LCZero CPU 4 threads 22.0/40 19-15-6 (L: m=15 t=0 i=0 a=0) (D: r=6 i=0 f=0 s=0 a=0) (tpm=960.5 d=17.49 nps=3767)
2. Predateur 2.2.1 (1786) 10.0/20 10-10-0 (L: m=10 t=0 i=0 a=0) (D: r=0 i=0 f=0 s=0 a=0) (tpm=887.6 d=55.72 nps=3133895)
3. Zurichess Appenzeller (1821) 8.0/20 5-9-6 (L: m=9 t=0 i=0 a=0) (D: r=6 i=0 f=0 s=0 a=0) (tpm=23.0 d=4.49 nps=959911)
I have the feeling that the new v4 client has problems uploading the games. This morning I let v4 run for some time, it produced about 30 games but only a few of them appear in the server statistics. Running the client with -debug doesn't give any extra information at all, so I really don't know what is going on.
My expectation was that LCZero on GPU would run a lot faster than on CPU. On my i7-6950x (using 10 cores) the CPU version does ~2500 nps, my GTX-1080Ti does ~3500 nps, so not much difference at all.Laskos wrote: Still intrigues me that using full CPU (4 cores), I can get speeds (NPS) achievable only with the best GPUs. Shouldn't top GPU be an order of magnitude faster than full CPU?
It is probably because matrix-matrix multiplication is memory-bound not compute-bound. If you don't do much computation per byte loaded, your speedup over the CPU (using all cores) is probably not going to go above 5-6X. Moreover DGEMM etc have been optimized for years for vector CPU machines so they are hard to beat.Joost Buijs wrote:My expectation was that LCZero on GPU would run a lot faster than on CPU. On my i7-6950x (using 10 cores) the CPU version does ~2500 nps, my GTX-1080Ti does ~3500 nps, so not much difference at all.Laskos wrote: Still intrigues me that using full CPU (4 cores), I can get speeds (NPS) achievable only with the best GPUs. Shouldn't top GPU be an order of magnitude faster than full CPU?
I don't have experience with matrix multiplication on a GPU, but when I use the 1080Ti for 'public key encryption' it runs an order of magnitude faster than the 6950x and somehow I expected LCZero to perform in the same way. Maybe the OpenCL code is not optimal yet, and probably there are other things that can be optimized as well, the project is very new and my guess is that the code will mature over time.
You are right, but the performance seems to be lower than it can be.Daniel Shawul wrote: It is probably because matrix-matrix multiplication is memory-bound not compute-bound. If you don't do much computation per byte loaded, your speedup over the CPU (using all cores) is probably not going to go above 5-6X. Moreover DGEMM etc have been optimized for years for vector CPU machines so they are hard to beat.
Daniel
The graph starts perhaps to show a point of diminishing returns?CMCanavessi wrote:New official version released:
https://github.com/glinscott/leela-ches ... s/tag/v0.4
Finally includes a windows build with all the dlls, and a working windows CPU-Only build as well
Hi, can you give some complete instructions (1,2,3,4 etc) about that?jpqy wrote:It's indeed working when explained well.. for using it into Cutechess you need to make a play.bat file then the engine get loaded..Thanks with the help from Aloril and other guys on LCZero chat!CMCanavessi wrote: It IS working, you just don't know how to use it. You need to specify the network file with -w <file>