Lc0 CPU DNNL version with high thread count

Joerg Oster · Post by **Joerg Oster** » Sun Dec 06, 2020 1:03 pm

I'm interested in the scaling abilities of the CPU version of Lc0.
If anyone wants to run it with a higher thread count, say 16, 32, 48, or even 64 threads, here's what to do.

Download the cpu_dnnl version from here: https://github.com/LeelaChessZero/lc0/r ... ag/v0.26.3
Unpack and run Lc0 in a Windows console.

Here are the exact commands:

Code: Select all

Lc0
uci
setoption name Threads value 32
setoption name NNCacheSize value 4000000
ucinewgame
isready
go movetime 90000

Change the "Threads" number accordingly.
Please post the last info line of the output and your CPU. Thanks.

Here is what I get with 1 and 4 threads on my i5-4570@3.20GHz:

1 Thread
info depth 7 seldepth 26 time 70442 nodes 15880 score cp 13 nps 255 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5

4 Threads
info depth 8 seldepth 30 time 72631 nodes 67409 score cp 12 nps 1123 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

BogStandard · Post by **BogStandard** » Sun Dec 06, 2020 11:38 pm

Computer 1 ....
i7-10510U CPU @ 1.80GHz (DELL Inspiron 7391 2n1 )
lc0 0.26.3
BLAS functions from DNNL version 1.5.0
BLAS max batch size is 256.

1 thread
info depth 8 seldepth 29 time 67792 nodes 26603 score cp 13 nps 405 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7
bestmove e2e4 ponder c7c5

4 threads
info depth 9 seldepth 30 time 70367 nodes 76175 score cp 12 nps 1104 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

Regards...
6 threads
info depth 9 seldepth 30 time 70303 nodes 76746 score cp 12 nps 1119 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

7 threads
info depth 9 seldepth 30 time 72219 nodes 80914 score cp 12 nps 1170 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

8 threads
info depth 9 seldepth 30 time 72528 nodes 90516 score cp 12 nps 1273 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

BogStandard · Post by **BogStandard** » Sun Dec 06, 2020 11:42 pm

(Dual 8 core HT) Xeon CPU E5-2690 @ 2.90GHz (HP Z620 Workstation, 2nd hand)
lc0 0.26.3
BLAS functions from DNNL version 1.5.0
BLAS max batch size is 256.

1 thread
info depth 7 seldepth 23 time 69588 nodes 10150 score cp 13 nps 168 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5
bestmove e2e4 ponder c7c5

4 threads
info depth 8 seldepth 29 time 68754 nodes 42821 score cp 12 nps 669 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

8 threads
info depth 9 seldepth 30 time 72889 nodes 75376 score cp 12 nps 1249 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

12 threads
info depth 9 seldepth 30 time 73030 nodes 96805 score cp 12 nps 1362 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

16 threads
info depth 9 seldepth 30 time 73833 nodes 106124 score cp 12 nps 1489 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

24 threads
info depth 9 seldepth 33 time 69492 nodes 153823 score cp 12 nps 2596 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8 d1h5 c6e7 c2c3 e7d5 e4d5
bestmove e2e4 ponder c7c5

30 threads
info depth 9 seldepth 33 time 67559 nodes 161679 score cp 12 nps 2621 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8 d1h5 c6e7 c2c3 e7d5 e4d5
bestmove e2e4 ponder c7c5

32 threads
info depth 9 seldepth 33 time 66256 nodes 169711 score cp 12 nps 2663 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8 d1h5 c6e7 c2c3 e7d5 e4d5
bestmove e2e4 ponder c7c5

Regards ...

MMarco · Post by **MMarco** » Mon Dec 07, 2020 3:55 am

Joerg Oster wrote: ↑Sun Dec 06, 2020 1:03 pm I'm interested in the scaling abilities of the CPU version of Lc0.
If anyone wants to run it with a higher thread count, say 16, 32, 48, or even 64 threads, here's what to do.

Download the cpu_dnnl version from here: https://github.com/LeelaChessZero/lc0/r ... ag/v0.26.3
Unpack and run Lc0 in a Windows console.

Here are the exact commands:
Code: Select all
Lc0
uci
setoption name Threads value 32
setoption name NNCacheSize value 4000000
ucinewgame
isready
go movetime 90000
Change the "Threads" number accordingly.
Please post the last info line of the output and your CPU. Thanks.

Here is what I get with 1 and 4 threads on my i5-4570@3.20GHz:
1 Thread
info depth 7 seldepth 26 time 70442 nodes 15880 score cp 13 nps 255 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5

4 Threads
info depth 8 seldepth 30 time 72631 nodes 67409 score cp 12 nps 1123 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

I think the scaling will depend on the minibatch-size. I did some data mining with Lc0 26.3 dnnl about a month ago and I found I get the highest NPS by using Threads = number of physical cores and MinibatchSize = 3. I would be curious if it holds at higher cores/threads count or if someone can find something even better. It holds for me on my three machines (4 cores, 8 cores, 10 cores systems).

Here are the results for my Ryzen-7 3750H (4 cores, 8 threads).

Note that Threads = 4 & MinibatchSize = 3 has the highest nps and that Threads = 8 & MinibatchSize = 3 looses nps. With MinibatchSize = 7, NPS seems always be increasing with Threads (but don't get as high).

Code: Select all

Threads = 1 & MinibatchSize = 1
info depth 7 seldepth 25 time 69682 nodes 13467 score cp 13 nps 221 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5
bestmove e2e4 ponder c7c5

Threads = 1 & MinibatchSize = 3
info depth 7 seldepth 27 time 69381 nodes 17608 score cp 13 nps 276 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5

Threads = 1 & MinibatchSize = 7
info depth 7 seldepth 27 time 69372 nodes 17837 score cp 13 nps 285 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5

Code: Select all

Threads = 4 & MinibatchSize = 1
info depth 8 seldepth 29 time 69643 nodes 38400 score cp 12 nps 638 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 4 & MinibatchSize = 3
info depth 8 seldepth 29 time 68987 nodes 45232 score cp 12 nps 738 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 4 & MinibatchSize = 7
info depth 8 seldepth 29 time 69213 nodes 43016 score cp 12 nps 685 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Code: Select all

Threads = 8 & MinibatchSize = 1
info depth 8 seldepth 29 time 68820 nodes 39667 score cp 12 nps 636 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 8 & MinibatchSize = 3
info depth 8 seldepth 29 time 68893 nodes 41198 score cp 12 nps 671 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 8 & MinibatchSize = 7
info depth 8 seldepth 29 time 69374 nodes 43565 score cp 12 nps 712 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Joerg Oster · Post by **Joerg Oster** » Mon Dec 07, 2020 11:08 am

BogStandard wrote: ↑Sun Dec 06, 2020 11:38 pm Computer 1 ....
i7-10510U CPU @ 1.80GHz (DELL Inspiron 7391 2n1 )
lc0 0.26.3
BLAS functions from DNNL version 1.5.0
BLAS max batch size is 256.

1 thread
info depth 8 seldepth 29 time 67792 nodes 26603 score cp 13 nps 405 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7
bestmove e2e4 ponder c7c5

4 threads
info depth 9 seldepth 30 time 70367 nodes 76175 score cp 12 nps 1104 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

Regards...
6 threads
info depth 9 seldepth 30 time 70303 nodes 76746 score cp 12 nps 1119 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

7 threads
info depth 9 seldepth 30 time 72219 nodes 80914 score cp 12 nps 1170 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

8 threads
info depth 9 seldepth 30 time 72528 nodes 90516 score cp 12 nps 1273 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

Thank you!

So on this machine and with 4 threads you achieve almost the same result as mine.
However, going from 4 to 8 threads, the gain is almost negligible. Interesting.

Joerg Oster · Post by **Joerg Oster** » Mon Dec 07, 2020 11:12 am

BogStandard wrote: ↑Sun Dec 06, 2020 11:42 pm (Dual 8 core HT) Xeon CPU E5-2690 @ 2.90GHz (HP Z620 Workstation, 2nd hand)
lc0 0.26.3
BLAS functions from DNNL version 1.5.0
BLAS max batch size is 256.

1 thread
info depth 7 seldepth 23 time 69588 nodes 10150 score cp 13 nps 168 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5
bestmove e2e4 ponder c7c5

4 threads
info depth 8 seldepth 29 time 68754 nodes 42821 score cp 12 nps 669 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

8 threads
info depth 9 seldepth 30 time 72889 nodes 75376 score cp 12 nps 1249 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

12 threads
info depth 9 seldepth 30 time 73030 nodes 96805 score cp 12 nps 1362 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

16 threads
info depth 9 seldepth 30 time 73833 nodes 106124 score cp 12 nps 1489 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5

24 threads
info depth 9 seldepth 33 time 69492 nodes 153823 score cp 12 nps 2596 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8 d1h5 c6e7 c2c3 e7d5 e4d5
bestmove e2e4 ponder c7c5

30 threads
info depth 9 seldepth 33 time 67559 nodes 161679 score cp 12 nps 2621 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8 d1h5 c6e7 c2c3 e7d5 e4d5
bestmove e2e4 ponder c7c5

32 threads
info depth 9 seldepth 33 time 66256 nodes 169711 score cp 12 nps 2663 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8 d1h5 c6e7 c2c3 e7d5 e4d5
bestmove e2e4 ponder c7c5

Regards ...

Thanks again.

Here going from 4 to 8 threads, you get a nice gain in terms of nps.
From 8 to 16 threads, the gain is small.
Yet going from 16 to 32 threads yields another boost.

Not sure how to interpret these results.

Joerg Oster · Post by **Joerg Oster** » Mon Dec 07, 2020 11:28 am

MMarco wrote: ↑Mon Dec 07, 2020 3:55 am
Joerg Oster wrote: ↑Sun Dec 06, 2020 1:03 pm I'm interested in the scaling abilities of the CPU version of Lc0.
If anyone wants to run it with a higher thread count, say 16, 32, 48, or even 64 threads, here's what to do.

Download the cpu_dnnl version from here: https://github.com/LeelaChessZero/lc0/r ... ag/v0.26.3
Unpack and run Lc0 in a Windows console.

Here are the exact commands:
Code: Select all
Lc0
uci
setoption name Threads value 32
setoption name NNCacheSize value 4000000
ucinewgame
isready
go movetime 90000
Change the "Threads" number accordingly.
Please post the last info line of the output and your CPU. Thanks.

Here is what I get with 1 and 4 threads on my i5-4570@3.20GHz:
1 Thread
info depth 7 seldepth 26 time 70442 nodes 15880 score cp 13 nps 255 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5

4 Threads
info depth 8 seldepth 30 time 72631 nodes 67409 score cp 12 nps 1123 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4 a8b8
bestmove e2e4 ponder c7c5
I think the scaling will depend on the minibatch-size. I did some data mining with Lc0 26.3 dnnl about a month ago and I found I get the highest NPS by using Threads = number of physical cores and MinibatchSize = 3. I would be curious if it holds at higher cores/threads count or if someone can find something even better. It holds for me on my three machines (4 cores, 8 cores, 10 cores systems).

Here are the results for my Ryzen-7 3750H (4 cores, 8 threads).

Note that Threads = 4 & MinibatchSize = 3 has the highest nps and that Threads = 8 & MinibatchSize = 3 looses nps. With MinibatchSize = 7, NPS seems always be increasing with Threads (but don't get as high).
Code: Select all
Threads = 1 & MinibatchSize = 1
info depth 7 seldepth 25 time 69682 nodes 13467 score cp 13 nps 221 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5
bestmove e2e4 ponder c7c5

Threads = 1 & MinibatchSize = 3
info depth 7 seldepth 27 time 69381 nodes 17608 score cp 13 nps 276 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5

Threads = 1 & MinibatchSize = 7
info depth 7 seldepth 27 time 69372 nodes 17837 score cp 13 nps 285 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5
bestmove e2e4 ponder c7c5
Code: Select all
Threads = 4 & MinibatchSize = 1
info depth 8 seldepth 29 time 69643 nodes 38400 score cp 12 nps 638 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 4 & MinibatchSize = 3
info depth 8 seldepth 29 time 68987 nodes 45232 score cp 12 nps 738 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 4 & MinibatchSize = 7
info depth 8 seldepth 29 time 69213 nodes 43016 score cp 12 nps 685 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5
Code: Select all
Threads = 8 & MinibatchSize = 1
info depth 8 seldepth 29 time 68820 nodes 39667 score cp 12 nps 636 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 8 & MinibatchSize = 3
info depth 8 seldepth 29 time 68893 nodes 41198 score cp 12 nps 671 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Threads = 8 & MinibatchSize = 7
info depth 8 seldepth 29 time 69374 nodes 43565 score cp 12 nps 712 tbhits 0 pv e2e4 c7c5 g1f3 e7e6 b1c3 b8c6 d2d4 c5d4 f3d4 g8f6 d4b5 d7d6 c1f4 e6e5 f4g5 a7a6 b5a3 b7b5 g5f6 g7f6 c3d5 f6f5 g2g3 f5e4 f1g2 f8g7 g2e4
bestmove e2e4 ponder c7c5

Thank you, too!

Interestingly, I also get higher nps with a MinibatchSize of 8 using only 1 core.
However, as soon as I'm using 4 threads, I get the best results with MinibatchSize=1.
Which seems the most logical to me.

This little experiment offers some interesting insights.

1. Lc0 doesn't obey the go movetime command very accurately.
There is definitely room for improvement.
(Maybe go nodes would have been the better comparison.)

2. Lc0 doesn't scale as good as I expected. However, I didn't look into the code,
so I don't know how multithreading is being implemented.

Lc0 CPU DNNL version with high thread count

Lc0 CPU DNNL version with high thread count

Re: Lc0 CPU DNNL version with high thread count

Re: Lc0 CPU DNNL version with high thread count

Re: Lc0 CPU DNNL version with high thread count

Re: Lc0 CPU DNNL version with high thread count

Re: Lc0 CPU DNNL version with high thread count

Re: Lc0 CPU DNNL version with high thread count