Need advice on SMP

Discussion of chess software programming and technical issues.

Moderator: Ras

Carbec
Posts: 161
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Need advice on SMP

Post by Carbec »

Hello,

I implemented SMP as "Lazy SMP with a shared transposition table".
It works well, at least I think :) But one thing confuses me, but its perhaps normal.
This is the million-nodes/sec , for 1 to 8 threads :
1 : 5,1
2 : 9,8
4 : 16,5
6 : 21,9
8 : 23,8
I made the test with a position set (BT2630) with 30 positions, and a search depth of 10.
Is it normal that the curve is not linear ? or Is there a bug somewhere ?
For information, I have a QuadCore Intel Core i7-2600K, with 4 cores and 8 threads.

Thanks for information

Philippe
Modern Times
Posts: 3702
Joined: Thu Jun 07, 2012 11:02 pm

Re: Need advice on SMP

Post by Modern Times »

Carbec wrote: Wed Mar 22, 2023 2:39 pm For information, I have a QuadCore Intel Core i7-2600K, with 4 cores and 8 threads.
My thoughts - beyond 4 threads you're accessing logical cores rather than real cores, and you won't get the same performance.
JVMerlino
Posts: 1396
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: Need advice on SMP

Post by JVMerlino »

Carbec wrote: Wed Mar 22, 2023 2:39 pm Hello,

I implemented SMP as "Lazy SMP with a shared transposition table".
It works well, at least I think :) But one thing confuses me, but its perhaps normal.
This is the million-nodes/sec , for 1 to 8 threads :
1 : 5,1
2 : 9,8
4 : 16,5
6 : 21,9
8 : 23,8
I made the test with a position set (BT2630) with 30 positions, and a search depth of 10.
Is it normal that the curve is not linear ? or Is there a bug somewhere ?
Philippe
A lot will depend on your implementation of how you split up the work of searching the tree, but you will never get truly linear performance from multiple cores. Even engines such as Crafty, which use a more sophisticated SMP implementation (https://www.chessprogramming.org/Dynamic_Tree_Splitting), don't get linear performance. Crafty reports these NPS numbers when analyzing the root position for 20 seconds:
1 core = 5.8
2 = 11.3
4 = 17.4
6 = 21.6
8 = 25.7

My mediocre engine (about 2600 on CCRL), which uses an extremely lazy SMP implementation that is wildly non-deterministic, gives these numbers:
1 = 2.0
2 = 3.7
4 = 7.0
6 = 9.8
8 = 11.9

Really, more important is the improvement in time-to-depth. So here are the numbers for my engine for reaching depth 19 from the root position:
1 core = 25.0
2 = 13.9
4 = 9.8
6 = 7.2
8 = 6.0
AndrewGrant
Posts: 1953
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Need advice on SMP

Post by AndrewGrant »

I wildly disagree with JVMerlino's post, and think all of that information is wrong especially for LazySMP.

You should see near linear, and possible greater than linear performance. You can see greater than linear when you reuse static evaluations from the TT. If you see less than linear, and shooting down hard as LOGICAL cores goes up, then you probably have some memory issues. True or False sharing of memory that is NOT the transposition table.

Time to depth is a useless metric for LazySMP. It only has value for engines that operate on a SINGLE search tree, but using multiple threads. Time to depth may or may not change in LazySMP. Things like singular extensions prompt searches to go deeper and wider, inflating the time to depth metric in some cases.

Here is what I get with a bench of depth 13, on 16MB, with various core counts.

Code: Select all

1.  1.888mnps ( x 1.00 )
2.  3.854mnps ( x 2.04 )
4.  7.474mnps ( x 3.96 )
6. 10.985mnps ( x 5.81 )
As time control goes up, I would suspect you settle somewhere nicely below linear. Probably never losing an entire logical core worth. At a certain point contention of resources exists.
smatovic
Posts: 3222
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

I agree with Andrew, with plain Shared Hash Table you should get linear NPS speedup across real cores (UMA). Hyper Threading (SMT-2) speedup depends on engine and implementation, and, time to depth is the wrong metric for modern LazySMP (beyond vanilla SHT).

Here my numbers with ABDADA and RMO parallel search:

https://github.com/smatovic/Zeta/blob/m ... esults.txt

--
Srdja
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Need advice on SMP

Post by hgm »

DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
smatovic
Posts: 3222
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

hgm wrote: Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.

--
Srdja
Joost Buijs
Posts: 1631
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Need advice on SMP

Post by Joost Buijs »

hgm wrote: Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Need advice on SMP

Post by dangi12012 »

If the number of cache lines / second is an issue and TT is luckily indirect storage you can use these instructions:

_directstoreu_u64 (bypasses cache hierarchy)
_mm_prefetch (prefetch before access, honestly you dont know that for TT before you have the new zobrist so maybe you have another usecase)

_mm256_stream_si256 (bypass cache but more modern and 32bytes at once)
_mm256_stream_load_si256 (bypass cache but more modern and 32bytes at once)

So for example I had an application that ran 230% faster after using above 2 instructions.
In essence you keep your working memory cache clean and go directly to memory.
If you access the data immediately again the overhead becomes very big.

In essence this is a thing the compiler cannot do for you - if you application is memory bound - think about reordering your working memory first if its truly random above instructions may help.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3222
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

Joost Buijs wrote: Thu Mar 23, 2023 9:22 am
hgm wrote: Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.
There must be a sweetspot of cores:memory channels ratio, which depends on engine I suppose.

--
Srdja