Need advice on SMP

Carbec · Post by **Carbec** » Wed Mar 22, 2023 2:39 pm

Hello,

I implemented SMP as "Lazy SMP with a shared transposition table".
It works well, at least I think

But one thing confuses me, but its perhaps normal.
This is the million-nodes/sec , for 1 to 8 threads :
1 : 5,1
2 : 9,8
4 : 16,5
6 : 21,9
8 : 23,8
I made the test with a position set (BT2630) with 30 positions, and a search depth of 10.
Is it normal that the curve is not linear ? or Is there a bug somewhere ?
For information, I have a QuadCore Intel Core i7-2600K, with 4 cores and 8 threads.

Thanks for information

Philippe

Modern Times · Post by **Modern Times** » Wed Mar 22, 2023 5:17 pm

Carbec wrote: ↑Wed Mar 22, 2023 2:39 pm For information, I have a QuadCore Intel Core i7-2600K, with 4 cores and 8 threads.

My thoughts - beyond 4 threads you're accessing logical cores rather than real cores, and you won't get the same performance.

JVMerlino · Post by **JVMerlino** » Wed Mar 22, 2023 6:23 pm

Carbec wrote: ↑Wed Mar 22, 2023 2:39 pm Hello,

I implemented SMP as "Lazy SMP with a shared transposition table".
It works well, at least I think But one thing confuses me, but its perhaps normal.
This is the million-nodes/sec , for 1 to 8 threads :
1 : 5,1
2 : 9,8
4 : 16,5
6 : 21,9
8 : 23,8
I made the test with a position set (BT2630) with 30 positions, and a search depth of 10.
Is it normal that the curve is not linear ? or Is there a bug somewhere ?
Philippe

A lot will depend on your implementation of how you split up the work of searching the tree, but you will never get truly linear performance from multiple cores. Even engines such as Crafty, which use a more sophisticated SMP implementation (https://www.chessprogramming.org/Dynamic_Tree_Splitting), don't get linear performance. Crafty reports these NPS numbers when analyzing the root position for 20 seconds:
1 core = 5.8
2 = 11.3
4 = 17.4
6 = 21.6
8 = 25.7

My mediocre engine (about 2600 on CCRL), which uses an extremely lazy SMP implementation that is wildly non-deterministic, gives these numbers:
1 = 2.0
2 = 3.7
4 = 7.0
6 = 9.8
8 = 11.9

Really, more important is the improvement in time-to-depth. So here are the numbers for my engine for reaching depth 19 from the root position:
1 core = 25.0
2 = 13.9
4 = 9.8
6 = 7.2
8 = 6.0

AndrewGrant · Post by **AndrewGrant** » Wed Mar 22, 2023 7:11 pm

I wildly disagree with JVMerlino's post, and think all of that information is wrong especially for LazySMP.

You should see near linear, and possible greater than linear performance. You can see greater than linear when you reuse static evaluations from the TT. If you see less than linear, and shooting down hard as LOGICAL cores goes up, then you probably have some memory issues. True or False sharing of memory that is NOT the transposition table.

Time to depth is a useless metric for LazySMP. It only has value for engines that operate on a SINGLE search tree, but using multiple threads. Time to depth may or may not change in LazySMP. Things like singular extensions prompt searches to go deeper and wider, inflating the time to depth metric in some cases.

Here is what I get with a bench of depth 13, on 16MB, with various core counts.

Code: Select all

1.  1.888mnps ( x 1.00 )
2.  3.854mnps ( x 2.04 )
4.  7.474mnps ( x 3.96 )
6. 10.985mnps ( x 5.81 )

As time control goes up, I would suspect you settle somewhere nicely below linear. Probably never losing an entire logical core worth. At a certain point contention of resources exists.

smatovic · Post by **smatovic** » Thu Mar 23, 2023 9:05 am

I agree with Andrew, with plain Shared Hash Table you should get linear NPS speedup across real cores (UMA). Hyper Threading (SMT-2) speedup depends on engine and implementation, and, time to depth is the wrong metric for modern LazySMP (beyond vanilla SHT).

Here my numbers with ABDADA and RMO parallel search:

https://github.com/smatovic/Zeta/blob/m ... esults.txt

--
Srdja

hgm · Post by **hgm** » Thu Mar 23, 2023 9:12 am

DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.

smatovic · Post by **smatovic** » Thu Mar 23, 2023 9:17 am

hgm wrote: ↑Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.

You can prefetch to hide memory latencies? But yes.

--
Srdja

Joost Buijs · Post by **Joost Buijs** » Thu Mar 23, 2023 9:22 am

hgm wrote: ↑Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.

This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.

dangi12012 · Post by **dangi12012** » Thu Mar 23, 2023 9:33 am

If the number of cache lines / second is an issue and TT is luckily indirect storage you can use these instructions:

_directstoreu_u64 (bypasses cache hierarchy)
_mm_prefetch (prefetch before access, honestly you dont know that for TT before you have the new zobrist so maybe you have another usecase)

_mm256_stream_si256 (bypass cache but more modern and 32bytes at once)
_mm256_stream_load_si256 (bypass cache but more modern and 32bytes at once)

So for example I had an application that ran 230% faster after using above 2 instructions.
In essence you keep your working memory cache clean and go directly to memory.
If you access the data immediately again the overhead becomes very big.

In essence this is a thing the compiler cannot do for you - if you application is memory bound - think about reordering your working memory first if its truly random above instructions may help.

smatovic · Post by **smatovic** » Thu Mar 23, 2023 9:40 am

Joost Buijs wrote: ↑Thu Mar 23, 2023 9:22 am
hgm wrote: ↑Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.

There must be a sweetspot of cores:memory channels ratio, which depends on engine I suppose.

--
Srdja

Need advice on SMP

Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP