Need advice on SMP

dangi12012 · Post by **dangi12012** » Thu Mar 23, 2023 9:52 am

On that note if we had a zobrist replacement - domain hash that only differs in the lower bits for similar positions then during search more relevant positions would fit into a cacheline. But that is a pipedream since positions differ too much 2 moves down.

On that note this is what I use instead of zobrist with great success:

Code: Select all

	static inline uint32_t hash32(const __m256i value) {
		uint32_t crcv = 0;
		crcv = _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 0));
		crcv = _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 1));
		crcv = _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 2));
		return _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 3));
}

Since zobrist needs 3 lookups per move and above code needs none the code run at similar speed purely inside L1 with the advantage of not needing to be incremental at all too.
I tried the other common boost::hash_combine but it was slower and yielded more wrong intersections.

hgm · Post by **hgm** » Thu Mar 23, 2023 10:26 am

smatovic wrote: ↑Thu Mar 23, 2023 9:17 am
hgm wrote: ↑Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.

--
Srdja

Prefetching can hide latency. But if throughput is the bottleneck, it could actually backfire, when you speculatively prefetch data that later tuns out to be not needed after all. The time that elapses between the moment that you are sure you will search a certain move, and that you need to have the TT data, cold be too short to perform a DRAM prefetch.

smatovic · Post by **smatovic** » Thu Mar 23, 2023 10:52 am

Yes, it really depends on engine I guess. How many compute cycles compared to one memory fetch, can you effectively prefetch, how many memory fetches per thread compared to cache/memory latency and throughput per thread? As Joost mentioned, would be interesting to know the sweetspot of current Stockfish on different hardware, there have been people in here reporting that they had to populate all their RAM slots (resp. channels) with modules to get more NPS....nevertheless I agree with Andrew in general.

--
Srdja

Carbec · Post by **Carbec** » Thu Mar 23, 2023 1:50 pm

hello,

Thanks for all these answers. Some are above my knowledge, but what I understood is I might have
a memory problem. Some data shared by all the threads, except the TT.
I will look at that.

Philippe

Joost Buijs · Post by **Joost Buijs** » Thu Mar 23, 2023 4:43 pm

smatovic wrote: ↑Thu Mar 23, 2023 9:40 am
Joost Buijs wrote: ↑Thu Mar 23, 2023 9:22 am
hgm wrote: ↑Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.
There must be a sweetspot of cores:memory channels ratio, which depends on engine I suppose.

--
Srdja

Of course, the clock frequency of consumer CPU's is usually somewhat higher, if you run with a low number of threads things could be different. With a high number of threads the memory bandwidth has a big influence on the performance of a chess engine. Maybe if you don't probe in quiescence this gets less of an issue.

abulmo2 · Post by **abulmo2** » Sat Mar 25, 2023 12:50 pm

AndrewGrant wrote: ↑Wed Mar 22, 2023 7:11 pm Here is what I get with a bench of depth 13, on 16MB, with various core counts.
Code: Select all
1.  1.888mnps ( x 1.00 )
2.  3.854mnps ( x 2.04 )
4.  7.474mnps ( x 3.96 )
6. 10.985mnps ( x 5.81 )

Beware also of the testing methodology. At fixed depth on my engine (Amoeba), I get poor scaling and unstable results, because once a task reaches the depth limit, it stops searching. I get better results at fixed time, where all tasks keep searching until the time limit is reached:

Code: Select all

    fixed depth (18)      fixed time (1 s./move)
1.  1.996 mnps (× 1.00)  2.100 mnps (x 1.00)
2.  3.779 mnps (× 1.89)  4.067 mnps (x 1.95)
4.  6.530 mnps (× 3.27)  8.330 mnps (x 3.99)
6. 10.201 mnps (× 5.11) 12.531 mnps (x 5.97)
8. 12.742 mnps (× 6.38) 16.584 mnps (x 7.93)

syzygy · Post by **syzygy** » Sun Mar 26, 2023 12:59 am

smatovic wrote: ↑Thu Mar 23, 2023 9:17 am
hgm wrote: ↑Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.

But probing the TT is not going to run into memory bandwidth limitations, and if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.

Slowdown for >4 threads is expected because the processor only has 4 cores.
Slowdown for 2-4 threads probably is due to real or false sharing.
(I suppose it could also have to do with L2 cache being shared by 4 active cores, but this doesn't seem to be an issue for most engines.)

syzygy · Post by **syzygy** » Sun Mar 26, 2023 1:27 am

Carbec wrote: ↑Thu Mar 23, 2023 1:50 pm hello,

Thanks for all these answers. Some are above my knowledge, but what I understood is I might have
a memory problem. Some data shared by all the threads, except the TT.
I will look at that.

On Linux you can use "perf c2c" to figure out the addresses where sharing occurs (type "man perf c2c").

syzygy · Post by **syzygy** » Sun Mar 26, 2023 1:36 am

syzygy wrote: ↑Sun Mar 26, 2023 12:59 amand if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.

Or is this wrong if the cores are accessing the same memory channel?

Does someone know how this works? Are memory accesses effectively pipelined by the memory controller, so that access latencies overlap? Or does the memory controller handle simultaneous cache misses by different cores/threads strictly sequentially?

Carbec · Post by **Carbec** » Mon Mar 27, 2023 10:35 am

Hello,

I use linux Mint, but there is no "perf". I tried to install it, without success.
I had an error about "broken packet". I am certainly not a linux expert, I installed
Mint given its good reputation. Which distro do you use ?

Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP