Need advice on SMP

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Need advice on SMP

Post by dangi12012 »

On that note if we had a zobrist replacement - domain hash that only differs in the lower bits for similar positions then during search more relevant positions would fit into a cacheline. But that is a pipedream since positions differ too much 2 moves down.

On that note this is what I use instead of zobrist with great success:

Code: Select all

	static inline uint32_t hash32(const __m256i value) {
		uint32_t crcv = 0;
		crcv = _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 0));
		crcv = _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 1));
		crcv = _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 2));
		return _mm_crc32_u64(crcv, _mm256_extract_epi64(value, 3));
}
Since zobrist needs 3 lookups per move and above code needs none the code run at similar speed purely inside L1 with the advantage of not needing to be incremental at all too.
I tried the other common boost::hash_combine but it was slower and yielded more wrong intersections.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
hgm
Posts: 27926
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Need advice on SMP

Post by hgm »

smatovic wrote: Thu Mar 23, 2023 9:17 am
hgm wrote: Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.

--
Srdja
Prefetching can hide latency. But if throughput is the bottleneck, it could actually backfire, when you speculatively prefetch data that later tuns out to be not needed after all. The time that elapses between the moment that you are sure you will search a certain move, and that you need to have the TT data, cold be too short to perform a DRAM prefetch.
smatovic
Posts: 2797
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

Yes, it really depends on engine I guess. How many compute cycles compared to one memory fetch, can you effectively prefetch, how many memory fetches per thread compared to cache/memory latency and throughput per thread? As Joost mentioned, would be interesting to know the sweetspot of current Stockfish on different hardware, there have been people in here reporting that they had to populate all their RAM slots (resp. channels) with modules to get more NPS....nevertheless I agree with Andrew in general.

--
Srdja
Carbec
Posts: 144
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: Need advice on SMP

Post by Carbec »

hello,

Thanks for all these answers. Some are above my knowledge, but what I understood is I might have
a memory problem. Some data shared by all the threads, except the TT.
I will look at that.

Philippe
Joost Buijs
Posts: 1577
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Need advice on SMP

Post by Joost Buijs »

smatovic wrote: Thu Mar 23, 2023 9:40 am
Joost Buijs wrote: Thu Mar 23, 2023 9:22 am
hgm wrote: Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.
There must be a sweetspot of cores:memory channels ratio, which depends on engine I suppose.

--
Srdja
Of course, the clock frequency of consumer CPU's is usually somewhat higher, if you run with a low number of threads things could be different. With a high number of threads the memory bandwidth has a big influence on the performance of a chess engine. Maybe if you don't probe in quiescence this gets less of an issue.
abulmo2
Posts: 434
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Need advice on SMP

Post by abulmo2 »

AndrewGrant wrote: Wed Mar 22, 2023 7:11 pm Here is what I get with a bench of depth 13, on 16MB, with various core counts.

Code: Select all

1.  1.888mnps ( x 1.00 )
2.  3.854mnps ( x 2.04 )
4.  7.474mnps ( x 3.96 )
6. 10.985mnps ( x 5.81 )
Beware also of the testing methodology. At fixed depth on my engine (Amoeba), I get poor scaling and unstable results, because once a task reaches the depth limit, it stops searching. I get better results at fixed time, where all tasks keep searching until the time limit is reached:

Code: Select all

    fixed depth (18)      fixed time (1 s./move)
1.  1.996 mnps (× 1.00)  2.100 mnps (x 1.00)
2.  3.779 mnps (× 1.89)  4.067 mnps (x 1.95)
4.  6.530 mnps (× 3.27)  8.330 mnps (x 3.99)
6. 10.201 mnps (× 5.11) 12.531 mnps (x 5.97)
8. 12.742 mnps (× 6.38) 16.584 mnps (x 7.93)
Richard Delorme
syzygy
Posts: 5584
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need advice on SMP

Post by syzygy »

smatovic wrote: Thu Mar 23, 2023 9:17 am
hgm wrote: Thu Mar 23, 2023 9:12 am DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.
But probing the TT is not going to run into memory bandwidth limitations, and if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.

Slowdown for >4 threads is expected because the processor only has 4 cores.
Slowdown for 2-4 threads probably is due to real or false sharing.
(I suppose it could also have to do with L2 cache being shared by 4 active cores, but this doesn't seem to be an issue for most engines.)
syzygy
Posts: 5584
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need advice on SMP

Post by syzygy »

Carbec wrote: Thu Mar 23, 2023 1:50 pm hello,

Thanks for all these answers. Some are above my knowledge, but what I understood is I might have
a memory problem. Some data shared by all the threads, except the TT.
I will look at that.
On Linux you can use "perf c2c" to figure out the addresses where sharing occurs (type "man perf c2c").
syzygy
Posts: 5584
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need advice on SMP

Post by syzygy »

syzygy wrote: Sun Mar 26, 2023 12:59 amand if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.
Or is this wrong if the cores are accessing the same memory channel?

Does someone know how this works? Are memory accesses effectively pipelined by the memory controller, so that access latencies overlap? Or does the memory controller handle simultaneous cache misses by different cores/threads strictly sequentially?
Carbec
Posts: 144
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: Need advice on SMP

Post by Carbec »

Hello,

I use linux Mint, but there is no "perf". I tried to install it, without success.
I had an error about "broken packet". I am certainly not a linux expert, I installed
Mint given its good reputation. Which distro do you use ?