On that note if we had a zobrist replacement - domain hash that only differs in the lower bits for similar positions then during search more relevant positions would fit into a cacheline. But that is a pipedream since positions differ too much 2 moves down.
On that note this is what I use instead of zobrist with great success:
Since zobrist needs 3 lookups per move and above code needs none the code run at similar speed purely inside L1 with the advantage of not needing to be incremental at all too.
I tried the other common boost::hash_combine but it was slower and yielded more wrong intersections.
hgm wrote: ↑Thu Mar 23, 2023 9:12 am
DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.
--
Srdja
Prefetching can hide latency. But if throughput is the bottleneck, it could actually backfire, when you speculatively prefetch data that later tuns out to be not needed after all. The time that elapses between the moment that you are sure you will search a certain move, and that you need to have the TT data, cold be too short to perform a DRAM prefetch.
Yes, it really depends on engine I guess. How many compute cycles compared to one memory fetch, can you effectively prefetch, how many memory fetches per thread compared to cache/memory latency and throughput per thread? As Joost mentioned, would be interesting to know the sweetspot of current Stockfish on different hardware, there have been people in here reporting that they had to populate all their RAM slots (resp. channels) with modules to get more NPS....nevertheless I agree with Andrew in general.
Thanks for all these answers. Some are above my knowledge, but what I understood is I might have
a memory problem. Some data shared by all the threads, except the TT.
I will look at that.
hgm wrote: ↑Thu Mar 23, 2023 9:12 am
DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
This is exactly the reason why server or HEDT CPU's with 4, 6, 8 or 12 memory channels always perform better with Lazy-SMP than consumer CPU's with just 2 memory channels.
There must be a sweetspot of cores:memory channels ratio, which depends on engine I suppose.
--
Srdja
Of course, the clock frequency of consumer CPU's is usually somewhat higher, if you run with a low number of threads things could be different. With a high number of threads the memory bandwidth has a big influence on the performance of a chess engine. Maybe if you don't probe in quiescence this gets less of an issue.
1. 1.888mnps ( x 1.00 )
2. 3.854mnps ( x 2.04 )
4. 7.474mnps ( x 3.96 )
6. 10.985mnps ( x 5.81 )
Beware also of the testing methodology. At fixed depth on my engine (Amoeba), I get poor scaling and unstable results, because once a task reaches the depth limit, it stops searching. I get better results at fixed time, where all tasks keep searching until the time limit is reached:
hgm wrote: ↑Thu Mar 23, 2023 9:12 am
DRAM access for TT probing can be a bottleneck, not? No matter how many cores you have, you could never get an NPS larger than the number of cache-lines you can load per second.
You can prefetch to hide memory latencies? But yes.
But probing the TT is not going to run into memory bandwidth limitations, and if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.
Slowdown for >4 threads is expected because the processor only has 4 cores.
Slowdown for 2-4 threads probably is due to real or false sharing.
(I suppose it could also have to do with L2 cache being shared by 4 active cores, but this doesn't seem to be an issue for most engines.)
Thanks for all these answers. Some are above my knowledge, but what I understood is I might have
a memory problem. Some data shared by all the threads, except the TT.
I will look at that.
On Linux you can use "perf c2c" to figure out the addresses where sharing occurs (type "man perf c2c").
syzygy wrote: ↑Sun Mar 26, 2023 12:59 amand if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.
Or is this wrong if the cores are accessing the same memory channel?
Does someone know how this works? Are memory accesses effectively pipelined by the memory controller, so that access latencies overlap? Or does the memory controller handle simultaneous cache misses by different cores/threads strictly sequentially?
I use linux Mint, but there is no "perf". I tried to install it, without success.
I had an error about "broken packet". I am certainly not a linux expert, I installed
Mint given its good reputation. Which distro do you use ?