The mp performance of stockfish 1.8

rvida · Post by **rvida** » Sun Jul 04, 2010 11:05 am

As for cache, you are right of course. What I wrote applies only for AMD NUMA systems (as Bob mentioned), where each node has its own memory controller. Accessing data allocated by a thread running on another node has somewhat larger latency. But as long as the data is still in the cache it doesn't matter either.

rvida · Post by **rvida** » Sun Jul 04, 2010 11:16 am

bob wrote: All you have to do is have each individual thread "zero" its own split blocks, no matter how they are allocated. They can be globally defined in a .c file, or they can be malloc()'ed. What counts is which thread first "touches" a block. The block is faulted into the physical memory of the node that first touches.

True, if you enforce that the split blocks for each thread are 4KB apart. Memory pages have granularity of 4 KB.

liuzy · Post by **liuzy** » Sun Jul 04, 2010 12:35 pm

I set "Minimum Split Depth" to 7 for 8 cores and rerun the test ( I run 5 times, each time I turn off the engine and restart it.), and get the inconceivable average result, time speedup factor reachs 7.27!

And I set "Minimum Split Depth" to 9 for 16 cores, still 5 times, and it reachs 9.4

It seems that the mp performance may be better when playing real game with the help of hash table. I don't have the tools to do the test in this way(Bob can do it), so the test results are NOT very scientific. Hope Bob can do some test for stockfish 1.8.

bob · Post by **bob** » Sun Jul 04, 2010 5:04 pm

rvida wrote:
bob wrote: All you have to do is have each individual thread "zero" its own split blocks, no matter how they are allocated. They can be globally defined in a .c file, or they can be malloc()'ed. What counts is which thread first "touches" a block. The block is faulted into the physical memory of the node that first touches.
True, if you enforce that the split blocks for each thread are 4KB apart. Memory pages have granularity of 4 KB.

Or 2 or 4mb if you use large pages.

bob · Post by **bob** » Sun Jul 04, 2010 5:06 pm

mcostalba wrote:
rvida wrote: Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.
Why do you think this ? access to SpliPoint struct is not very often and anyhow because the SpliPoint struct is big also in an uniform array they end up more then a cache line apart so that as long as each thread accesses its own SpliPoint set should be no cache contention.

Or am I missing something ?

The NUMA issue for AMD-based machines.

In my case, each thread _always_ works out of a "split block" which contains the local tree data for that thread. Even the initial thread in a non-SMP compile still uses a single split block that contains all local data including tree state.

bob · Post by **bob** » Mon Jul 05, 2010 1:13 am

mcostalba wrote:
liuzy wrote:
liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Default.
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.
It should increase with number of cores. Try setting to 7 (max value) when testing with 16 or even 8 cores.

Have you tried the "group concept" at all? increasing the minimim split depth has negative effects (and, in fact, is really not that good of a way to decide when to split or not split). Limiting the number of threads at a single split point, however, has a real effect, particularly as you get into endgames where there are not many moves at any single point.

The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8