The mp performance of stockfish 1.8

liuzy · Post by **liuzy** » Sat Jul 03, 2010 6:28 pm

I use the parameters as follow to test the mp performance of stockfish 1.8 x64 version, the test is from the start position, go to depth 25.

8 cores:
const int MAX_THREADS = 8;
const int ACTIVE_SPLIT_POINTS_MAX = 8;

16 cores:
const int MAX_THREADS = 16;
const int ACTIVE_SPLIT_POINTS_MAX = 16;

and get the following time speedup data:
1 core: 1
8 cores: 5.98
16 cores: 9.11

The performance is very good.

bob · Post by **bob** » Sat Jul 03, 2010 6:33 pm

liuzy wrote:I use the parameters as follow to test the mp performance of stockfish 1.8 x64 version, the test is from the start position, go to depth 25.

16 cores:
const int MAX_THREADS = 16;
const int ACTIVE_SPLIT_POINTS_MAX = 16;

8 cores:
const int MAX_THREADS = 8;
const int ACTIVE_SPLIT_POINTS_MAX = 8;

and get the following time speedup data:
---------------------------------------------------
1 core 8 cores 16 cores
---------------------------------------------------
1 5.98 9.11
---------------------------------------------------

The performance is very good.

Certainly the "ACTIVE_SPLIT_POINTS_MAX has to be way too low. On 8 cores I see far larger numbers:

time=2:40 mat=0 n=3204930194 fh=89% nps=19.9M
extensions=29.1M qchecks=78.4M reduced=362.2M pruned=801.2M
predicted=25 evals=1103.7M 50move=1 EGTBprobes=0 hits=0
SMP-> splits=175852 aborts=24219 data=68/512 elap=2:40

that was a 30 10 game on ICC, the "data=" says there are 512 split blocks available, and at some point, 68 were being used. Allowing one active split point per thread seems way wrong unless I misunderstand what the term means in stockfish.

rvida · Post by **rvida** » Sat Jul 03, 2010 6:48 pm

I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard

mcostalba · Post by **mcostalba** » Sat Jul 03, 2010 7:59 pm

rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard

Yes Richard, correct.

liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?

liuzy · Post by **liuzy** » Sat Jul 03, 2010 9:12 pm

liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?

Default.
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.

bob · Post by **bob** » Sun Jul 04, 2010 2:49 am

rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard

Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.

rvida · Post by **rvida** » Sun Jul 04, 2010 3:17 am

bob wrote:
rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.

Excuse me for my ignorance, but I don't really understand what You are
arguing about. We were talking about the _number_ of split points (per thread). It has nothing to do with memory/cache locality. Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.

Richard

bob · Post by **bob** » Sun Jul 04, 2010 3:54 am

rvida wrote:
bob wrote:
rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.
Excuse me for my ignorance, but I don't really understand what You are
arguing about. We were talking about the _number_ of split points (per thread). It has nothing to do with memory/cache locality. Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.

Richard

That is exactly what I was talking about. But you probably don't understand the NUMA issue. Doesn't matter how you "allocate" the split blocks. Each CPU on a NUMA box has local memory, which is very fast, and all other memory is remote, which is at least 2x as slow (on a 2-way NUMA box) and on a 4-cpu box what you get is one bank of fast memory, two banks that are 2x slower, and the final 1/4 of memory is 3x slower (talking AMD here). All you have to do is have each individual thread "zero" its own split blocks, no matter how they are allocated. They can be globally defined in a .c file, or they can be malloc()'ed. What counts is which thread first "touches" a block. The block is faulted into the physical memory of the node that first touches.

Nothing in my comments referenced cache/memory.

mcostalba · Post by **mcostalba** » Sun Jul 04, 2010 8:40 am

liuzy wrote:
liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Default.
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.

It should increase with number of cores. Try setting to 7 (max value) when testing with 16 or even 8 cores.

mcostalba · Post by **mcostalba** » Sun Jul 04, 2010 8:44 am

rvida wrote: Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.

Why do you think this ? access to SpliPoint struct is not very often and anyhow because the SpliPoint struct is big also in an uniform array they end up more then a cache line apart so that as long as each thread accesses its own SpliPoint set should be no cache contention.

Or am I missing something ?

The mp performance of stockfish 1.8

The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8

Re: The mp performance of stockfish 1.8