The mp performance of stockfish 1.8

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

liuzy

The mp performance of stockfish 1.8

Post by liuzy »

I use the parameters as follow to test the mp performance of stockfish 1.8 x64 version, the test is from the start position, go to depth 25.

8 cores:
const int MAX_THREADS = 8;
const int ACTIVE_SPLIT_POINTS_MAX = 8;

16 cores:
const int MAX_THREADS = 16;
const int ACTIVE_SPLIT_POINTS_MAX = 16;

and get the following time speedup data:
1 core: 1
8 cores: 5.98
16 cores: 9.11

The performance is very good.
Last edited by liuzy on Sat Jul 03, 2010 6:33 pm, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The mp performance of stockfish 1.8

Post by bob »

liuzy wrote:I use the parameters as follow to test the mp performance of stockfish 1.8 x64 version, the test is from the start position, go to depth 25.

16 cores:
const int MAX_THREADS = 16;
const int ACTIVE_SPLIT_POINTS_MAX = 16;

8 cores:
const int MAX_THREADS = 8;
const int ACTIVE_SPLIT_POINTS_MAX = 8;

and get the following time speedup data:
---------------------------------------------------
1 core 8 cores 16 cores
---------------------------------------------------
1 5.98 9.11
---------------------------------------------------

The performance is very good.
Certainly the "ACTIVE_SPLIT_POINTS_MAX has to be way too low. On 8 cores I see far larger numbers:

time=2:40 mat=0 n=3204930194 fh=89% nps=19.9M
extensions=29.1M qchecks=78.4M reduced=362.2M pruned=801.2M
predicted=25 evals=1103.7M 50move=1 EGTBprobes=0 hits=0
SMP-> splits=175852 aborts=24219 data=68/512 elap=2:40

that was a 30 10 game on ICC, the "data=" says there are 512 split blocks available, and at some point, 68 were being used. Allowing one active split point per thread seems way wrong unless I misunderstand what the term means in stockfish.
User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 12:00 pm
Location: Slovakia, EU

Re: The mp performance of stockfish 1.8

Post by rvida »

I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: The mp performance of stockfish 1.8

Post by mcostalba »

rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
Yes Richard, correct.

liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
liuzy

Re: The mp performance of stockfish 1.8

Post by liuzy »

liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Default.
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The mp performance of stockfish 1.8

Post by bob »

rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.
User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 12:00 pm
Location: Slovakia, EU

Re: The mp performance of stockfish 1.8

Post by rvida »

bob wrote:
rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.
Excuse me for my ignorance, but I don't really understand what You are
arguing about. We were talking about the _number_ of split points (per thread). It has nothing to do with memory/cache locality. Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.

Richard
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The mp performance of stockfish 1.8

Post by bob »

rvida wrote:
bob wrote:
rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.

Richard
Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.
Excuse me for my ignorance, but I don't really understand what You are
arguing about. We were talking about the _number_ of split points (per thread). It has nothing to do with memory/cache locality. Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.

Richard
That is exactly what I was talking about. But you probably don't understand the NUMA issue. Doesn't matter how you "allocate" the split blocks. Each CPU on a NUMA box has local memory, which is very fast, and all other memory is remote, which is at least 2x as slow (on a 2-way NUMA box) and on a 4-cpu box what you get is one bank of fast memory, two banks that are 2x slower, and the final 1/4 of memory is 3x slower (talking AMD here). All you have to do is have each individual thread "zero" its own split blocks, no matter how they are allocated. They can be globally defined in a .c file, or they can be malloc()'ed. What counts is which thread first "touches" a block. The block is faulted into the physical memory of the node that first touches.

Nothing in my comments referenced cache/memory.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: The mp performance of stockfish 1.8

Post by mcostalba »

liuzy wrote:
liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Default.
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.
It should increase with number of cores. Try setting to 7 (max value) when testing with 16 or even 8 cores.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: The mp performance of stockfish 1.8

Post by mcostalba »

rvida wrote: Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.
Why do you think this ? access to SpliPoint struct is not very often and anyhow because the SpliPoint struct is big also in an uniform array they end up more then a cache line apart so that as long as each thread accesses its own SpliPoint set should be no cache contention.

Or am I missing something ?