I use the parameters as follow to test the mp performance of stockfish 1.8 x64 version, the test is from the start position, go to depth 25.
8 cores:
const int MAX_THREADS = 8;
const int ACTIVE_SPLIT_POINTS_MAX = 8;
16 cores:
const int MAX_THREADS = 16;
const int ACTIVE_SPLIT_POINTS_MAX = 16;
and get the following time speedup data:
1 core: 1
8 cores: 5.98
16 cores: 9.11
The performance is very good.
The mp performance of stockfish 1.8
Moderators: hgm, Rebel, chrisw
The mp performance of stockfish 1.8
Last edited by liuzy on Sat Jul 03, 2010 6:33 pm, edited 1 time in total.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: The mp performance of stockfish 1.8
Certainly the "ACTIVE_SPLIT_POINTS_MAX has to be way too low. On 8 cores I see far larger numbers:liuzy wrote:I use the parameters as follow to test the mp performance of stockfish 1.8 x64 version, the test is from the start position, go to depth 25.
16 cores:
const int MAX_THREADS = 16;
const int ACTIVE_SPLIT_POINTS_MAX = 16;
8 cores:
const int MAX_THREADS = 8;
const int ACTIVE_SPLIT_POINTS_MAX = 8;
and get the following time speedup data:
---------------------------------------------------
1 core 8 cores 16 cores
---------------------------------------------------
1 5.98 9.11
---------------------------------------------------
The performance is very good.
time=2:40 mat=0 n=3204930194 fh=89% nps=19.9M
extensions=29.1M qchecks=78.4M reduced=362.2M pruned=801.2M
predicted=25 evals=1103.7M 50move=1 EGTBprobes=0 hits=0
SMP-> splits=175852 aborts=24219 data=68/512 elap=2:40
that was a 30 10 game on ICC, the "data=" says there are 512 split blocks available, and at some point, 68 were being used. Allowing one active split point per thread seems way wrong unless I misunderstand what the term means in stockfish.
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: The mp performance of stockfish 1.8
I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.
Richard
Richard
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: The mp performance of stockfish 1.8
Yes Richard, correct.rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.
Richard
liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Re: The mp performance of stockfish 1.8
Default.liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: The mp performance of stockfish 1.8
Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.
Richard
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: The mp performance of stockfish 1.8
Excuse me for my ignorance, but I don't really understand what You arebob wrote:Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.
Richard
arguing about. We were talking about the _number_ of split points (per thread). It has nothing to do with memory/cache locality. Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.
Richard
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: The mp performance of stockfish 1.8
That is exactly what I was talking about. But you probably don't understand the NUMA issue. Doesn't matter how you "allocate" the split blocks. Each CPU on a NUMA box has local memory, which is very fast, and all other memory is remote, which is at least 2x as slow (on a 2-way NUMA box) and on a 4-cpu box what you get is one bank of fast memory, two banks that are 2x slower, and the final 1/4 of memory is 3x slower (talking AMD here). All you have to do is have each individual thread "zero" its own split blocks, no matter how they are allocated. They can be globally defined in a .c file, or they can be malloc()'ed. What counts is which thread first "touches" a block. The block is faulted into the physical memory of the node that first touches.rvida wrote:Excuse me for my ignorance, but I don't really understand what You arebob wrote:Even "per thread" that would seem to be low. I think that for 8 cores use 256 split blocks. One real issue is that on AMD boxes, which are NUMA, you want to have each processor able to grab split blocks that are in local memory, rather than remote memory (which can be 2x-3x slower.rvida wrote:I guess the meaning of "ACTIVE_SPLIT_POINTS_MAX" is per thread, otherwise when a master has finished his work it would be unable to help its slaves.
Richard
arguing about. We were talking about the _number_ of split points (per thread). It has nothing to do with memory/cache locality. Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.
Richard
Nothing in my comments referenced cache/memory.
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: The mp performance of stockfish 1.8
It should increase with number of cores. Try setting to 7 (max value) when testing with 16 or even 8 cores.liuzy wrote:Default.liuyuan, what value have you used for "Minimum Split Depth" UCI parameter in the above cases ?
Do you have better suggestion, I can run the test again.
BTW, it seems that the 32 core version cann't run correctly, I got segment fault. I'm not quite sure, maybe it caused by the OS.
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: The mp performance of stockfish 1.8
Why do you think this ? access to SpliPoint struct is not very often and anyhow because the SpliPoint struct is big also in an uniform array they end up more then a cache line apart so that as long as each thread accesses its own SpliPoint set should be no cache contention.rvida wrote: Btw my feeling is that instead of having an uniform array of NxM "split blocks" (according to Your terminology) allocated at program startup, it is more efficient if each of the N threads has an array of M split points allocated by themselves.
Or am I missing something ?