info about zappa on 512 cores ?

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 23, 2012 4:05 pm

Does anyone have any idea on the details of thehardware used by zappa on 512 cores? I suppose it is a NUMA machine from sgi. By how much did it scale ? What kind of performance tuning was done ? Also the old CilkChess was run on similar SMP machine of 256 cores but scaling was not that good IIRC. So what kind of scaling should be expected from a 256/512 core SMP machine in general ? Are there successful stories of a cluster chess playing program in the past? I had a chance to try scorpio cluster with different interconnects but no success (infiniband,quadrics,myrinet,gigE etc..) nothing. It performs bad after 8 processors. Even I am having difficult with scaling on NUMA smp systems (note I haven't done any kind of optimization , not even the simple allocating split blocks on "near" locations).The one I am testing on has 8 nodes with 4 cores each with inifiniband interconnect. I will see if there are gainst after I do the memory affinity for each thread.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 24, 2012 12:20 pm

I am not getting any speed up on numa machines. I used 8 Opteron 8354 quad core processors (8 x 4 = 32 cores), but scorpio scales up to 8 cores but not above. I also tested crafty but it shows the same behaviour. Then I tested on an another machine to see if the problem is numa 2 X 12-core AMD opteron box with interconnect 4x QDR Infiniband to a 2-tiered blocking switch network. Both scale up to 12 cores (nps wise) but zero improvement from then on. I think that once numa starts to hit there is no scaling. This time I have rearranged code so that local stuff such as pawn hash table, eval hash table and split blocks are allocated at the node where the thread resides. I used simple "first touch" method for that. But it didn't help.

bob · Post by **bob** » Sun Jun 24, 2012 7:01 pm

Daniel Shawul wrote:I am not getting any speed up on numa machines. I used 8 Opteron 8354 quad core processors (8 x 4 = 32 cores), but scorpio scales up to 8 cores but not above. I also tested crafty but it shows the same behaviour. Then I tested on an another machine to see if the problem is numa 2 X 12-core AMD opteron box with interconnect 4x QDR Infiniband to a 2-tiered blocking switch network. Both scale up to 12 cores (nps wise) but zero improvement from then on. I think that once numa starts to hit there is no scaling. This time I have rearranged code so that local stuff such as pawn hash table, eval hash table and split blocks are allocated at the node where the thread resides. I used simple "first touch" method for that. But it didn't help.

When you talk about testing Crafty, can you tell me the hardware? I have run on NUMA boxes quite frequently (in fact, everything made today is NUMA). But with NUMA, terms like infiniband and such don't really fit, as those are message-passing concepts, where NUMA is a shared memory architecture, but with different latencies for different parts of memory.

Crafty is certainly not going to work on a message-passing architecture that implements shared memory in a very faked way. Crafty depends on memory latency being fast. Even infiniband is anything but fast in that regard.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 24, 2012 7:41 pm

bob wrote:
Daniel Shawul wrote:I am not getting any speed up on numa machines. I used 8 Opteron 8354 quad core processors (8 x 4 = 32 cores), but scorpio scales up to 8 cores but not above. I also tested crafty but it shows the same behaviour. Then I tested on an another machine to see if the problem is numa 2 X 12-core AMD opteron box with interconnect 4x QDR Infiniband to a 2-tiered blocking switch network. Both scale up to 12 cores (nps wise) but zero improvement from then on. I think that once numa starts to hit there is no scaling. This time I have rearranged code so that local stuff such as pawn hash table, eval hash table and split blocks are allocated at the node where the thread resides. I used simple "first touch" method for that. But it didn't help.
When you talk about testing Crafty, can you tell me the hardware? I have run on NUMA boxes quite frequently (in fact, everything made today is NUMA). But with NUMA, terms like infiniband and such don't really fit, as those are message-passing concepts, where NUMA is a shared memory architecture, but with different latencies for different parts of memory.

Crafty is certainly not going to work on a message-passing architecture that implements shared memory in a very faked way. Crafty depends on memory latency being fast. Even infiniband is anything but fast in that regard.

I have already given the hardwares I used. Here is the detail for one of them

Code: Select all

32 cores
8 sockets x 4 cores per socket (8 Opteron 8354 quad core processors)
AMD Opteron @ 2.2 GHz
Type: Compute
Notes: Compute nodes.
Memory: 128.0 GB
Local storage: 0.0 GB
each core =  512 KB L2 cache
each processor has a 2 MB L3 cache that is shared amongst the 4 cores
Operating System	CentOS 5.4
Interconnect	InfiniBand

Shoudn't this sytem be called numa? Infact numactl shows the following

Code: Select all

$ numactl --hardware
available: 8 nodes (0-7)
node 0 size: 16140 MB
node 0 free: 68 MB
node 1 size: 16160 MB
node 1 free: 6 MB
node 2 size: 16160 MB
node 2 free: 7 MB
node 3 size: 16160 MB
node 3 free: 4404 MB
node 4 size: 16160 MB
node 4 free: 6988 MB
node 5 size: 16160 MB
node 5 free: 6055 MB
node 6 size: 16160 MB
node 6 free: 6813 MB
node 7 size: 16160 MB
node 7 free: 7759 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10
$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 0 1 2 3 4 5 6 7

Both scale upto 8 cores well but nothing afterwards which I am assuming is when it starts using the interconnect. I would expect that to be the case going from 4 cores to 8 but no problems there ?

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 24, 2012 7:53 pm

I tried three different transposition table schemes on the numa

Code: Select all

local - each core has its own tt and does not share it
global - they all share the tt but is centralized (one processor has it)
distributed - each processor carries a section of the tt

So in code it looks like

Code: Select all

#if TT_TYPE == 0       //global
    const int pid = 0;
#elif TT_TYPE == 1     //distributed
	int pid = hash_key % PROCESSOR::n_processors;
#else  //local
    int pid = processor_id;
#endif
	register PPROCESSOR proc = processors[pid];

The distributed tt has a modulo so it may be a bit slower. The local transposition table while having a higher nps fails badly on timed search. It reached only depth 21 in the time the others reached depth = 24 so it is necessary to share them. Among the other two, global tt seems to perform a little better. That after I rewrote my code to try out a distributed scheme. The YBW authors used a distributed tt scheme. Though tt entries are probed through message passing, it is the same concept. But my test seem to suggest global tt is better. The result could be different when the number of processors increases to 256 or so. Has anyone tried a distributed shared tt (not a distributed local tt) on the numa or even clusters ?

bob · Post by **bob** » Mon Jun 25, 2012 2:37 am

Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:I am not getting any speed up on numa machines. I used 8 Opteron 8354 quad core processors (8 x 4 = 32 cores), but scorpio scales up to 8 cores but not above. I also tested crafty but it shows the same behaviour. Then I tested on an another machine to see if the problem is numa 2 X 12-core AMD opteron box with interconnect 4x QDR Infiniband to a 2-tiered blocking switch network. Both scale up to 12 cores (nps wise) but zero improvement from then on. I think that once numa starts to hit there is no scaling. This time I have rearranged code so that local stuff such as pawn hash table, eval hash table and split blocks are allocated at the node where the thread resides. I used simple "first touch" method for that. But it didn't help.
When you talk about testing Crafty, can you tell me the hardware? I have run on NUMA boxes quite frequently (in fact, everything made today is NUMA). But with NUMA, terms like infiniband and such don't really fit, as those are message-passing concepts, where NUMA is a shared memory architecture, but with different latencies for different parts of memory.

Crafty is certainly not going to work on a message-passing architecture that implements shared memory in a very faked way. Crafty depends on memory latency being fast. Even infiniband is anything but fast in that regard.
I have already given the hardwares I used. Here is the detail for one of them
Code: Select all
32 cores
8 sockets x 4 cores per socket (8 Opteron 8354 quad core processors)
AMD Opteron @ 2.2 GHz
Type: Compute
Notes: Compute nodes.
Memory: 128.0 GB
Local storage: 0.0 GB
each core =  512 KB L2 cache
each processor has a 2 MB L3 cache that is shared amongst the 4 cores
Operating System	CentOS 5.4
Interconnect	InfiniBand
Shoudn't this sytem be called numa? Infact numactl shows the following
Code: Select all
$ numactl --hardware
available: 8 nodes (0-7)
node 0 size: 16140 MB
node 0 free: 68 MB
node 1 size: 16160 MB
node 1 free: 6 MB
node 2 size: 16160 MB
node 2 free: 7 MB
node 3 size: 16160 MB
node 3 free: 4404 MB
node 4 size: 16160 MB
node 4 free: 6988 MB
node 5 size: 16160 MB
node 5 free: 6055 MB
node 6 size: 16160 MB
node 6 free: 6813 MB
node 7 size: 16160 MB
node 7 free: 7759 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10
$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 0 1 2 3 4 5 6 7
Both scale upto 8 cores well but nothing afterwards which I am assuming is when it starts using the interconnect. I would expect that to be the case going from 4 cores to 8 but no problems there ?

Your "infiniband" says it all. It is really not a 'Numa" box in the sense I was using. IE AMD using their hypertransport bus as opposed to using infiniband. I would agree that a NUMA box using infiniband has a significant latency issue that I have not touched on in Crafty at all. The only Numa-related thing I do in Crafty (which fits Intel/AMD Numa perfectly when you are talking about single motherboard NUMA boxes, is to make sure that the split blocks for a thread are first touched by the core that thread will run on, so that those virtual pages fault into physical pages that are local to that core. But Crafty is not designed at all for a machine with the high Numa latency infiniband introduces compared to direct bus connections...

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 25, 2012 3:23 am

Your "infiniband" says it all. It is really not a 'Numa" box in the sense I was using. IE AMD using their hypertransport bus as opposed to using infiniband. I would agree that a NUMA box using infiniband has a significant latency issue that I have not touched on in Crafty at all.

Well crafty also displays it as a numa machine 8 x 4 ways so you have inconsistencies there. What is confusing is that it has no problems scaling to 8 processors when the opterons are mentioned as 4 core boxes. On another AMD machine with 24 cores (4 X 6 cores) same thing happens after using more than 12 threads. Again here one would expect the interconnect to be used after 6 cores but it is not... So the pattern seem to indicate two nodes are glued by something other than an infiniband (such as hypertransport bus).

The only Numa-related thing I do in Crafty (which fits Intel/AMD Numa perfectly when you are talking about single motherboard NUMA boxes, is to make sure that the split blocks for a thread are first touched by the core that thread will run on, so that those virtual pages fault into physical pages that are local to that core. But Crafty is not designed at all for a machine with the high Numa latency infiniband introduces compared to direct bus connections...

I am also using the implicit memory allocation method to improve numa performance as suggested in the intel manual here. So I allocate pawn tt, eval tt and split blocks local to a thread and "touch" them first with the thread. I also tried a distributed shared main hash table. This is the suggested method in the manual and also Feldmann co used it even for clusters using message passing. But for numa machine I test on ,it seams division and modulo required for accessing section of the hash table slow down access time a lot. Thus for now I am using a global hash table bound to one node which performed a little better than other methods. From a quick look at crafty, I understand you do not distribute the tt for numa. There also seem to be a dead code for allocating interleaved memory that doesn't get used anywhere. It looks like libnuma is used only for displaying number of numa cores...

bob · Post by **bob** » Mon Jun 25, 2012 12:05 pm

Daniel Shawul wrote:
Your "infiniband" says it all. It is really not a 'Numa" box in the sense I was using. IE AMD using their hypertransport bus as opposed to using infiniband. I would agree that a NUMA box using infiniband has a significant latency issue that I have not touched on in Crafty at all.
Well crafty also displays it as a numa machine 8 x 4 ways so you have inconsistencies there. What is confusing is that it has no problems scaling to 8 processors when the opterons are mentioned as 4 core boxes. On another AMD machine with 24 cores (4 X 6 cores) same thing happens after using more than 12 threads. Again here one would expect the interconnect to be used after 6 cores but it is not... So the pattern seem to indicate two nodes are glued by something other than an infiniband (such as hypertransport bus).
The only Numa-related thing I do in Crafty (which fits Intel/AMD Numa perfectly when you are talking about single motherboard NUMA boxes, is to make sure that the split blocks for a thread are first touched by the core that thread will run on, so that those virtual pages fault into physical pages that are local to that core. But Crafty is not designed at all for a machine with the high Numa latency infiniband introduces compared to direct bus connections...
I am also using the implicit memory allocation method to improve numa performance as suggested in the intel manual here. So I allocate pawn tt, eval tt and split blocks local to a thread and "touch" them first with the thread. I also tried a distributed shared main hash table. This is the suggested method in the manual and also Feldmann co used it even for clusters using message passing. But for numa machine I test on ,it seams division and modulo required for accessing section of the hash table slow down access time a lot. Thus for now I am using a global hash table bound to one node which performed a little better than other methods. From a quick look at crafty, I understand you do not distribute the tt for numa. There also seem to be a dead code for allocating interleaved memory that doesn't get used anywhere. It looks like libnuma is used only for displaying number of numa cores...

It's a NUMA box, so Crafty will say so. But it is a VERY POOR NUMA box compared to the typical single-MB AMD machines I have used in the past where you get 4-8 physical processor chips on a single MB, connected thru the hypertransport bus rather than remotely thru infiniband. The minute you go remote with infiniband, memory latency becomes huge compared to the usual bus architecture AMD/Intel uses, and Crafty has not even been remotely optimized for such an architecture. A simple spin lock is horrible there.

The "mallocinterleaved" is a windows mechanism that interleaves pages of a large memory region over the physical ram. Not for split blocks and the like, just for the hash memory stuff... Otherwise you end up with all the hash table on one node's local memory which creates a "hot-spot" that hurts performance. By interleaving pages, each physical processor's memory gets an equal amount of the hash table which spreads out accesses.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 25, 2012 6:00 pm

It's a NUMA box, so Crafty will say so. But it is a VERY POOR NUMA box compared to the typical single-MB AMD machines I have used in the past where you get 4-8 physical processor chips on a single MB, connected thru the hypertransport bus rather than remotely thru infiniband. The minute you go remote with infiniband, memory latency becomes huge compared to the usual bus architecture AMD/Intel uses, and Crafty has not even been remotely optimized for such an architecture. A simple spin lock is horrible there.

Well then why would it scale good up to 8 processors? It should start slowing down at 4. The server is an 8 socket x AMD quad-cores. Here is the spec of the server HP ProLiant G6 server.I was wrong about the other machine though. That one was 2 socket x 12-core AMD machines HP Proliant G7 server, so it is understandable that it scaled well up to 12 cores but not on more processors.

The "mallocinterleaved" is a windows mechanism that interleaves pages of a large memory region over the physical ram. Not for split blocks and the like, just for the hash memory stuff... Otherwise you end up with all the hash table on one node's local memory which creates a "hot-spot" that hurts performance. By interleaving pages, each physical processor's memory gets an equal amount of the hash table which spreads out accesses.

I am saying you don't use that code anywhere even though you have it (be it on windows or linux). The hash table seem to be allocated in the regular manner so it follows the touch rule and be bound to one processor. What I do to get the same behaviour as interleaved memory is a bit different. Each processor explicitly allocates a portion of the hash table and then hash table store/probes are done by first locating which processor contains the section of tt (modulo n_processors) and then dividing the hash_key by n_processors to get the desired entry.

Edit
I have been using the login node for tests so that explains why it scaled only up to 8 (some other people were using some cores at that node.). Submitting jobs properly now shows some speed up for 16 cores but clearly there is some loss in efficiency due to being NUMA. I will test upto 32 cores and see what happens. The infniband is not an issue because I think that was used to connect it with other nodes of similar capacity (32 cores / node)... That makes much more sense now.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 25, 2012 7:21 pm

Code: Select all

   Scorpio               Crafty    
   Nps in K   Scaling     Nps in M   Scaling 
 1   1029             1   1.5   1 
 2   2017   1.960155491   3.1   2.066666667 
 4   3931     3.8202138   5.8   3.866666667 
 8   6928   6.732750243   11.4  7.6 
16   8982   8.728862974   12.7  8.466666667

info about zappa on 512 cores ?

info about zappa on 512 cores ?

numa scaling

Re: numa scaling

Re: numa scaling

transposition tables

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling