info about zappa on 512 cores ?

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 25, 2012 10:09 pm

Here is a result of a simple test code that shows that memory latency from one processor to another are not far apart.

Memory allocated on cpu 0
0: time = 5.22189 seconds
1: time = 6.51558 seconds
2: time = 6.09133 seconds
3: time = 7.2878 seconds
4: time = 7.18063 seconds
5: time = 6.88262 seconds
6: time = 8.77036 seconds
7: time = 8.72347 seconds
8: time = 5.24879 seconds
9: time = 6.52614 seconds
10: time = 6.08169 seconds
11: time = 7.3159 seconds
12: time = 7.18515 seconds
13: time = 6.87356 seconds
14: time = 8.77845 seconds
15: time = 8.71535 seconds
16: time = 5.51238 seconds
17: time = 6.52246 seconds
18: time = 6.08975 seconds
19: time = 7.30738 seconds
20: time = 7.16827 seconds
21: time = 6.8735 seconds
22: time = 8.77116 seconds
23: time = 8.71744 seconds
24: time = 5.28619 seconds
25: time = 6.53422 seconds
26: time = 6.07203 seconds
27: time = 7.31839 seconds
28: time = 7.17994 seconds
29: time = 6.86937 seconds
30: time = 8.7875 seconds
31: time = 8.70591 seconds

Similar results when memory is allocated on other processors. With in the same node the average result is about 6s, while far accesses take about 7.5s. Not so much difference as to be an SMP machine IMO. I am still waiting on a 32 processor result but so far 16 processor seem to have bad scaling.

bob · Post by **bob** » Mon Jun 25, 2012 10:38 pm

Daniel Shawul wrote:Here is a result of a simple test code that shows that memory latency from one processor to another are not far apart.
Code: Select all
Memory allocated on cpu 0
0: time = 5.22189 seconds
1: time = 6.51558 seconds
2: time = 6.09133 seconds
3: time = 7.2878 seconds
4: time = 7.18063 seconds
5: time = 6.88262 seconds
6: time = 8.77036 seconds
7: time = 8.72347 seconds
8: time = 5.24879 seconds
9: time = 6.52614 seconds
10: time = 6.08169 seconds
11: time = 7.3159 seconds
12: time = 7.18515 seconds
13: time = 6.87356 seconds
14: time = 8.77845 seconds
15: time = 8.71535 seconds
16: time = 5.51238 seconds
17: time = 6.52246 seconds
18: time = 6.08975 seconds
19: time = 7.30738 seconds
20: time = 7.16827 seconds
21: time = 6.8735 seconds
22: time = 8.77116 seconds
23: time = 8.71744 seconds
24: time = 5.28619 seconds
25: time = 6.53422 seconds
26: time = 6.07203 seconds
27: time = 7.31839 seconds
28: time = 7.17994 seconds
29: time = 6.86937 seconds
30: time = 8.7875 seconds
31: time = 8.70591 seconds
Similar results when memory is allocated on other processors. With in the same node the average result is about 6s, while far accesses take about 7.5s. Not so much difference as to be an SMP machine IMO. I am still waiting on a 32 processor result but so far 16 processor seem to have bad scaling.

The times are pretty variable. And there is some tuning that can be done inside Crafty as it really expects bus-level speeds, not infiniband latency, which would likely be somewhat better with different internal settings...

Those times don't really say much about how a CHESS program like Crafty will scale. The locks will be important and at infiniband latencies, they will likely hurt significantly..

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 25, 2012 10:53 pm

The times are pretty variable. And there is some tuning that can be done inside Crafty as it really expects bus-level speeds, not infiniband latency, which would likely be somewhat better with different internal settings...

No they are not. Have you seen the example given in the test code that I mentioned?

Code: Select all

Memory allocated on cpu 0
0: time = 7.95079 seconds
1: time = 7.93437 seconds
2: time = 10.2906 seconds
3: time = 10.3224 seconds

This has a 7 sec to 10 sec difference between accessing a far/near memory.

Those times don't really say much about how a CHESS program like Crafty will scale. The locks will be important and at infiniband latencies, they will likely hurt significantly..

Like I have mentioned before, I don't think inifiniband is involved here. That is used only for connecting "nodes" with 32 cores each. The cluster have thousands of cores where each node an HP Prolient G5 servers 8sockets X AMD optern quad-core. So each cpu is connected to the other through the HyperTransport interface Machine spec

Here is the other machine with 2socketX 12 core AMD

bob · Post by **bob** » Mon Jun 25, 2012 11:17 pm

Daniel Shawul wrote:
The times are pretty variable. And there is some tuning that can be done inside Crafty as it really expects bus-level speeds, not infiniband latency, which would likely be somewhat better with different internal settings...
No they are not. Have you seen the example given in the test code that I mentioned?
Code: Select all
Memory allocated on cpu 0
0: time = 7.95079 seconds
1: time = 7.93437 seconds
2: time = 10.2906 seconds
3: time = 10.3224 seconds
This has a 7 sec to 10 sec difference between accessing a far/near memory.
Those times don't really say much about how a CHESS program like Crafty will scale. The locks will be important and at infiniband latencies, they will likely hurt significantly..
Like I have mentioned before, I don't think inifiniband is involved here. That is used only for connecting "nodes" with 32 cores each. The cluster have thousands of cores where each node an HP Prolient G5 servers 8sockets X AMD optern quad-core. So each cpu is connected to the other through the HyperTransport interface Machine spec

Here is the other machine with 2socketX 12 core AMD

You don't consider 50% as a significant variance??? I did look at the "test code" to see what it is doing, and it is purely a bandwidth measure. There is latency and there is bandwidth. Infiniband can provide bandwidth, but it is very slow in terms of latency when compared to a normal local memory access...

The AMD approach, if you REALLY have 8 chips, is a large ring. And for each hop, the access time goes way up. With 4 chips, you have one local, two 1-hops, and one 2-hop access. With 8, it is 1 local, 2 1-hops, 2 2-hops, 2 3-hops and 1 4-hop, which gets sticky.

Crafty was tuned on a 4 chip AMD box... I have not done any sort of tuning (yet) for the newer AMD and/or Intel boxes where everyone is using NUMA now.

We have 3 clusters here that use infiniband, and while it looks great compared to other network topologies, it does not look great compared to actual memory over the bus..

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 25, 2012 11:46 pm

You don't consider 50% as a significant variance??? I did look at the "test code" to see what it is doing, and it is purely a bandwidth measure. There is latency and there is bandwidth. Infiniband can provide bandwidth, but it is very slow in terms of latency when compared to a normal local memory access...

How did you come up with 50%. I considered the worst and best and it is about a 25% variation. Compared to the example given before, this machine seem more SMP than that one...

The AMD approach, if you REALLY have 8 chips, is a large ring. And for each hop, the access time goes way up. With 4 chips, you have one local, two 1-hops, and one 2-hop access. With 8, it is 1 local, 2 1-hops, 2 2-hops, 2 3-hops and 1 4-hop, which gets sticky.

Crafty was tuned on a 4 chip AMD box... I have not done any sort of tuning (yet) for the newer AMD and/or Intel boxes where everyone is using NUMA now.

We have 3 clusters here that use infiniband, and while it looks great compared to other network topologies, it does not look great compared to actual memory over the bus..

The interconnect between cpus (quad-core opterons) is a HyperTransport bus.That is why it scaled well upto 8 procesors which it woudn't have it used infiniband. So you can't blame it on inifiniband.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 26, 2012 2:55 am

Here is result on a non-NUMA intel xeon processors for comaprison ( 4 sockets X Quad core Intel xeon). Here scorpio seem to scale slightly better but both have problems scaling on 8 processors. I think the previous opteron test showed a better scaling at 8 cores. I suspect it is because of contention on the shared FSB.

Code: Select all

	scorpio		crafty	
1	1173	1.00	1.8	1.00
2	2323	1.98	3.5	1.94
4	4386	3.74	6.7	3.72
8	7042	6.00	10.5	5.83

And just in case there are doubts the machine being UMA here is the latency test.

Code: Select all

Memory allocated on cpu 0
0: time = 8.69122 seconds
1: time = 8.50812 seconds
2: time = 8.47348 seconds
3: time = 8.4931 seconds
4: time = 8.66125 seconds
5: time = 8.50479 seconds
6: time = 8.47527 seconds
7: time = 8.48791 seconds
8: time = 8.84295 seconds
9: time = 8.48066 seconds
10: time = 8.47517 seconds
11: time = 8.48868 seconds
12: time = 8.7562 seconds
13: time = 8.47956 seconds
14: time = 8.47186 seconds
15: time = 8.48825 seconds

bob · Post by **bob** » Tue Jun 26, 2012 5:34 am

Daniel Shawul wrote:
You don't consider 50% as a significant variance??? I did look at the "test code" to see what it is doing, and it is purely a bandwidth measure. There is latency and there is bandwidth. Infiniband can provide bandwidth, but it is very slow in terms of latency when compared to a normal local memory access...
How did you come up with 50%. I considered the worst and best and it is about a 25% variation. Compared to the example given before, this machine seem more SMP than that one...

I took 6 and 9 seconds. difference = 3 seconds. 3 / 6 = 50%,,,

Not that it is very significant since that test code is primarily a memcpy() test which is bandwidth rather than latency. At a split point, you need some bandwidth to be sure, when doing the copy/split, but after that, it becomes more latency... And I am not sure what that box does with all the shared read-only data, if you plant it on one node, and make the others access over the infiniband switch, that would be an issue for sure...

The AMD approach, if you REALLY have 8 chips, is a large ring. And for each hop, the access time goes way up. With 4 chips, you have one local, two 1-hops, and one 2-hop access. With 8, it is 1 local, 2 1-hops, 2 2-hops, 2 3-hops and 1 4-hop, which gets sticky.

Crafty was tuned on a 4 chip AMD box... I have not done any sort of tuning (yet) for the newer AMD and/or Intel boxes where everyone is using NUMA now.

We have 3 clusters here that use infiniband, and while it looks great compared to other network topologies, it does not look great compared to actual memory over the bus..
The interconnect between cpus (quad-core opterons) is a HyperTransport bus.That is why it scaled well upto 8 procesors which it woudn't have it used infiniband. So you can't blame it on inifiniband.

Without knowing exactly how you are testing, exactly what the hardware looks like, scaling is difficult to discuss precisely. Normally, Crafty scales well on a normal AMD box (or Intel box). But I have not done a ton of testing on 6 core and up chips and have seen some issues there because with 6 cores, you can have 6x the demand, with the same HT hardware you have with just one core. So there is definitely a bottleneck. I just have not played around with any AMD hardware in quite a while since Intel has ruled the roost for several years in terms of chess performance...

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 26, 2012 1:06 pm

I took 6 and 9 seconds. difference = 3 seconds. 3 / 6 = 50%,,,

You have to divide by the maximum of 6 and 9 to get the percentage difference. It is not a percentage change going from 6 to 9 as that would be meaning less (the result would be different going from 9 to 6).

Not that it is very significant since that test code is primarily a memcpy() test which is bandwidth rather than latency. At a split point, you need some bandwidth to be sure, when doing the copy/split, but after that, it becomes more latency... And I am not sure what that box does with all the shared read-only data, if you plant it on one node, and make the others access over the infiniband switch, that would be an issue for sure...

Forget about the infiniband now since all of the processors I used are on one node with 32 processors. I tested only for up to 16. So the latency comes from the difference in number of hops and many other factors for the opteron in addition to that. I would guess the constant shared data will be cached really well. Each core has (64kb L1 + 512 kb L2) and a shared 2MB L3.

Without knowing exactly how you are testing, exactly what the hardware looks like, scaling is difficult to discuss precisely. Normally, Crafty scales well on a normal AMD box (or Intel box). But I have not done a ton of testing on 6 core and up chips and have seen some issues there because with 6 cores, you can have 6x the demand, with the same HT hardware you have with just one core. So there is definitely a bottleneck. I just have not played around with any AMD hardware in quite a while since Intel has ruled the roost for several years in terms of chess performance...

Well I already metioned it is HP server machine 8 sockets X Quad-core AMD opterons. The only doubtful stuff I provided at first was the inifinband interconnect but I cleared that up a couple of posts ago. That was used for connecting the server machine to other similar machines. Each onehas 8 cpus interconnected by an HT of 1ghz, so upto 32 cores no need for the inifiniband.
Btw what about the result on the intel xeon I gave in the other post (UMA machine). Crafty seem to suffer a lot at 8 cores. The timed tests are done in the following way:
1 processor - 1 minute run
2 processor - 2 minute run
4 processor - 4 minute run
etc..
Note that I am comparing only NPS values, if it was speed up we are talking about it would be much worse. It seems to me the search nowadays is very selective to gain best parallel performance.

Edit: Suj confirmed that he used the same hardware for Sjeng in the past HP Prolient DL785 G5 server. It is a bit old now compared to the new more powerful servers.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jun 28, 2012 1:17 am

I found an interesting bug in the distributed tt a couple of days ago and the above post I made examplifies it. Here is the bug just in case someone finds it interesting too. I used a modulo on the hash key to find which processor has the section of the TT, but then I used the same hash key to do the probing as well. This effectively throws away most of the hash tables when many processors are used! about 90% of TT with 8 threads, 75% of TT with 4 threads are useless...The problem is that using a modulo of the hash_key to select which processor is going to store that position effectively makes processor 1 to store hash_keys ending in 1, processor 2 in 2, processor 3 in 3 etc... So only 1/N of the TT is used with N processors. To avoid that a division by N should be applied before the key is used for probing. It may not be necessarily a division that is applied. In code:

Code: Select all

processor[hash_key % N]
hash_key = hash_key / N
...

And then I do another operation to determine for each individual TT using the regular method that has 2^M entries. That ofcourse helps since it avoids division. Now this all sounds a bit redundant and I am not sure if it is the only way it can be done. And I wonder if there are efficient methods because I can already feel the nps slow down with the current method (not much but it is there). So I will rewind and rephrase a question just in case someone has a better idea.
--------------------------------------------------------------------------------
Give N small hash tables of the same size with K entries, how do you hash a position with zobrist hash hash_key ? Assume the small hash tables are physically separate (on different processor if you will).
--------------------------------------------------------------------------------
cheers

diep · Post by **diep** » Sat Jun 30, 2012 1:07 am

Why don't you ship Anthony Cozzie an e-mail? He also gave a lecture at a NCSA event some years ago about it, maybe he still has the sheets for you...

info about zappa on 512 cores ?

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: numa scaling

Re: transposition tables

Re: info about zappa on 512 cores ?