I took 6 and 9 seconds. difference = 3 seconds. 3 / 6 = 50%,,,
You have to divide by the maximum of 6 and 9 to get the percentage difference. It is not a percentage change going from 6 to 9 as that would be meaning less (the result would be different going from 9 to 6).
Not that it is very significant since that test code is primarily a memcpy() test which is bandwidth rather than latency. At a split point, you need some bandwidth to be sure, when doing the copy/split, but after that, it becomes more latency... And I am not sure what that box does with all the shared read-only data, if you plant it on one node, and make the others access over the infiniband switch, that would be an issue for sure...
Forget about the infiniband now since all of the processors I used are on one node with 32 processors. I tested only for up to 16. So the latency comes from the difference in number of hops and many other factors for the opteron in addition to that. I would guess the constant shared data will be cached really well. Each core has (64kb L1 + 512 kb L2) and a shared 2MB L3.
Without knowing exactly how you are testing, exactly what the hardware looks like, scaling is difficult to discuss precisely. Normally, Crafty scales well on a normal AMD box (or Intel box). But I have not done a ton of testing on 6 core and up chips and have seen some issues there because with 6 cores, you can have 6x the demand, with the same HT hardware you have with just one core. So there is definitely a bottleneck. I just have not played around with any AMD hardware in quite a while since Intel has ruled the roost for several years in terms of chess performance...
Well I already metioned it is HP server machine 8 sockets X Quad-core AMD opterons. The only doubtful stuff I provided at first was the inifinband interconnect but I cleared that up a couple of posts ago. That was used for connecting the server machine to other similar machines. Each onehas 8 cpus interconnected by an HT of 1ghz, so upto 32 cores no need for the inifiniband.
Btw what about the result on the intel xeon I gave in the other post (UMA machine). Crafty seem to suffer a lot at 8 cores. The timed tests are done in the following way:
1 processor - 1 minute run
2 processor - 2 minute run
4 processor - 4 minute run
etc..
Note that I am comparing only NPS values, if it was speed up we are talking about it would be much worse. It seems to me the search nowadays is very selective to gain best parallel performance.
Edit: Suj confirmed that he used the same hardware for Sjeng in the past
HP Prolient DL785 G5 server. It is a bit old now compared to the new more powerful servers.