I am not saying that necessarily there is something wrong, but these numbers with the new compiler are pretty unbelievable. And they contradict my tests with Crafty 23.6 to 4 cores and Andreas Strangmüller tests to 16 cores with Crafty 24.1:bob wrote:That is based on 4 runs of 24 positions. All I have so far is the 1 cpu run which didn't change at all, and these 4 20 cpu runs. Yes they are pretty high, but as to whether something is amiss or not I do not know. I have run literally thousands of positions comparing results with 1 to N cpus. there is always variability, but there have been no instances of bogus moves being chosen or bogus scores being chosen. And Crafty has been playing on ICC with this code for months. Only bug I have found in last 30 days is a statistical gathering bug counting the number of reductions that were done, overstepping an array bound. But that bug was in the search, period, not a SMP issue.Laskos wrote:These are exceptionally high speedup numbers, surpassing even your DTS results. I have never seen such to 16 or 20 cores. I am tempted to think something is amiss.bob wrote:I have started re-running my SMP tests since the Intel compiler provides such a nice NPS improvement (multiple threads only, single thread seems to be the same as before).
Here's my first 20 core test runs (same 24 positions, each run 4 times). All results are geometric mean for a 24 position run.
Net results, NPS scaling improved from 14.7 or so to 16.3x. I know part of the bottleneck is 64GB of hash and TLB thrashing, so I do have plans to try mmap() and using the rather clumsy 1gb huge pages. But I have not done this as of yet, still using the automatic 2mb pages (transparent huge pages in current linux kernels). Speedup (geomean) went from 12.45 to 16.3 due to the intel issue plus a few further SMP refinements. BTW the 20 cpu search looked at about 5% more nodes (average) compared to the 1 cpu test, so overhead is pretty well managed for the moment.Code: Select all
run 1 run 2 run 3 run 4 avg speedup 15.0 15.0 15.0 15.3 15.1 nps 16.3 16.4 16.3 16.3 16.3
The speedup has really captured my attention, because it is right at the theoretical max (15.1 with max of 16.3 based on NPS numbers). I am fixing to create another set of test positions just to be sure that these positions don't somehow happen to artificially inflate the speedup results. I am not sure exactly how I am going to choose this test set. I was leaning toward either (a) taking a set of positions from a long time control game played on ICC, or else (b) a random set of positions extracted from GM games (the way I extract starting positions for cluster testing).
A much larger set of positions would be better, statistically, but not so good practically as the tests would take forever to run. More on this later....
I am certainly looking at the code carefully and testing whatever I can. As of right now, it looks rock-solid to me. It might be that those 24 positions are very favorable to a search by Crafty. I have already run the set posted in this thread using 20 cores. I'm going to pick 60 of them and run to fixed depth with 1 and 20 cores to get a feel for whether the positions might be the issue. I'd rather run 1000 positions but that is simply not practical at these speeds...
http://www.talkchess.com/forum/viewtopi ... 59&start=0
I tested Crafty 23.6 with 30 positions for average 2 minutes per position on 4 cores (average 6 minutes per position on 1 core). The effective speedup on 4 cores I get is 2.7 and NPS speedup 3.2, nowhere near your 3.7 and 3.8 for Crafty 25.0 with the old compiler (maybe even higher with the new compiler).
Andreas Strangmüller to 16 cores got an effective speedup for Crafty 24.1 of maybe 6.0, and bad scaling 8 -> 16 threads, while your results to 16 cores with Crafty 25.0 using the old compiler show a speedup of 10-11 and a very good 8 -> 16 thread scaling of 1.6 for effective speedup and 1.7 for NPS speedup. Maybe even better with the new compiler.
The speedup of 15 on 20 cores with the new compiler is VERY high.
Anyway, if you confirm your Crafty 25.0 new compiler numbers, you not only should publish the results, but also a paper on explaining the parallel algorithm. I think nobody else gets such speedups. With this sort of SMP implementation, Jonny would have won long ago the WCCC.

