For reference also the node speeds from the starting position after 2 minutes, using Houdini 6 with 4 GB of hash (no Large Pages):
- 20 threads (20 cores): 27.2 MN/s
- 40 threads (40 cores): 52.4 MN/s
- 80 threads (40 cores): 67.3 MN/s
Nodes/sec. with last new CPU's!
Moderators: hgm, Rebel, chrisw
-
- Posts: 1796
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Nodes/sec. with last new CPU's!
what was the actual (i.e. all-core turbo) clock speed? Intel?
-
- Posts: 1796
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Nodes/sec. with last new CPU's!
Ah just seen E5-2698 v4.
I think that's 2.70 GHz on all cores.
I think that's 2.70 GHz on all cores.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Nodes/sec. with last new CPU's!
Again, pretty unbelievable. So, going from 20 cores 1 NUMA node to 40 cores 2 NUMA nodes gives 64 +/- 19 ELO points? The doubling in time in these conditions cannot exceed 80-90 ELO points. So, it means 1.5-1.8 effective speedup from 1 node to 2 nodes with 20 cores each. On average, even higher than what Peter got. These numbers are hard to believe.Houdini wrote:I've stopped the 80 hyper-threads vs 20 threads match after 690 games.Houdini wrote:The NUMA scaling you want to test will appear in the results of the current 80 threads (on 2 NUMA nodes) vs 20 threads (on 1 NUMA node) test.
After 100 games it's (+28 -7 =65) or about +74±40 Elo.
So we currently have the following:The scaling you want to study is the difference between the second and the third result, currently about 60±40 Elo going from 20 to 40 threads.Code: Select all
40 hyper-threads (on 1 node ) vs 20 threads (on 1 node ): +13±10 Elo 80 hyper-threads (on 2 nodes) vs 40 threads (on 2 nodes): +11±12 Elo 80 hyper-threads (on 2 nodes) vs 20 threads (on 1 node ): +74±40 Elo
Result is (+194 -48 =448) or about +75±15 Elo.
That means we have the following, with 1 node using 20 cores, 2 nodes using 40 cores:From these results the scaling from 20 threads (20 cores) to 40 threads (40 cores) can be estimated as +64±19 Elo. More games would be needed to reduce the error margins, but the over-all picture is quite clear.Code: Select all
40 hyper-threads (on 1 node ) vs 20 threads (on 1 node ): +13±10 Elo 80 hyper-threads (on 2 nodes) vs 40 threads (on 2 nodes): +11±12 Elo 80 hyper-threads (on 2 nodes) vs 20 threads (on 1 node ): +75±15 Elo
The scaling is surprisingly good, especially if you take into account that a (40+0.4) time control is quite decent with 20 or 40 threads. Applying the (n^0.8) scaling formula, 20 threads at (40+0.4) would be equivalent to 1 thread at (440+4.4) which is similar to IPON's (300+3 with ponder).
I don't know why you are saying that exactly now, when you got spectacular scaling both with cores on the same number of nodes, and with NUMA. If cluster scaling is lower, but comparable and stable, say a constant 1.4 effective speedup (time-to-strength) with doubling the cluster, then the things look much more favorable now with heavy hardware than before. Say, up to now I was under the impression that a 3000 core cluster (say 150 20 core machines) gives no more than 100 ELO points advantage compared to a single 20 core machine. Now it seems more like 250-300 ELO points (effective speedup of maybe 12). Or a NUMA 4 node Xeon (say a total of 96 cores) now seems to give a regular 16 core server a beating of 150 ELO points (about a factor of 4 effective speed-up). I was imagining much smaller gains. I was under the impression that software is more dominant, and multicore monsters are pretty much a waste strength-wise.This also means that Houdini 6 on 20 cores would be competitive with Houdini 5 on 40 cores, or as said on the Houdini web page, "upgrading to Houdini 6 is like doubling the computational power of your computer for chess".
Great to see the dominance of software over hardware!
Last edited by Laskos on Thu Sep 21, 2017 2:05 pm, edited 1 time in total.
-
- Posts: 1796
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Nodes/sec. with last new CPU's!
I've heard there are issues with the quad socket machines and that performance - certainly as recently as the v3 - is much lower than expected for chess. I don't know why. It's not simply due to the lower clock speed etc.Laskos wrote:
Or a NUMA 4 node Xeon (say a total of 96 cores) would give a regular 16 core server a beating of 150 ELO points (about a factor of 4 effective speed-up).
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Nodes/sec. with last new CPU's!
The server has ES CPUs, they're running at 2.3 GHz all-core.Werewolf wrote:what was the actual (i.e. all-core turbo) clock speed? Intel?
Last edited by Houdini on Thu Sep 21, 2017 2:48 pm, edited 1 time in total.
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Nodes/sec. with last new CPU's!
The numbers are better than expected, but that's why you need to run the actual matches - there is surprisingly little hard data available beyond 16 threads.Laskos wrote:Again, pretty unbelievable. So, going from 20 cores 1 NUMA node to 40 cores 2 NUMA nodes gives 64 +/- 19 ELO points? The doubling in time in these conditions cannot exceed 80-90 ELO points. So, it means 1.5-1.8 effective speedup from 1 node to 2 nodes with 20 cores each. On average, even higher than what Peter got. These numbers are hard to believe.
Note also that this is self-testing, the engine playing itself tends to inflate the Elo differences.
Maybe I can run a final match between (20 threads at 40+0.4) and (20 threads at 80+0.8) so that we also know the Elo improvement from doubling the time.
Because despite the nice hardware scaling, the software improvement is even more impressive. Compared to Houdini 3 from 2012 we're now about 250 Elo higher, which means that 1 thread of Houdini 6 would probably be competitive with 16 threads of Houdini 3.Laskos wrote:I don't know why you are saying that exactly now, when you got spectacular scaling both with cores on the same number of nodes, and with NUMA.
Last edited by Houdini on Thu Sep 21, 2017 2:57 pm, edited 1 time in total.
-
- Posts: 2821
- Joined: Fri Sep 25, 2015 9:38 pm
- Location: Sortland, Norway
Re: Nodes/sec. with last new CPU's!
Can you check what your Xeon evaluate the famous Spassky vs Fischer position - 1 or 2 minute of infinite analysis with 5 or 6-men syzygyHoudini wrote:I have ES CPUs, they're running at 2.3 GHz all-core.Werewolf wrote:what was the actual (i.e. all-core turbo) clock speed? Intel?
[d]5k2/pp4pp/4pp2/1P6/8/P2KP1P1/5P1b/2B5 b - - 0 0
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Nodes/sec. with last new CPU's!
Results from this match are now in. 800 games yield (+195 -58 =547) or 60±13 Elo in favor of the (80+0.8) engine.Houdini wrote:Maybe I can run a final match between (20 threads at 40+0.4) and (20 threads at 80+0.8) so that we also know the Elo improvement from doubling the time.
The previous matches estimated the scaling from 20 threads (20 cores) to 40 threads (40 cores) as +64±19 Elo.
So it appears that the improvement from doubling the number of threads (20->40) is similar to the doubling of the time (40+0.4->80+0.8). A rather unexpected, but very good result for the Lazy-like SMP used by Houdini 6.
Error margins remain relatively big - it would be rather expensive (both in time and electricity) to reduce them significantly.
-
- Posts: 2821
- Joined: Fri Sep 25, 2015 9:38 pm
- Location: Sortland, Norway
Re: Nodes/sec. with last new CPU's!
Regarding spassky vs fischer position above, H6 Pro on 8-Core i7-5960X 4.1GHz (5-men syzygy) evaluate position after 30. g3 as +0.25
Maybe if someone analyse with Xeon, H6 will finally evaluate the position as triple zero!
Maybe if someone analyse with Xeon, H6 will finally evaluate the position as triple zero!