Nodes/sec. with last new CPU's!

Houdini · Post by **Houdini** » Thu Sep 21, 2017 12:41 pm

For reference also the node speeds from the starting position after 2 minutes, using Houdini 6 with 4 GB of hash (no Large Pages):
- 20 threads (20 cores): 27.2 MN/s
- 40 threads (40 cores): 52.4 MN/s
- 80 threads (40 cores): 67.3 MN/s

Werewolf · Post by **Werewolf** » Thu Sep 21, 2017 1:54 pm

what was the actual (i.e. all-core turbo) clock speed? Intel?

Werewolf · Post by **Werewolf** » Thu Sep 21, 2017 1:55 pm

Ah just seen E5-2698 v4.

I think that's 2.70 GHz on all cores.

Laskos · Post by **Laskos** » Thu Sep 21, 2017 1:58 pm

Houdini wrote:
Houdini wrote:The NUMA scaling you want to test will appear in the results of the current 80 threads (on 2 NUMA nodes) vs 20 threads (on 1 NUMA node) test.
After 100 games it's (+28 -7 =65) or about +74±40 Elo.

So we currently have the following:
Code: Select all
40 hyper-threads &#40;on 1 node ) vs 20 threads &#40;on 1 node )&#58; +13±10 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 40 threads &#40;on 2 nodes&#41;&#58; +11±12 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 20 threads &#40;on 1 node )&#58; +74±40 Elo
The scaling you want to study is the difference between the second and the third result, currently about 60±40 Elo going from 20 to 40 threads.
I've stopped the 80 hyper-threads vs 20 threads match after 690 games.
Result is (+194 -48 =448) or about +75±15 Elo.

That means we have the following, with 1 node using 20 cores, 2 nodes using 40 cores:
Code: Select all
40 hyper-threads &#40;on 1 node ) vs 20 threads &#40;on 1 node )&#58; +13±10 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 40 threads &#40;on 2 nodes&#41;&#58; +11±12 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 20 threads &#40;on 1 node )&#58; +75±15 Elo
From these results the scaling from 20 threads (20 cores) to 40 threads (40 cores) can be estimated as +64±19 Elo. More games would be needed to reduce the error margins, but the over-all picture is quite clear.

The scaling is surprisingly good, especially if you take into account that a (40+0.4) time control is quite decent with 20 or 40 threads. Applying the (n^0.8) scaling formula, 20 threads at (40+0.4) would be equivalent to 1 thread at (440+4.4) which is similar to IPON's (300+3 with ponder).

Again, pretty unbelievable. So, going from 20 cores 1 NUMA node to 40 cores 2 NUMA nodes gives 64 +/- 19 ELO points? The doubling in time in these conditions cannot exceed 80-90 ELO points. So, it means 1.5-1.8 effective speedup from 1 node to 2 nodes with 20 cores each. On average, even higher than what Peter got. These numbers are hard to believe.

This also means that Houdini 6 on 20 cores would be competitive with Houdini 5 on 40 cores, or as said on the Houdini web page, "upgrading to Houdini 6 is like doubling the computational power of your computer for chess".
Great to see the dominance of software over hardware!

I don't know why you are saying that exactly now, when you got spectacular scaling both with cores on the same number of nodes, and with NUMA. If cluster scaling is lower, but comparable and stable, say a constant 1.4 effective speedup (time-to-strength) with doubling the cluster, then the things look much more favorable now with heavy hardware than before. Say, up to now I was under the impression that a 3000 core cluster (say 150 20 core machines) gives no more than 100 ELO points advantage compared to a single 20 core machine. Now it seems more like 250-300 ELO points (effective speedup of maybe 12). Or a NUMA 4 node Xeon (say a total of 96 cores) now seems to give a regular 16 core server a beating of 150 ELO points (about a factor of 4 effective speed-up). I was imagining much smaller gains. I was under the impression that software is more dominant, and multicore monsters are pretty much a waste strength-wise.

Werewolf · Post by **Werewolf** » Thu Sep 21, 2017 2:05 pm

Laskos wrote:
Or a NUMA 4 node Xeon (say a total of 96 cores) would give a regular 16 core server a beating of 150 ELO points (about a factor of 4 effective speed-up).

I've heard there are issues with the quad socket machines and that performance - certainly as recently as the v3 - is much lower than expected for chess. I don't know why. It's not simply due to the lower clock speed etc.

Houdini · Post by **Houdini** » Thu Sep 21, 2017 2:35 pm

Werewolf wrote:what was the actual (i.e. all-core turbo) clock speed? Intel?

The server has ES CPUs, they're running at 2.3 GHz all-core.

Houdini · Post by **Houdini** » Thu Sep 21, 2017 2:48 pm

Laskos wrote:Again, pretty unbelievable. So, going from 20 cores 1 NUMA node to 40 cores 2 NUMA nodes gives 64 +/- 19 ELO points? The doubling in time in these conditions cannot exceed 80-90 ELO points. So, it means 1.5-1.8 effective speedup from 1 node to 2 nodes with 20 cores each. On average, even higher than what Peter got. These numbers are hard to believe.

The numbers are better than expected, but that's why you need to run the actual matches - there is surprisingly little hard data available beyond 16 threads.
Note also that this is self-testing, the engine playing itself tends to inflate the Elo differences.
Maybe I can run a final match between (20 threads at 40+0.4) and (20 threads at 80+0.8) so that we also know the Elo improvement from doubling the time.

Laskos wrote:I don't know why you are saying that exactly now, when you got spectacular scaling both with cores on the same number of nodes, and with NUMA.

Because despite the nice hardware scaling, the software improvement is even more impressive. Compared to Houdini 3 from 2012 we're now about 250 Elo higher, which means that 1 thread of Houdini 6 would probably be competitive with 16 threads of Houdini 3.

Nordlandia · Post by **Nordlandia** » Thu Sep 21, 2017 2:49 pm

Houdini wrote:
Werewolf wrote:what was the actual (i.e. all-core turbo) clock speed? Intel?
I have ES CPUs, they're running at 2.3 GHz all-core.

Can you check what your Xeon evaluate the famous Spassky vs Fischer position - 1 or 2 minute of infinite analysis with 5 or 6-men syzygy

[d]5k2/pp4pp/4pp2/1P6/8/P2KP1P1/5P1b/2B5 b - - 0 0

Houdini · Post by **Houdini** » Fri Sep 22, 2017 2:21 pm

Houdini wrote:Maybe I can run a final match between (20 threads at 40+0.4) and (20 threads at 80+0.8) so that we also know the Elo improvement from doubling the time.

Results from this match are now in. 800 games yield (+195 -58 =547) or 60±13 Elo in favor of the (80+0.8) engine.

The previous matches estimated the scaling from 20 threads (20 cores) to 40 threads (40 cores) as +64±19 Elo.
So it appears that the improvement from doubling the number of threads (20->40) is similar to the doubling of the time (40+0.4->80+0.8). A rather unexpected, but very good result for the Lazy-like SMP used by Houdini 6.

Error margins remain relatively big - it would be rather expensive (both in time and electricity) to reduce them significantly.

Nordlandia · Post by **Nordlandia** » Fri Sep 22, 2017 2:31 pm

Regarding spassky vs fischer position above, H6 Pro on 8-Core i7-5960X 4.1GHz (5-men syzygy) evaluate position after 30. g3 as +0.25

Maybe if someone analyse with Xeon, H6 will finally evaluate the position as triple zero!

Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!