Nodes/sec. with last new CPU's!

Houdini · Post by **Houdini** » Wed Sep 20, 2017 9:34 am

Houdini wrote:I'll start the 40 vs 80 threads match, it will take some time .

After 35 hours and 890 games in the (80 threads at 0.65x speed) versus (40 threads at 1x speed) match, the results are similar to the previous tests.
(+147 -118 =625) or about (+11±12 Elo) in favor of the 80 hyper-threads.

Time control of the games is 40+0.4 (with 80 or 40 threads this produces a very good level of play).
Games are run with Houdini 6 using 2048 MB of hash on a dual Xeon E5-v4 server, 2 x 20 cores at 2.3 GHz.
The CPU consumption reported by the Task Manager is quite huge, 1381 hours vs 690 hours; in 35 hours real time there's been 2000 hours of CPU time.

The results confirm that it is indeed beneficial to use more threads and suggest that at least up to 80 threads the multi-threaded performance increase is very consistent.

As a final verification for this point I will start a 80 hyper-threads vs 20 threads match on the same server. This should produce something else than a +10 Elo difference

.

Laskos · Post by **Laskos** » Wed Sep 20, 2017 12:45 pm

Houdini wrote:
Houdini wrote:I'll start the 40 vs 80 threads match, it will take some time .
After 35 hours and 890 games in the (80 threads at 0.65x speed) versus (40 threads at 1x speed) match, the results are similar to the previous tests.
(+147 -118 =625) or about (+11±12 Elo) in favor of the 80 hyper-threads.

Time control of the games is 40+0.4 (with 80 or 40 threads this produces a very good level of play).
Games are run with Houdini 6 using 2048 MB of hash on a dual Xeon E5-v4 server, 2 x 20 cores at 2.3 GHz.
The CPU consumption reported by the Task Manager is quite huge, 1381 hours vs 690 hours; in 35 hours real time there's been 2000 hours of CPU time.

The results confirm that it is indeed beneficial to use more threads and suggest that at least up to 80 threads the multi-threaded performance increase is very consistent.

As a final verification for this point I will start a 80 hyper-threads vs 20 threads match on the same server. This should produce something else than a +10 Elo difference .

This is indeed very interesting. And I also used in the past thread/hyperthread testing to derive the scaling, because 1.30 speed-up with hyperthreading (on Intels) is pretty stable and borderline in giving strength benefit, thus can give a more precise result in deriving the value of the doubling. Besides that, one doesn't need more cores, and in your case, NUMA configuration (which has a bit different scaling) is the same.

Maybe instead of the next test you make, you can test NUMA? I think your, called by me from my past experience "naive", say 1.7 factor per doubling, stands pretty firm now, at least for Lazy Houdini (and close to that is Lazy Texel). But NUMA issues are a bit more mysterious. I guess you have 2x20 core NUMA configuration, correct? Can you test 40 cores (40 threads) on 2 CPU against 20 cores (20 threads) on single CPU? Very few tests are done to see the NUMA scaling (Peter did with Texel). And probably beyond 2 multicore CPUs very few users would need.

Werewolf · Post by **Werewolf** » Wed Sep 20, 2017 12:51 pm

Houdini wrote:
Houdini wrote:I'll start the 40 vs 80 threads match, it will take some time .
After 35 hours and 890 games in the (80 threads at 0.65x speed) versus (40 threads at 1x speed) match, the results are similar to the previous tests.
(+147 -118 =625) or about (+11±12 Elo) in favor of the 80 hyper-threads.

Time control of the games is 40+0.4 (with 80 or 40 threads this produces a very good level of play).
Games are run with Houdini 6 using 2048 MB of hash on a dual Xeon E5-v4 server, 2 x 20 cores at 2.3 GHz.
The CPU consumption reported by the Task Manager is quite huge, 1381 hours vs 690 hours; in 35 hours real time there's been 2000 hours of CPU time.

The results confirm that it is indeed beneficial to use more threads and suggest that at least up to 80 threads the multi-threaded performance increase is very consistent.

As a final verification for this point I will start a 80 hyper-threads vs 20 threads match on the same server. This should produce something else than a +10 Elo difference .

VERY interesting.

I wonder how high we can go on this.
There was a study done a while back, I think by Kai, suggesting that maximum speedup from Lazy with hundreds of threads was about 22x single thread (100%) speed.

Laskos · Post by **Laskos** » Wed Sep 20, 2017 1:04 pm

Werewolf wrote:
Houdini wrote:
Houdini wrote:I'll start the 40 vs 80 threads match, it will take some time .
After 35 hours and 890 games in the (80 threads at 0.65x speed) versus (40 threads at 1x speed) match, the results are similar to the previous tests.
(+147 -118 =625) or about (+11±12 Elo) in favor of the 80 hyper-threads.

Time control of the games is 40+0.4 (with 80 or 40 threads this produces a very good level of play).
Games are run with Houdini 6 using 2048 MB of hash on a dual Xeon E5-v4 server, 2 x 20 cores at 2.3 GHz.
The CPU consumption reported by the Task Manager is quite huge, 1381 hours vs 690 hours; in 35 hours real time there's been 2000 hours of CPU time.

The results confirm that it is indeed beneficial to use more threads and suggest that at least up to 80 threads the multi-threaded performance increase is very consistent.

As a final verification for this point I will start a 80 hyper-threads vs 20 threads match on the same server. This should produce something else than a +10 Elo difference .
VERY interesting.

I wonder how high we can go on this.
There was a study done a while back, I think by Kai, suggesting that maximum speedup from Lazy with hundreds of threads was about 22x single thread (100%) speed.

I guess I was wrong

.

I assumed Amdahl's Law, and it fitted very well with Stockfish Lazy up to 16 threads, fitted very very well. Frankly, I don't think that anybody has a clear idea how Lazy works, and how the scaling to many cores in Texel and Houdini with Lazy is better than that of MCTS Go programs, which should in principle scale better than chess engines with alpha-beta. It's all weird and new, to me at least.

If this Lazy weird scaling is valid for clusters too, with say constant 1.3 instead of constant 1.7 effective speed-up per doubling the number of clusters, we could see 5,000 core clusters in the future having a significant advantage over regular hardware.

Houdini · Post by **Houdini** » Wed Sep 20, 2017 1:37 pm

Laskos wrote:Maybe instead of the next test you make, you can test NUMA? I think your, called by me from my past experience "naive", say 1.7 factor per doubling, stands pretty firm now, at least for Lazy Houdini (and close to that is Lazy Texel). But NUMA issues are a bit more mysterious. I guess you have 2x20 core NUMA configuration, correct? Can you test 40 cores (40 threads) on 2 CPU against 20 cores (20 threads) on single CPU? Very few tests are done to see the NUMA scaling (Peter did with Texel). And probably beyond 2 multicore CPUs very few users would need.

Indeed, it's 2 NUMA nodes each containing a 20-core Xeon.

The NUMA scaling you want to test will appear in the results of the current 80 threads (on 2 NUMA nodes) vs 20 threads (on 1 NUMA node) test.
After 100 games it's (+28 -7 =65) or about +74±40 Elo.

So we currently have the following:

Code: Select all

40 hyper-threads &#40;on 1 node ) vs 20 threads &#40;on 1 node )&#58; +13±10 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 40 threads &#40;on 2 nodes&#41;&#58; +11±12 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 20 threads &#40;on 1 node )&#58; +74±40 Elo

The scaling you want to study is the difference between the second and the third result, currently about 60±40 Elo going from 20 to 40 threads.

ouachita · Post by **ouachita** » Wed Sep 20, 2017 3:15 pm

In short, whereas the three golden rules of engine competition in the past were cores, cores and cores, today those same golden rules are threads, threads and threads.

Robert, I assume you are using the Xeon E5-2698 v4?

Houdini · Post by **Houdini** » Wed Sep 20, 2017 4:05 pm

ouachita wrote:In short, whereas the three golden rules of engine competition in the past were cores, cores and cores, today those same golden rules are threads, threads and threads.

Robert, I assume you are using the Xeon E5-2698 v4?

Correct, the server uses E5-2698 v4.

ouachita · Post by **ouachita** » Wed Sep 20, 2017 9:36 pm

Robert,
Although I might be wrong, your excellent posts today seem inconsistent with the text at your site:

"If your computer supports hyper-threading it is usually recommended not using more threads than physical cores. The additional hyper-threads will yield about 25% to 30% extra node speed, but the inefficiency of the parallel alpha-beta search with the higher number of threads will partially offset this speed gain. This means that the extra hyper-threads will produce only a small increase in Elo – probably at most 10 Elo."

I thought you posted today that 80 threads on 40 cores was stronger than say 40 threads.

Please advise, thx.

Houdini · Post by **Houdini** » Wed Sep 20, 2017 10:54 pm

ouachita wrote:Robert,
Although I might be wrong, your excellent posts today seem inconsistent with the text at your site:

"...This means that the extra hyper-threads will produce only a small increase in Elo – probably at most 10 Elo."

I thought you posted today that 80 threads on 40 cores was stronger than say 40 threads.

Please advise, thx.

Bobby, the results above demonstrate what is written in the manual (see the bold section): 80 hyper-threads are only about 10 Elo stronger than 40 threads.
There is some gain, but it is rather small compared to the apparent huge increase from 40 to 80 threads (or as the Windows Task Manager shows: from 50% CPU usage to 100% CPU usage).

Houdini · Post by **Houdini** » Thu Sep 21, 2017 12:26 pm

Houdini wrote:The NUMA scaling you want to test will appear in the results of the current 80 threads (on 2 NUMA nodes) vs 20 threads (on 1 NUMA node) test.
After 100 games it's (+28 -7 =65) or about +74±40 Elo.

So we currently have the following:
Code: Select all
40 hyper-threads &#40;on 1 node ) vs 20 threads &#40;on 1 node )&#58; +13±10 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 40 threads &#40;on 2 nodes&#41;&#58; +11±12 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 20 threads &#40;on 1 node )&#58; +74±40 Elo
The scaling you want to study is the difference between the second and the third result, currently about 60±40 Elo going from 20 to 40 threads.

I've stopped the 80 hyper-threads vs 20 threads match after 690 games.
Result is (+194 -48 =448) or about +75±15 Elo.

That means we have the following, with 1 node using 20 cores, 2 nodes using 40 cores:

Code: Select all

40 hyper-threads &#40;on 1 node ) vs 20 threads &#40;on 1 node )&#58; +13±10 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 40 threads &#40;on 2 nodes&#41;&#58; +11±12 Elo
80 hyper-threads &#40;on 2 nodes&#41; vs 20 threads &#40;on 1 node )&#58; +75±15 Elo

From these results the scaling from 20 threads (20 cores) to 40 threads (40 cores) can be estimated as +64±19 Elo. More games would be needed to reduce the error margins, but the over-all picture is quite clear.

The scaling is surprisingly good, especially if you take into account that a (40+0.4) time control is quite decent with 20 or 40 threads. Applying the (n^0.8) scaling formula, 20 threads at (40+0.4) would be equivalent to 1 thread at (440+4.4) which is similar to IPON's (300+3 with ponder).

This also means that Houdini 6 on 20 cores would be competitive with Houdini 5 on 40 cores, or as said on the Houdini web page, "upgrading to Houdini 6 is like doubling the computational power of your computer for chess".
Great to see the dominance of software over hardware!

Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!