Nodes/sec. with last new CPU's!

zullil · Post by **zullil** » Wed Aug 30, 2017 11:10 pm

Leto wrote:If anyone's interested here are the results I got with my 12 core Xeon and my 4 thread HP laptop, using the 2017-5-22 asmfishw:

Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151

HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291

How many threads did you use on the Xeon? I did not see instruction on the benchmarking site about setting Threads = physical cores. If you have hyper-threading enabled, you can likely get a higher NPS using perhaps 16 threads. Of course, what this means is open to interpretation, but people seem to like seeing big numbers.

Dann Corbit · Post by **Dann Corbit** » Thu Aug 31, 2017 12:26 am

Laskos wrote:
Houdini wrote:
Laskos wrote:Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.
Chess engine alpha-beta is a lot more complex than the tasks for which Amdahl's law usually is applied.
I see no fundamental reason why the coefficient to use in Amdahl's formula could not depend on the number of threads.
For example:
- 0.955 for 8 threads.
- 0.975 for 24 threads
- 0.990 for 80 threads
This, of course, would imply that Amdahl's law is not a very good model of multi-threaded chess engines.

I will end my contribution to this thread by once again saying that only well-controlled tests with high number of threads will allow us to move beyond the current, rather idle speculations .
Well, it's quite possible. Peter Österlund got crazy results with his Lazy Texel.
http://www.talkchess.com/forum/viewtopic.php?t=64824

I pute the table with Lazy Texel doublings in a more readable way, and added effective speedup per doubling for the number of cores and Amdahl's efficiency if assumed to be a law here. The numbers I added are not perfect, because doubling time line is not equal in base strength to doubling cores line (only the first row), but quite close.

ELO gain
Errors about 10 ELO points 2 SD.
Code: Select all
                                                   NUMA=2
Config          X1     X2     X4     X8    X16      X32 
----------------------------------------------------------
Doubling time    -    112    100     85     72       63 
Doubling cores   -     72     76     73     66       44
==========================================================
Effect. speedup       
per doubling         1.56   1.69   1.81    1.89    1.62

Amdahl's
efficiency          0.719  0.900  0.972   0.992   0.981
It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).

Maybe something else is in effect here.
Perhaps there is a fixed startup cost that is payed in full by the first engine that starts. So there is some baseline term that we should subtract.

Amdahl's law is basically pure math. It seems to me that if it is not obeyed what that really means is there is something in the model that we have not accounted for.

Leto · Post by **Leto** » Thu Aug 31, 2017 3:02 am

zullil wrote:
Leto wrote:If anyone's interested here are the results I got with my 12 core Xeon and my 4 thread HP laptop, using the 2017-5-22 asmfishw:

Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151

HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291
How many threads did you use on the Xeon? I did not see instruction on the benchmarking site about setting Threads = physical cores. If you have hyper-threading enabled, you can likely get a higher NPS using perhaps 16 threads. Of course, what this means is open to interpretation, but people seem to like seeing big numbers.

I used 12 threads.

Geonerd · Post by **Geonerd** » Thu Aug 31, 2017 3:34 am

AMD "Phenom" x6 at 3.4GHz
Win7, 8gb.

asmFishW_2017-05-22_popcnt
*** bench hash 1024 threads 6 depth 26 realtime 0 ***
info string hash set to 1024 MB no large pages
info string node 0 has threads 0 1 2 3 4 5
1: nodes: 142040275 8991 knps
.
.
.
37: nodes: 2538277 13573 knps
===========================
Total time (ms) : 253063
Nodes searched : 2651985044
Nodes/second : 10479544

jpqy · Post by **jpqy** » Fri Sep 01, 2017 2:06 pm

Updated with 4 new benches..

http://www.ipmanchess.yolasite.com/amd- ... -bench.php

JP.

Houdini · Post by **Houdini** » Mon Sep 18, 2017 4:47 pm

Laskos wrote:It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).

Some new information, I'm currently running Houdini 6 on a dual 20-core Xeon server in a match of 20 threads against 40 hyper-threads.
So basically you have (20 threads at 1x speed) versus (40 threads at 0.65x speed).
After 900 games, the 40 threads engine leads by about 10 Elo points.

The result is very similar to the 12 vs 24 threads match I reported above.
The "naive" formula - implying that every doubling of the number of threads produces about 20% of overhead - seems to hold quite well.

If I find the time, I'll also run a 40 threads vs 80 hyper-threads match.

Laskos · Post by **Laskos** » Mon Sep 18, 2017 10:15 pm

Houdini wrote:
Laskos wrote:It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).
Some new information, I'm currently running Houdini 6 on a dual 20-core Xeon server in a match of 20 threads against 40 hyper-threads.
So basically you have (20 threads at 1x speed) versus (40 threads at 0.65x speed).
After 900 games, the 40 threads engine leads by about 10 Elo points.

The result is very similar to the 12 vs 24 threads match I reported above.
The "naive" formula - implying that every doubling of the number of threads produces about 20% of overhead - seems to hold quite well.

If I find the time, I'll also run a 40 threads vs 80 hyper-threads match.

Yes, Peter seems to get similar results with Lazy Texel, at least close. Here is the thread he posted:
http://www.talkchess.com/forum/viewtopi ... =&start=60

I don't know how this Lazy stuff with alpha-beta can scale better than Monte Carlo search of AlphaGo (although on cluster), but it seems something like your 1.75^(doublings) is not naive anymore. Then there are NUMA and cluster issues. In that thread Peter studied them too.

Houdini · Post by **Houdini** » Mon Sep 18, 2017 10:37 pm

I've stopped the match after 1200 games, final score is (+196 -149 =855) or about (+13±10 Elo) in favor of the 40 hyper-threads engine.

I'll start the 40 vs 80 threads match, it will take some time

.

Dann Corbit · Post by **Dann Corbit** » Mon Sep 18, 2017 10:41 pm

Houdini wrote:I've stopped the match after 1200 games, final score is (+196 -149 =855) or about (+13±10 Elo) in favor of the 40 hyper-threads engine.

I'll start the 40 vs 80 threads match, it will take some time .

If the threads are hyperthreads, are we sure of what we are measuring?

IOW, is the power of a hyperthread core (when physical cores are already exhausted) equal to the power of a core when the physical core count has not been exhausted?

Houdini · Post by **Houdini** » Mon Sep 18, 2017 10:45 pm

Dann Corbit wrote:
Houdini wrote:I've stopped the match after 1200 games, final score is (+196 -149 =855) or about (+13±10 Elo) in favor of the 40 hyper-threads engine.

I'll start the 40 vs 80 threads match, it will take some time .
If the threads are hyperthreads, are we sure of what we are measuring?

IOW, is the power of a hyperthread core (when physical cores are already exhausted) equal to the power of a core when the physical core count has not been exhausted?

Hyper-threads are slower; when you double the number of threads you only get about 30% total node speed increase.
That's why I call it a match between (20 threads at 1x speed) and (40 threads at 0.65x speed).

Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!