Nodes/sec. with last new CPU's!

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
zullil
Posts: 6442
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: Nodes/sec. with last new CPU's!

Post by zullil » Wed Aug 30, 2017 9:10 pm

Leto wrote:If anyone's interested here are the results I got with my 12 core Xeon and my 4 thread HP laptop, using the 2017-5-22 asmfishw:

Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151

HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291
How many threads did you use on the Xeon? I did not see instruction on the benchmarking site about setting Threads = physical cores. If you have hyper-threading enabled, you can likely get a higher NPS using perhaps 16 threads. Of course, what this means is open to interpretation, but people seem to like seeing big numbers. :wink:

Dann Corbit
Posts: 12040
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Nodes/sec. with last new CPU's!

Post by Dann Corbit » Wed Aug 30, 2017 10:26 pm

Laskos wrote:
Houdini wrote:
Laskos wrote:Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.
Chess engine alpha-beta is a lot more complex than the tasks for which Amdahl's law usually is applied.
I see no fundamental reason why the coefficient to use in Amdahl's formula could not depend on the number of threads.
For example:
- 0.955 for 8 threads.
- 0.975 for 24 threads
- 0.990 for 80 threads
This, of course, would imply that Amdahl's law is not a very good model of multi-threaded chess engines.

I will end my contribution to this thread by once again saying that only well-controlled tests with high number of threads will allow us to move beyond the current, rather idle speculations :-).
Well, it's quite possible. Peter Österlund got crazy results with his Lazy Texel.
http://www.talkchess.com/forum/viewtopic.php?t=64824

I pute the table with Lazy Texel doublings in a more readable way, and added effective speedup per doubling for the number of cores and Amdahl's efficiency if assumed to be a law here. The numbers I added are not perfect, because doubling time line is not equal in base strength to doubling cores line (only the first row), but quite close.

ELO gain
Errors about 10 ELO points 2 SD.

Code: Select all

                                                   NUMA=2
Config          X1     X2     X4     X8    X16      X32 
----------------------------------------------------------
Doubling time    -    112    100     85     72       63 
Doubling cores   -     72     76     73     66       44
==========================================================
Effect. speedup       
per doubling         1.56   1.69   1.81    1.89    1.62

Amdahl's
efficiency          0.719  0.900  0.972   0.992   0.981
It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).
Maybe something else is in effect here.
Perhaps there is a fixed startup cost that is payed in full by the first engine that starts. So there is some baseline term that we should subtract.

Amdahl's law is basically pure math. It seems to me that if it is not obeyed what that really means is there is something in the model that we have not accounted for.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
Leto
Posts: 2052
Joined: Thu May 04, 2006 1:40 am
Location: Dune

Re: Nodes/sec. with last new CPU's!

Post by Leto » Thu Aug 31, 2017 1:02 am

zullil wrote:
Leto wrote:If anyone's interested here are the results I got with my 12 core Xeon and my 4 thread HP laptop, using the 2017-5-22 asmfishw:

Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151

HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291
How many threads did you use on the Xeon? I did not see instruction on the benchmarking site about setting Threads = physical cores. If you have hyper-threading enabled, you can likely get a higher NPS using perhaps 16 threads. Of course, what this means is open to interpretation, but people seem to like seeing big numbers. :wink:
I used 12 threads.

Geonerd
Posts: 77
Joined: Fri Mar 10, 2017 12:44 am

Re: Nodes/sec. with last new CPU's!

Post by Geonerd » Thu Aug 31, 2017 1:34 am

AMD "Phenom" x6 at 3.4GHz
Win7, 8gb.

asmFishW_2017-05-22_popcnt
*** bench hash 1024 threads 6 depth 26 realtime 0 ***
info string hash set to 1024 MB no large pages
info string node 0 has threads 0 1 2 3 4 5
1: nodes: 142040275 8991 knps
.
.
.
37: nodes: 2538277 13573 knps
===========================
Total time (ms) : 253063
Nodes searched : 2651985044
Nodes/second : 10479544

jpqy
Posts: 532
Joined: Thu Apr 24, 2008 7:31 am
Location: Belgium

Re: Nodes/sec. with last new CPU's!

Post by jpqy » Fri Sep 01, 2017 12:06 pm


User avatar
Houdini
Posts: 1471
Joined: Mon Mar 15, 2010 11:00 pm
Contact:

Re: Nodes/sec. with last new CPU's!

Post by Houdini » Mon Sep 18, 2017 2:47 pm

Laskos wrote:It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).
Some new information, I'm currently running Houdini 6 on a dual 20-core Xeon server in a match of 20 threads against 40 hyper-threads.
So basically you have (20 threads at 1x speed) versus (40 threads at 0.65x speed).
After 900 games, the 40 threads engine leads by about 10 Elo points.

The result is very similar to the 12 vs 24 threads match I reported above.
The "naive" formula - implying that every doubling of the number of threads produces about 20% of overhead - seems to hold quite well.

If I find the time, I'll also run a 40 threads vs 80 hyper-threads match.

User avatar
Laskos
Posts: 10949
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Nodes/sec. with last new CPU's!

Post by Laskos » Mon Sep 18, 2017 8:15 pm

Houdini wrote:
Laskos wrote:It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).
Some new information, I'm currently running Houdini 6 on a dual 20-core Xeon server in a match of 20 threads against 40 hyper-threads.
So basically you have (20 threads at 1x speed) versus (40 threads at 0.65x speed).
After 900 games, the 40 threads engine leads by about 10 Elo points.

The result is very similar to the 12 vs 24 threads match I reported above.
The "naive" formula - implying that every doubling of the number of threads produces about 20% of overhead - seems to hold quite well.

If I find the time, I'll also run a 40 threads vs 80 hyper-threads match.
Yes, Peter seems to get similar results with Lazy Texel, at least close. Here is the thread he posted:
http://www.talkchess.com/forum/viewtopi ... =&start=60

I don't know how this Lazy stuff with alpha-beta can scale better than Monte Carlo search of AlphaGo (although on cluster), but it seems something like your 1.75^(doublings) is not naive anymore. Then there are NUMA and cluster issues. In that thread Peter studied them too.

User avatar
Houdini
Posts: 1471
Joined: Mon Mar 15, 2010 11:00 pm
Contact:

Re: Nodes/sec. with last new CPU's!

Post by Houdini » Mon Sep 18, 2017 8:37 pm

I've stopped the match after 1200 games, final score is (+196 -149 =855) or about (+13±10 Elo) in favor of the 40 hyper-threads engine.

I'll start the 40 vs 80 threads match, it will take some time :).

Dann Corbit
Posts: 12040
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Nodes/sec. with last new CPU's!

Post by Dann Corbit » Mon Sep 18, 2017 8:41 pm

Houdini wrote:I've stopped the match after 1200 games, final score is (+196 -149 =855) or about (+13±10 Elo) in favor of the 40 hyper-threads engine.

I'll start the 40 vs 80 threads match, it will take some time :).
If the threads are hyperthreads, are we sure of what we are measuring?

IOW, is the power of a hyperthread core (when physical cores are already exhausted) equal to the power of a core when the physical core count has not been exhausted?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
Houdini
Posts: 1471
Joined: Mon Mar 15, 2010 11:00 pm
Contact:

Re: Nodes/sec. with last new CPU's!

Post by Houdini » Mon Sep 18, 2017 8:45 pm

Dann Corbit wrote:
Houdini wrote:I've stopped the match after 1200 games, final score is (+196 -149 =855) or about (+13±10 Elo) in favor of the 40 hyper-threads engine.

I'll start the 40 vs 80 threads match, it will take some time :).
If the threads are hyperthreads, are we sure of what we are measuring?

IOW, is the power of a hyperthread core (when physical cores are already exhausted) equal to the power of a core when the physical core count has not been exhausted?
Hyper-threads are slower; when you double the number of threads you only get about 30% total node speed increase.
That's why I call it a match between (20 threads at 1x speed) and (40 threads at 0.65x speed).

Post Reply