Best Stockfish NPS scaling yet

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Jouni
Posts: 3316
Joined: Wed Mar 08, 2006 8:15 pm

Re: Best Stockfish NPS scaling yet

Post by Jouni »

I use command "stockfish.exe bench 256 2 20 default depth" vs "stockfish.exe bench 256 1 20 default depth". Tested versions stockfish_15030208_x64.exe and SF6 from http://stockfishchess.org/download/.
Jouni
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Best Stockfish NPS scaling yet

Post by zullil »

Jouni wrote:I use command "stockfish.exe bench 256 2 20 default depth" vs "stockfish.exe bench 256 1 20 default depth". Tested versions stockfish_15030208_x64.exe and SF6 from http://stockfishchess.org/download/.
Thanks. Now I know that your results are not for the version I tested (nolocks), which has yet to be committed (and might never be).

In the future, I think I'll refrain from posting scaling data for uncommitted patches, since this post seems to have generated a lot of confusion.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Best Stockfish NPS scaling yet

Post by zullil »

zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )

Best NPS scaling from 8 to 16 threads I've seen yet:

Code: Select all

Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 300000 default time 
===========================
Total time (ms) : 11100082
Nodes searched  : 278067643640
Nodes/second    : 25050954

./stockfish bench 16384 8 300000 default time 
===========================
Total time (ms) : 11100004
Nodes searched  : 163772870234
Nodes/second    : 14754307

25050954/14754307 = 1.70
Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!

Code: Select all

./stockfish bench 16384 1 300000 default time
===========================
Total time (ms) : 11100000
Nodes searched  : 23435580318
Nodes/second    : 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.:wink:
syzygy
Posts: 5569
Joined: Tue Feb 28, 2012 11:56 pm

Re: Best Stockfish NPS scaling yet

Post by syzygy »

zullil wrote:Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!

Code: Select all

./stockfish bench 16384 1 300000 default time
===========================
Total time (ms) : 11100000
Nodes searched  : 23435580318
Nodes/second    : 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.:wink:
How does this compare to other engines?
Dann Corbit
Posts: 12566
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Best Stockfish NPS scaling yet

Post by Dann Corbit »

zullil wrote:
zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )

Best NPS scaling from 8 to 16 threads I've seen yet:

Code: Select all

Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 300000 default time 
===========================
Total time (ms) : 11100082
Nodes searched  : 278067643640
Nodes/second    : 25050954

./stockfish bench 16384 8 300000 default time 
===========================
Total time (ms) : 11100004
Nodes searched  : 163772870234
Nodes/second    : 14754307

25050954/14754307 = 1.70
Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!

Code: Select all

./stockfish bench 16384 1 300000 default time
===========================
Total time (ms) : 11100000
Nodes searched  : 23435580318
Nodes/second    : 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.:wink:
Don't forget Amdahl's Law:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char string&#91;32767&#93;;
char *getsafe&#40;char *buffer, int count&#41;
&#123;
    char *result = buffer, *np;
    if (&#40;buffer == NULL&#41; || &#40;count < 1&#41;)
        result = NULL;
    else if &#40;count == 1&#41;
        *result = '\0';
    else if (&#40;result = fgets&#40;buffer, count, stdin&#41;) != NULL&#41;
        if &#40;np = strchr&#40;buffer, '\n'))
            *np = '\0';
    return result;
&#125;


int main&#40;void&#41;
&#123;
    char *p;
    double f;
    unsigned n;
    puts&#40;"For fraction, 0 means 100% parallel. 1 means 100% serial.");
    puts&#40;"Fraction of program that is serial &#40;0.0 - 1.00&#41;&#58;");
    
    p = getsafe&#40;string, sizeof string&#41;;
    f = atof&#40;p&#41;;
    for &#40;n = 1; n <= 256; n++)
    	printf&#40;"Speedup for %u threads is %14.12g\n", n, 1.0 / &#40;f + &#40;1.0 / n&#41; * &#40;1.0-f&#41;));
    
    return 0;
&#125;
Dann Corbit
Posts: 12566
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Best Stockfish NPS scaling yet

Post by Dann Corbit »

If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.

For fraction, 0 means 100% parallel. 1 means 100% serial.
Fraction of program that is serial (0.0 - 1.00):
.02
Speedup for 1 threads is 1
Speedup for 2 threads is 1.96078431373
Speedup for 3 threads is 2.88461538462
Speedup for 4 threads is 3.77358490566
Speedup for 5 threads is 4.62962962963
Speedup for 6 threads is 5.45454545455
Speedup for 7 threads is 6.25
Speedup for 8 threads is 7.01754385965
Speedup for 9 threads is 7.75862068966
Speedup for 10 threads is 8.47457627119
Speedup for 11 threads is 9.16666666667
Speedup for 12 threads is 9.83606557377
Speedup for 13 threads is 10.4838709677
Speedup for 14 threads is 11.1111111111
Speedup for 15 threads is 11.71875
Speedup for 16 threads is 12.3076923077
Speedup for 17 threads is 12.8787878788
Speedup for 18 threads is 13.4328358209
Speedup for 19 threads is 13.9705882353
Speedup for 20 threads is 14.4927536232
Speedup for 21 threads is 15
Speedup for 22 threads is 15.4929577465
Speedup for 23 threads is 15.9722222222
Speedup for 24 threads is 16.4383561644
Speedup for 25 threads is 16.8918918919
Speedup for 26 threads is 17.3333333333
Speedup for 27 threads is 17.7631578947
Speedup for 28 threads is 18.1818181818
Speedup for 29 threads is 18.5897435897
Speedup for 30 threads is 18.9873417722
Speedup for 31 threads is 19.375
Speedup for 32 threads is 19.7530864198
Speedup for 33 threads is 20.1219512195
Speedup for 34 threads is 20.4819277108
Speedup for 35 threads is 20.8333333333
Speedup for 36 threads is 21.1764705882
Speedup for 37 threads is 21.511627907
Speedup for 38 threads is 21.8390804598
Speedup for 39 threads is 22.1590909091
Speedup for 40 threads is 22.4719101124
Speedup for 41 threads is 22.7777777778
Speedup for 42 threads is 23.0769230769
Speedup for 43 threads is 23.3695652174
Speedup for 44 threads is 23.6559139785
Speedup for 45 threads is 23.9361702128
Speedup for 46 threads is 24.2105263158
Speedup for 47 threads is 24.4791666667
Speedup for 48 threads is 24.7422680412
Speedup for 49 threads is 25
Speedup for 50 threads is 25.2525252525
Speedup for 51 threads is 25.5
Speedup for 52 threads is 25.7425742574
Speedup for 53 threads is 25.9803921569
Speedup for 54 threads is 26.213592233
Speedup for 55 threads is 26.4423076923
Speedup for 56 threads is 26.6666666667
Speedup for 57 threads is 26.8867924528
Speedup for 58 threads is 27.1028037383
Speedup for 59 threads is 27.3148148148
Speedup for 60 threads is 27.5229357798
Speedup for 61 threads is 27.7272727273
Speedup for 62 threads is 27.9279279279
Speedup for 63 threads is 28.125
Speedup for 64 threads is 28.3185840708
Speedup for 65 threads is 28.5087719298
Speedup for 66 threads is 28.6956521739
Speedup for 67 threads is 28.8793103448
Speedup for 68 threads is 29.0598290598
Speedup for 69 threads is 29.2372881356
Speedup for 70 threads is 29.4117647059
Speedup for 71 threads is 29.5833333333
Speedup for 72 threads is 29.7520661157
Speedup for 73 threads is 29.9180327869
Speedup for 74 threads is 30.081300813
Speedup for 75 threads is 30.2419354839
Speedup for 76 threads is 30.4
Speedup for 77 threads is 30.5555555556
Speedup for 78 threads is 30.7086614173
Speedup for 79 threads is 30.859375
Speedup for 80 threads is 31.007751938
Speedup for 81 threads is 31.1538461538
Speedup for 82 threads is 31.2977099237
Speedup for 83 threads is 31.4393939394
Speedup for 84 threads is 31.5789473684
Speedup for 85 threads is 31.7164179104
Speedup for 86 threads is 31.8518518519
Speedup for 87 threads is 31.9852941176
Speedup for 88 threads is 32.1167883212
Speedup for 89 threads is 32.2463768116
Speedup for 90 threads is 32.3741007194
Speedup for 91 threads is 32.5
Speedup for 92 threads is 32.6241134752
Speedup for 93 threads is 32.7464788732
Speedup for 94 threads is 32.8671328671
Speedup for 95 threads is 32.9861111111
Speedup for 96 threads is 33.1034482759
Speedup for 97 threads is 33.2191780822
Speedup for 98 threads is 33.3333333333
Speedup for 99 threads is 33.4459459459
Speedup for 100 threads is 33.5570469799
Speedup for 101 threads is 33.6666666667
Speedup for 102 threads is 33.7748344371
Speedup for 103 threads is 33.8815789474
Speedup for 104 threads is 33.9869281046
Speedup for 105 threads is 34.0909090909
Speedup for 106 threads is 34.1935483871
Speedup for 107 threads is 34.2948717949
Speedup for 108 threads is 34.3949044586
Speedup for 109 threads is 34.4936708861
Speedup for 110 threads is 34.5911949686
Speedup for 111 threads is 34.6875
Speedup for 112 threads is 34.7826086957
Speedup for 113 threads is 34.8765432099
Speedup for 114 threads is 34.9693251534
Speedup for 115 threads is 35.0609756098
Speedup for 116 threads is 35.1515151515
Speedup for 117 threads is 35.2409638554
Speedup for 118 threads is 35.3293413174
Speedup for 119 threads is 35.4166666667
Speedup for 120 threads is 35.5029585799
Speedup for 121 threads is 35.5882352941
Speedup for 122 threads is 35.6725146199
Speedup for 123 threads is 35.7558139535
Speedup for 124 threads is 35.838150289
Speedup for 125 threads is 35.9195402299
Speedup for 126 threads is 36
Speedup for 127 threads is 36.0795454545
Speedup for 128 threads is 36.1581920904
Speedup for 129 threads is 36.2359550562
Speedup for 130 threads is 36.312849162
Speedup for 131 threads is 36.3888888889
Speedup for 132 threads is 36.4640883978
Speedup for 133 threads is 36.5384615385
Speedup for 134 threads is 36.6120218579
Speedup for 135 threads is 36.6847826087
Speedup for 136 threads is 36.7567567568
Speedup for 137 threads is 36.8279569892
Speedup for 138 threads is 36.8983957219
Speedup for 139 threads is 36.9680851064
Speedup for 140 threads is 37.037037037
Speedup for 141 threads is 37.1052631579
Speedup for 142 threads is 37.1727748691
Speedup for 143 threads is 37.2395833333
Speedup for 144 threads is 37.3056994819
Speedup for 145 threads is 37.3711340206
Speedup for 146 threads is 37.4358974359
Speedup for 147 threads is 37.5
Speedup for 148 threads is 37.5634517766
Speedup for 149 threads is 37.6262626263
Speedup for 150 threads is 37.6884422111
Speedup for 151 threads is 37.75
Speedup for 152 threads is 37.8109452736
Speedup for 153 threads is 37.8712871287
Speedup for 154 threads is 37.9310344828
Speedup for 155 threads is 37.9901960784
Speedup for 156 threads is 38.0487804878
Speedup for 157 threads is 38.1067961165
Speedup for 158 threads is 38.1642512077
Speedup for 159 threads is 38.2211538462
Speedup for 160 threads is 38.2775119617
Speedup for 161 threads is 38.3333333333
Speedup for 162 threads is 38.3886255924
Speedup for 163 threads is 38.4433962264
Speedup for 164 threads is 38.4976525822
Speedup for 165 threads is 38.5514018692
Speedup for 166 threads is 38.6046511628
Speedup for 167 threads is 38.6574074074
Speedup for 168 threads is 38.7096774194
Speedup for 169 threads is 38.7614678899
Speedup for 170 threads is 38.8127853881
Speedup for 171 threads is 38.8636363636
Speedup for 172 threads is 38.9140271493
Speedup for 173 threads is 38.963963964
Speedup for 174 threads is 39.0134529148
Speedup for 175 threads is 39.0625
Speedup for 176 threads is 39.1111111111
Speedup for 177 threads is 39.1592920354
Speedup for 178 threads is 39.2070484581
Speedup for 179 threads is 39.2543859649
Speedup for 180 threads is 39.3013100437
Speedup for 181 threads is 39.347826087
Speedup for 182 threads is 39.3939393939
Speedup for 183 threads is 39.4396551724
Speedup for 184 threads is 39.4849785408
Speedup for 185 threads is 39.5299145299
Speedup for 186 threads is 39.5744680851
Speedup for 187 threads is 39.6186440678
Speedup for 188 threads is 39.6624472574
Speedup for 189 threads is 39.7058823529
Speedup for 190 threads is 39.7489539749
Speedup for 191 threads is 39.7916666667
Speedup for 192 threads is 39.8340248963
Speedup for 193 threads is 39.8760330579
Speedup for 194 threads is 39.9176954733
Speedup for 195 threads is 39.9590163934
Speedup for 196 threads is 40
Speedup for 197 threads is 40.0406504065
Speedup for 198 threads is 40.0809716599
Speedup for 199 threads is 40.1209677419
Speedup for 200 threads is 40.1606425703
Speedup for 201 threads is 40.2
Speedup for 202 threads is 40.2390438247
Speedup for 203 threads is 40.2777777778
Speedup for 204 threads is 40.3162055336
Speedup for 205 threads is 40.3543307087
Speedup for 206 threads is 40.3921568627
Speedup for 207 threads is 40.4296875
Speedup for 208 threads is 40.46692607
Speedup for 209 threads is 40.503875969
Speedup for 210 threads is 40.5405405405
Speedup for 211 threads is 40.5769230769
Speedup for 212 threads is 40.6130268199
Speedup for 213 threads is 40.6488549618
Speedup for 214 threads is 40.6844106464
Speedup for 215 threads is 40.7196969697
Speedup for 216 threads is 40.7547169811
Speedup for 217 threads is 40.7894736842
Speedup for 218 threads is 40.8239700375
Speedup for 219 threads is 40.8582089552
Speedup for 220 threads is 40.8921933086
Speedup for 221 threads is 40.9259259259
Speedup for 222 threads is 40.9594095941
Speedup for 223 threads is 40.9926470588
Speedup for 224 threads is 41.0256410256
Speedup for 225 threads is 41.0583941606
Speedup for 226 threads is 41.0909090909
Speedup for 227 threads is 41.1231884058
Speedup for 228 threads is 41.155234657
Speedup for 229 threads is 41.1870503597
Speedup for 230 threads is 41.2186379928
Speedup for 231 threads is 41.25
Speedup for 232 threads is 41.28113879
Speedup for 233 threads is 41.3120567376
Speedup for 234 threads is 41.3427561837
Speedup for 235 threads is 41.3732394366
Speedup for 236 threads is 41.4035087719
Speedup for 237 threads is 41.4335664336
Speedup for 238 threads is 41.4634146341
Speedup for 239 threads is 41.4930555556
Speedup for 240 threads is 41.5224913495
Speedup for 241 threads is 41.5517241379
Speedup for 242 threads is 41.5807560137
Speedup for 243 threads is 41.6095890411
Speedup for 244 threads is 41.638225256
Speedup for 245 threads is 41.6666666667
Speedup for 246 threads is 41.6949152542
Speedup for 247 threads is 41.722972973
Speedup for 248 threads is 41.7508417508
Speedup for 249 threads is 41.7785234899
Speedup for 250 threads is 41.8060200669
Speedup for 251 threads is 41.8333333333
Speedup for 252 threads is 41.8604651163
Speedup for 253 threads is 41.8874172185
Speedup for 254 threads is 41.9141914191
Speedup for 255 threads is 41.9407894737
Speedup for 256 threads is 41.9672131148
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Best Stockfish NPS scaling yet

Post by zullil »

syzygy wrote:
zullil wrote:Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!

Code: Select all

./stockfish bench 16384 1 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100000
Nodes searched  &#58; 23435580318
Nodes/second    &#58; 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.:wink:
How does this compare to other engines?
On my system---no idea. I don't own Komodo. I suppose I could install a few engines and do some testing. Maybe.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Best Stockfish NPS scaling yet

Post by zullil »

Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.

For fraction, 0 means 100% parallel. 1 means 100% serial.
Fraction of program that is serial (0.0 - 1.00):
.02
Speedup for 1 threads is 1
Speedup for 2 threads is 1.96078431373
Speedup for 3 threads is 2.88461538462
Speedup for 4 threads is 3.77358490566
Speedup for 5 threads is 4.62962962963
Speedup for 6 threads is 5.45454545455
Speedup for 7 threads is 6.25
Speedup for 8 threads is 7.01754385965
Speedup for 9 threads is 7.75862068966
Speedup for 10 threads is 8.47457627119
Speedup for 11 threads is 9.16666666667
Speedup for 12 threads is 9.83606557377
Speedup for 13 threads is 10.4838709677
Speedup for 14 threads is 11.1111111111
Speedup for 15 threads is 11.71875
Speedup for 16 threads is 12.3076923077
Good point. Maybe 11.9 is better than I realized.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Best Stockfish NPS scaling yet

Post by bob »

More interesting to just take the limit of that function as # processors approaches infinity. 50x.

Unfortunately measuring the serial part of a program is not trivial, because it also includes parts of the parallel code that deal with locks that cause contention.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Best Stockfish NPS scaling yet

Post by bob »

Dann Corbit wrote:
zullil wrote:
zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )

Best NPS scaling from 8 to 16 threads I've seen yet:

Code: Select all

Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100082
Nodes searched  &#58; 278067643640
Nodes/second    &#58; 25050954

./stockfish bench 16384 8 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100004
Nodes searched  &#58; 163772870234
Nodes/second    &#58; 14754307

25050954/14754307 = 1.70
Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!

Code: Select all

./stockfish bench 16384 1 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100000
Nodes searched  &#58; 23435580318
Nodes/second    &#58; 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.:wink:
Don't forget Amdahl's Law:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char string&#91;32767&#93;;
char *getsafe&#40;char *buffer, int count&#41;
&#123;
    char *result = buffer, *np;
    if (&#40;buffer == NULL&#41; || &#40;count < 1&#41;)
        result = NULL;
    else if &#40;count == 1&#41;
        *result = '\0';
    else if (&#40;result = fgets&#40;buffer, count, stdin&#41;) != NULL&#41;
        if &#40;np = strchr&#40;buffer, '\n'))
            *np = '\0';
    return result;
&#125;


int main&#40;void&#41;
&#123;
    char *p;
    double f;
    unsigned n;
    puts&#40;"For fraction, 0 means 100% parallel. 1 means 100% serial.");
    puts&#40;"Fraction of program that is serial &#40;0.0 - 1.00&#41;&#58;");
    
    p = getsafe&#40;string, sizeof string&#41;;
    f = atof&#40;p&#41;;
    for &#40;n = 1; n <= 256; n++)
    	printf&#40;"Speedup for %u threads is %14.12g\n", n, 1.0 / &#40;f + &#40;1.0 / n&#41; * &#40;1.0-f&#41;));
    
    return 0;
&#125;
My math doesn't quite match yours above. With .02 of a program in parallel,

speedup = 1 / (0.02 + 0.98 / 8) = 7.01

Math is pretty simple. One cpu takes 1.0 time. N cpu takes 0.02 + 0.98 / 8 (assuming 8 cpus). 0.98 is done in parallel, 0.02 is done in serial.

Limit of this equation as # cpus goes to infinity is simply 1 . 0.02 which is 50x. I didn't try to figure out where your math is breaking down. Maybe it is just rounding or truncation error? But 7 looks right for 8 cpus, For 16 I get 12.3. This matches my books in parallel computing.