Lazy SMP and "lazy cluster" experiments

Dann Corbit · Post by **Dann Corbit** » Fri Aug 25, 2017 9:46 am

Laskos wrote:
Daniel Shawul wrote:Thanks for the tests that confirm ABDADA is clearly better than lazy.

On a constant depth search my lazy gives a time-to-depth speedup of 1.11 while ABDADA gives 1.73 using 2 cores. So the advantage of ABDADA for me is quite significant even though i don't do anything fancy with lazy. YBW is superior than both in my engine, so I still think that is the best algorithm. If you fine tune your YBW (split depth etc) and use longer time control, may be it will be the the best for you too.

Daniel
You cannot discard other data. Stockfish YBW was beating Komodo Lazy on single core LTC, and losing terribly to Komodo on 16-32 cores LTC (TCEC). When they (Stockfish) slowly applied decently Lazy, it started beating Komodo Lazy on 16-44 cores pretty harshly as it is beating it on 1 core.

Houdini YBW (3,4) was quite equal on one core to Komodo Lazy at some time, but was losing terribly on 16-32 cores. The main change in Houdini 5 was adopting Lazy, and it was a sudden gain on 16-44 cores (similarly to what was for Stockfish). Houdini 5 in fact had beaten Komodo Lazy to final in TCEC, and only lost to Stockfish Lazy.

All those YBW implementations in top engines were crappy, and only yours is good?

On CCRL and CEGT, the known Lazy engines (Stockfish, Komodo, Andscacs, Houdini 5, Cheng) 1 -> 4 cores are scaling on average at least equal, more probably better than YBW known engines (Crafty 24, 23, Houdini 4,3,2, Rybka 4,3,2).

Do I have to dismiss this significant data, and to accept only yours?

Peter's data with Lazy and ABDADA are significant, but I still have to see more empirical data and at longer TC. Quick tricks like time-to-depth will not help here. To me, as of now, the LTC empirical data shows Lazy as applied in Stockfish and Komodo as the best, with Peter's results provisory and to be confirmed, and your statements as irrelevant.

I think that the experiments in SMP and Cluster engines are all very interesting.

I think that what we measure is important. Time to depth is one important measure, and Elo is probably king. But all these things are undecided.

Let's all have cool heads and explore these things rationally.

BTW, I think your analysis will eventually help us to actually understand the game. I thank you for your math.

As everyone knows, math is the queen of science.
And a beautiful queen at that.

mar · Post by **mar** » Fri Aug 25, 2017 10:53 am

Daniel Shawul wrote:Martin also threw the "stockfish does it", the "you are being ignorant of the +100 lazy gives to stockfish" argument

Correction: I was talking about +100 elo 4c I got with Cheng, good enough for me considering the little amount of work it cost me back then.
I mentioned SF because they wouldn't switch for no reason (apparently they tested extensively)
Anyway, ABDADA seems very interesting and definitely worth trying.

My only problem was you playing lazy down.
Because something doesn't work for me (for whatever reason) doesn't mean it can't work for others, counter moves didn't work for me for example but I'd never
say it's a stupid idea just because of that.

Anyway, Peter did a thorough research on a very interesting topic (as usual)

Dann Corbit · Post by **Dann Corbit** » Fri Aug 25, 2017 11:56 am

I think the really wonderful thing about Lazy is that it works pretty well and it is easy to get right.

Some of the more esoteric methods are so hard to get right there are practically no known open source implementations to study.

A little fire and brimstone from various advocates does little harm.

It is better to be passionate about things than apathetic.

In the end, we sometimes investigate the claims and in doing so we learn.
That makes it all worth while.

I think that the hands of a master can make any algorithm look good in the right setting.

lucasart · Post by **lucasart** » Sat Aug 26, 2017 1:55 am

Dann Corbit wrote:I think the really wonderful thing about Lazy is that it works pretty well and it is easy to get right.

Some of the more esoteric methods are so hard to get right there are practically no known open source implementations to study.

A little fire and brimstone from various advocates does little harm.

It is better to be passionate about things than apathetic.

In the end, we sometimes investigate the claims and in doing so we learn.
That makes it all worth while.

I think that the hands of a master can make any algorithm look good in the right setting.

The proof is in the pudding. People like Bob or Daniel can wave their hand all they want, with abstract arguments, none of them has ever written an engine that scales better than SF and K.

As for me, my experience with lazy is the following (in Demolito):
* TTD: 1.28 on 2 threads, 1.63 on 3 threads, 1.79 on 4 threads (but my machines only has 4 cores, so the 4 thread measure is underestimated due to pollution from OS and other programs, impossible to quantify how much though).
* ELO: +190 elo on 4 threads. That's in 8+0.08 self-play, using my 5 ply opening set (very shallow and diversified, designed to reduce draw rate). OK, in rating list conditions (depending on TC and Books), this may shrink down to 100 elo or so. But it's still impressive, and I don't see Scorpio or Crafty showing better results than that.

What happens on 32-64 threads is not really interesting to me, because all we have is anecdotic (unverifiable) evidence from the few people who posess such machines. I don't have such a machine, and for now the standard desktop machines are nothing more than 4 cores. Even 8 cores are very rare. This has been the case for almost a decade now. 4 cores is the de facto standard.

Peter got a gain from lazy to ABDADA going from 32 to 64 cores. Does this prove that lazy is hopeless past 32 cores, and ABDADA is the way to go? Definitely not. There is a lot more to lazy than people imagine, and if fine tuned optimally (as SF tries to do) it's a monster of ELO scaling (not TTD which is the wrong measure).

But yes, perhaps the most optimally tuned lazy cannot beat ABDADA on 64 cores. Not a question of practical relevance today with (almost) everyone running 4 cores.

Dann Corbit · Post by **Dann Corbit** » Sat Aug 26, 2017 2:26 am

lucasart wrote:
Dann Corbit wrote:I think the really wonderful thing about Lazy is that it works pretty well and it is easy to get right.

Some of the more esoteric methods are so hard to get right there are practically no known open source implementations to study.

A little fire and brimstone from various advocates does little harm.

It is better to be passionate about things than apathetic.

In the end, we sometimes investigate the claims and in doing so we learn.
That makes it all worth while.

I think that the hands of a master can make any algorithm look good in the right setting.
The proof is in the pudding. People like Bob or Daniel can wave their hand all they want, with abstract arguments, none of them has ever written an engine that scales better than SF and K.

As for me, my experience with lazy is the following (in Demolito):
* TTD: 1.28 on 2 threads, 1.63 on 3 threads, 1.79 on 4 threads (but my machines only has 4 cores, so the 4 thread measure is underestimated due to pollution from OS and other programs, impossible to quantify how much though).
* ELO: +190 elo on 4 threads. That's in 8+0.08 self-play, using my 5 ply opening set (very shallow and diversified, designed to reduce draw rate). OK, in rating list conditions (depending on TC and Books), this may shrink down to 100 elo or so. But it's still impressive, and I don't see Scorpio or Crafty showing better results than that.

What happens on 32-64 threads is not really interesting to me, because all we have is anecdotic (unverifiable) evidence from the few people who posess such machines. I don't have such a machine, and for now the standard desktop machines are nothing more than 4 cores. Even 8 cores are very rare. This has been the case for almost a decade now. 4 cores is the de facto standard.

Peter got a gain from lazy to ABDADA going from 32 to 64 cores. Does this prove that lazy is hopeless past 32 cores, and ABDADA is the way to go? Definitely not. There is a lot more to lazy than people imagine, and if fine tuned optimally (as SF tries to do) it's a monster of ELO scaling (not TTD which is the wrong measure).

But yes, perhaps the most optimally tuned lazy cannot beat ABDADA on 64 cores. Not a question of practical relevance today with (almost) everyone running 4 cores.

The change to Lazy SMP brought Komodo and SF a few Elo (I guess less than 10) and it is probably because it is so easy to implement it correctly.

The vast superiority of SF and Komodo has very little to do with the SMP schemes and has enormously to do with the pruning schemes.

IMO-YMMV

I think there is a lot of unexplored ground in SMP methodology. I guess that the best techniques still have not been found.

petero2 · Post by **petero2** » Sat Aug 26, 2017 9:17 am

petero2 wrote:To check if NPS scales linearly with the number of search threads as expected I ran a series of tests on an Amazon EC2 r4.16xlarge instance
...
I let texel search for 20 seconds from the starting position, using different number of search threads and different hash table sizes. Each search was repeated five times.

When using 1 thread and 1MB hash, the speed was 1.47 MN/s.

For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:
Code: Select all
threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        1       1.011   1.015   1.013   1.034   1.024   0.980   1.011   1.005   0.986
 2        2.010   1.999   2.031   2.015   2.031   2.028   1.970   2.036   2.028   1.980
 4        4.002   3.995   4.008   4.017   4.060   4.033   3.924   4.058   4.038   3.931
 8        7.944   7.966   7.968   8.007   8.012   8.059   7.831   8.068   8.037   7.836
16       15.718  15.822  15.846  15.747  15.882  15.871  15.418  16.007  15.923  15.392
32       30.781  30.855  30.790  30.843  31.267  31.187  30.386  31.690  31.830  30.919
64       36.540  36.070  36.254  36.346  36.779  37.108  36.076  37.836  38.380  38.217
If each entry is divided by the number of used search threads, we get the following relative speeds per thread:
Code: Select all
threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        1       1.011   1.015   1.013   1.034   1.024   0.980   1.011   1.005   0.986
 2        1.005   1.000   1.016   1.008   1.016   1.014   0.985   1.018   1.014   0.990
 4        1.000   0.999   1.002   1.004   1.015   1.008   0.981   1.015   1.010   0.983
 8        0.993   0.996   0.996   1.001   1.001   1.007   0.979   1.008   1.005   0.980
16        0.982   0.989   0.990   0.984   0.993   0.992   0.964   1.000   0.995   0.962
32        0.962   0.964   0.962   0.964   0.977   0.975   0.950   0.990   0.995   0.966
64        0.571   0.564   0.566   0.568   0.575   0.580   0.564   0.591   0.600   0.597
The time in seconds required to initialize the hash table is given by the following table:
Code: Select all
threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        0.023   0.022   0.021   0.041   0.099   0.022   0.022   4.817  19.302  84.250
 2        0.022   0.022   0.022   0.041   0.100   0.023   0.023   4.910  19.527  84.641
 4        0.024   0.022   0.023   0.041   0.100   0.023   0.023   4.922  19.663  84.971
 8        0.023   0.024   0.023   0.043   0.100   0.024   0.023   4.949  19.602  85.991
16        0.026   0.026   0.025   0.044   0.104   0.026   0.026   4.944  19.649  85.229
32        0.030   0.031   0.031   0.049   0.108   0.030   0.030   4.944  19.773  85.196
64        0.041   0.042   0.042   0.060   0.118   0.041   0.040   4.933  19.410  85.199
...
It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache.

I find this result quite unexpected, so I repeated it on my 24 core computer:

Using 1 thread and 1MB hash, the speed was 1.65 MN/s.

Speed relative to the 1 thread/1MB hash value:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G    mean
 1        1       1.001   1.004   1.004   1.023   1.009   0.979   0.857   0.985
 3        2.729   2.736   2.769   2.733   2.746   2.767   2.713   2.383   2.697
 6        5.242   5.260   5.225   5.191   5.221   5.271   5.134   4.528   5.134
12       10.401  10.460  10.388  10.247  10.246  10.349  10.124   8.787  10.125
24       20.430  20.389  20.474  20.224  20.567  20.688  20.248  19.114  20.267
48       24.061  24.234  24.046  24.190  24.500  24.838  24.185  23.910  24.245

Speed per thread:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G    mean
 1        1       1.001   1.004   1.004   1.023   1.009   0.979   0.857   0.985
 3        0.910   0.912   0.923   0.911   0.915   0.922   0.904   0.794   0.899
 6        0.874   0.877   0.871   0.865   0.870   0.879   0.856   0.755   0.856
12        0.867   0.872   0.866   0.854   0.854   0.862   0.844   0.732   0.844
24        0.851   0.850   0.853   0.843   0.857   0.862   0.844   0.796   0.844
48        0.501   0.505   0.501   0.504   0.510   0.517   0.504   0.498   0.505

Initialization time in seconds:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G
 1        0.037   0.035   0.036   0.059   0.130   0.034   0.036   7.773
 3        0.033   0.033   0.036   0.059   0.126   0.036   0.034   7.639
 6        0.034   0.035   0.037   0.059   0.116   0.036   0.034   7.780
12        0.037   0.038   0.037   0.057   0.121   0.037   0.035   7.642
24        0.040   0.034   0.036   0.049   0.112   0.036   0.031   7.450
48        0.040   0.032   0.032   0.050   0.100   0.033   0.035   7.708
mean      0.037   0.034   0.036   0.055   0.118   0.035   0.034   7.665

This result is much closer to what I would theoretically predict based on turbo boost specifications for this computer.

It is as if the EC2 instance was not using turbo boost, although this page claims that it is enabled by default and I did not disable it myself.

Base speed NPS comparison (1.47/1.65 = 0.89) is consistent with turbo boost being enabled (2.3+7*0.1)/(2.5+8*0.1) = 0.91, but EC2 scaling behaves as if it is disabled. Maybe the hardware is overclocked to always run on the highest turbo frequency regardless of number of cores in use.

petero2 · Post by **petero2** » Sat Aug 26, 2017 11:04 am

I have run a set of tests on my 24 core computer to see how memory page size and the use of prefetch instructions affect performance.

I tested all combinations of:
* Page size: 4KB 2MB 1GB
* Software prefetch: on off
* Number of threads: 1 3 6 12 24
* Hash table size: 1GB 2GB 4GB 8GB

For the 1 thread + 1GB hash case I got the following MN/s values:

Code: Select all

     prefetch  
     on     off
4KB  1.62   1.53
2MB  1.65   1.60
1GB  1.67   1.59

I use the 2MB/on value (1.65MN/s) as the base value. Speed relative to base value:

Code: Select all

                    prefetch on                       prefetch off
4KB pages&#58;
threads      1G      2G      4G      8G         1G      2G      4G      8G
 1        0.980   0.989   0.960   0.941      0.929   0.939   0.913   0.891
 3        2.734   2.726   2.659   2.563      2.573   2.574   2.503   2.454
 6        5.145   5.094   5.022   4.909      4.893   4.871   4.786   4.692
12       10.218  10.014   9.779   9.666      9.608   9.574   9.321   9.201
24       20.226  20.003  19.727  19.254     19.224  19.018  18.644  18.350

2MB pages&#58;
 1        1       1.018   1.018   1.013      0.967   0.954   0.964   0.962
 3        2.746   2.824   2.761   2.808      2.643   2.683   2.644   2.622
 6        5.309   5.359   5.319   5.306      5.057   5.039   5.065   5.050
12       10.404  10.450  10.405  10.429      9.892   9.936   9.961   9.904
24       20.611  20.824  20.797  20.891     19.860  19.665  19.684  19.671

1GB pages&#58;
 1        1.010   0.995   0.978   0.934      0.962   0.956   0.928   0.889
 3        2.767   2.739   2.730   2.557      2.609   2.595   2.561   2.449
 6        5.252   5.213   5.141   4.895      4.995   5.005   4.878   4.658
12       10.292  10.325  10.035   9.682      9.927   9.835   9.650   9.212
24       20.670  20.584  20.297  19.618     19.640  19.501  19.289  18.631

The same data presented as relative speed per core:

Code: Select all

                    prefetch on                       prefetch off
4KB pages
threads      1G      2G      4G      8G         1G      2G      4G      8G
 1        0.980   0.989   0.960   0.941      0.929   0.939   0.913   0.891
 3        0.911   0.909   0.886   0.854      0.858   0.858   0.834   0.818
 6        0.857   0.849   0.837   0.818      0.815   0.812   0.798   0.782
12        0.852   0.835   0.815   0.806      0.801   0.798   0.777   0.767
24        0.843   0.833   0.822   0.802      0.801   0.792   0.777   0.765

2MB pages&#58;
 1        1       1.018   1.018   1.013      0.967   0.954   0.964   0.962
 3        0.915   0.941   0.920   0.936      0.881   0.894   0.881   0.874
 6        0.885   0.893   0.887   0.884      0.843   0.840   0.844   0.842
12        0.867   0.871   0.867   0.869      0.824   0.828   0.830   0.825
24        0.859   0.868   0.867   0.870      0.828   0.819   0.820   0.820

1GB pages&#58;
 1        1.010   0.995   0.978   0.934      0.962   0.956   0.928   0.889    
 3        0.922   0.913   0.910   0.852      0.870   0.865   0.854   0.816
 6        0.875   0.869   0.857   0.816      0.832   0.834   0.813   0.776
12        0.858   0.860   0.836   0.807      0.827   0.820   0.804   0.768
24        0.861   0.858   0.846   0.817      0.818   0.813   0.804   0.776

The next table shows hash table initialization time in seconds. Note that prefetch instructions are not used when initializing the hash table and only one thread is used, so any difference between the "on" and "off" cases and number of threads is instead a measure of the unpredictability of memory allocation/initialization speed.

Code: Select all

                    prefetch on                       prefetch off
4KB pages&#58;
threads      1G      2G      4G      8G         1G      2G      4G      8G
 1        0.547   1.015   1.780   3.551      0.566   1.004   1.829   3.481
 3        0.530   0.947   1.809   3.510      0.540   0.929   1.809   3.536
 6        0.487   0.955   1.794   3.482      0.544   0.987   1.817   3.517
12        0.485   0.910   1.746   3.610      0.521   0.931   1.811   3.477
24        0.445   0.894   1.649   3.400      0.459   0.948   1.798   3.348

2MB pages&#58;
 1        0.385   0.623   1.190   2.348      0.373   0.678   1.254   2.322
 3        0.354   0.648   1.182   2.305      0.353   0.623   1.189   2.290
 6        0.339   0.620   1.187   2.301      0.338   0.646   1.180   2.334
12        0.333   0.615   1.157   2.258      0.340   0.605   1.174   2.260
24        0.293   0.527   1.029   2.235      0.290   0.526   1.137   2.017

1GB pages&#58;
 1        0.037   0.035   0.034   0.035      0.036   0.035   0.035   0.035
 3        0.035   0.035   0.035   0.036      0.035   0.034   0.036   0.034
 6        0.036   0.034   0.035   0.035      0.036   0.036   0.035   0.035
12        0.034   0.037   0.034   0.035      0.036   0.035   0.035   0.036
24        0.035   0.031   0.031   0.026      0.032   0.034   0.030   0.032

Some conclusions:

* Initialization time gets smaller for larger page sizes.

* Software prefetch helps in all cases and gives around 5% NPS increase.

* 2MB page size is generally the fastest. This is nice because you can enable transparent huge pages in linux, which will automatically cause all programs to use 2MB pages when possible.

* 4KB pages are almost as fast as 2MB pages for a 1GB hash table, but performance drops when larger hash tables are used.

I think the most remarkable conclusion here is that 1GB pages are not useful for texel, at least not on the computer used for the test.

petero2 · Post by **petero2** » Sat Aug 26, 2017 11:20 pm

lucasart wrote:Peter got a gain from lazy to ABDADA going from 32 to 64 cores. Does this prove that lazy is hopeless past 32 cores, and ABDADA is the way to go? Definitely not.

Actually the 32 -> 64 core result I have previously reported (+39 elo) was with my original lazy SMP/cluster implementation. ABDADA was not used in that test. Also this was when going from 1 to 2 cluster nodes, so I expect my lazy SMP would scale better than that on a single computer that has 64 cores.

The tests I have done so far to compare my "approximate ABDADA" implementation with my "lazy SMP" implementation can be summarized as follows:

Code: Select all

cores  dElo
    2   8.3
    2   9.2
    4   9.8
    4   8.2
    8  12.3
   16  18.4
   24  16.0
   40  14.3

This shows that in texel, my implementation of ABDADA is stronger than my implementation of lazy SMP. There are a lot of things those tests do not show though:

* They do not show that ABDADA in general is better than lazy SMP. I have not spent much time optimizing either lazy SMP or ABDADA in texel, so it is unknown what would happen if I did.

* They do not show what would happen in engines other than texel. Other engines might behave differently so even if I spend a lot of time trying to find the optimal SMP/cluster algorithm for texel, it may still not be optimal for other engines.

* They do not show that my ABDADA implementation scales better than my lazy SMP implementation in texel. Even though the elo differences are generally larger for a larger core count, the error bars in each measurement are quite large, and if you plot the table above no clear trend is visible, so it is unknown how to extrapolate this to higher core counts.

lucasart · Post by **lucasart** » Wed Aug 30, 2017 1:17 am

petero2 wrote:Lazy SMP algorithm
[...]
Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.

When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.

The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".

The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:
Code: Select all
0
|
+--------+-----------+---------+
|        |           |         |
1        2           3         4
|        |           |
+-+-+-+  +-+--+--+   +--+--+
| | | |  | |  |  |   |  |  |
5 6 7 8  9 10 11 12  13 14 15
A search thread is only allowed to talk to its neighbors in the tree graph. When a search thread receives a command from a neighbor it acts on the command and also forwards it to its parent or children as appropriate.

This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.

The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.

Interesting. My Lazy SMP is rather different. Here's how it works:
* main thread (the one that starts the main() function) does UCI I/O, and is always responsive.
* upon receiving a go command, it spawns a timer thread.
* the timer thread spawns N worker threads.

Timer and Worker(s) only live when the engine is searching. It turns out that spawning a few threads is so fast, the overhead is totally negligible. So you don't need to keep threads alive all the time, and deal with condition variables. Simple thread_create() and thread_join() operations are all I ever need.

Already the philosophy is quite different. All workers are equal. There is no concept of main thread among the workers. This simplifies the code, as there is no special case for the main thread.

UCI output (printing completed iterations with depth, score, pv etc.) is centralised in a global object, that contains the last iteration completed (highest depth completed so far, don't care which of the equal workers completed it). The object embeds a mutex, obviously. When a worker completes an iteration, it just tries to update the PV (only updated if deeper, when update occurs, print to screen, mutex guarantees no concurrency problem that would cause scrambled output).

In the ID loop, before a worker starts an iteration, it checks how many workers are searching this iteration or above. If at least half the workers are searching >= depth, then there is no point in searching this iteration, try the next one (and so on, until we find a useful iteration to work on).

Search abortion is twofold:
* Timer can signal all workers to stop immediately (using long_jmp() in C, or exception if it was C++).
* When a worker completes an iteration, it signals all workers working <= depth to stop immediately. When the affected workers stop, they return in the ID loop and look for a useful job to do, as previously described.

All this ensures that you have half the workers at each depth d and d+1. But it does so in a rather martial way. Workers are not allowed to just go their own way in the ID loop. We constantly want to keep them in rank and correct towards 1/2 at d, and 1/2 at d+1.

Next steps:
1/ The next thing to try is to use some more complicated patterns as 1/2 at d and 1/2 at d+1. This has shown to be very useful with many threads in Stockfish testing. But I cannot see any gain in variying this for 4 threads. And I do not have big machines at my disposal. So I stay with the simple scheme for now.

2/ The other thing to try is obviously ABDADA. Again, I really doubt I will gain anything on 4 Threads above what I have. Probably needs 8 or 16 threads to see anything.

Laskos · Post by **Laskos** » Wed Aug 30, 2017 1:29 pm

petero2 wrote:Time doubling vs cores doubling vs cluster nodes doubling

To see how well the algorithm scales, a series of test matches were played where the used hardware resources were successively doubled. Three different ways of doubling hardware resources were tested:
* Doubling the thinking time.
* Doubling the number of cores on a single cluster node.
* Doubling the number of cluster nodes.

The matches were played either on identical "E5-2686 v4 @ 2.30GHz" EC2 nodes, on my 24 core computer (E5-2680 v3 @ 2.50GHz), or on my 16 core computer (E5-2670 0 @ 2.60GHz). Turbo boost was enabled during all tests. The transposition table size was 128MB. The EC2 instances, which were used for all cluster tests, are connected by 10 Gigabit Ethernet and the ping time between nodes is around 90 microseconds.

Matches are run using a slightly modified version of cutechess-cli. The modification uses thread affinities to lock each game to a subset of the available cores. This gives more reliable results, especially on NUMA hardware when running multiple games in parallel.

The results can be summarized by the following table, which shows the elo increase when successively doubling thinking time (row 1), number of cores (row 2), and number of cluster nodes (row 3).
Code: Select all
config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
t1cXn1    -   72   76   73   66   44
t1c1nX    -   50   46   51   29   --
A configuration tAcBnC means the time control is A times the base time control, the number of cores per cluster node is B, and the number of cluster nodes is C. The base time control was 24s/game+0.24s/move. Each entry in the table shows the elo increase compared to the configuration immediately to the left of the entry. For example, the value 100 in the X4 column in row 1 means that when playing 4*base_time vs 2*base_time on a single core and a single cluster node, the elo increase is 100.

The total elo increase between 1*base_time and 32*base_time is: 112+100+85+72+63 = 432.
The total elo increase between 1 core and 32 cores is: 72+76+73+66+44 = 331.
The total elo increase between 1 cluster node and 16 nodes is: 50+46+51+29 = 176.

The 32 node cluster test is missing because my current Amazon instance limit does not let me rent 32 computers at the same time. I might try to get this limit increased in the future.

The 32 core configuration is the only one that was running on more than one NUMA node. The 16 core configuration was run on an EC2 instance having 16 cores per NUMA node so it only used one NUMA node.

Each match was typically 1000 or 2000 games long, and the estimated error in the results was between 4 and 6 elo, one standard deviation. More details about the matches are given in the following table.
Code: Select all
              dElo  sDev  win loss draw  depth1 depth2
 2t vs  1t  +112.0   4.5  778  153 1075   17.65  16.34
 4t vs  2t  +100.0   4.3  690  130 1180   19.12  17.81
 8t vs  4t   +85.4   4.3  615  133 1252   20.58  19.29 
16t vs  8t   +72.1   4.1  552  143 1305   21.90  20.65
32t vs 16t   +62.7   3.9  480  123 1397   23.21  22.05
                   
 2c vs  1c   +72.2   4.5  604  194 1202   17.19  16.68
 4c vs  2c   +76.4   4.5  642  209 1149   17.74  17.07
 8c vs  4c   +73.2   4.3  577  162 1261   18.71  18.00
16c vs  8c   +65.7   5.7  260   73  667   19.89  19.17
32c vs 16c   +44.0   5.3  203   77  720   20.55  19.97
                   
 2n vs  1n   +50.0   6.5  260  117  623   16.68  16.55
 4n vs  2n   +45.8   6.0  252  121  627   17.12  16.71
 8n vs  4n   +51.1   6.1  251  105  644   17.46  17.08
16n vs  8n   +28.6   6.1  203  121  676   18.09  17.70
The estimated standard deviation was computed by grouping game results in pairs of two (pentanomial distribution) since each opening was repeated with colors reversed.

depth1 and depth2 are the arithmetic mean value of the search depth taken over all moves an engine configuration played in the match. Note that this value depends to a non-trivial amount on the number of moves the engine was searching in very simple endgame positions where texel's maximum search depth (100 ply) was reached. There may be better ways to compute a representative average search depth from the PGN data.

Other cluster tests

Some test results using more than one core per cluster node:
Code: Select all
                    dElo  sDev  win loss draw  depth1 depth2
t1c4n4  vs t1c4n1  +92.8   5.8  316   55  629   18.78  17.97
t1c32n2 vs t1c32n1 +39.2   5.6  217  102  705   20.73  20.17
t1c4n2  vs t1c4n1  +37.7   5.7  221  113  666   18.75  18.34

t1c2n2  vs t1c2n1  +31.4   6.7  228  138  634   17.74  17.43
t1c2n4  vs t1c2n2  +42.2   6.1  237  116  647   18.09  17.66
t1c2n8  vs t1c2n4  +31.4   5.7  203  113  684   18.49  18.19
t1c2n16 vs t1c2n8  +32.1   5.7  206  114  680   18.97  18.58
The last four rows can be used to extend the summary table above with one more row:
Code: Select all
config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
t1cXn1    -   72   76   73   66   44
t1c1nX    -   50   46   51   29   --
t1c2nX    -   31   42   31   32   --
Varying transposition table size

With the given hardware texel fills the transposition table at around 16MB/s per thread. This means that the fixed 128MB hash table used in the previous tests is too small to hold the entire search tree when the thinking time, number of cores, or number of cluster nodes is large. To see if this affected the results, some tests were repeated with a 512MB hash table.
Code: Select all
                    dElo  sDev  win loss draw  depth1 depth2
t8c1n1  vs t4c1n1  +97.1   4.2  654  109 1237   20.57  19.24
t16c1n1 vs t8c1n1  +83.4   4.2  604  133 1263   21.89  20.60
t32c1n1 vs t16c1n1 +72.4   4.0  538  127 1335   23.18  21.89
Some tests were also repeated with a 1MB transposition table, which is too small to hold the whole search tree for all configurations.
Code: Select all
                    dElo  sDev  win loss draw  depth1 depth2
t2c1n1 vs t1c1n1  +103.3   4.1  875  182 1343   16.90  15.82
t4c1n1 vs t2c1n1   +77.5   4.4  619  180 1201   18.06  17.02
t8c1n1 vs t4c1n1   +77.0   4.3  603  167 1230   19.11  18.08

t1c2n1 vs t1c1n1   +48.6   4.4  544  266 1190   16.35  16.11
t1c4n1 vs t1c2n1   +38.8   3.1  953  508 2539   16.89  16.48
t1c8n1 vs t1c4n1   +46.3   4.2  465  200 1335   17.52  17.07
Using the above data the summary table can be extended:
Code: Select all
config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
tXc1n1'   -             97   83   72     // 512MB hash
tXc1n1''  -  103   78   77               // 1MB hash
t1cXn1    -   72   76   73   66   44
t1cXn1''  -   49   39   46               // 1MB hash
t1c1nX    -   50   46   51   29   --
t1c2nX    -   31   42   31   32   --
It can be seen that using a too small transposition table hurts more when doubling the number of cores than when doubling the thinking time.

Possible NUMA issues

As mentioned above the 32 core configuration was the only one running on more than one NUMA node. The result 32 cores vs 16 cores (+44 elo) was also somewhat lower than expected. To further test NUMA behavior a 16 core vs 8 core match was played on the 16 core computer (which has two NUMA nodes) and a 24 core vs 12 core match was played on the 24 core computer (which also has two NUMA nodes.)
Code: Select all
                    dElo  sDev  win loss draw  depth1 depth2
t1c16n1 vs t1c8n1  +44.4   5.6  219   92  689   19.62  18.95    // 128MB hash
t1c16n1 vs t1c8n1  +48.3   6.1  240  102  658   19.68  19.00    // 1024MB hash
t1c24n1 vs t1c12n1 +41.5   5.9  211   92  697   20.12  19.58    // 128MB hash
t1c24n1 vs t1c12n1 +47.5   6.0  229   93  678   20.39  19.80    // 1024MB hash
The elo increase is similar to the earlier 32 cores vs 16 cores match, and the 16 cores vs 8 cores result (+44.4 elo) is lower than the earlier non-NUMA 16 cores vs 8 cores result (+66 elo).

Test data

All games played for these measurements are available in this archive file.

The tests used the texel development version 1.07a27, available for download here.

Turbo boost effects

Turbo boost was enabled for all tests. This makes the computers run faster than the base clock frequency, but the speed increase is larger when fewer cores are in use. This can potentially skew the results in favor for configurations using few cores, i.e. it can look like the SMP algorithm scales worse than it would scale on hypothetical hardware that runs at the same clock speed regardless of how many cores are in use.

All used computers have very good cooling systems, so the actual turbo boost frequency can be calculated from processor turbo boost specification information as explained here. The following table shows the turbo boost value as a function on the number of active cores:
Code: Select all
             18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1   base    incr
E5-2686 v4 &#58;  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  5  7  7   2.3GHz  0.1GHz
E5-2680 v3 &#58;                    4  4  4  4  4  4  4  4  5  6  8  8   2.5GHz  0.1GHz
E5-2670 0  &#58;                                4  4  5  5  6  6  7  7   2.6GHz  0.1GHz
As an example of the worst case, consider a match played between 1 core and 12 cores on the E5-2680 computer. When the 1 core engine is thinking only one core is active so the clock frequency is 2.5+8*0.1 = 3.3GHz. When the 12 core engine is thinking all 12 cores are active so the clock frequency is 2.5+4*0.1 = 2.9GHz. In this case it would be as if the 12 core engine had a 12% time handicap (1-2.9/3.3), so if we wanted to measure the true algorithmic 12 core vs 1 core improvement, we would have to give the 12 core engine 12% more thinking time.

However, these tests were run in a way that does not trigger the worst case, since only 2*N vs 1*N core matches were played (for various values of N), and as many parallel games as possible were played on each test computer. For example, consider a 4 core vs 2 cores match played on the E5-2680 computer. If only one game were to be played in parallel, the speed difference would be 1-(2.5+5*0.1)/(2.5+8*0.1)=9%. However since only 4 cores are needed for each game, 3 games are played in parallel. This means that at least 2*3 cores are active at any time, so it can be seen from the table above that the turbo boost multiplier is always 4, and hence the speed difference is zero.

I think the largest difference that actually occurred in the test data is 8 cores vs 4 cores played on the E5-2670 computer. In this case the speed difference is 1-(2.6+4*0.1)/(2.6+6*0.1)=6%.

Your results in this post with Lazy Texel are in fact quite remarkable and, after looking more carefully, weird to me. I don't touch cluster stuff here, just cores and one NUMA node from 16 to 32 cores. I put a part of your main table in the more readable to me form.

ELO gain
Errors about 10 ELO points 2SD.

Code: Select all

                                                   
                                                  NUMA=2
Config          X1     X2     X4     X8    X16      X32 
----------------------------------------------------------
Doubling time    -    112    100     85     72       63 
Doubling cores   -     72     76     73     66       44
==========================================================
Effective speedup       
per doubling cores   1.56   1.62   1.69   1.72     1.49

The Effective speedup (in time) per doubling cores was not so straightforward to determine, because doubling cores is at different strength level from doubling time. Different strength level means different ELO gain from time doubling. So I had to find a good fit for doubling in time and to deduce the effective speedup from doubling cores. One can't take each individual doubling, take the ratio of ELO gain as the exponent of 2, and derive core doublings. Also, one can't just divide total ELO points for cores and time, make it as exponent and get total effective speedup, then derive doublings. Say, if I do that for X32 (5 doublings), the total effective speedup would be 32**(331/432) ~ 14.2. I get 11.0.

Here is the plot:

The fit function is not so important for time doublings of Texel at that 24''+ 0.24'' time control, but I continued it to 256 times that, because it is probably not far off. The function was used to derive each effective speedup.

Now the speedups: 1.56, 1.62, 1.69, 1.72 and for NUMA node, 1.49. NUMA node, from your results too behaves with number of cores similarly to non-NUMA, but at smaller values of effective speedup. Small with few cores, larger with many cores. The results are frankly weird to me. This is not Vas Rajlich's empirical diminishing numbers, this is not Robert Houdart's rule of thumb 1.75**doublings, this is not Amdahl's law, this is... most closely related to Robert Hyatt's seemingly ridiculous rule of thumb:

effective speedup = 1 + 0.7 * (n_cores-1).

for his YBW.

I derided it on this forum several years ago and several times. But only it can vaguely show this Lazy Texel behavior. Change 0.7 to 0.5 or so.
I plot here these rules of thumb adapted to your data:

My own tests with top Lazy engines show 1 -> 2 cores as about speedup of 1.9, and 2 -> 4 cores as about 1.8, in accordance with the red curve. Your results with Lazy Texel are quite remarkable, and the error margins are not that large, I estimate that average 2SD for effective speedup is no more than 0.08.

PS To mention that I didn't even take into account the Turbo Boost effect, which will give even better scaling. But not by much, as I understand from your post, maybe within a maximum of 1-2% for one doubling of cores.

Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: SMP NPS measurements

Page size and prefetch effects on NPS

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and lazy cluster algorithm

Re: ELO measurements