Lazy SMP and "lazy cluster" experiments

Discussion of chess software programming and technical issues.

Moderator: Ras

petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Lazy SMP and "lazy cluster" experiments

Post by petero2 »

Lately I have been doing some experiments with an implementation of lazy SMP in texel. My motives were to see how lazy SMP behaves in general and to have an SMP algorithm that is easier to generalize to cluster computing than texel's old SMP algorithm.

In the following posts I will describe how the algorithm works, present some test results and suggest some ideas for future improvements.
petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Lazy SMP and lazy cluster algorithm

Post by petero2 »

Lazy SMP algorithm

The general design principles for texel's old SMP algorithm were "no waiting" and "be asynchronous whenever possible". Typical lazy SMP algorithms also follow these principles, since there is very little synchronization between threads, except for the transposition table. Other than that there is very little similarity between the old and the new algorithm.

Texel's lazy SMP algorithm uses a shared transposition table, which is thread safe and lockless using the "XOR trick". Other data, such as history tables, killer tables and evaluation caches, are private to each search thread.

The first thread is the master thread and all other threads are helper threads. Whenever the master thread calls the negaMax function from the root search function, it also tells the helper threads to start searching the same position. The helper threads improve the search in two ways. First, they store partial search results in the shared transposition table, which can be useful for other search threads. Second, if a helper thread finishes its search before the master thread is finished, the master thread will detect this and abort its search and use the result from the helper thread instead.

In pseudo code, the algorithm works basically like this:

Code: Select all

Move iterativeDeepening(Position pos) {
    for (depth = 1; depth <= maxDepth && !timeout; depth++) {
        for (all legal moves m) {
            [alpha, beta] = score_from_previous_iteration -/+ aspiration_window_delta;
            pos.makeMove(m);
            score = -negaScoutRoot(pos, -beta, -alpha, depth);
            pos.unMakeMove(m);
            if (score outside aspiration window)
                widen window and redo search;
            if (score > bestScore) {
                bestMove = m;
                bestScore = score;
            }
        }
    }
    tell all helper threads to stop searching
    return bestMove;
}

int negaScoutRoot(Position pos, int alpha, int beta, int depth) {
    tell all helper threads to start searching with params (pos, alpha, beta, depth)
    try {
        return negaScout(pos, alpha, beta, depth);
    } catch (HelperThreadResult result) {
        return result.getScore();
    }
}

int negaScout(Position pos, int alpha, int beta, int depth) {
    if (time to check for abort) {
        if (timeout)
            throw StopSearch();
        if (helper thread has a search result)
            throw HelperThreadResult(helper_thread_score);
    }

    // Normal recursive negaScout implementation ...
}

void helperThreadMainLoop() {
    while (!quit) {
        while (idle)
            wait for search command
        int extraDepth = thread_number % 2;
        try {
            for ( ; depth + extraDepth <= maxDepth; extraDepth++) {
                negaScout2(pos, alpha, beta, depth + extraDepth);
            }
        } catch (StopSearch) {
        }
    }
}

int negaScout2(Position pos, int alpha, int beta, int depth) { // Like negaScout() except for abort condition
    if (time to check for abort) {
        if (stop command received or new search command received)
            throw StopSearch();
    }

    // Normal recursive negaScout implementation ...
}
Communication between the master thread and helper threads is asynchronous, meaning that when a thread sends a command to another thread, it does not wait for the other thread to acknowledge the command.

Standard C++11 threading primitives such as locks and condition variables are used for the communication.

Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.

When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.

The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".

The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:

Code: Select all

0
|
+--------+-----------+---------+
|        |           |         |
1        2           3         4
|        |           |
+-+-+-+  +-+--+--+   +--+--+
| | | |  | |  |  |   |  |  |
5 6 7 8  9 10 11 12  13 14 15
A search thread is only allowed to talk to its neighbors in the tree graph. When a search thread receives a command from a neighbor it acts on the command and also forwards it to its parent or children as appropriate.

This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.

The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.

Lazy cluster algorithm

Support for clusters is implemented using the message passing interface (MPI). An MPI program consists of a set of processes, called "processors" in MPI terminology. Different processors do not share any memory and possibly execute on different computers (aka cluster nodes). All communication between processors is performed using message passing, which for example can be implemented by the MPI library using TCP/IP communication over Ethernet.

The texel cluster implementation is a so called "hybrid" MPI implementation, which means that each processor is a multithreaded program and there is only one processor on each computer in the cluster. Inside each processor the threads are arranged in the same tree structure as in the lazy SMP case. Cluster nodes are also connected in a tree structure. The main thread in each processor is connected to the main thread in other processors. This means that the set of all search threads in all cluster nodes forms a tree of trees. For example if each node has 3 cores and there are 7 nodes in the system, the tree structure looks like this:

Code: Select all

n0_t0 +- n0_t1
|     |
|     +- n0_t2
|
+----------------+----------------+---------------+
|                |                |               |
n1_t0 +- n1_t1   n2_t0 +- n2_t1   n3_t0 +- n3_t1  n4_t0 +- n4_t1
|     |                |                |               |
|     +- n1_t2         +- n2_t2         +- n3_t2        +- n4_t2
|
+----------------+
|                |
n5_t0 +- n5_t1   n6_t0 +- n6_t1
      |                |
      +- n5_t2         +- n6_t2
Communication between cluster nodes works logically the same way as between threads within a node, but the implementation uses MPI messages instead of C++11 thread primitives.

It is not necessary for each cluster node to have the same number of cores, even though that was the case in the example above. The number of cluster nodes in the system is given when the program starts. The number of used search threads is however given by the "Threads" UCI parameter. Normally the number of search threads is specified to be the same as the total number of cores in the cluster. It is however possible to specify a different number of search threads. If there are fewer search threads than cores, cores are allocated in breadth first order. If there are more search threads than cores, the extra search threads are spread out evenly over the cluster nodes and cores, which makes it possible to utilize hyperthreading cores.

Using the cluster communication described so far improves the search by making it possible for search threads to finish the search before the master thread (on node 0) and reporting the result up in the tree to the master thread. However, since each processor lives in its own address space, the major lazy SMP search improvement caused by transposition table sharing does not work between cluster nodes. To overcome this limitation a method to distribute transposition table updates to other cluster nodes has been implemented.

Each processor has one transposition table shared between all threads in the process. Whenever a search thread stores a result in the transposition table, it also checks if the stored depth is large enough to make it worthwhile to distribute this information to other cluster nodes. If so, the TT entry is stored in output queues containing entries to be sent to cluster node neighbors. Each cluster neighbor has its own output queue. Spare bandwidth between cluster nodes is used to transmit the queued entries to neighboring nodes. When a node receives a TT entry, it inserts it in its local transposition table, and, if the depth is large enough, also stores it in the output queues for all neighbors except the neighbor the entry came from. Since the cluster nodes form a tree structure, this ensures that each TT entry update can be distributed to all other cluster nodes, but distribution can stop at any cluster node if the TT entry depth is not large enough.

What TT entry depth is considered "large enough" for distribution to other cluster nodes is determined dynamically using a simple procedure. Each output queue is periodically transmitted to the receiving node. If the queue is less than half full when it is about to be sent, the minimum depth for that queue is decreased by 1. If on the other hand the queue becomes full before it was possible to send it to the receiving node, the minimum depth for that queue is increased by 1.

Cluster algorithm properties

From the design it should be clear that speed measured in nodes per second (NPS) will increase proportionally to the number of nodes in the cluster, since all threads will always be searching something as soon as the first "start search" command has reached all search threads.

The reported NPS values in the UCI communication is based on information available in the master thread. Each search thread maintains a local "nodes searched" counter, and counter changes are periodically reported to parent threads, which makes this information propagate up the tree structure towards the master thread. However, at a given point in time there will be counter values that have not yet been propagated to the master thread. Therefore the reported NPS values will be somewhat smaller than the real NPS values, but this error will get smaller the longer the search is running, and after around 10 seconds the error should be quite small even for very large clusters.

The transposition table design makes TT access from search threads very fast compared to more traditional distributed hash table systems where a thread has to ask another cluster node for the result of a TT probe. This is because no cluster communication is required to access a TT entry. On the other hand the asynchronous TT entry broadcasting mechanism means that there will be a delay between the time when a TT entry is created and the time when it has been distributed to all cluster nodes.

Of course NPS in itself is not important for a chess engine. What really matters is how much the extra hardware contributes to making the program play stronger chess. Tests will have to be performed to measure the strength increase.
petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

SMP NPS measurements

Post by petero2 »

To check if NPS scales linearly with the number of search threads as expected I ran a series of tests on an Amazon EC2 r4.16xlarge instance. The instance runs Amazon linux and has 480GB RAM. "lscpu" gives the following hardware details:

Code: Select all

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               1644.842
BogoMIPS:              4661.99
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
I let texel search for 20 seconds from the starting position, using different number of search threads and different hash table sizes. Each search was repeated five times.

When using 1 thread and 1MB hash, the speed was 1.47 MN/s.

For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        1       1.011   1.015   1.013   1.034   1.024   0.980   1.011   1.005   0.986
 2        2.010   1.999   2.031   2.015   2.031   2.028   1.970   2.036   2.028   1.980
 4        4.002   3.995   4.008   4.017   4.060   4.033   3.924   4.058   4.038   3.931
 8        7.944   7.966   7.968   8.007   8.012   8.059   7.831   8.068   8.037   7.836
16       15.718  15.822  15.846  15.747  15.882  15.871  15.418  16.007  15.923  15.392
32       30.781  30.855  30.790  30.843  31.267  31.187  30.386  31.690  31.830  30.919
64       36.540  36.070  36.254  36.346  36.779  37.108  36.076  37.836  38.380  38.217
If each entry is divided by the number of used search threads, we get the following relative speeds per thread:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        1       1.011   1.015   1.013   1.034   1.024   0.980   1.011   1.005   0.986
 2        1.005   1.000   1.016   1.008   1.016   1.014   0.985   1.018   1.014   0.990
 4        1.000   0.999   1.002   1.004   1.015   1.008   0.981   1.015   1.010   0.983
 8        0.993   0.996   0.996   1.001   1.001   1.007   0.979   1.008   1.005   0.980
16        0.982   0.989   0.990   0.984   0.993   0.992   0.964   1.000   0.995   0.962
32        0.962   0.964   0.962   0.964   0.977   0.975   0.950   0.990   0.995   0.966
64        0.571   0.564   0.566   0.568   0.575   0.580   0.564   0.591   0.600   0.597
The time in seconds required to initialize the hash table is given by the following table:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        0.023   0.022   0.021   0.041   0.099   0.022   0.022   4.817  19.302  84.250
 2        0.022   0.022   0.022   0.041   0.100   0.023   0.023   4.910  19.527  84.641
 4        0.024   0.022   0.023   0.041   0.100   0.023   0.023   4.922  19.663  84.971
 8        0.023   0.024   0.023   0.043   0.100   0.024   0.023   4.949  19.602  85.991
16        0.026   0.026   0.025   0.044   0.104   0.026   0.026   4.944  19.649  85.229
32        0.030   0.031   0.031   0.049   0.108   0.030   0.030   4.944  19.773  85.196
64        0.041   0.042   0.042   0.060   0.118   0.041   0.040   4.933  19.410  85.199
Analysis

NPS scales almost linearly with the number of search threads and is only very slightly affected by the hash table size. There are some anomalies however:

* For 64 threads the NPS scaling is only 58%, but this is expected because the machine only has 32 cores (but 64 hyperthreads).

* For 4GB hash the NPS is a few percent lower than for other hash sizes. On this machine, 8 GB RAM has been reserved for 1GB huge pages, which means that for hash size 1GB and 4GB, the memory allocation is served using 1GB huge pages. For other hash table sizes 2MB huge pages are used since the "transparent huge pages" feature is enabled in the kernel. It can be seen that the initialization time is significantly smaller for 1GB and 4GB hash sizes, which is expected when 1GB huge pages are used. It is however surprising that NPS is reduced when 4 1GB huge pages are used for the hash table.

* There is a small slowdown when 32 threads are used and when 256GB hash is used. For these test texel was run in NUMA aware mode, which means that it puts all search threads on the same NUMA node if possible. This is possible when using at most 16 threads, but not when using 32 threads. Also, texel tries to allocate the hash table on NUMA node 0, but for the 256GB hash case there is not enough memory on node 0, so some of the hash table is allocated on node 1. The observed NPS values seem reasonable given how texel handles NUMA hardware.

It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache. Texel uses prefetch instructions to improve hash table access times. Possibly that causes most of the memory latency to be hidden.
petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

ELO measurements

Post by petero2 »

Time doubling vs cores doubling vs cluster nodes doubling

To see how well the algorithm scales, a series of test matches were played where the used hardware resources were successively doubled. Three different ways of doubling hardware resources were tested:
* Doubling the thinking time.
* Doubling the number of cores on a single cluster node.
* Doubling the number of cluster nodes.

The matches were played either on identical "E5-2686 v4 @ 2.30GHz" EC2 nodes, on my 24 core computer (E5-2680 v3 @ 2.50GHz), or on my 16 core computer (E5-2670 0 @ 2.60GHz). Turbo boost was enabled during all tests. The transposition table size was 128MB. The EC2 instances, which were used for all cluster tests, are connected by 10 Gigabit Ethernet and the ping time between nodes is around 90 microseconds.

Matches are run using a slightly modified version of cutechess-cli. The modification uses thread affinities to lock each game to a subset of the available cores. This gives more reliable results, especially on NUMA hardware when running multiple games in parallel.

The results can be summarized by the following table, which shows the elo increase when successively doubling thinking time (row 1), number of cores (row 2), and number of cluster nodes (row 3).

Code: Select all

config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
t1cXn1    -   72   76   73   66   44
t1c1nX    -   50   46   51   29   --
A configuration tAcBnC means the time control is A times the base time control, the number of cores per cluster node is B, and the number of cluster nodes is C. The base time control was 24s/game+0.24s/move. Each entry in the table shows the elo increase compared to the configuration immediately to the left of the entry. For example, the value 100 in the X4 column in row 1 means that when playing 4*base_time vs 2*base_time on a single core and a single cluster node, the elo increase is 100.

The total elo increase between 1*base_time and 32*base_time is: 112+100+85+72+63 = 432.
The total elo increase between 1 core and 32 cores is: 72+76+73+66+44 = 331.
The total elo increase between 1 cluster node and 16 nodes is: 50+46+51+29 = 176.

The 32 node cluster test is missing because my current Amazon instance limit does not let me rent 32 computers at the same time. I might try to get this limit increased in the future.

The 32 core configuration is the only one that was running on more than one NUMA node. The 16 core configuration was run on an EC2 instance having 16 cores per NUMA node so it only used one NUMA node.

Each match was typically 1000 or 2000 games long, and the estimated error in the results was between 4 and 6 elo, one standard deviation. More details about the matches are given in the following table.

Code: Select all

              dElo  sDev  win loss draw  depth1 depth2
 2t vs  1t  +112.0   4.5  778  153 1075   17.65  16.34
 4t vs  2t  +100.0   4.3  690  130 1180   19.12  17.81
 8t vs  4t   +85.4   4.3  615  133 1252   20.58  19.29 
16t vs  8t   +72.1   4.1  552  143 1305   21.90  20.65
32t vs 16t   +62.7   3.9  480  123 1397   23.21  22.05
                   
 2c vs  1c   +72.2   4.5  604  194 1202   17.19  16.68
 4c vs  2c   +76.4   4.5  642  209 1149   17.74  17.07
 8c vs  4c   +73.2   4.3  577  162 1261   18.71  18.00
16c vs  8c   +65.7   5.7  260   73  667   19.89  19.17
32c vs 16c   +44.0   5.3  203   77  720   20.55  19.97
                   
 2n vs  1n   +50.0   6.5  260  117  623   16.68  16.55
 4n vs  2n   +45.8   6.0  252  121  627   17.12  16.71
 8n vs  4n   +51.1   6.1  251  105  644   17.46  17.08
16n vs  8n   +28.6   6.1  203  121  676   18.09  17.70
The estimated standard deviation was computed by grouping game results in pairs of two (pentanomial distribution) since each opening was repeated with colors reversed.

depth1 and depth2 are the arithmetic mean value of the search depth taken over all moves an engine configuration played in the match. Note that this value depends to a non-trivial amount on the number of moves the engine was searching in very simple endgame positions where texel's maximum search depth (100 ply) was reached. There may be better ways to compute a representative average search depth from the PGN data.

Other cluster tests

Some test results using more than one core per cluster node:

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t1c4n4  vs t1c4n1  +92.8   5.8  316   55  629   18.78  17.97
t1c32n2 vs t1c32n1 +39.2   5.6  217  102  705   20.73  20.17
t1c4n2  vs t1c4n1  +37.7   5.7  221  113  666   18.75  18.34

t1c2n2  vs t1c2n1  +31.4   6.7  228  138  634   17.74  17.43
t1c2n4  vs t1c2n2  +42.2   6.1  237  116  647   18.09  17.66
t1c2n8  vs t1c2n4  +31.4   5.7  203  113  684   18.49  18.19
t1c2n16 vs t1c2n8  +32.1   5.7  206  114  680   18.97  18.58
The last four rows can be used to extend the summary table above with one more row:

Code: Select all

config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
t1cXn1    -   72   76   73   66   44
t1c1nX    -   50   46   51   29   --
t1c2nX    -   31   42   31   32   --
Varying transposition table size

With the given hardware texel fills the transposition table at around 16MB/s per thread. This means that the fixed 128MB hash table used in the previous tests is too small to hold the entire search tree when the thinking time, number of cores, or number of cluster nodes is large. To see if this affected the results, some tests were repeated with a 512MB hash table.

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t8c1n1  vs t4c1n1  +97.1   4.2  654  109 1237   20.57  19.24
t16c1n1 vs t8c1n1  +83.4   4.2  604  133 1263   21.89  20.60
t32c1n1 vs t16c1n1 +72.4   4.0  538  127 1335   23.18  21.89
Some tests were also repeated with a 1MB transposition table, which is too small to hold the whole search tree for all configurations.

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t2c1n1 vs t1c1n1  +103.3   4.1  875  182 1343   16.90  15.82
t4c1n1 vs t2c1n1   +77.5   4.4  619  180 1201   18.06  17.02
t8c1n1 vs t4c1n1   +77.0   4.3  603  167 1230   19.11  18.08

t1c2n1 vs t1c1n1   +48.6   4.4  544  266 1190   16.35  16.11
t1c4n1 vs t1c2n1   +38.8   3.1  953  508 2539   16.89  16.48
t1c8n1 vs t1c4n1   +46.3   4.2  465  200 1335   17.52  17.07
Using the above data the summary table can be extended:

Code: Select all

config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
tXc1n1'   -             97   83   72     // 512MB hash
tXc1n1''  -  103   78   77               // 1MB hash
t1cXn1    -   72   76   73   66   44
t1cXn1''  -   49   39   46               // 1MB hash
t1c1nX    -   50   46   51   29   --
t1c2nX    -   31   42   31   32   --
It can be seen that using a too small transposition table hurts more when doubling the number of cores than when doubling the thinking time.

Possible NUMA issues

As mentioned above the 32 core configuration was the only one running on more than one NUMA node. The result 32 cores vs 16 cores (+44 elo) was also somewhat lower than expected. To further test NUMA behavior a 16 core vs 8 core match was played on the 16 core computer (which has two NUMA nodes) and a 24 core vs 12 core match was played on the 24 core computer (which also has two NUMA nodes.)

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t1c16n1 vs t1c8n1  +44.4   5.6  219   92  689   19.62  18.95    // 128MB hash
t1c16n1 vs t1c8n1  +48.3   6.1  240  102  658   19.68  19.00    // 1024MB hash
t1c24n1 vs t1c12n1 +41.5   5.9  211   92  697   20.12  19.58    // 128MB hash
t1c24n1 vs t1c12n1 +47.5   6.0  229   93  678   20.39  19.80    // 1024MB hash
The elo increase is similar to the earlier 32 cores vs 16 cores match, and the 16 cores vs 8 cores result (+44.4 elo) is lower than the earlier non-NUMA 16 cores vs 8 cores result (+66 elo).

Test data

All games played for these measurements are available in this archive file.

The tests used the texel development version 1.07a27, available for download here.

Turbo boost effects

Turbo boost was enabled for all tests. This makes the computers run faster than the base clock frequency, but the speed increase is larger when fewer cores are in use. This can potentially skew the results in favor for configurations using few cores, i.e. it can look like the SMP algorithm scales worse than it would scale on hypothetical hardware that runs at the same clock speed regardless of how many cores are in use.

All used computers have very good cooling systems, so the actual turbo boost frequency can be calculated from processor turbo boost specification information as explained here. The following table shows the turbo boost value as a function on the number of active cores:

Code: Select all

             18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1   base    incr
E5-2686 v4 :  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  5  7  7   2.3GHz  0.1GHz
E5-2680 v3 :                    4  4  4  4  4  4  4  4  5  6  8  8   2.5GHz  0.1GHz
E5-2670 0  :                                4  4  5  5  6  6  7  7   2.6GHz  0.1GHz
As an example of the worst case, consider a match played between 1 core and 12 cores on the E5-2680 computer. When the 1 core engine is thinking only one core is active so the clock frequency is 2.5+8*0.1 = 3.3GHz. When the 12 core engine is thinking all 12 cores are active so the clock frequency is 2.5+4*0.1 = 2.9GHz. In this case it would be as if the 12 core engine had a 12% time handicap (1-2.9/3.3), so if we wanted to measure the true algorithmic 12 core vs 1 core improvement, we would have to give the 12 core engine 12% more thinking time.

However, these tests were run in a way that does not trigger the worst case, since only 2*N vs 1*N core matches were played (for various values of N), and as many parallel games as possible were played on each test computer. For example, consider a 4 core vs 2 cores match played on the E5-2680 computer. If only one game were to be played in parallel, the speed difference would be 1-(2.5+5*0.1)/(2.5+8*0.1)=9%. However since only 4 cores are needed for each game, 3 games are played in parallel. This means that at least 2*3 cores are active at any time, so it can be seen from the table above that the turbo boost multiplier is always 4, and hence the speed difference is zero.

I think the largest difference that actually occurred in the test data is 8 cores vs 4 cores played on the E5-2670 computer. In this case the speed difference is 1-(2.6+4*0.1)/(2.6+6*0.1)=6%.
petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Possible improvements

Post by petero2 »

For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.

The elo measurements showed that when the transposition table is not large enough to contain the entire search tree, multi core searches suffer more than increased time searches. It might be possible to improve the hash replacement strategy to improve both multi core and single core performance for the too small hash table case.

The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.

The elo measurements showed that adding more cluster nodes gave less elo improvement than adding the same amount of additional cores in the SMP case. There are at least four cluster overheads compared to SMP:
* The main thread on each cluster node has to handle the MPI communication.
* The main thread on each cluster node has to handle the insertion of hash table entries into the node local hash table.
* The delay between generating a hash table entry and the entry being available for all search threads is larger.
* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.

The extra depth used by helper threads is currently computed as "threadNumber % 2". If there is a large number of helper threads it might be better to let some of them search at even larger depths.

Finally, tests have shown that when running on 16 cores on a single computer, the lazy SMP algorithm is around 12 elo weaker than texel's old SMP algorithm. It is not obvious how to extend the old algorithm to clusters, but it is possible that some sort of generalization of that algorithm could perform better on clusters than the current lazy algorithm does.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Lazy SMP and "lazy cluster" experiments

Post by Daniel Shawul »

Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel
petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Lazy SMP and "lazy cluster" experiments

Post by petero2 »

Daniel Shawul wrote:Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel
Hi Daniel,

I actually read a lot of old interesting posts from you for inspiration while working on texel's cluster algorithm. I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes. I wanted to avoid that and therefore only considered algorithms where no node ever has to wait for an answer from any other node during search.

Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.

It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.

Regarding minimum sharing depth, I did a quick and dirty measurement of that for the root node when running on two single threaded cluster nodes connected by 1 gigabit Ethernet (ping time around 0.2ms). The average minimum sharing depth when searching in the start position was 1.04, distributed like this:

Code: Select all

min depth   % of time
0              7.04
1             82.21
2             10.73
3              0.02
I plan to do a more accurate measurement later in several different cluster configurations.

My long term goal is to figure out an algorithm for a distributed asynchronous approximate priority queue, which I could use as a building block when generalizing the old texel SMP algorithm to work in a cluster.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Lazy SMP and "lazy cluster" experiments

Post by D Sceviour »

petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?
jdart
Posts: 4410
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Lazy SMP and "lazy cluster" experiments

Post by jdart »

Last time I tried ABDADA it was a big loser, but that was a long time ago and not on modern hardware.

I am kind of afraid to touch my SMP implementation because it works very reliably and it took a long time to get it there. But it doesn't scale as well as I would like. I have this on my to-do list to look at but it is not at the top of the list.

Thanks Peter for your very detailed writeup.

--Jon
petero2
Posts: 729
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Lazy SMP and "lazy cluster" experiments

Post by petero2 »

D Sceviour wrote:
petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?
The idea was that the busy flag is only a hint that affects in what order moves are searched, so if there is a conflict between the busy flag and the result already stored in a TT entry, the busy flag is ignored.

Something like this (untested):

Code: Select all

int abdada(Position pos, int alpha, int beta, int depth, bool exclusive) {
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe(pos);
    if (e.cutOff(alpha, beta, depth))
        return e.score;
    if (exclusive && e.busy)
        return BUSY;
    e.busy = true;

    // null move, forward pruning, etc

    MoveList moves = generateMoves(pos);
    bool searched[256];
    int nextDepth[256];
    for (int pass = 0; pass < 2; pass++) {
        for (int i = 0; i < moves.size(); i++) {
            if (pass == 0)
                nextDepth[i] = depth - 1 + extension - reduction;
            else if (searched[i])
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove(moves[i]);
            int score = -abdada(pos, -beta, -alpha, nextDepth[i], nextExclusive);
            pos.unMakeMove(moves[i]);
            searched[i] = score != -BUSY;

            // Update alpha and bestScore, break if done
        }
    }

    tt.store(pos, bestScore); // Also sets entry.busy = false;
    return bestScore;
}
The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.

I don't know if this algorithm would be better than lazy SMP, or if it can be combined with the "depth + 1" idea from lazy SMP. It would have to be tested by playing games.