Lazy SMP and "lazy cluster" experiments

D Sceviour · Post by **D Sceviour** » Wed Aug 09, 2017 1:07 am

petero2 wrote:The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.

The phrase "even if e.busy is sometimes wrong" sounds like a recipe for disaster. The algorithm would still have to wait somewhere to make sure the "searched(i)" moves do not return a busy signal. The abdada search can only complete after all those busy nodes are finished, otherwise it may return an incorrect value.

Maybe try something strange like if alpha>hash_value then return hash_value [untested]. This ensures the previous ply will fail-high on the search. The final score for ply-1 would be on the safe side. A refutation of the "busy" node by another thread should verify the case condition when it is complete.

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 09, 2017 5:48 pm

I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes.

Note sure about that. There is not a lot of waiting in YBW waiting for response from other nodes; messages are processed asynchronously via MPI_IProbe(), and can even be processede as they arrive if the hardware allows it. IIRC the BG/Q has for instance a dedicated 17th core for that purpose. If you allow in any thread to process MPI messages you have a lot more complexity.

Ofcourse the standard YBW has some waiting to do for the last processor to finish, and for the first move to be searched fully before parallelization. But those have barely any effect on performance and the effect is the same for both smp and cluster machines. The problem with the cluster is the split depth is often much larger than that for smp due to often slow interconnect. Other than that the cluster YBW implementation almost gave me the same speedup as the smp implementation both running on an smp implementation.

Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.

What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.

Have you done multi-node tests for your cluster implementation ? I am sure loosing just 12 elo on an smp machine is something you should be happy about, but performance dips heavily when you start adding in more nodes. On an SMP machine, using pure MPI mode should actually be equivalent to smp mode because message passing mode could be done more efficiently by just copying data on a shared memory.

I think with the simple approach that you have for the cluster the distributed TT scheme has more impact than anything else. My opinion is that if YBW performed better than Lazy on SMP, it should perform better also on the cluster too. Asynchronous schemes often have way too much overhead, but given that we have way too much power on the cluster who knows which algorithm works better for it.

Thanks for the tests,
Daniel

petero2 · Post by **petero2** » Wed Aug 09, 2017 7:07 pm

D Sceviour wrote:
petero2 wrote:The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.
The phrase "even if e.busy is sometimes wrong" sounds like a recipe for disaster. The algorithm would still have to wait somewhere to make sure the "searched(i)" moves do not return a busy signal. The abdada search can only complete after all those busy nodes are finished, otherwise it may return an incorrect value.

Actually no thread has to wait for any other thread. If you look at my pseudo code it can be seen that all moves are searched exactly once (except in the case of a fail high, which stops the move loop immediately). The first pass searches move 0 and all moves where the busy flag indicates that nobody else is likely to be searching that move. The second pass searches all remaining moves (that's what the searched[] array is used for). If somebody else is already searching the same move in the second pass, that information is ignored and the thread will search the move itself anyway.

Therefore you will get "valid" scores returned by all threads, regardless of what other threads are doing. By "valid" I mean that the returned score is based on searching all possible moves, so it is for example not possible to "forget" to search the only move that recaptures a queen.

Whether or not this will actually cause stronger play than a single threaded search has to be tested. In the worst case the busy flag is wrong all the time and there is none of the "widening" effect that has been observed in lazy SMP.

I plan to test this algorithm and report how it performs.

Actually, reading the pseudo code again, I realize that it might not be clear that alpha and bestScore should only be updated if the move was actually searched, like this:

Code: Select all

int approximateAbdada&#40;Position pos, int alpha, int beta, int depth, bool exclusive&#41; &#123;
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe&#40;pos&#41;;
    if &#40;e.cutOff&#40;alpha, beta, depth&#41;)
        return e.score;
    if &#40;exclusive && e.busy&#41;
        return BUSY;
    e.busy = true;

    // null move, forward pruning, etc

    MoveList moves = generateMoves&#40;pos&#41;;
    bool searched&#91;256&#93;;
    int nextDepth&#91;256&#93;;
    for &#40;int pass = 0; pass < 2; pass++) &#123;
        for &#40;int i = 0; i < moves.size&#40;); i++) &#123;
            if &#40;pass == 0&#41;
                nextDepth&#91;i&#93; = depth - 1 + extension - reduction;
            else if &#40;searched&#91;i&#93;)
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove&#40;moves&#91;i&#93;);
            int score = -approximateAbdada&#40;pos, -beta, -alpha, nextDepth&#91;i&#93;, nextExclusive&#41;;
            pos.unMakeMove&#40;moves&#91;i&#93;);
            searched&#91;i&#93; = score != -BUSY;
            if (!searched&#91;i&#93;)               // <------------------------------------- Missing in previous post
                continue;

            // Update alpha and bestScore, break if done
        &#125;
    &#125;

    tt.store&#40;pos, bestScore&#41;; // Also sets entry.busy = false;
    return bestScore;
&#125;

petero2 · Post by **petero2** » Wed Aug 09, 2017 8:06 pm

Daniel Shawul wrote:
I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes.
Note sure about that. There is not a lot of waiting in YBW waiting for response from other nodes; messages are processed asynchronously via MPI_IProbe(), and can even be processede as they arrive if the hardware allows it. IIRC the BG/Q has for instance a dedicated 17th core for that purpose. If you allow in any thread to process MPI messages you have a lot more complexity.

Ofcourse the standard YBW has some waiting to do for the last processor to finish, and for the first move to be searched fully before parallelization. But those have barely any effect on performance and the effect is the same for both smp and cluster machines. The problem with the cluster is the split depth is often much larger than that for smp due to often slow interconnect. Other than that the cluster YBW implementation almost gave me the same speedup as the smp implementation both running on an smp implementation.

I think the post where I got that idea was this one, but I realize now that that post is very old and probably not very relevant to what you are doing in current versions of Scorpio.

Daniel Shawul wrote:
Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.
What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.

I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.

Daniel Shawul wrote:Have you done multi-node tests for your cluster implementation ? I am sure loosing just 12 elo on an smp machine is something you should be happy about, but performance dips heavily when you start adding in more nodes. On an SMP machine, using pure MPI mode should actually be equivalent to smp mode because message passing mode could be done more efficiently by just copying data on a shared memory.

Yes, I actually reported several such results in the ELO measurements post.

I repeat the most interesting results here:

Two single threaded MPI processors running on different computers compared to a single threaded search on one computer: +50 elo

16 single threaded MPI processors running on different computers compared to a single threaded search on one computer: +50+46+51+29 = +176 elo

4 MPI processors running on different computers, each processor having 4 threads, compared to a non-MPI 4-thread search: +93 elo

2 MPI processors running on different computers, each processor having 32 threads running on a 2-socket NUMA hardware, compared to a non-MPI 32-thread search running on one computer having the same NUMA hardware: +39 elo

Daniel Shawul wrote:I think with the simple approach that you have for the cluster the distributed TT scheme has more impact than anything else. My opinion is that if YBW performed better than Lazy on SMP, it should perform better also on the cluster too. Asynchronous schemes often have way too much overhead, but given that we have way too much power on the cluster who knows which algorithm works better for it.

The problem is that my old algorithm was not YBW, it was a home made algorithm that admittedly has many YBW and DTS-like properties, but still is quite different. For one thing, it keeps all potential parallel tasks in a priority queue data structure, and that structure is only accessed under a lock.

I need to find a way to efficiently parallelize this data structure, and I don't want to implement locking primitives using message passing.

Also the data structure is actually more complicated that a priority queue, because it also has operations to recompute probability estimates and propagate those estimates between parent/child relationships between splitpoints.

D Sceviour · Post by **D Sceviour** » Wed Aug 09, 2017 8:57 pm

This looks interesting as a replacement for YBW. I may try something like this, as my program is having threading added for the first time. It has no constraints from any previous misconceptions. First thing is to get the hash save properly updating and clearing the BUSY bit.

I like the concept. Instead of slave and worker threads, each thread participates equally within a neural network of public library information (hash info). It may be called Integrated Neural Threading (INT). It might even contribute to artificial intelligence if it can get past the philosophical bias some people have towards slave and worker threads.

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 09, 2017 10:59 pm

I think the post where I got that idea was this one, but I realize now that that post is very old and probably not very relevant to what you are doing in current versions of Scorpio.

Yep, that one is pretty old and I am not even sure if i was using the cluster correctly infact.

I just did a quick test with 16 cores smp machine (Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz). The number of nodes searched per sec by each core, for a st=10 sec search seems to be more or less the same using all the three methods.

Code: Select all

mt                  1
smp_depth           8
cluster_depth       11
message_poll_nodes  20
smp_type            YBW
cluster_type        YBW

The kilo-nodes per sec searched by each core are:

Code: Select all

ALG/CORES&#58;  1  2 4 8 `6
YBW  &#58;  1337k  1308k 1307k 1200k 1200k
ABDADA &#58; 1337k 1309k 1306k 1220k 1202k
SHT &#58; 1337k 1308k 1306k 1220k 1202k

It is possible to use the best algorithm, interms of lower search over-head (YBW in this instance), on a cluster as well. The original paper reports results for upto a 1000 cluster nodes before reaching stagnation. It maybe that your DTS-YBW like algorithm could still be the best algorithm on the cluster once you figure out the details.

Yes, I actually reported several such results in the ELO measurements post.

I repeat the most interesting results here:

Two single threaded MPI processors running on different computers compared to a single threaded search on one computer: +50 elo

16 single threaded MPI processors running on different computers compared to a single threaded search on one computer: +50+46+51+29 = +176 elo

4 MPI processors running on different computers, each processor having 4 threads, compared to a non-MPI 4-thread search: +93 elo

2 MPI processors running on different computers, each processor having 32 threads running on a 2-socket NUMA hardware, compared to a non-MPI 32-thread search running on one computer having the same NUMA hardware: +39 elo

It is good that you report results of doubling time so that we know we are not aiming for an impossible elo improvement with a parallel algorithm.

I think that if you get a 15 elo improvement from doubling 32n vs 16n, and then we assume anything less than 10 elo is not an improvement, this implies we need new algorithms for clusters >64 nodes and with thousands of cores. I wonder what the Johnny guy is using to run on a cluster with thousands of nodes; not interested in the specifics but just whether it is a smart YBW-like algorithm, or a dumb SHT-like (no pun intended) algorithm.

The problem is that my old algorithm was not YBW, it was a home made algorithm that admittedly has many YBW and DTS-like properties, but still is quite different. For one thing, it keeps all potential parallel tasks in a priority queue data structure, and that structure is only accessed under a lock.

I need to find a way to efficiently parallelize this data structure, and I don't want to implement locking primitives using message passing.

A centralized work queue (probably very inefficient) or a work-stealing like algorithm where each processor keeps its own queue ?

I have never tried it but there is also a transposition-table driven work scheduling where work is sent specifically to the processor that own the state. That is, given the hashkey of a position, you know which processor is going to search it; a novel work-stealing idea for sure but so far has only been used in GO programs i think.

Also the data structure is actually more complicated that a priority queue, because it also has operations to recompute probability estimates and propagate those estimates between parent/child relationships between splitpoints.

A probability estimate for the likelihood of the success of a splitpoint? It seems a complicated algorithm indeed; will be looking forward to the results of ABDADA and your algorithm. You should probably aim for inventing algorithms on clusters with thousands of nodes, as it seems SMP and cluster with few nodes is a "dead issue" IMO.
Daniel

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 10, 2017 6:02 pm

petero2 wrote:For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.

You can create groups of nodes with MPI_CART_CREATE to take advantage of the topology but handing just the cluster node, smp core classification should be good enough already. But I like the idea of storing the upper parts of the tree in memory gives a lot of opportunities for asynchronous algorithms that could be scalable on supercomputers. Monte-Carlo tree search does that and it is more scalable than alpha-beta. There is also an asynchronous cluster algorithm APHID for alpha-beta that does something similar.

The elo measurements showed that when the transposition table is not large enough to contain the entire search tree, multi core searches suffer more than increased time searches. It might be possible to improve the hash replacement strategy to improve both multi core and single core performance for the too small hash table case.

The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.

The elo measurements showed that adding more cluster nodes gave less elo improvement than adding the same amount of additional cores in the SMP case. There are at least four cluster overheads compared to SMP:
* The main thread on each cluster node has to handle the MPI communication.
* The main thread on each cluster node has to handle the insertion of hash table entries into the node local hash table.

You can have any thread process MPI messages including hashtable entry updates if the hardware supports it. Scorpio can work with MPI_THREAD_MULTIPLE but it gives you a lot of headache as compared to MPI_THREAD_SINGLE/FUNNELED.

* The delay between generating a hash table entry and the entry being available for all search threads is larger.

You can not do anything about that since that depends on the speed of the interconnet. But you can do speculative search while waiting for a TT entry from another processor ,and then quit the search if you get a TT hit.

* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.

Well if the hardware has that extra thread for communication (e.g. the BGQ's 17th core), then you can do that, otherwise you are loosing one thread that could have been used for search...

The extra depth used by helper threads is currently computed as "threadNumber % 2". If there is a large number of helper threads it might be better to let some of them search at even larger depths.

Indeed. What happens if you have 1000 threads, 500 on depth d and 500 on d+1, That seems too inefficient and probably why lazy won't work for clusters with many nodes..

petero2 · Post by **petero2** » Thu Aug 10, 2017 8:34 pm

Daniel Shawul wrote:
The problem is that my old algorithm was not YBW, it was a home made algorithm that admittedly has many YBW and DTS-like properties, but still is quite different. For one thing, it keeps all potential parallel tasks in a priority queue data structure, and that structure is only accessed under a lock.

I need to find a way to efficiently parallelize this data structure, and I don't want to implement locking primitives using message passing.
A centralized work queue (probably very inefficient) or a work-stealing like algorithm where each processor keeps its own queue ?

It is a centralized data structure. I got it to work reasonably well on SMP/NUMA machines by using various tricks, like:
* Dynamically adjusting the minimum split depth based on measured lock contention.
* Use a hand written heap implementation that is faster than the gcc STL version.
* Use trylock when attempting to insert into the queue. If the lock is busy store the splitpoint locally and then try again in the next search tree node, and perform a batch insert once you get the lock.

Nevertheless it is obvious that a centralized queue is not going to work well for a large cluster (maybe not even for any cluster). I would like to have some data structure that works like a priority queue on a local scale, but does something more scalable on a non-local scale. Some kind of approximate priority queue I guess.

Daniel Shawul wrote:
Also the data structure is actually more complicated that a priority queue, because it also has operations to recompute probability estimates and propagate those estimates between parent/child relationships between splitpoints.
A probability estimate for the likelihood of the success of a splitpoint? It seems a complicated algorithm indeed;

Yes, for each move in a splitpoint a probability is computed that a sequential search would have searched that move. The computation is based on measuring how often a fail high occurs for various move numbers and node types.

This probability changes every time a search for a move in the splitpoint completes. For example, in a position the probability that move 4 needs to be searched might be 70% after the search for move 1 has finished, but if the search for move 2 has also finished, the probability that move 4 is needed might be 95%. Therefore every time a search finishes for a move in a splitpoint, the priority queue needs to be updated.

For splitpoints not created by the master thread, there is another complication. The probability that a move in that splitpoint is needed by the master thread is computed from several parts:
The probability computed by fail high statistics for the move in that splitpoint.
The probability that the splitpoint itself is needed by the master thread.
The probability that the parent splitpoint is needed by the master thread.
...
The final probability is obtained by multiplying those probabilities.

This means that when the search finishes for a move at a splitpoint, the probability needs to be updated also for all child splitpoints.

Quite complicated indeed. That's why I hope some simpler algorithm could perform equally well, or even better.

Daniel Shawul wrote:will be looking forward to the results of ABDADA and your algorithm. You should probably aim for inventing algorithms on clusters with thousands of nodes, as it seems SMP and cluster with few nodes is a "dead issue" IMO.

Yes, I am aiming for thousands of nodes, but I will have no way to actually test that in the foreseeable future.

petero2 · Post by **petero2** » Thu Aug 10, 2017 10:08 pm

Daniel Shawul wrote:
petero2 wrote:For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.
You can create groups of nodes with MPI_CART_CREATE to take advantage of the topology but handing just the cluster node, smp core classification should be good enough already.

The problem is that the underlying hardware topology is unknown, so I don't know good values to pass to MPI_Cart_create. I suppose a command line parameter to texel where you can specify the hypertorus/hypercube topology would be good enough, since very few people would use this feature anyway.

Daniel Shawul wrote:
* The delay between generating a hash table entry and the entry being available for all search threads is larger.
You can not do anything about that since that depends on the speed of the interconnet.

In texel there is an additional delay because thread 0 periodically polls for new MPI messages, since I don't want to have a separate thread busy waiting for MPI messages. The polling interval could be optimized. Now it is somewhat arbitrarily set to 1000 search nodes.

Daniel Shawul wrote:
* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.
Well if the hardware has that extra thread for communication (e.g. the BGQ's 17th core), then you can do that, otherwise you are loosing one thread that could have been used for search...

I was thinking of offloading the part where received TT entries are stored in the transposition table to a separate hyperthread. That work will likely be stuck most of the time because of store buffer stalls. Therefore it might be possible to run it on a free hyperthread without much slowdown of the other hyperthread on the same core (which is doing normal search).

D Sceviour · Post by **D Sceviour** » Fri Aug 11, 2017 12:51 am

The first statistics are rolling in for the hash BUSY probe. From the root position:

Two threads attempt:
Total Nodes = 56695182
first pass busy counter1 1889 nodes
second pass busy counter2 218 nodes

88.5% efficiency in hash busy bit !!
0.9999790% overall efficiency

Four threads attempt:
Total Nodes = 201065925
first pass busy counter1 12672 nodes
second pass busy counter2 1719 nodes

86.4% efficiency in hash busy bit !!
0.99937161% overall efficiency

Undoubtedly, there are still many difficulties to overcome, but the results are too interesting to ignore. If these type of results continue to hold up, it might be suspected there will be a big improvement over the problem of diminishing marginal returns for the number of cores with YBW type threading. That is, elo might actually improve with increasing number of threads. Stay tuned.

Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Possible improvements

Re: Lazy SMP and "lazy cluster" experiments

Re: Possible improvements

Re: Lazy SMP and "lazy cluster" experiments