Lazy SMP and "lazy cluster" experiments

main line · Post by **main line** » Fri Aug 11, 2017 7:31 am

Send me cluster engine for my 2 Xeons 12 cores.
I want test does it work.

petero2 · Post by **petero2** » Fri Aug 11, 2017 8:30 pm

main line wrote:Send me cluster engine for my 2 Xeons 12 cores.
I want test does it work.

You can download it from here: texel107a27.7z.

To run in cluster mode you must first have a working MPI installation on your computers. If you are running windows, you can download from here: Microsoft MPI v8.1.

The texel archive contains a cluster enabled version for windows, texel64cl.exe. If you run linux, you have to compile a cluster enabled version yourself. It is hopefully easy to figure out how to do that by searching for cluster in the Makefile.

There is some more information in the readme.txt file in the texel archive:

Code: Select all

Cluster
-------

Texel can run on computer clusters by using the MPI system. It has only been
tested using MPICH in linux and MS-MPI in windows but should work with other MPI
implementations as well.

The pre-compiled windows executable texel64cl.exe is compiled and linked against
MS-MPI version 8.1. It requires the MS-MPI redistributable package to be
installed and configured on all computers in the cluster.

Running on a cluster is an advanced functionality and probably requires some
knowledge of cluster systems to set up.

Texel uses a so called hybrid MPI design. This means that it uses a single MPI
process per computer. On each computer it uses threads and shared memory, and
optionally NUMA awareness.

As an example, if you have 4 linux computers called host1, host2, host3, host4
and MPICH installed on all computers, you can start Texel like this:

  mpiexec -f machinefile -n 4 /path/to/texel

"machinefile" is a text file that lists the available computers and the number
of MPI processes to run on each computer (which should be 1), like this:

host1:1
host2:1
host3:1
host4:1

After texel has been started, use the "Threads" UCI option to control the total
number of search threads to use. Texel automatically decides how many threads to
use for each computer, and can also handle the case where different computers
have different number of CPUs and cores.

petero2 · Post by **petero2** » Fri Aug 11, 2017 9:01 pm

petero2 wrote:The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.

I have now run a test match and the idea to skip hash table probes at depth 0 for NUMA nodes > 0 is really bad for the lazy SMP algorithm. I got a 12 elo improvement by removing that code:

Code: Select all

engines       dElo  sDev  win loss draw  depth1 depth2
a28 vs a27   +12.5   2.4 1020  804 4176   19.72  19.71

It could be that this was bad for the old algorithm too. At the time I added that code I did not play enough games to reliably measure the elo difference.

petero2 · Post by **petero2** » Sat Aug 12, 2017 11:59 pm

Daniel Shawul wrote:I wonder what the Johnny guy is using to run on a cluster with thousands of nodes; not interested in the specifics but just whether it is a smart YBW-like algorithm, or a dumb SHT-like (no pun intended) algorithm.

I don't know what the primary cluster algorithm is, but according to this post he uses some of the nodes for "speculative pondering".

D Sceviour · Post by **D Sceviour** » Sun Aug 13, 2017 5:54 pm

Here is the root movelist showing a sample distribution of the number of nodes per thread for four threads total using a hash busy bit. Each column represents the number of nodes used by each thread per move. Each result is different because of the different startup times for threads, so the results could never be duplicated. The results are only typically representative.

At the bottom, ctnodes shows the total number of nodes used per each thread. This is a good result. The number of nodes is approximately equally distributed among all the threads, and there is no sitting around waiting for a parent thread.

Code: Select all

Search Iteration =  12
PV [12] 12  e2-e4 e7-e5 g1-f3 g8-f6 b1-c3 b8-c6 f1-b5 d7-d6

Thread nodes 1		 	    2    	      3		      4
______________________________________________________________

 1   e2-e4   311343        403850      385643      381976
 2   g1-f3   241327        153263      57239       65989
 3   e2-e3   13582         1           21780       28041
 4   f2-f3   2341          1           7497        7128
 5   b1-a3   267           1           1114        1114
 6   c2-c4   1316          1           6179        6292
 7   g1-h3   1             1           22          1
 8   b1-c3   2909          1           24050       25558
 9   d2-d4   2256          1           40759       28695
 10  d2-d3   1             1           1           17931
 11  f2-f4   11443         8146        1           2294
 12  c2-c3   1             1           1           1
 13  b2-b3   1             1           1           1
 14  g2-g4   1             1           1           1
 15  g2-g3   1             1           1           1
 16  h2-h3   6140          6067        1           1
 17  a2-a3   4462          5045        1           1
 18  h2-h4   1             9335        1           1
 19  a2-a4   1             5955        1           1
 20  b2-b4   1             57          1           1
             _____________
 Total       597395

 ctnodes  1   1248234
 ctnodes  2   1069643
 ctnodes  3   1075838
 ctnodes  4   1117619

I am not satisfied each thread has the proper aspiration value at the root ply 1. Each thread starts with (-infinity,infinity). For now, the root ply 1 threads are corrected with:

Code: Select all

		if (rootAlpha>alpha)
		   alpha=rootAlpha;

but perhaps there is a better suggestion.

Daniel Shawul · Post by **Daniel Shawul** » Sun Aug 13, 2017 6:20 pm

petero2 wrote:
petero2 wrote:The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.
I have now run a test match and the idea to skip hash table probes at depth 0 for NUMA nodes > 0 is really bad for the lazy SMP algorithm. I got a 12 elo improvement by removing that code:
Code: Select all
engines       dElo  sDev  win loss draw  depth1 depth2
a28 vs a27   +12.5   2.4 1020  804 4176   19.72  19.71
It could be that this was bad for the old algorithm too. At the time I added that code I did not play enough games to reliably measure the elo difference.

The latency for remote memory access maynot be big enough for you to avoid 0-depth lookups. In any case, distributing the TT is the correct and scalable approach when you
you have multiple sockets per node. I did not measure the speedup i got from this setup (scorpio has an option for local/global/distributed TT on a NUMA node and also CLUSTER),
but a distributed TT did help some on the cluster.

petero2 · Post by **petero2** » Sun Aug 13, 2017 9:52 pm

petero2 wrote:
Daniel Shawul wrote:What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.
I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.

I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before. A few additional comments:

* If a hash probe does not find an existing entry for the position, I do not create a new entry just to be able to set the busy flag. This means that on the first iterative deepening iteration where a position is searched, the busy flag will not be used, but in subsequent iterations a hash entry will exist (unless it has been overwritten by something else) so the busy flag will then be used.

* I disabled the lazy SMP "depth + 1" trick, as I don't know if that makes sense in combination with ABDADA. Disabling it in a pure lazy SMP setting seemed to cost around 5 elo at most, so I hope this is not a big deal.

The result after 2000 games at time control 24+0.24, 4 cores was:

Code: Select all

                dElo  sDev  win loss draw  depth1 depth2
ABDADA vs lazy  +8.2   4.5  361  314 1325   18.77  18.40

Considering this was the first try, it looks good even though the number of games is too small to be sure.

The average search depth increased, which is expected since lazy SMP tends to "go wide" rather than deep.

As this was a quick and dirty implementation I suspect it has some overhead that negatively affects single threaded search. I think it can can be optimized, which hopefully improves both single threaded and multi threaded performance.

D Sceviour · Post by **D Sceviour** » Sun Aug 13, 2017 10:45 pm

Here are some suggested changes to the pseudo code I found useful to get it functional.:

(1) Define a BUSY_SCORE = arbitrary 55555 or any value providing
MATE_SCORE < BUSY_SCORE < INFINITY

(2) It needs a better name since it is not really an ABDADA search; not close to the ABDADA description given in wiki. Instead of lazy SMP or approximateAbdada() how about busy SMP?

(3) counter1 and counter2 are global debug variables to test the efficiency of the BUSY bit.

Code: Select all

   // debug counters for efficiency test
    if (e.busy1)
        (exclusion ? counter1++ : counter2++);

(4) "exclusion" flag sounds better than "exclusive", but it may be a matter of taste.

(5) e.busy = true placed after null move and forward pruning since these could produce fast cutoffs.

(6) clear exclusion flag for PVS and reduction research so first search is not lost.

(7) busy exclusive test not included in quiescence hash probe, but may re-look at that one.

I would post an edited code version, but for some reason the editor here will not accept any changes to the original "approximateAbdada()" code.

Daniel Shawul · Post by **Daniel Shawul** » Mon Aug 14, 2017 4:48 am

petero2 wrote:
petero2 wrote:
Daniel Shawul wrote:What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.
I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.
I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before. A few additional comments:

ABDADA makes much more sense than that lazy hack, which is just a parallel iterative deepening. If it were beneficial to do d+1 before finishing d, then it should be beneficial to do so in the sequential version too ...

* If a hash probe does not find an existing entry for the position, I do not create a new entry just to be able to set the busy flag. This means that on the first iterative deepening iteration where a position is searched, the busy flag will not be used, but in subsequent iterations a hash entry will exist (unless it has been overwritten by something else) so the busy flag will then be used.

* I disabled the lazy SMP "depth + 1" trick, as I don't know if that makes sense in combination with ABDADA. Disabling it in a pure lazy SMP setting seemed to cost around 5 elo at most, so I hope this is not a big deal.

Was the loss only 5 elos ? That is quite surprising given that the trick was branded as the magic of lazy smp. How does the d+1 ID trick interact with internal-ID ?

My results with fixed depth search were so bad for the simple SHT detailed in this post. The search overhead was way too big , and the speedup way too small for a test i did on the bratko-kopec positions:

Code: Select all

   Overhead  Speedup  Nps-scaling
YBW: 1.123   2.070   1.993 
ABD: 1.265   1.748   1.915 
SHT: 1.997   1.161   1.893

Since I didn't have a quck way of adding LAZY i didn't include it then, maybe i will atleast for the mpi version. One thing i like to have is the option of mixing different cluster and smp algorithms and that "d+1" hack gets in the way of that.

The result after 2000 games at time control 24+0.24, 4 cores was:
Code: Select all
                dElo  sDev  win loss draw  depth1 depth2
ABDADA vs lazy  +8.2   4.5  361  314 1325   18.77  18.40
Considering this was the first try, it looks good even though the number of games is too small to be sure.

Yep, that is because ABDADA is a proper parallel search algorithm; and quite a novel one compared to other contemporary tree splitting techniques with its use of TT.

The average search depth increased, which is expected since lazy SMP tends to "go wide" rather than deep.

If lazy smp widens the tree and gives a benefit, one should be able to get the same result with just a sequential search then. If it has a widening effect, though i don't understand how -- sure has way too much overhead, it maybe an indication that one's selective search is too aggressive and needs to be toned down. Lazy smp lacks so many things as a parallel algorithm, but people come up with excuses to justify using it.

As this was a quick and dirty implementation I suspect it has some overhead that negatively affects single threaded search. I think it can can be optimized, which hopefully improves both single threaded and multi threaded performance.

Dann Corbit · Post by **Dann Corbit** » Mon Aug 14, 2017 7:24 am

petero2 wrote:
Daniel Shawul wrote:
petero2 wrote:For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.
You can create groups of nodes with MPI_CART_CREATE to take advantage of the topology but handing just the cluster node, smp core classification should be good enough already.
The problem is that the underlying hardware topology is unknown, so I don't know good values to pass to MPI_Cart_create. I suppose a command line parameter to texel where you can specify the hypertorus/hypercube topology would be good enough, since very few people would use this feature anyway.

Why not do a hardware discovery on startup and report it to a controlling node?

It could also be possible to do some sort of ping exchange to calibrate net speed.

Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Possible improvements

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Possible improvements

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Possible improvements