Lazy SMP and "lazy cluster" experiments
Moderators: hgm, Rebel, chrisw
-
- Posts: 60
- Joined: Thu Jul 07, 2016 10:15 pm
Re: Lazy SMP and "lazy cluster" experiments
Send me cluster engine for my 2 Xeons 12 cores.
I want test does it work.
I want test does it work.
-
- Posts: 697
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Lazy SMP and "lazy cluster" experiments
You can download it from here: texel107a27.7z.main line wrote:Send me cluster engine for my 2 Xeons 12 cores.
I want test does it work.
To run in cluster mode you must first have a working MPI installation on your computers. If you are running windows, you can download from here: Microsoft MPI v8.1.
The texel archive contains a cluster enabled version for windows, texel64cl.exe. If you run linux, you have to compile a cluster enabled version yourself. It is hopefully easy to figure out how to do that by searching for cluster in the Makefile.
There is some more information in the readme.txt file in the texel archive:
Code: Select all
Cluster
-------
Texel can run on computer clusters by using the MPI system. It has only been
tested using MPICH in linux and MS-MPI in windows but should work with other MPI
implementations as well.
The pre-compiled windows executable texel64cl.exe is compiled and linked against
MS-MPI version 8.1. It requires the MS-MPI redistributable package to be
installed and configured on all computers in the cluster.
Running on a cluster is an advanced functionality and probably requires some
knowledge of cluster systems to set up.
Texel uses a so called hybrid MPI design. This means that it uses a single MPI
process per computer. On each computer it uses threads and shared memory, and
optionally NUMA awareness.
As an example, if you have 4 linux computers called host1, host2, host3, host4
and MPICH installed on all computers, you can start Texel like this:
mpiexec -f machinefile -n 4 /path/to/texel
"machinefile" is a text file that lists the available computers and the number
of MPI processes to run on each computer (which should be 1), like this:
host1:1
host2:1
host3:1
host4:1
After texel has been started, use the "Threads" UCI option to control the total
number of search threads to use. Texel automatically decides how many threads to
use for each computer, and can also handle the case where different computers
have different number of CPUs and cores.
-
- Posts: 697
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Possible improvements
I have now run a test match and the idea to skip hash table probes at depth 0 for NUMA nodes > 0 is really bad for the lazy SMP algorithm. I got a 12 elo improvement by removing that code:petero2 wrote:The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.
Code: Select all
engines dElo sDev win loss draw depth1 depth2
a28 vs a27 +12.5 2.4 1020 804 4176 19.72 19.71
-
- Posts: 697
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Lazy SMP and "lazy cluster" experiments
I don't know what the primary cluster algorithm is, but according to this post he uses some of the nodes for "speculative pondering".Daniel Shawul wrote:I wonder what the Johnny guy is using to run on a cluster with thousands of nodes; not interested in the specifics but just whether it is a smart YBW-like algorithm, or a dumb SHT-like (no pun intended) algorithm.
-
- Posts: 570
- Joined: Mon Jul 20, 2015 5:06 pm
Re: Lazy SMP and "lazy cluster" experiments
Here is the root movelist showing a sample distribution of the number of nodes per thread for four threads total using a hash busy bit. Each column represents the number of nodes used by each thread per move. Each result is different because of the different startup times for threads, so the results could never be duplicated. The results are only typically representative.
At the bottom, ctnodes shows the total number of nodes used per each thread. This is a good result. The number of nodes is approximately equally distributed among all the threads, and there is no sitting around waiting for a parent thread.
I am not satisfied each thread has the proper aspiration value at the root ply 1. Each thread starts with (-infinity,infinity). For now, the root ply 1 threads are corrected with:
but perhaps there is a better suggestion.
At the bottom, ctnodes shows the total number of nodes used per each thread. This is a good result. The number of nodes is approximately equally distributed among all the threads, and there is no sitting around waiting for a parent thread.
Code: Select all
Search Iteration = 12
PV [12] 12 e2-e4 e7-e5 g1-f3 g8-f6 b1-c3 b8-c6 f1-b5 d7-d6
Thread nodes 1 2 3 4
______________________________________________________________
1 e2-e4 311343 403850 385643 381976
2 g1-f3 241327 153263 57239 65989
3 e2-e3 13582 1 21780 28041
4 f2-f3 2341 1 7497 7128
5 b1-a3 267 1 1114 1114
6 c2-c4 1316 1 6179 6292
7 g1-h3 1 1 22 1
8 b1-c3 2909 1 24050 25558
9 d2-d4 2256 1 40759 28695
10 d2-d3 1 1 1 17931
11 f2-f4 11443 8146 1 2294
12 c2-c3 1 1 1 1
13 b2-b3 1 1 1 1
14 g2-g4 1 1 1 1
15 g2-g3 1 1 1 1
16 h2-h3 6140 6067 1 1
17 a2-a3 4462 5045 1 1
18 h2-h4 1 9335 1 1
19 a2-a4 1 5955 1 1
20 b2-b4 1 57 1 1
_____________
Total 597395
ctnodes 1 1248234
ctnodes 2 1069643
ctnodes 3 1075838
ctnodes 4 1117619
Code: Select all
if (rootAlpha>alpha)
alpha=rootAlpha;
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Possible improvements
The latency for remote memory access maynot be big enough for you to avoid 0-depth lookups. In any case, distributing the TT is the correct and scalable approach when youpetero2 wrote:I have now run a test match and the idea to skip hash table probes at depth 0 for NUMA nodes > 0 is really bad for the lazy SMP algorithm. I got a 12 elo improvement by removing that code:petero2 wrote:The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.It could be that this was bad for the old algorithm too. At the time I added that code I did not play enough games to reliably measure the elo difference.Code: Select all
engines dElo sDev win loss draw depth1 depth2 a28 vs a27 +12.5 2.4 1020 804 4176 19.72 19.71
you have multiple sockets per node. I did not measure the speedup i got from this setup (scorpio has an option for local/global/distributed TT on a NUMA node and also CLUSTER),
but a distributed TT did help some on the cluster.
-
- Posts: 697
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Lazy SMP and "lazy cluster" experiments
I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before. A few additional comments:petero2 wrote:I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.Daniel Shawul wrote:What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.
* If a hash probe does not find an existing entry for the position, I do not create a new entry just to be able to set the busy flag. This means that on the first iterative deepening iteration where a position is searched, the busy flag will not be used, but in subsequent iterations a hash entry will exist (unless it has been overwritten by something else) so the busy flag will then be used.
* I disabled the lazy SMP "depth + 1" trick, as I don't know if that makes sense in combination with ABDADA. Disabling it in a pure lazy SMP setting seemed to cost around 5 elo at most, so I hope this is not a big deal.
The result after 2000 games at time control 24+0.24, 4 cores was:
Code: Select all
dElo sDev win loss draw depth1 depth2
ABDADA vs lazy +8.2 4.5 361 314 1325 18.77 18.40
The average search depth increased, which is expected since lazy SMP tends to "go wide" rather than deep.
As this was a quick and dirty implementation I suspect it has some overhead that negatively affects single threaded search. I think it can can be optimized, which hopefully improves both single threaded and multi threaded performance.
-
- Posts: 570
- Joined: Mon Jul 20, 2015 5:06 pm
Re: Lazy SMP and "lazy cluster" experiments
Here are some suggested changes to the pseudo code I found useful to get it functional.:
(1) Define a BUSY_SCORE = arbitrary 55555 or any value providing
MATE_SCORE < BUSY_SCORE < INFINITY
(2) It needs a better name since it is not really an ABDADA search; not close to the ABDADA description given in wiki. Instead of lazy SMP or approximateAbdada() how about busy SMP?
(3) counter1 and counter2 are global debug variables to test the efficiency of the BUSY bit.
(4) "exclusion" flag sounds better than "exclusive", but it may be a matter of taste.
(5) e.busy = true placed after null move and forward pruning since these could produce fast cutoffs.
(6) clear exclusion flag for PVS and reduction research so first search is not lost.
(7) busy exclusive test not included in quiescence hash probe, but may re-look at that one.
I would post an edited code version, but for some reason the editor here will not accept any changes to the original "approximateAbdada()" code.
(1) Define a BUSY_SCORE = arbitrary 55555 or any value providing
MATE_SCORE < BUSY_SCORE < INFINITY
(2) It needs a better name since it is not really an ABDADA search; not close to the ABDADA description given in wiki. Instead of lazy SMP or approximateAbdada() how about busy SMP?
(3) counter1 and counter2 are global debug variables to test the efficiency of the BUSY bit.
Code: Select all
// debug counters for efficiency test
if (e.busy1)
(exclusion ? counter1++ : counter2++);
(5) e.busy = true placed after null move and forward pruning since these could produce fast cutoffs.
(6) clear exclusion flag for PVS and reduction research so first search is not lost.
(7) busy exclusive test not included in quiescence hash probe, but may re-look at that one.
I would post an edited code version, but for some reason the editor here will not accept any changes to the original "approximateAbdada()" code.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Lazy SMP and "lazy cluster" experiments
ABDADA makes much more sense than that lazy hack, which is just a parallel iterative deepening. If it were beneficial to do d+1 before finishing d, then it should be beneficial to do so in the sequential version too ...petero2 wrote:I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before. A few additional comments:petero2 wrote:I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.Daniel Shawul wrote:What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.
Was the loss only 5 elos ? That is quite surprising given that the trick was branded as the magic of lazy smp. How does the d+1 ID trick interact with internal-ID ?* If a hash probe does not find an existing entry for the position, I do not create a new entry just to be able to set the busy flag. This means that on the first iterative deepening iteration where a position is searched, the busy flag will not be used, but in subsequent iterations a hash entry will exist (unless it has been overwritten by something else) so the busy flag will then be used.
* I disabled the lazy SMP "depth + 1" trick, as I don't know if that makes sense in combination with ABDADA. Disabling it in a pure lazy SMP setting seemed to cost around 5 elo at most, so I hope this is not a big deal.
My results with fixed depth search were so bad for the simple SHT detailed in this post. The search overhead was way too big , and the speedup way too small for a test i did on the bratko-kopec positions:
Code: Select all
Overhead Speedup Nps-scaling
YBW: 1.123 2.070 1.993
ABD: 1.265 1.748 1.915
SHT: 1.997 1.161 1.893
Yep, that is because ABDADA is a proper parallel search algorithm; and quite a novel one compared to other contemporary tree splitting techniques with its use of TT.The result after 2000 games at time control 24+0.24, 4 cores was:Considering this was the first try, it looks good even though the number of games is too small to be sure.Code: Select all
dElo sDev win loss draw depth1 depth2 ABDADA vs lazy +8.2 4.5 361 314 1325 18.77 18.40
If lazy smp widens the tree and gives a benefit, one should be able to get the same result with just a sequential search then. If it has a widening effect, though i don't understand how -- sure has way too much overhead, it maybe an indication that one's selective search is too aggressive and needs to be toned down. Lazy smp lacks so many things as a parallel algorithm, but people come up with excuses to justify using it.The average search depth increased, which is expected since lazy SMP tends to "go wide" rather than deep.
As this was a quick and dirty implementation I suspect it has some overhead that negatively affects single threaded search. I think it can can be optimized, which hopefully improves both single threaded and multi threaded performance.
-
- Posts: 12566
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Possible improvements
Why not do a hardware discovery on startup and report it to a controlling node?petero2 wrote:The problem is that the underlying hardware topology is unknown, so I don't know good values to pass to MPI_Cart_create. I suppose a command line parameter to texel where you can specify the hypertorus/hypercube topology would be good enough, since very few people would use this feature anyway.Daniel Shawul wrote:You can create groups of nodes with MPI_CART_CREATE to take advantage of the topology but handing just the cluster node, smp core classification should be good enough already.petero2 wrote:For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.
It could also be possible to do some sort of ping exchange to calibrate net speed.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.