Re: Lazy SMP and "lazy cluster" experiments
Posted: Fri Aug 11, 2017 7:31 am
Send me cluster engine for my 2 Xeons 12 cores.
I want test does it work.
I want test does it work.
You can download it from here: texel107a27.7z.main line wrote:Send me cluster engine for my 2 Xeons 12 cores.
I want test does it work.
Code: Select all
Cluster
-------
Texel can run on computer clusters by using the MPI system. It has only been
tested using MPICH in linux and MS-MPI in windows but should work with other MPI
implementations as well.
The pre-compiled windows executable texel64cl.exe is compiled and linked against
MS-MPI version 8.1. It requires the MS-MPI redistributable package to be
installed and configured on all computers in the cluster.
Running on a cluster is an advanced functionality and probably requires some
knowledge of cluster systems to set up.
Texel uses a so called hybrid MPI design. This means that it uses a single MPI
process per computer. On each computer it uses threads and shared memory, and
optionally NUMA awareness.
As an example, if you have 4 linux computers called host1, host2, host3, host4
and MPICH installed on all computers, you can start Texel like this:
mpiexec -f machinefile -n 4 /path/to/texel
"machinefile" is a text file that lists the available computers and the number
of MPI processes to run on each computer (which should be 1), like this:
host1:1
host2:1
host3:1
host4:1
After texel has been started, use the "Threads" UCI option to control the total
number of search threads to use. Texel automatically decides how many threads to
use for each computer, and can also handle the case where different computers
have different number of CPUs and cores.
I have now run a test match and the idea to skip hash table probes at depth 0 for NUMA nodes > 0 is really bad for the lazy SMP algorithm. I got a 12 elo improvement by removing that code:petero2 wrote:The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.
Code: Select all
engines dElo sDev win loss draw depth1 depth2
a28 vs a27 +12.5 2.4 1020 804 4176 19.72 19.71
I don't know what the primary cluster algorithm is, but according to this post he uses some of the nodes for "speculative pondering".Daniel Shawul wrote:I wonder what the Johnny guy is using to run on a cluster with thousands of nodes; not interested in the specifics but just whether it is a smart YBW-like algorithm, or a dumb SHT-like (no pun intended) algorithm.
Code: Select all
Search Iteration = 12
PV [12] 12 e2-e4 e7-e5 g1-f3 g8-f6 b1-c3 b8-c6 f1-b5 d7-d6
Thread nodes 1 2 3 4
______________________________________________________________
1 e2-e4 311343 403850 385643 381976
2 g1-f3 241327 153263 57239 65989
3 e2-e3 13582 1 21780 28041
4 f2-f3 2341 1 7497 7128
5 b1-a3 267 1 1114 1114
6 c2-c4 1316 1 6179 6292
7 g1-h3 1 1 22 1
8 b1-c3 2909 1 24050 25558
9 d2-d4 2256 1 40759 28695
10 d2-d3 1 1 1 17931
11 f2-f4 11443 8146 1 2294
12 c2-c3 1 1 1 1
13 b2-b3 1 1 1 1
14 g2-g4 1 1 1 1
15 g2-g3 1 1 1 1
16 h2-h3 6140 6067 1 1
17 a2-a3 4462 5045 1 1
18 h2-h4 1 9335 1 1
19 a2-a4 1 5955 1 1
20 b2-b4 1 57 1 1
_____________
Total 597395
ctnodes 1 1248234
ctnodes 2 1069643
ctnodes 3 1075838
ctnodes 4 1117619
Code: Select all
if (rootAlpha>alpha)
alpha=rootAlpha;
The latency for remote memory access maynot be big enough for you to avoid 0-depth lookups. In any case, distributing the TT is the correct and scalable approach when youpetero2 wrote:I have now run a test match and the idea to skip hash table probes at depth 0 for NUMA nodes > 0 is really bad for the lazy SMP algorithm. I got a 12 elo improvement by removing that code:petero2 wrote:The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.It could be that this was bad for the old algorithm too. At the time I added that code I did not play enough games to reliably measure the elo difference.Code: Select all
engines dElo sDev win loss draw depth1 depth2 a28 vs a27 +12.5 2.4 1020 804 4176 19.72 19.71
I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before. A few additional comments:petero2 wrote:I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.Daniel Shawul wrote:What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.
Code: Select all
dElo sDev win loss draw depth1 depth2
ABDADA vs lazy +8.2 4.5 361 314 1325 18.77 18.40
Code: Select all
// debug counters for efficiency test
if (e.busy1)
(exclusion ? counter1++ : counter2++);
ABDADA makes much more sense than that lazy hack, which is just a parallel iterative deepening. If it were beneficial to do d+1 before finishing d, then it should be beneficial to do so in the sequential version too ...petero2 wrote:I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before. A few additional comments:petero2 wrote:I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.Daniel Shawul wrote:What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.
Was the loss only 5 elos ? That is quite surprising given that the trick was branded as the magic of lazy smp. How does the d+1 ID trick interact with internal-ID ?* If a hash probe does not find an existing entry for the position, I do not create a new entry just to be able to set the busy flag. This means that on the first iterative deepening iteration where a position is searched, the busy flag will not be used, but in subsequent iterations a hash entry will exist (unless it has been overwritten by something else) so the busy flag will then be used.
* I disabled the lazy SMP "depth + 1" trick, as I don't know if that makes sense in combination with ABDADA. Disabling it in a pure lazy SMP setting seemed to cost around 5 elo at most, so I hope this is not a big deal.
Code: Select all
Overhead Speedup Nps-scaling
YBW: 1.123 2.070 1.993
ABD: 1.265 1.748 1.915
SHT: 1.997 1.161 1.893
Yep, that is because ABDADA is a proper parallel search algorithm; and quite a novel one compared to other contemporary tree splitting techniques with its use of TT.The result after 2000 games at time control 24+0.24, 4 cores was:Considering this was the first try, it looks good even though the number of games is too small to be sure.Code: Select all
dElo sDev win loss draw depth1 depth2 ABDADA vs lazy +8.2 4.5 361 314 1325 18.77 18.40
If lazy smp widens the tree and gives a benefit, one should be able to get the same result with just a sequential search then. If it has a widening effect, though i don't understand how -- sure has way too much overhead, it maybe an indication that one's selective search is too aggressive and needs to be toned down. Lazy smp lacks so many things as a parallel algorithm, but people come up with excuses to justify using it.The average search depth increased, which is expected since lazy SMP tends to "go wide" rather than deep.
As this was a quick and dirty implementation I suspect it has some overhead that negatively affects single threaded search. I think it can can be optimized, which hopefully improves both single threaded and multi threaded performance.
Why not do a hardware discovery on startup and report it to a controlling node?petero2 wrote:The problem is that the underlying hardware topology is unknown, so I don't know good values to pass to MPI_Cart_create. I suppose a command line parameter to texel where you can specify the hypertorus/hypercube topology would be good enough, since very few people would use this feature anyway.Daniel Shawul wrote:You can create groups of nodes with MPI_CART_CREATE to take advantage of the topology but handing just the cluster node, smp core classification should be good enough already.petero2 wrote:For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.