Lazy SMP and "lazy cluster" experiments

mar · Post by **mar** » Thu Aug 17, 2017 9:05 am

Daniel Shawul wrote:Lazy smp lacks so many things as a parallel algorithm, but people come up with excuses to justify using it.

100 elo 4 cores, no matter how hard you try to play it down

so, yeah, SF switched from YBW for no reason apparently
(you sound like Bob)

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 17, 2017 3:19 pm

mar wrote:
Daniel Shawul wrote:Lazy smp lacks so many things as a parallel algorithm, but people come up with excuses to justify using it.
100 elo 4 cores, no matter how hard you try to play it down
so, yeah, SF switched from YBW for no reason apparently
(you sound like Bob)

Because stockfish uses it is not a justification for the efficiency of the algorithm. I bet if they used ABDADA and when (not if) it beats that lazy hack, they would go crazy like before.

ABDADA does the same thing as lazy, it starts n parallel threads at the root but has a better work division scheme than the shoot-and-hope scheme. You can do the "depth trick" assume the imaginary "widening effect" on ABDADA too; and it is almost equally easy for the real lazy people. The fact is that I shot and shot with that lazy hack but no luck for me so far, and infact for many others. It is still inferior to my YBW and ABDADA, only thing i didn't try was the depth trick due to awkwardness of implementation on SMP. Now i am told that it may be worth 5 elos, and Peter got +10 elo over lazy with a quick ABDADA implementation, go figure...

mar · Post by **mar** » Thu Aug 17, 2017 4:51 pm

Daniel Shawul wrote:Because stockfish uses it is not a justification for the efficiency of the algorithm. I bet if they used ABDADA and when (not if) it beats that lazy hack, they would go crazy like before. ABDADA does the same thing as lazy, it starts n parallel threads at the root but has a better work division scheme than the shoot-and-hope scheme. You can do the "depth trick" assume the imaginary "widening effect" on ABDADA too; and it is almost equally easy for the real lazy people. The fact is that I shot and shot with that lazy hack but no luck for me so far, and infact for many others. It is still inferior to my YBW and ABDADA, only thing i didn't try was the depth trick due to awkwardness of implementation on SMP. Now i am told that it may be worth 5 elos, and Peter got +10 elo over lazy with a quick ABDADA implementation, go figure...

My engine gets +100 (4 cores vs 1, according to CCRL), I had "lazy" before SF did. Scorpio gets +100 with YBW; Zappa +115, Crafty +95
Error margins are relatively big and no idea how well 1CPU vs 4CPU are connected in their rating list.

To sum up: +100 isn't bad for something that doesn't work, don't you think?

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 17, 2017 6:41 pm

mar wrote:
Daniel Shawul wrote:Because stockfish uses it is not a justification for the efficiency of the algorithm. I bet if they used ABDADA and when (not if) it beats that lazy hack, they would go crazy like before. ABDADA does the same thing as lazy, it starts n parallel threads at the root but has a better work division scheme than the shoot-and-hope scheme. You can do the "depth trick" assume the imaginary "widening effect" on ABDADA too; and it is almost equally easy for the real lazy people. The fact is that I shot and shot with that lazy hack but no luck for me so far, and infact for many others. It is still inferior to my YBW and ABDADA, only thing i didn't try was the depth trick due to awkwardness of implementation on SMP. Now i am told that it may be worth 5 elos, and Peter got +10 elo over lazy with a quick ABDADA implementation, go figure...
My engine gets +100 (4 cores vs 1, according to CCRL), I had "lazy" before SF did. Scorpio gets +100 with YBW; Zappa +115, Crafty +95
Error margins are relatively big and no idea how well 1CPU vs 4CPU are connected in their rating list.

To sum up: +100 isn't bad for something that doesn't work, don't you think?

Well I did a quick match between lazy+smp vs YBW in my engine and got this:

Code: Select all

Score of scorpio1 vs scorpio2&#58; 19 - 46 - 35  &#91;0.365&#93; 100
ELO difference&#58; -96.19 +/- 56.31
Finished match

scorpio1=lazy+smp with 2 threads without the depth trick, scorpio2 is YBW 2 threads. That is a 100 elo difference with the error bars...

petero2 · Post by **petero2** » Fri Aug 18, 2017 9:33 pm

petero2 wrote:I have finished my first "approximate ABDADA" test now. The algorithm is basically as I outlined before.
...
The result after 2000 games at time control 24+0.24, 4 cores was:
Code: Select all
                dElo  sDev  win loss draw  depth1 depth2
ABDADA vs lazy  +8.2   4.5  361  314 1325   18.77  18.40
Considering this was the first try, it looks good even though the number of games is too small to be sure.

I have run some more tests now. I am still using the quick and dirty ABDADA implementation. For one thing it is adding 288 extra bytes to each stack frame in the recursive search function. On one core it is a clear loss compared to the previous lazy version, time control 6+0.06:

Code: Select all

                     dElo  sDev   win  loss  draw  depth1 depth2
ABDADA vs lazy 1c    -8.1   1.4  6807  7514 15879   13.75  13.84

On 16 and 24 cores it is a clear win however:

Code: Select all

                     dElo  sDev   win  loss  draw  depth1 depth2
ABDADA vs lazy 16c  +18.4   4.0   353   247  1400   20.29  19.72
ABDADA vs lazy 24c  +16.0   3.9   335   243  1422   20.82  20.21

I also have an old result from when I implemented lazy SMP. On 4 cores, time control 60+0.6, lazy was better than the old DTS-like algorithm:

Code: Select all

                     dElo  sDev   win  loss  draw  depth1 depth2
lazy vs DTS-like 4c +13.4   3.0   781   623  2682   18.81  18.69

I have also run a test where I disabled three "tricks" I do in my lazy SMP implementation in addition to a plain shared hashtable (SHT) algorithm:
1. If a helper thread finishes before the master thread, use the helper thread result.
2. Run half of the helper threads at depth + 1.
3. If a helper thread finishes its search, immediately restart the search at the next depth.
I played the match at time control 24+0.24, 4 cores. The result was:

Code: Select all

                     dElo  sDev   win  loss  draw  depth1 depth2
SHT vs lazy 4c      -17.2   4.3   279   378  1343   18.20  18.32

By summarizing results obtained so far, I get:

Code: Select all

Elo increase 4*time vs 1*time  &#58; 112+100 = 212
Lazy 4c vs 1c                  &#58; 72+76   = 148 &#40;69.8% of 212&#41;
DTS-like 4c vs 1c              &#58; 148-13  = 135 &#40;63.7% of 212&#41;
SHT 4c vs 1c                   &#58; 148-17  = 131 &#40;61.8% of 212&#41;
ABDADA 4c vs 1c                &#58; 148+8   = 156 &#40;73.6% of 212&#41;

Same summary for 16 cores:

Code: Select all

Elo increase 16*time vs 1*time &#58; 112+100+85+72 = 369
Lazy 16c vs 1c                 &#58; 72+76+73+66   = 287 &#40;77.8% of 369&#41;
ABDADA 16c vs 1c               &#58; 287+18        = 305 &#40;82.7% of 369&#41;

Please note that while most of the matches were played at time control 24+0.24, a few matches used a different time control. Also note that adding rating differences will cause the standard deviation of the sum to be larger than the individual standard deviations. Nevertheless I think the above is a reasonably good estimate of how the different SMP algorithms perform in texel.

Since "approximate ABDADA" is the best algorithm I have found so far, my next goal is to try to rewrite it in a way that does not cause a slowdown and corresponding elo loss for the 1 core case.

petero2 · Post by **petero2** » Fri Aug 18, 2017 9:50 pm

petero2 wrote:I am still using the quick and dirty ABDADA implementation. For one thing it is adding 288 extra bytes to each stack frame in the recursive search function. On one core it is a clear loss compared to the previous lazy version, time control 6+0.06:
Code: Select all
                     dElo  sDev   win  loss  draw  depth1 depth2
ABDADA vs lazy 1c    -8.1   1.4  6807  7514 15879   13.75  13.84

I noticed that ABDADA in texel sometimes has an effect on the search tree even on 1 core. This happens because texel does not look for a 3-fold repetition across null moves, because I don't want the losing side to be able to refute two consecutive null moves with a 3-fold repetition.

When searching from the starting position this happens for the first time after the following moves:

e2e4 e7e5 d2d4 e5d4 b1c3 null c3b1 null b1c3

The last Nc3 move repeats the position from 4 plies earlier, which means the busy flag is already set in the hash entry, so the move is delayed until the second pass in the ABDADA loop.

I think this happens far too seldom to have a measurable effect on elo. I believe the elo loss is caused only by the speed loss.

petero2 · Post by **petero2** » Fri Aug 18, 2017 10:00 pm

Dann Corbit wrote:
petero2 wrote:
Daniel Shawul wrote:
petero2 wrote:For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.
You can create groups of nodes with MPI_CART_CREATE to take advantage of the topology but handing just the cluster node, smp core classification should be good enough already.
The problem is that the underlying hardware topology is unknown, so I don't know good values to pass to MPI_Cart_create. I suppose a command line parameter to texel where you can specify the hypertorus/hypercube topology would be good enough, since very few people would use this feature anyway.

Why not do a hardware discovery on startup and report it to a controlling node?

It could also be possible to do some sort of ping exchange to calibrate net speed.

In fact I already do hardware discovery on each node to find out how many cores and hyperthreads are available. See Cluster::computeConcurrency() in the source code.

Discovering the network topology using pings can be done. In fact it already has been done:

https://www.researchgate.net/publicatio ... t_clusters

It does not seem trivial to obtain fast and accurate results though, so this is not something I want to implement in texel.

Daniel Shawul · Post by **Daniel Shawul** » Sat Aug 19, 2017 4:48 am

Thanks Peter for these tests!

ABDADA is the clear winner here even though it started out loosing to lazy on 1 cpu.
Btw, where is the 288 extra bytes coming from. My TT table entry size is same as before and i think i needed only 1 bit to make ABDADA work.

I am surprized that your DTS-like implementation looses to lazy on 4 core. My YBW (also DTS like because it is peer-to-peer; however, idle threads still get picked up instead of themselves looking for work) beats lazy by a heavy margin and i am puzzled by what my implementation is missing for other people to see big wins with lazy. I don't do the depth+1 trick and others, also i synchronize the threads every iterations, but still i expected to see a significant gain with pure SHT.

I am going to try all the lazy tricks and also increase selectivity until i get the elos other people are getting with lazy.

Dann Corbit · Post by **Dann Corbit** » Sat Aug 19, 2017 6:20 am

Lazy has been well known since 1972:
https://www.youtube.com/watch?v=uflsjU95lRM
Explanations are around the 5:40 mark.

petero2 · Post by **petero2** » Sat Aug 19, 2017 9:56 am

Daniel Shawul wrote:ABDADA is the clear winner here even though it started out loosing to lazy on 1 cpu.
Btw, where is the 288 extra bytes coming from. My TT table entry size is same as before and i think i needed only 1 bit to make ABDADA work.

My TT entry did not increase either. I just changed the size of the depth field from 10 to 9 bits.

In the pseudo code I posted before, I used two arrays:

* One 256 bit "searched[]" array to keep track of which moves were searched in the first ABDADA pass.

* One 256 byte "lrmVec[]" array to keep track of lmr reductions computed in the first ABDADA pass, to ensure that moves delayed to the second pass use the same lmr as they would have used if they had been searched in the first pass. If I just recompute lmr in the second pass, I may get a different result because the calculation depends on the number of legal moves already tested, not just on the move index in the sorted move list.

However, none of those arrays are really needed, because my Move class already has a 4 byte score field, so I can just encode that information as special score range within the move list. I am currently running tests with that modification (and some other tweaks). It looks promising but the tests are not finished yet.

Daniel Shawul wrote:I am surprized that your DTS-like implementation looses to lazy on 4 core. My YBW (also DTS like because it is peer-to-peer; however, idle threads still get picked up instead of themselves looking for work) beats lazy by a heavy margin and i am puzzled by what my implementation is missing for other people to see big wins with lazy. I don't do the depth+1 trick and others, also i synchronize the threads every iterations, but still i expected to see a significant gain with pure SHT.

I am going to try all the lazy tricks and also increase selectivity until i get the elos other people are getting with lazy.

I usually prefer to base what I write on concrete tests, but currently the list of tests I want to run grows faster than I can run the tests, so I will make an exception.

{begin speculation}

1. It might be the case that lazy SMP (and SHT) works better if you have a per-thread history table. The late moves in ALL nodes are mostly ordered by the history heuristic, and if the history table is thread local, there is a decent chance that it becomes different in different threads. This would have the effect that different threads often don't search the same moves at the same time, and would therefore be able to pick up results from other threads from the transposition table.

2. If you do heavy LMR/LMP type reductions, and the reduction a move receives depends on the move order, a side effect of a thread local history table is that different threads will use different reduction amounts for the same move. Since the moves searched early by each thread typically get a lower reduction, and those results are later picked up through the hash table by other threads, the effect is as if the average LMR/LMP reduction is lower than in a single threaded search. It is already known that LMR/LMP is a big loss for fixed depth games, but a big gain in fixed time games, so I expect this effect would cause some elo increase, although not as big an elo increase as when using an SMP algorithm that avoids duplicated work without decreasing LMR/LMP.

3. If you do none of 1 and 2, SHT/lazy SMP may give very little gain, which could explain why SHT has historically been very inefficient in many programs, even though lazy SMP is pretty efficient in many modern programs.

{end speculation}

I intend to eventually run tests to see if any of those conjectures are actually correct.

Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Possible improvements

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments