Recursive DTS-like search algorithm

jdart · Post by **jdart** » Fri Jul 26, 2013 3:51 am

I am seeing some falloff starting even at > 8 cores but getting worse at higher core counts, where threads are blocked for significant amounts of time waiting for work. In my implementation each node that can be split checks for threads idle (using a fast check w/o locking) and then if so does a more detailed check for the YBWC. What I see is that fairly often the first check is succeeding and the second one is failing. Could be a bug of course but also something mis-tuned.

--Jon

syzygy · Post by **syzygy** » Fri Jul 26, 2013 9:22 pm

bob wrote:
syzygy wrote:
bob wrote:In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.

Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
If we define DTS as "idle threads look for a good split point themselves anywhere in the trees searched by the active threads", then DTS and recursive search do not bite each other. No need to do ugly things with the stack. The only limitation on DTS imposed by a recursive search is that a master thread after running out of work is limited to the subtrees being searched by its slave threads.

See here for more information on my implementation.

The answer to what is likely your main objection: no, I do not copy position structures when doing a split. I let the idle thread copy the moves from root to split node and let it make those moves on its private board. Those moves have to be accessible globally, but that is not difficult to achieve.
Not quite. I am searching at ply=20. I want to split back at ply=10. It is not easy to do. I already have the search state for this path in a split block. But it needs to be moved to another split block. Making all the previous plies use a different split block, yet they have pointers in their local stack space. That's not easy to solve.

Once a node becomes a candidate split point, I copy the values of alpha, beta and depth into the position struct that is globally accessible. I don't have any separate split block allocation issues.

When an idle thread joins the split point, it copies those values and it keeps a pointer to the original position struct.

In your case, what you put in your split block is irrelevant. You have to have some sort of shared data structure that contains the moves that are to be searched in parallel.

Each thread has its own move stack. A node becomes a candidate split point only after all moves have already been generated (i.e. after captures and killers have been searched, except for PV nodes where I make sure to generate everything after the first move was searched).

Yet the ply back up in the recursive call stack doesn't know about it. Or are you not really using recursive programming for anything other than calling the procedure, and have EVERYTHING else in a global array indexed by ply??

I don't store the position struct on the stack like e.g. Stockfish does (but Crafty doesn't either). There is absolutely no reason to call my search anything else than a recursive search, unless you consider Crafty's to be iterative.

bob · Post by **bob** » Sat Jul 27, 2013 3:43 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.

Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
If we define DTS as "idle threads look for a good split point themselves anywhere in the trees searched by the active threads", then DTS and recursive search do not bite each other. No need to do ugly things with the stack. The only limitation on DTS imposed by a recursive search is that a master thread after running out of work is limited to the subtrees being searched by its slave threads.

See here for more information on my implementation.

The answer to what is likely your main objection: no, I do not copy position structures when doing a split. I let the idle thread copy the moves from root to split node and let it make those moves on its private board. Those moves have to be accessible globally, but that is not difficult to achieve.
Not quite. I am searching at ply=20. I want to split back at ply=10. It is not easy to do. I already have the search state for this path in a split block. But it needs to be moved to another split block. Making all the previous plies use a different split block, yet they have pointers in their local stack space. That's not easy to solve.
Once a node becomes a candidate split point, I copy the values of alpha, beta and depth into the position struct that is globally accessible. I don't have any separate split block allocation issues.

When an idle thread joins the split point, it copies those values and it keeps a pointer to the original position struct.

In your case, what you put in your split block is irrelevant. You have to have some sort of shared data structure that contains the moves that are to be searched in parallel.
Each thread has its own move stack. A node becomes a candidate split point only after all moves have already been generated (i.e. after captures and killers have been searched, except for PV nodes where I make sure to generate everything after the first move was searched).

Yet the ply back up in the recursive call stack doesn't know about it. Or are you not really using recursive programming for anything other than calling the procedure, and have EVERYTHING else in a global array indexed by ply??
I don't store the position struct on the stack like e.g. Stockfish does (but Crafty doesn't either). There is absolutely no reason to call my search anything else than a recursive search, unless you consider Crafty's to be iterative.

That seems like a terribly tight restriction (after all moves have been generated). And it would seem that would add up to a lot of idle time, where the normal YBW criterion only requires one move be searched before splitting. In my code, that one move can even be the hash best-move...

syzygy · Post by **syzygy** » Sat Jul 27, 2013 1:03 pm

bob wrote:That seems like a terribly tight restriction (after all moves have been generated). And it would seem that would add up to a lot of idle time, where the normal YBW criterion only requires one move be searched before splitting. In my code, that one move can even be the hash best-move...

I have practically no idle time, since idle threads look for candidate split points themselves. In your approach, idle threads can only join "new" candidate split points at the point they are found which means you have to be more liberal at designating a node as a candidate split point.

Splitting only after captures and killers decreases the chance of splitting at a node that fails high early in the move list. Most likely the node does not fail high at all, which means it was perfect for splitting, or the node fails high basically at a random move which we could not predict anyway (I think Crafty does not even attempt to sort those moves).

For PV nodes I do split immediately after the first move, because in these nodes you have a good expectation of having searched the best move first.

So far I haven't done many measurements though, and I will admit that intuition can be deceiving.

In my present approach I do see some inefficiencies when the PV node at ply = n is almost finished and until the search starts on the PV node at ply = n -1. All threads join the last move being searched and probably end up doing lots of split at shallow depths until no split points are available anymore and they are idle until the PV node at ply = n is completely finished and the main thread starts at the PV node at ply = n -1. Of course this is not specific to my implementation but is an inherent property of YBW.

Your DTS paper gives the interesting suggestion to start on the PV node at ply = n - 1 before the PV node at ply = n is completely finished. This is probably tricky to get right, but it seems to ensure that candidate split points at relatively large depths are always available. So I am tempted to try this out at some point.

bob · Post by **bob** » Sun Jul 28, 2013 6:10 am

syzygy wrote:
bob wrote:That seems like a terribly tight restriction (after all moves have been generated). And it would seem that would add up to a lot of idle time, where the normal YBW criterion only requires one move be searched before splitting. In my code, that one move can even be the hash best-move...
I have practically no idle time, since idle threads look for candidate split points themselves. In your approach, idle threads can only join "new" candidate split points at the point they are found which means you have to be more liberal at designating a node as a candidate split point.

Splitting only after captures and killers decreases the chance of splitting at a node that fails high early in the move list. Most likely the node does not fail high at all, which means it was perfect for splitting, or the node fails high basically at a random move which we could not predict anyway (I think Crafty does not even attempt to sort those moves).

For PV nodes I do split immediately after the first move, because in these nodes you have a good expectation of having searched the best move first.

So far I haven't done many measurements though, and I will admit that intuition can be deceiving.

In my present approach I do see some inefficiencies when the PV node at ply = n is almost finished and until the search starts on the PV node at ply = n -1. All threads join the last move being searched and probably end up doing lots of split at shallow depths until no split points are available anymore and they are idle until the PV node at ply = n is completely finished and the main thread starts at the PV node at ply = n -1. Of course this is not specific to my implementation but is an inherent property of YBW.

Your DTS paper gives the interesting suggestion to start on the PV node at ply = n - 1 before the PV node at ply = n is completely finished. This is probably tricky to get right, but it seems to ensure that candidate split points at relatively large depths are always available. So I am tempted to try this out at some point.

It is easy to implement that last feature, but the problem is you might well split before you have an alpha bound at N. Suppose you have 32 CPUS, as we had on the last Cray I used. If ply=N has < 32 moves, DTS would split at N-1 for the remainder of those CPUS. With no alpha bound in sight. But once alpha was established, it was very easy to inform those at n-1 about it.

I saw two choices. (1) wait until alpha is known, which might mean CPUs sit idld; (2) start somewhere else speculatively and then update bounds as they become known... At worst, the two are equal, but with reasonable luck, (2) can save time.

Current YBW's don't do that, of course, they will split when searching any node where at least one branch has been completed...

For up to 16 cpus, I don't see enough idle time to measure (I do place restrictions on splitting, such as no more than N cpus split at any single point (N is tunable)).

I had decided that once 32 core (and beyond) boxes are readily available, I'd probably start over and write an iterated search..

sje · Post by **sje** » Fri Aug 02, 2013 7:09 pm

bob wrote:I had decided that once 32 core (and beyond) boxes are readily available, I'd probably start over and write an iterated search..

As I've written in another thread, I believe that an iterated search is a good idea even if there is only a single core available.

It will likely be many years before we see 32 core machines; they would be very costly and would not be needed by a typical home user with no heavy tasks other than rendering and video ripping. If such machines do appear relatively soon, they would likely have four octocore CPUs.

I think a more likely development, already in progress, is the deployment of near super-computer video cards. These would be used by gamers who drive much of the market and who will soon be demanding 4K (3840x2160) video. Perhaps the upcoming Apple Mac Pro, which has two very high end video cards as standard, will lead the way for applications which rely upon OpenGL and the like more than upon the CPU.

petero2 · Post by **petero2** » Sat Aug 10, 2013 1:58 am

My current implementation contains two algorithm improvements and a number of implementation bug fixes.

The first algorithm improvement is that the lock contention on the priority work queue is measured in real-time and if it goes above 10% the minimum split depth is incremented. If it goes below 5% and the current minimum split depth is >10, the minimum split depth is decremented. This means that the algorithm adapts automatically to endgame positions where the branching factor is much smaller than in middlegame positions.

The second algorithm improvement addresses disadvantage #2 in the original algorithm:

petero2 wrote: If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the master thread fails low on move one, it will start searching move two. When this happens the helper thread is "kicked out" of that branch and has to start searching somewhere else. The idea is that the master thread should quickly get to the same point where the helper thread was interrupted because of transposition table hits. However, there is still some overhead involved. This seems hard to avoid with the current design, but the hope is that the overhead is small.

To reduce the number of "kick-outs", when a helper thread is about to select a move to search from a potential SplitPoint, it checks if the next available move has an estimated usefulness probability above 98%. If this is the case, the helper thread will select the move after the next available move. This has the effect that the SplitPoint owner has to search the next available move itself, and hopefully by the time that search is finished the helper thread has also finished its search so that no kick-out is necessary.

An example might make it more clear. Assume that at a SplitPoint the available moves are:

e4, d4, Nf3, Nc3, c4, f4

Lets say the SplitPoint owner has finished searching e4 and is now searching d4. A thread that wants to help at this SplitPoint could choose to search Nf3. However, if it is almost sure that Nf3 needs to be searched (because the confidence that this is an ALL node is very high), it is assumed that Nc3 also has to be searched. In this case the helper thread will start searching Nc3. When the SplitPoint owner thread finishes the d4 search, it will continue searching Nf3, but since no helper thread has been searching that move, it is quite likely that the Nc3 search finishes before the Nf3 search. When this happens, the helper thread will again conclude that it could start searching c4, but it will instead search f4 because the confidence that f4 needs to be searched is high.

With these two improvements I get pretty good speedup results if I disable all forward pruning except null move in my program. Using my time to depth measuring tool I get:

Code: Select all

pos  d     time   spd2   spd3   spd4      nodes     n2     n3     n4      nps   nps2   nps3   nps4 score   ds2   ds3   ds4
  1 13  158.001  1.741  2.789  3.477  390481446  1.122  1.054  1.118  2471390  1.954  2.940  3.888    21     0     0     0
  2 13   74.117  1.945  3.006  3.761  181807120  1.009  0.983  1.045  2452980  1.962  2.956  3.929    17     0     0     0
  3 12  105.165  1.969  2.874  3.649  258324459  1.002  1.025  1.069  2456380  1.972  2.945  3.901    17     0     0     0
  4 13   92.929  1.873  3.062  3.725  235221371  1.024  0.972  1.057  2531190  1.918  2.976  3.937    23     0     0     0
  5 13   88.262  2.021  2.934  3.728  224639423  0.986  1.005  1.055  2545130  1.992  2.949  3.932    28     0     0     0
  6 14  111.331  1.940  2.314  2.896  289098572  1.023  1.263  1.340  2596740  1.984  2.923  3.879    15     0     0     0
  7 13   86.060  1.851  2.883  3.828  220508128  1.040  1.030  1.030  2562260  1.924  2.971  3.944   -24     0     0     0
  8 13  459.788  3.519  3.549  6.718 1166951782  0.557  0.836  0.589  2538020  1.959  2.967  3.957   -26     0     0     0
  9 13  295.005  2.026  3.026  4.595  761747959  0.974  0.980  0.859  2582150  1.973  2.966  3.945   -29     0     0     0
 10 13  390.187  2.288  3.199  4.188  984488779  0.850  0.932  0.946  2523120  1.945  2.981  3.964   -35     0     0     0
 11 13  161.333  1.865  3.219  4.213  420113648  0.995  0.918  0.929  2604010  1.856  2.954  3.915    22     0     0     0
 12 15  269.960  1.869  2.853  3.892  712708990  1.024  1.036  1.012  2640060  1.914  2.955  3.938    27     0     0     0
 13 14  304.837  1.922  2.884  3.790  795950521  1.028  1.024  1.032  2611070  1.975  2.952  3.912    23     0     0     0
 14 14  192.420  2.054  3.121  3.818  505919224  0.936  0.937  1.014  2629240  1.923  2.924  3.873    29     0     0     0
 15 14   95.780  1.976  2.473  3.364  258563711  0.953  1.167  1.125  2699550  1.883  2.886  3.783    26    -2    -2    -2
 16 15  299.518  1.726  2.489  3.512  791776805  1.143  1.184  1.117  2643500  1.973  2.946  3.922    19     0     0     0
 17 16  525.107  1.680  2.890  3.460 1372305676  1.176  1.018  1.127  2613380  1.976  2.942  3.899     0     0     0     0
 18 15   95.574  2.055  2.643  3.114  254539882  0.956  1.081  1.222  2663280  1.965  2.858  3.805     0     0     0     0
 19 14   23.994  1.721  2.250  2.892   64430788  1.125  1.293  1.329  2685240  1.937  2.909  3.843     0     0     0     0
 20 16  138.037  2.025  2.765  4.324  393727539  0.961  1.044  0.889  2852330  1.946  2.887  3.843    54   -16    -8    -7
 21 14   59.830  1.464  2.558  3.317  174209005  1.252  1.135  1.156  2911710  1.832  2.904  3.834    46     1     1     1
 22 13   11.797  2.007  2.047  5.391   34931060  0.879  1.389  0.702  2961040  1.765  2.843  3.784    18     4     0     4
 23 13   11.601  1.744  2.623  3.403   35027326  1.126  1.091  1.122  3019480  1.965  2.862  3.818    31     0     0     0
 24 14   25.757  1.129  2.604  1.907   76064258  1.706  1.103  1.973  2953100  1.926  2.872  3.762    55     0     0   -13
avg     169.850  1.934  2.794  3.790  441814061  1.035  1.063  1.077  2656097  1.934  2.928  3.884

So the NPS speedup with 4 threads is 3.88 and the time to depth speedup is 3.79.

Disabling all forward pruning (LMR, LMP, futility, reverse futility, razoring) is clearly not a good thing to do except possibly when trying to understand how the parallel search algorithm behaves. Unfortunately, when all forward pruning methods are enabled the speedup results are much worse:

Code: Select all

pos  d     time   spd2   spd3   spd4      nodes     n2     n3     n4      nps   nps2   nps3   nps4 score   ds2   ds3   ds4
  1 18   76.198  2.308  2.271  3.036  140705368  0.851  1.286  1.262  1846580  1.964  2.919  3.833     0     0    -3    -4
  2 18   48.578  1.892  2.195  2.818   90289115  1.017  1.337  1.373  1858660  1.925  2.934  3.870     0    -9     0     8
  3 17   54.028  1.812  2.437  2.413  100038381  1.078  1.191  1.610  1851600  1.953  2.904  3.884    16   -12   -29     7
  4 18   41.333  1.928  2.357  3.484   79155826  1.026  1.238  1.112  1915080  1.978  2.919  3.874     3    -3   -19    -3
  5 18   34.182  2.379  1.983  2.120   64808802  0.844  1.488  1.825  1895990  2.009  2.951  3.869     0     0     0     9
  6 19   50.132  1.977  1.653  1.742   95170456  1.022  1.759  2.188  1898390  2.020  2.908  3.812     2     4    12    -2
  7 18   32.034  0.763  1.448  1.605   61156020  2.562  2.037  2.436  1909110  1.953  2.951  3.909   -57    25   -16    -4
  8 18   71.050  1.575  1.695  2.331  136117702  1.233  1.720  1.653  1915810  1.943  2.916  3.853   -66    23    13    12
  9 18   61.141  1.685  2.540  3.037  117226780  1.121  1.151  1.266  1917310  1.888  2.922  3.845   -62    -3     7    10
 10 18  102.272  1.789  3.262  4.286  199391848  1.100  0.915  0.913  1949630  1.968  2.984  3.915   -75    -2   -66   -42
 11 18   43.191  1.832  2.500  2.247   85370535  1.063  1.147  1.676  1976560  1.947  2.868  3.766    28   -18   -15   -13
 12 20   40.677  1.494  1.245  1.271   82464862  1.342  2.344  3.044  2027320  2.005  2.920  3.871     0    20    31    25
 13 19   29.984  0.799  0.865  1.015   60412013  2.459  3.346  3.757  2014820  1.965  2.895  3.813    24   -12   -18    -1
 14 19   49.076  0.852  2.619  1.651   96911003  2.287  1.088  2.329  1974710  1.949  2.848  3.845    20    -5     2     5
 15 19   60.583  1.661  1.798  1.869  126451201  1.169  1.586  2.009  2087240  1.942  2.853  3.756     0     0     0     0
 16 20   26.787  1.200  1.876  1.721   53529373  1.600  1.508  2.196  1998310  1.921  2.830  3.780     4    -4    -4    -4
 17 21   71.311  1.489  1.981  3.441  151231576  1.320  1.452  1.085  2120750  1.965  2.875  3.735     0     0     0     0
 18 20   29.540  1.718  1.870  2.340   65524777  1.120  1.521  1.601  2218160  1.925  2.845  3.745     0     0     0     0
 19 19   47.341  1.458  1.849  2.340  110390929  1.343  1.586  1.646  2331800  1.958  2.932  3.851     0     0     0     0
 20 21   80.663  1.119  0.975  1.281  199634703  1.761  3.022  3.046  2474910  1.971  2.945  3.901     0    70    34    40
 21 19   81.436  1.602  2.076  3.131  203068873  1.228  1.408  1.234  2493610  1.967  2.923  3.863    75     4     1   -58
 22 18   28.240  1.808  1.397  4.831   74345393  1.080  2.079  0.801  2632650  1.953  2.904  3.869    56   -23    18    -6
 23 18   28.886  1.522  1.475  1.317   75867252  1.286  1.979  2.956  2626460  1.958  2.919  3.892    64     0    13    21
 24 19   40.983  1.527  1.486  1.707  105705955  1.261  1.998  2.282  2579290  1.924  2.969  3.895   106     0    17    10
avg      51.235  1.591  1.911  2.376  107290364  1.341  1.674  1.888  2104781  1.956  2.910  3.844

The NPS speedup with 4 threads is still good, 3.84 instead of 3.88, but the time to depth speedup is now only 2.38 instead of 3.79. I don't do any intentional "widening" of the search, so it is likely that the extra nodes searched are mostly wasted.

With forward pruning enabled we can see that it is much more common that the multi-threaded search returns a different score than the single threaded search (see columns ds2, ds3, ds4). I am not sure what causes this but my guess is that the way I propagate history and killer information between threads has some problems.

When a thread starts to search at a SplitPoint, it copies the current killer and history tables from the SplitPoint owner thread. This copy is made without any locking. The assumption is that it is better to avoid the the locking overhead and accept the occasional corrupted entries caused by the data races. When a helper thread starts searching a new move at the same SplitPoint it was previously searching at, it keeps its killer and history tables so no killer or history tables are copied from the SplitPoint owner. When a helper thread has finished searching at a SplitPoint the killer and history information in the helper thread is thrown away. There is no mechanism to propagate the information back to the SplitPoint owner.

Is this a sensible way to handle killer and history information? How are other programs doing this?

Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm

Re: Recursive DTS-like search algorithm