Rein Halbersma wrote:zamar wrote:petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.
An early version of the algorithm is described
here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?
I'm no where near an expert on parallel alpha-beta, but it seems to me that people in this thread are reinventing optimal scheduling of jobs related to potential split points. AFAICS, this is mostly solved in the literature by randomized work-stealing such as pioneered by Cilk-chess and currently available as Cilk-Plus from Intel (there is a g++ branch).
The way they implement work-stealing is by using a cactus-stack, essentially one deque of split points for each hardware thread. During execution, each thread keeps popping the bottom of a randomly chosen deque, and also pushes each potential split point to its own deque. There are tons of papers and PhD theses from the Cilk project in which they prove that this kind of scheduling is optimal for fork-join parallelism. Because the randomization automatically load-balances, there is no need for heuristics on the depth in the tree where a split point is being added.
I did study work-stealing algorithms and fork-join parallelism before designing my algorithm. However, as far as I know those algorithms make the assumption that all work units that are stored in the work queues have to be computed by some thread at some point in time. They don't handle the case where work units have an associated probability of being unnecessary.
In the alpha beta case, there is no guarantee that stealing a work unit from the bottom of a deque is best, because that might correspond to a PV node where the first move has not yet been searched, while a work unit higher up in the deque may correspond to a strongly expected ALL node.
That said, a drawback of the priority queue I use is that it is almost guaranteed to become a bottleneck if the number of threads is large enough. I have optimized the algorithm so that this is seldom a problem on my 16 core computer, but I have not been able to test on larger machines.
At the time I designed the algorithm I did not find any scalable priority queue algorithms I could use, but I did a new search today and found this
SprayList paper. Maybe that can be used to improve scalability.