Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

smatovic wrote: Fri Jun 21, 2019 9:14 pm
dragontamer5788 wrote: Fri Jun 21, 2019 7:35 pm ...
I've searched for some of the words in your post, and apparently this old topic exists: http://www.talkchess.com/forum3/viewtop ... =7&t=41853

I'll have to go through that topic and see what other parallel-methodologies were proposed and tried. I'll take a look at your code as well, and maybe I'll be inspired with alternative approaches.
Maybe a word on what I have tried on gpu, during development I got 6 different approaches playing
chess on gpu, the latter two played decent chess...

1) 'SPPS' AlphaBeta parallelization, 128 threads work on up to 128 moves per node in parallel
2) MCTS without UCT across multiple Compute Units of the GPU
3) LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
4) 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
5) Parallel BestFirstMiniMax Search, similar to MCTS but with AlphaBeta playouts at leaf nodes
6) Classic parallel AlphaBeta approach with ABDADA and RMO, randomized move order, for parallelization

5) uses the 'one thread one board' approach with thousands of threads and 6) the 'one SIMD one board'
approach with hundreds of workers where 64 threads are coupled to one work-group to work on the same
node in parallel during move generation, move sorting and position evaluation.

IIRC, 3), the LIFO-Stack, gave the best nps throughput, but I could not figure out how to implement
AlphaBeta pruning efficient.

Zeta milestones:
https://zeta-chess.app26.de/post/zeta-milestones/

Zeta v099k parallel scaling results:
https://github.com/smatovic/Zeta/blob/m ... esults.txt

--
Srdja
Very interesting project. Your #3 is closest to what I'm trying to do, so its good to hear it has a good raw-nodes-per-second calculation. As you point out, Alpha-Beta pruning in parallel seems like an unsolved problem. But I hope to accomplish it in this topic.

Instead of using a LIFO stack for one-compute unit, I plan to use a Deque as if it were a stack within a compute unit. Whenever the Deque fills up, it "work donates" to the global stack, which will allow other compute units to work-steal. A smaller "local" deque will allow for more workstealing, but will require more synchronization. A larger deque will have less workstealing, but will require less communication between compute units.

* Using a Deque as a stack is as simple as "add to tail" and "remove from tail" (local operations within a compute unit).

* The Deque work donation step is as simple as "remove from head". My current plan is to use simple spinlocks for the global work buffer.
smatovic
Posts: 2644
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by smatovic »

dragontamer5788 wrote: Sat Jun 22, 2019 11:47 pm Very interesting project. Your #3 is closest to what I'm trying to do, so its good to hear it has a good raw-nodes-per-second calculation. As you point out, Alpha-Beta pruning in parallel seems like an unsolved problem. But I hope to accomplish it in this topic.

Instead of using a LIFO stack for one-compute unit, I plan to use a Deque as if it were a stack within a compute unit. Whenever the Deque fills up, it "work donates" to the global stack, which will allow other compute units to work-steal. A smaller "local" deque will allow for more workstealing, but will require more synchronization. A larger deque will have less workstealing, but will require less communication between compute units.

* Using a Deque as a stack is as simple as "add to tail" and "remove from tail" (local operations within a compute unit).

* The Deque work donation step is as simple as "remove from head". My current plan is to use simple spinlocks for the global work buffer.
Steven Edwards suggested a doubly-linked-list as data structure with spinlocks
for parallel access, but I have never implemented and tested it:

http://www.talkchess.com/forum3/viewtop ... 93#p425411

I took me some time to get into gpu programming and understand the architecture
better, so some of my posts are full of presumptions.

--
Srdja
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by Daniel Shawul »

From my understanding: a work-efficient parallel algorithm for Alpha-Beta search is already a "unicorn project" never seen before
As you point out, Alpha-Beta pruning in parallel seems like an unsolved problem. But I hope to accomplish it in this topic.
The algorithm to parallelize alpha-beta maximally is there like I mentioned before, namely, the rollouts version of it so I am not sure
what exactly you are trying to accomplish.
Each thread essentially looks at one leaf node at a time, however, it comes at a cost of keeping track of bounds [alpha,beta]
inside the tree in a table of some sort. And there is no "search overhead" since each node is looked at once, however many threads
could hit on "closed nodes (zero alpha-beta window)" in a rollout and return to the root without actually doing any real work. That can also happen with standard MCTS when two or more threads want to create children of the same node. These issues are alleviated using virtual loss.

Among the standard algorithms of parallelizing alpha-beta, ABDADA comes close to this algorithm since it parallelizes via a shared transposition table. However each thread still has to traverse the whole tree in a recursive manner that is not efficient on the GPU.
Parallelizing algorithms such as YBW are not that efficient on the GPU. Nowadays even on CPUs with many cores (>=32) algorithms that use recursive decompositon of the tree like YBW are failing short. So I don't see a way forward with them on the GPU unless things have changed significantly -- not long ago recursion was not possible.
2. Lazy evaluation of score = max(score, alpha) is the hard part. Presume "alpha" initially equals [A, B, max, -], while "score" initially equals "[C, D, Max]. Then max(score, alpha) will generate [C, D, Max, A, B, Max, -, Max], representing the "lazy evaluation" of the alpha-tree... where "A", "B", "C", and "D" are values that will be returned by some Spawned task. ("Register renaming" from Tomasulo's Algorithm). These lists are organized into postfix notation. That is, [C, D, Max, A, B, Max, -, Max] is equivalent to the computation Max(-Max(A,B), Max(C, D)).

3. "beta.wait()" is when this tree is actually evaluated. If beta = [A, B, max, -] on this line, then "beta.wait" will stay on the "Blocked Queue" until A and B are available.
Alpha-beta bounds are beautifully handled in the rollouts version of alpha-beta with a neat proof provided. Adding null-move logic to it was fun and challenging.

If you try to do all this without using big memory tables, threads are sure going to spend a lot of time in the "blocked queue" and face the same problems as everybody else trying to parallelize recursive alpha-beta naively. People have actually tried work stealing approaches, though on cpus, for chess e.g. cilkchess .

Another algorithm using work stealing transpostion driven work scheduling

Daniel
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

smatovic wrote: Sun Jun 23, 2019 9:48 amSteven Edwards suggested a doubly-linked-list as data structure with spinlocks
for parallel access, but I have never implemented and tested it:

http://www.talkchess.com/forum3/viewtop ... 93#p425411

I took me some time to get into gpu programming and understand the architecture
better, so some of my posts are full of presumptions.
While I'm confident that a GPU could implement a parallel linked lists efficiently, it would be a non-trivial project and also has almost never been done before. This following paper has some hints on how a GPU-parallel linked list could be constructed and traversed: http://uenics.evansville.edu/~mr56/ece7 ... rithms.pdf (Specifically: Page 10)

A SIMD-linked list would be very different than a traditional linked list. Maybe your malloc() function grabs everything from the same array, so that the list of allocated nodes is easily calculated. (And if you have an easily accessed array of every allocated node, then a Linked List traversal becomes efficient). Unfortunately, if you already have an array of every allocated node, why not just iterate over that array instead? I can think of a few obscure use cases for a SIMD-linked list, but arrays should be preferred on a GPU.

--------------

As such, the SIMD-design would be to use a circular-buffer deque with NVidia Warp / AMD Wavefront level synchronization. Once you know about the existence of a few simple helper-functions, it becomes immediately obvious how to implement this kind of code. All you gotta know about is the __activemask() and activelane helper function... and then take advantage of the "implicit barriers" associated with SIMD. (NVidia Pascal+ architectures will need a __syncwarp() barrier as appropriate, but __syncwarp() is very efficient. AMD "implementation" of __syncwarp() is simply a no-op, all instructions within a wavefront on an AMD system are implicitly a thread barrier)

Code: Select all


// All of this is untested, conceptual pseudocode

uint32 activelane(){
	return the parallel-prefix sum of __activemask();
	// Ex: if(threadIdx.x % 5==0) blah = activelane(); will return 0th processor == 0. 5th processor = 1. 10th processor = 2. Etc. etc.
	// Processors 1-4, 6-9, etc. etc. are all "unactive" because of the if-statement. activelane() is a magic function that enumerates
	// active processors, no matter the context.
	// CUDA has this algorithm described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#discovery-pattern-cg
	// AMD has the following https://scchan.github.io/hcc/namespacehc.html#a518b02705c95b77399f523b078960863
}

void simd_addToTail(Deque d, int32_t toAdd){
	__shared__ uint32_t myTail;
	if(activelane() == 0){
		// Compare head vs tail for overflow. Overflow code is complex, 
		// and is ignored for this pseudocode. 
		// On overflow, removeFromHead to give to the global work queue
		myTail = atomicAdd(d.tail, popcount(__activemask()));
	}
	// "myTail" is unique among all threads utilizing the queue
	__syncwarp(); // All threads in a warp wait here for each other to prevent race conditions
	d.buffer[(myTail + __activelane()) % d.capacity] = toAdd;
}

int32_t simd_removeFromTail(Deque d){
	__shared__ uint32_t myTail;
	if(activelane() == 0){
		// Compare head vs tail for underflow. 
		// and is ignored for this pseudocode. Underflow "worksteals"
		// from the global-work queue.
		myTail = atomicSub(d.tail, popcount(__activemask()));
	}	
	// "myTail" is unique among all threads utilizing the queue
	__syncwarp(); // All threads in a workgroup wait here for each other to prevent race conditions
	return d.buffer[(myTail + __activelane() + popcount(__activemask())) % d.capacity];
}

// Add and remove from head is similar.
If you use syncthreads() instead of syncwarp() at those locations, then it can theoretically provides AMD Workgroup / NVidia Block level synchronization. But it would be prone to deadlocks. If one thread is waiting at "addToTail" while other threads are waiting at "removeFromTail", a deadlock will occur. However, deadlock is impossible if you have workgroups equal to the wavefront / warp size (32-threads for NVidia, 64-threads for AMD).

A full lock is probably needed in case of overflow or underflow. It will be a challenge to handle the edge cases elegantly. But the core concept of a SIMD-deque is actually quite easy, as long as we restrict ourselves to Wavefront / Warp-size threadblocks.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

Daniel Shawul wrote: Sun Jun 23, 2019 10:55 am
From my understanding: a work-efficient parallel algorithm for Alpha-Beta search is already a "unicorn project" never seen before
As you point out, Alpha-Beta pruning in parallel seems like an unsolved problem. But I hope to accomplish it in this topic.
The algorithm to parallelize alpha-beta maximally is there like I mentioned before, namely, the rollouts version of it so I am not sure
what exactly you are trying to accomplish.
Each thread essentially looks at one leaf node at a time, however, it comes at a cost of keeping track of bounds [alpha,beta]
inside the tree in a table of some sort. And there is no "search overhead" since each node is looked at once, however many threads
could hit on "closed nodes (zero alpha-beta window)" in a rollout and return to the root without actually doing any real work. That can also happen with standard MCTS when two or more threads want to create children of the same node. These issues are alleviated using virtual loss.

Among the standard algorithms of parallelizing alpha-beta, ABDADA comes close to this algorithm since it parallelizes via a shared transposition table. However each thread still has to traverse the whole tree in a recursive manner that is not efficient on the GPU.
Parallelizing algorithms such as YBW are not that efficient on the GPU. Nowadays even on CPUs with many cores (>=32) algorithms that use recursive decompositon of the tree like YBW are failing short. So I don't see a way forward with them on the GPU unless things have changed significantly -- not long ago recursion was not possible.
I admit I don't understand ABDADA yet.

But as far as I'm aware, none of those schemes are work-efficient. Work-efficient means that if the serial implementation visits X nodes, then the parallel implementation visits exactly X nodes as well. That is to say: the parallel algorithm should perform exactly the same amount of work as the sequential algorithm.

Because the "rollout" version visits nodes unnecessarily, that immediately disqualifies it from my consideration. I'm hoping to create a truly work-efficient alpha-beta parallel implementation.

Do you know if ABDADA is work-efficient? Or is it optimistically assuming, like YBW is?
If you try to do all this without using big memory tables, threads are sure going to spend a lot of time in the "blocked queue" and face the same problems as everybody else trying to parallelize recursive alpha-beta naively. People have actually tried work stealing approaches, though on cpus, for chess e.g. cilkchess .
I don't expect hardware threads to be spending any time on blocked queues as long as the "active queue" has work to be done. Remember, a "blocked" task simply switches over to an "active" task, allowing for dynamic rescheduling of work.

There may be 16384 SIMD-threads running on a Vega64 GPU, but there will be ~1 Billion nodes to process. Even if 0.1% of the nodes (aka: 1-million nodes) are available for parallel processing at any given time, the Vega64 threads would be able to find work. Don't "wait" on blocked queues. Just switch to a node that is available to be analyzed right now.

I expect my main problem is to write an efficient-implementation of work-queues and futures. As you can see in my previous post, I've "solved" the work-queue with regards to inside a SIMD workgroup. I just gotta figure out the overflow / underflow conditions so that different compute units can actually communicate. (Although: to be fair, the many edge-cases for overflow / underflow seem difficult to figure out).

Blocked tasks aren't free: they need to be regularly reevaluated to see if they're ready yet (that is: checking if "score > beta" is ready to be calculated yet). But we're looking at asymptotic complexity of O(n), where n is the number of blocked tasks. So it is "only" polynomial complexity that I'm tacking onto the system. Since chess is EXPTime / EXPSpace, I hope that the complexity of task-scheduling can be ignored.

--------

In the case of my specific implementation: tasks cannot truly block. Tasks can only spawn a blocked task. Under my system, the GPU threads are very easily proven to be at 100% utilization (100% of the SIMD threads will be working on active tasks). The only "states" for tasks are "Blocked" (when waiting for score>beta) and active (once score>beta is "available", the task can run).

There are some other "blocked" states that I'm coming up against: when buffers fill up for example. But honestly, most "buffers fill up" situation are handled by a simple abort() out of memory error (aka: too complicated, I'm not going to bother handling it. :D :D ). I'm just gonna assume that the 8GB of RAM on the Vega64 is sufficient for the job.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by Daniel Shawul »

dragontamer5788 wrote: Sun Jun 23, 2019 4:55 pm
Daniel Shawul wrote: Sun Jun 23, 2019 10:55 am
From my understanding: a work-efficient parallel algorithm for Alpha-Beta search is already a "unicorn project" never seen before
As you point out, Alpha-Beta pruning in parallel seems like an unsolved problem. But I hope to accomplish it in this topic.
The algorithm to parallelize alpha-beta maximally is there like I mentioned before, namely, the rollouts version of it so I am not sure
what exactly you are trying to accomplish.
Each thread essentially looks at one leaf node at a time, however, it comes at a cost of keeping track of bounds [alpha,beta]
inside the tree in a table of some sort. And there is no "search overhead" since each node is looked at once, however many threads
could hit on "closed nodes (zero alpha-beta window)" in a rollout and return to the root without actually doing any real work. That can also happen with standard MCTS when two or more threads want to create children of the same node. These issues are alleviated using virtual loss.

Among the standard algorithms of parallelizing alpha-beta, ABDADA comes close to this algorithm since it parallelizes via a shared transposition table. However each thread still has to traverse the whole tree in a recursive manner that is not efficient on the GPU.
Parallelizing algorithms such as YBW are not that efficient on the GPU. Nowadays even on CPUs with many cores (>=32) algorithms that use recursive decompositon of the tree like YBW are failing short. So I don't see a way forward with them on the GPU unless things have changed significantly -- not long ago recursion was not possible.
I admit I don't understand ABDADA yet.

But as far as I'm aware, none of those schemes are work-efficient. Work-efficient means that if the serial implementation visits X nodes, then the parallel implementation visits exactly X nodes as well. That is to say: the parallel algorithm should perform exactly the same amount of work as the sequential algorithm.
With that definition, you would not be able to do any parallelization since alpha-beta is inherently sequential. Even the young brothers wait (YBW) algorithm makes a speculation that after the first child is searched fully, the [a,b] windows for siblings is "almost" known i.e. assuming a very good move ordering. However, it does happen that sometimes a sibling fails high, in which case full window researches are done thereby introducing overhead. So if you are not willing to do this minimal level of speculation as YBW, you won't be able to parallelize the subtree at all because essentially any sibling can fail high and introduce overhead. If we knew the perfect move ordering, you would make the first move without any search.
Because the "rollout" version visits nodes unnecessarily, that immediately disqualifies it from my consideration. I'm hoping to create a truly work-efficient alpha-beta parallel implementation.
No it doesn't. The rollout version closes alpha-beta windows of the node once it is fully searched. So when another thread stumbles upon this node it goes back to the root to look for another job (leaf node). The traditional definition of search overhead is that you search more nodes than the sequential search does -- the rollout version doesn't do this because it searches exactly the same tree as the sequential search.
Do you know if ABDADA is work-efficient? Or is it optimistically assuming, like YBW is?
Not by your defintion, it has a lot of search overhead and so have lots of other algorithms that parallelize alpha-beta.
If you try to do all this without using big memory tables, threads are sure going to spend a lot of time in the "blocked queue" and face the same problems as everybody else trying to parallelize recursive alpha-beta naively. People have actually tried work stealing approaches, though on cpus, for chess e.g. cilkchess .
I don't expect hardware threads to be spending any time on blocked queues as long as the "active queue" has work to be done. Remember, a "blocked" task simply switches over to an "active" task, allowing for dynamic rescheduling of work.

There may be 16384 SIMD-threads running on a Vega64 GPU, but there will be ~1 Billion nodes to process. Even if 0.1% of the nodes (aka: 1-million nodes) are available for parallel processing at any given time, the Vega64 threads would be able to find work. Don't "wait" on blocked queues. Just switch to a node that is available to be analyzed right now.

I expect my main problem is to write an efficient-implementation of work-queues and futures. As you can see in my previous post, I've "solved" the work-queue with regards to inside a SIMD workgroup. I just gotta figure out the overflow / underflow conditions so that different compute units can actually communicate. (Although: to be fair, the many edge-cases for overflow / underflow seem difficult to figure out).

Blocked tasks aren't free: they need to be regularly reevaluated to see if they're ready yet (that is: checking if "score > beta" is ready to be calculated yet). But we're looking at asymptotic complexity of O(n), where n is the number of blocked tasks. So it is "only" polynomial complexity that I'm tacking onto the system. Since chess is EXPTime / EXPSpace, I hope that the complexity of task-scheduling can be ignored.

--------

In the case of my specific implementation: tasks cannot truly block. Tasks can only spawn a blocked task. Under my system, the GPU threads are very easily proven to be at 100% utilization (100% of the SIMD threads will be working on active tasks). The only "states" for tasks are "Blocked" (when waiting for score>beta) and active (once score>beta is "available", the task can run).

There are some other "blocked" states that I'm coming up against: when buffers fill up for example. But honestly, most "buffers fill up" situation are handled by a simple abort() out of memory error (aka: too complicated, I'm not going to bother handling it. :D :D ). I'm just gonna assume that the 8GB of RAM on the Vega64 is sufficient for the job.
Cilk programming language, predecessor to intel's TBB -- the inspiration for your project if i am not mistaken, was used to write Cilkchess using work stealing algorithms. Cilkchess was actually the testbed for researching Cilk and by extension TBB.
No offence but I suggest you do some literature review to see the challenges of tree parallelization in the traditional way.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

Daniel Shawul wrote: Mon Jun 24, 2019 5:00 pm
dragontamer5788 wrote: Sun Jun 23, 2019 4:55 pmI admit I don't understand ABDADA yet.

But as far as I'm aware, none of those schemes are work-efficient. Work-efficient means that if the serial implementation visits X nodes, then the parallel implementation visits exactly X nodes as well. That is to say: the parallel algorithm should perform exactly the same amount of work as the sequential algorithm.
With that definition, you would not be able to do any parallelization since alpha-beta is inherently sequential. Even the young brothers wait (YBW) algorithm makes a speculation that after the first child is searched fully, the [a,b] windows for siblings is "almost" known i.e. assuming a very good move ordering. However, it does happen that sometimes a sibling fails high, in which case full window researches are done thereby introducing overhead. So if you are not willing to do this minimal level of speculation as YBW, you won't be able to parallelize the subtree at all because essentially any sibling can fail high and introduce overhead. If we knew the perfect move ordering, you would make the first move without any search.
Well, as I said, I fully recognize what I'm doing is a "unicorn project", something no one has tried to do before. If I fail, that's fine. If I succeed, then its a new search paradigm. So I might as well attempt to do my methodology.

I could point out why I think a parallel work-efficient Alpha-Beta is possible... but I'm not entirely sure that any argument could convince you.

EDIT: Okay, lemme give a quickie explanation why I think its possible.

The short-version of my hypothesis is that... PV-nodes are inherently parallel (all children of PV-nodes must be searched, as you have [-infinity, infinity] as alpha-beta). ALL-nodes are parallel in classical Alpha-Beta search (in Alpha-Beta, all children of an ALL node are searched. Other algorithms, such as aspiration-windows, reduce the amount of work by visiting fewer than all children in an ALL-node). CUT-nodes are the only sequential node in Alpha-Beta search.

In effect, the system I'm discussing here places CUT-nodes into a blocked list, while searching PV-nodes and ALL-nodes exhaustively in parallel. CUT-nodes continue to be sequential-only, but my hypothesis is that there are enough PV-nodes and ALL-nodes to keep a 16,000-thread machine busy.
Cilk programming language, predecessor to intel's TBB -- the inspiration for your project if i am not mistaken, was used to write Cilkchess using work stealing algorithms. Cilkchess was actually the testbed for researching Cilk and by extension TBB.
No offence but I suggest you do some literature review to see the challenges of tree parallelization in the traditional way.
That's the thing. I have reviewed the parallel algorithms and I've concluded that none are efficient enough for what I want to accomplish. I admit that I don't fully understand them all, but a key component of my project is that I want a work efficient search.

The entire point of parallel algorithms is to improve the strength of Chess Engines. If a parallelization algorithm gets more-and-more work the more parallel it gets, then many core architectures are completely hopeless. Consider Bitonic Sort vs Merge Sort: parallel Bitonic Sort is O(n * lg^2(n)) while merge-sort is O(n*lg(n)). This means that as N->infinity, the single-core computer will eventually be "faster" than the parallel Bitonic sort.

In practice, the difference between O(n*lg^2(n)) and O(n*lg(n)) is inconsequential for practical problems. (IE: most lists stop at size 1-trillion or so, so N does NOT practically move towards infinity). But I think you can see from this Bitonic vs Merge example the "curse" of parallel... but work-inefficient... algorithms. If your algorithm provably requires more work done to accomplish the same task, then you will inevitably be slower than a single-core computer (given a large enough N).

I mean, the number of nodes visited in classical Alpha-Beta is O(n^.5), where "n" is the number of nodes of minimax-search. I don't know what YBW or ABDADA do exactly, but if they're O(n^.55) or O(n^.6) (just slightly worse in growth factor), then the algorithms will get slower-and-slower the bigger the problem gets. To the point where the single-core processor will inevitably beat a 10,000-core.
Last edited by dragontamer5788 on Mon Jun 24, 2019 6:03 pm, edited 2 times in total.
smatovic
Posts: 2644
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by smatovic »

Under these terms I can imagine only a breadth first like search to be work-efficient.

--
Srdja
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

In effect, the system I'm discussing here places CUT-nodes into a blocked list, while searching PV-nodes and ALL-nodes exhaustively in parallel. CUT-nodes continue to be sequential-only, but my hypothesis is that there are enough PV-nodes and ALL-nodes to keep a 16,000-thread machine busy
It appears I was imprecise with my initial wording. I apologize. And now that I think of it, I don't think PV, CUT, and ALL nodes properly describe the three cases I'm thinking about.

Let me try again and this time, I'll define my terminology.

* Node Type P: Alpha = -infinity and Beta=infinity. All children of Node-type P can be evaluated in parallel. max(score(child1), score(child2), score(child3)...) is returned. Therefore, all children can be searched in parallel.

* Node type A: Beta = infinity. Alpha is non-zero, but otherwise functions similar to Node Type P. max(alpha, score(child1), score(child2), score(child3)...) is returned. Therefore, all children can be searched in parallel.

* Node Type C: With "Beta" as non-infinity, there is a chance of Beta-cutoff each time a child is evaluated. if(score(child) > beta) return beta immediately (do not search any other children). I identify this step as essential to Alpha-Beta Search's efficiency, and wish to create a parallel search which continues this tradition.

Node Type P, A, and C are similar to PV-nodes, All-nodes, and Cut-nodes, but there are subtle differences so its probably best if I named them something else. As you can see, Node Types P and A can be searched in parallel, and there are approximately sqrt(N) nodes of type P or A in a minimax tree. (where "n" is the number of nodes of a minimax). All other nodes (N - sqrt(N)) are of Node Type C.

At a ply of ~10 with 35x branch factor, there are 2,758,547,353,515,625 nodes total, and only ~50 million of them are of type P or A. But 50-million nodes is still plenty of nodes to search in parallel. And hopefully, there will be over ~16000 type C nodes that can be searched in parallel once you run out of P or A nodes to work with.
Philipp Bach
Posts: 4
Joined: Thu Jan 01, 2015 12:21 pm

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by Philipp Bach »

* Node Type P: Alpha = -infinity and Beta=infinity. All children of Node-type P can be evaluated in parallel. max(score(child1), score(child2), score(child3)...) is returned. Therefore, all children can be searched in parallel.
Isn't the returned score from child1 usually used as bound for child2 etc...?
If you evaluated all childs in parallel each would still have the bounds Alpha = -infinity and Beta=infinity, so it would effectively degenerate to a minimax search?