### Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Posted:

**Thu Jun 20, 2019 7:50 pm**Introduction

A few weeks ago, I theorized that GPUs can effectively perform work-efficient alpha-beta search. It led to a long discussion elsewhere on the internet, which pretty much ended with "If you think its possible, then you should do it". A couple of weeks later, here I am today ready to learn from the Chess community and begin tackling this project.

From my understanding: a work-efficient parallel algorithm for Alpha-Beta search is already a "unicorn project" never seen before... and I'm complicating the project by trying to make it work on a grossly non-traditional architecture: the GPU. Finally, I'm not even using CUDA, but instead the lesser-used AMD ROCm.

But despite all of these unknowns, the power of the Vega 64 GPU I'm targeting gives me much hope. With 4096 shaders (supporting 16,384 GPU-threads of execution) clocked at 1.5GHz, and with 8GB of HBM2 RAM delivering 483GB/s, the Vega 64 GPU is a monster of a machine: far wider than your typical 6-core CPU. Far more memory bandwidth than your typical 20GB/s DDR4 channel.

Furthermore, the GPGPU architecture is largely similar between NVidia and AMD, any ideas I implement on AMD's GPU should port over to NVidia's GPUs (although the code may have to change slightly). All modern GPUs have specs of this nature: incredible numbers of SIMD cores, and stupefying amounts of RAM bandwidth. It behooves the modern chess programmer to leverage the GPU.

There are many unanswered questions about how this can be done, but this topic hopes to at least answer the problem of search trees.

Previous Research

* https://github.com/StuartRiffle/Jaglavak -- An NVidia GPU-based chess engine that uses MCTS (calculated on the CPU) + Random Rollouts (calculated on the GPU). Not my project, but its proof that a GPU can implement a bitboard-based chess program, even if the MCTS search-tree remains strictly on the CPU-side of this implementation.

* Intel's Task-based Fibonnaci example: https://software.intel.com/en-us/node/506102 -- This is one of my main inspirations for the project. Task-based parallelism captures the structure of recursive algorithms (like Alpha-beta search) elegantly. Unfortunately, Task-based parallelism doesn't seem to have been ported to the GPU architecture yet.

* Stream Compaction: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf -- Stream Compaction is the algorithm used for GPGPU-based raytracers. This is the algorithm I'll be using to implement parallel Task-queues (or parallel Task-stacks). Task-based parallelism for GPUs doesn't exist yet, but the Stream Compaction algorithm should make it possible.

* Futures / Promises -- https://en.wikipedia.org/wiki/Futures_and_promises . The C++ community (and other languages) have implemented future<int> as a synchronization primitive. This does not exist on a GPU yet.

* Tomasulo's Algorithm -- https://en.wikipedia.org/wiki/Tomasulo_algorithm. Tomasulo's Algorithm is the CPUs use to discover which operations can be safely executed out-of-order. This algorithm "discovers" parallelism, even in the (seemingly sequential) CPU Assembly language. I hope to use Tomasulo's Algorithm to implement Futures<> efficiently. Virtually every modern CPU implements this algorithm (or a variant of it) in the CPU's microcode to provide out-of-order execution of code.

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Now that the introductions are out of the way, here's my gameplan that I want to discuss. I hope to implement the following rough algorithm on the GPU (based on ChessProgramming.org Negamax AlphaBeta):

Some notes:

1. "Spawn" will be Task-based parallelism. Spawn roughly correlates to "Place a new task into the work-queue". It is assumed that the available threads (the 16,000+ threads of the Vega64) will be grabbing tasks from this queue. Obviously, efficient implementation of this work-queue is of utter-importance for this whole approach to work.

2. Lazy evaluation of score = max(score, alpha) is the hard part. Presume "alpha" initially equals [A, B, max, -], while "score" initially equals "[C, D, Max]. Then max(score, alpha) will generate [C, D, Max, A, B, Max, -, Max], representing the "lazy evaluation" of the alpha-tree... where "A", "B", "C", and "D" are values that will be returned by some Spawned task. ("Register renaming" from Tomasulo's Algorithm). These lists are organized into postfix notation. That is, [C, D, Max, A, B, Max, -, Max] is equivalent to the computation Max(-Max(A,B), Max(C, D)).

3. "beta.wait()" is when this tree is actually evaluated. If beta = [A, B, max, -] on this line, then "beta.wait" will stay on the "Blocked Queue" until A and B are available.

The mechanism described in #2 should "discover" the innate parallelism across alpha-beta trees. There's a lot of edge cases (what to do if the list fills up? How big should the list be for chess?). But I've mapped out a lot of examples by hand, and I'm confident that a system like this would work now. It appears that the square-root(nodes) of Minimax are immediately available to this parallelism strategy.

As I program this, I'll update the thread. But I wanted to at least generate some discussion. Maybe someone else has made a system similar to mine that I should be aware of? Its always good to read other research papers on this subject.

A few weeks ago, I theorized that GPUs can effectively perform work-efficient alpha-beta search. It led to a long discussion elsewhere on the internet, which pretty much ended with "If you think its possible, then you should do it". A couple of weeks later, here I am today ready to learn from the Chess community and begin tackling this project.

From my understanding: a work-efficient parallel algorithm for Alpha-Beta search is already a "unicorn project" never seen before... and I'm complicating the project by trying to make it work on a grossly non-traditional architecture: the GPU. Finally, I'm not even using CUDA, but instead the lesser-used AMD ROCm.

But despite all of these unknowns, the power of the Vega 64 GPU I'm targeting gives me much hope. With 4096 shaders (supporting 16,384 GPU-threads of execution) clocked at 1.5GHz, and with 8GB of HBM2 RAM delivering 483GB/s, the Vega 64 GPU is a monster of a machine: far wider than your typical 6-core CPU. Far more memory bandwidth than your typical 20GB/s DDR4 channel.

Furthermore, the GPGPU architecture is largely similar between NVidia and AMD, any ideas I implement on AMD's GPU should port over to NVidia's GPUs (although the code may have to change slightly). All modern GPUs have specs of this nature: incredible numbers of SIMD cores, and stupefying amounts of RAM bandwidth. It behooves the modern chess programmer to leverage the GPU.

There are many unanswered questions about how this can be done, but this topic hopes to at least answer the problem of search trees.

Previous Research

* https://github.com/StuartRiffle/Jaglavak -- An NVidia GPU-based chess engine that uses MCTS (calculated on the CPU) + Random Rollouts (calculated on the GPU). Not my project, but its proof that a GPU can implement a bitboard-based chess program, even if the MCTS search-tree remains strictly on the CPU-side of this implementation.

* Intel's Task-based Fibonnaci example: https://software.intel.com/en-us/node/506102 -- This is one of my main inspirations for the project. Task-based parallelism captures the structure of recursive algorithms (like Alpha-beta search) elegantly. Unfortunately, Task-based parallelism doesn't seem to have been ported to the GPU architecture yet.

* Stream Compaction: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf -- Stream Compaction is the algorithm used for GPGPU-based raytracers. This is the algorithm I'll be using to implement parallel Task-queues (or parallel Task-stacks). Task-based parallelism for GPUs doesn't exist yet, but the Stream Compaction algorithm should make it possible.

* Futures / Promises -- https://en.wikipedia.org/wiki/Futures_and_promises . The C++ community (and other languages) have implemented future<int> as a synchronization primitive. This does not exist on a GPU yet.

* Tomasulo's Algorithm -- https://en.wikipedia.org/wiki/Tomasulo_algorithm. Tomasulo's Algorithm is the CPUs use to discover which operations can be safely executed out-of-order. This algorithm "discovers" parallelism, even in the (seemingly sequential) CPU Assembly language. I hope to use Tomasulo's Algorithm to implement Futures<> efficiently. Virtually every modern CPU implements this algorithm (or a variant of it) in the CPU's microcode to provide out-of-order execution of code.

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Now that the introductions are out of the way, here's my gameplan that I want to discuss. I hope to implement the following rough algorithm on the GPU (based on ChessProgramming.org Negamax AlphaBeta):

Code: Select all

```
future<int> alphaBeta( future<int> alpha, future<int> beta, int depthleft ) {
if( depthleft == 0 ) return quiesce( alpha, beta );
for ( all moves) {
future<int> score = Spawn(-alphaBeta( -beta, -alpha, depthleft - 1 )); // "Lazy calculate" the "negate" operation without actually evaluating it.
if(beta != infinity){ //Blocking is unnecessary while beta is infinity
//Block until "score >= beta" can be evaluated.
beta.wait(); // All stored "negate" and "max" operations will be evaluated here, and here only.
score.wait(); // Ditto
if( score >= beta )
return beta;
}
score = max(score, alpha); // "lazy calculate" the "max" operation without actually evaluating it until a "wait" statement is called
}
return alpha; // This is a "lazy symbolic fugure"
}
```

1. "Spawn" will be Task-based parallelism. Spawn roughly correlates to "Place a new task into the work-queue". It is assumed that the available threads (the 16,000+ threads of the Vega64) will be grabbing tasks from this queue. Obviously, efficient implementation of this work-queue is of utter-importance for this whole approach to work.

2. Lazy evaluation of score = max(score, alpha) is the hard part. Presume "alpha" initially equals [A, B, max, -], while "score" initially equals "[C, D, Max]. Then max(score, alpha) will generate [C, D, Max, A, B, Max, -, Max], representing the "lazy evaluation" of the alpha-tree... where "A", "B", "C", and "D" are values that will be returned by some Spawned task. ("Register renaming" from Tomasulo's Algorithm). These lists are organized into postfix notation. That is, [C, D, Max, A, B, Max, -, Max] is equivalent to the computation Max(-Max(A,B), Max(C, D)).

3. "beta.wait()" is when this tree is actually evaluated. If beta = [A, B, max, -] on this line, then "beta.wait" will stay on the "Blocked Queue" until A and B are available.

The mechanism described in #2 should "discover" the innate parallelism across alpha-beta trees. There's a lot of edge cases (what to do if the list fills up? How big should the list be for chess?). But I've mapped out a lot of examples by hand, and I'm confident that a system like this would work now. It appears that the square-root(nodes) of Minimax are immediately available to this parallelism strategy.

As I program this, I'll update the thread. But I wanted to at least generate some discussion. Maybe someone else has made a system similar to mine that I should be aware of? Its always good to read other research papers on this subject.