Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

I've given some thought to how to evaluate an "incomplete future".

The P3121* branch, and its parents, seems to satisfy all the situations I'm interested in. Here's my hand-done trace of execution:

Code: Select all

F = AlphaBeta(P, -Inf, +Inf)

// F1 and F2 execute in parallel, but are ignored for this example
F3 = -AlphaBeta(P3, -Inf, [F1, F2, Max, -])

F31 = -AlphaBeta(P31, [F1, F2, Max], +Inf])
F3B = BetaBlock(F31 >= [F1, F2, Max, -])

// F311 and F313 execute in parallel, but are ignored for this example
F312 = -AlphaBeta(P31, -Inf, [F1, F2, Max, F311, Max, -])

F3121 = -AlphaBeta(P31, [F1, F2, Max, F311, Max], +Inf)
F312B = BetaBlocked(F3121 >= [F1, F2, Max, F311, Max, -])

// Penultimate node, shortcut here for simplicity
F3121 = -Max(-eval(P31211), -eval(F31212), -eval(F31213))
Okay, so F1, F2, F311, and F313 are off executing in parallel while the code discovers the value F3121.

Lets actually evaluate things, because numbers are easier to think than futures. My "evaluation" function for the purposes of these posts is hash(Position) = ((Position * 31415) / 1000) % 100. (The 5th and 4th digits after being multiplied by the arbitrary Pi-like number 31415). And remember, 50-99 are negative numbers, while 0-49 are positive numbers. We're working modulo 100 arithmetic.

"Physically", the numbers are represented from 0 to 99. Logically, the numbers are -50 to 49. (And -50 negated == -50, same quirk as in binary.)

Code: Select all

eval(P31211) = (31211 * 31415) // 1000 % 100 = 93
-eval(P31211) = -93 = 7

eval(P31212) = 24
-eval(P31212) = -24 = 76

eval(P31213) = 56
-eval(P31213) = -56 = 44

F3121 = -Max(-eval(P31211), -eval(F31212), -eval(F31213))

F3121 = -Max(7, -24, 44)
F3121 = -44
So we can evaluate F3121 without knowing what F1, F2, or anything else in the tree is. Secondly, YBWC and ABDADA would NOT have found this node, because F312 is a "younger brother". F312 will not execute until F311* branch has been fully evaluated. So my methodology will safely and work-efficiently find parallel work that YBWC / ABDADA will not (before resorting to speculative execution).

Moving up the tree, we still have:

Code: Select all

F312B = BetaBlocked(F3121 >= [F1, F2, Max, F311, Max, -])
F312B = BetaBlocked(-44 >= [F1, F2, Max, F311, Max, -])
Under my "work-efficiency" requirement, we cannot execute this BetaBlocked tree consisting of F3122 and F3123 until F1, F2, and F311 are done executing. But look at how bad F3121 is: it is -44 (and the absolute minimum score under my eval() methodology is -50). There's a very high chance that speculative execution would work out here.

Now it appears that F311 = -AlphaBeta(P311, ...) == -20. Lets modify the Beta-Block question now.

Code: Select all

F312B = BetaBlocked(-44 >= [F1, F2, Max, -20, Max, -])
I don't know where I'm going with this :-). But I think a simple heuristic may exist here for speculative execution. F1 and F2 are "top level" nodes, so they'll probably have a value closer to 0 because there are more choices available to both players. Furthermore, -44 is almost the lowest number by my scheme (the minimum is -50), so it is very unlikely for F1 or F2 to come out as 45+ and actually refute the F312B tree.

So it is probably safe to speculatively execute F312B, even if we don't know the F1 and F2 values yet.

Because all the futures are stored in the BetaBlock task, it is theoretically possible for the scheduler to perform logic on the expression-tree and figure out which Beta-blocked tasks have the lowest chance of refutation. Maybe sorting all BetaBlocked tasks by the score, and executing the smallest score values is best.

If F312B executes speculatively, (with F1 and F2 remaining unknown), the F3122 would look as follows:

Code: Select all

F3122 = -AlphaBeta(P3122, [F1, F2, Max, F311, Max], 44)
F312BB = BetaBlocked(F3122 >= [F1, F2, Max, F311, Max, -])
Where F312BB represents the 2nd beta-block in the F312* subtree.

------

Relaxing the work-efficiency requirement will be a last resort, if I cannot figure out a way to be work-efficient here. If I do have to resort to speculative execution here, it seems like simple heuristics (such as sorting the Beta-Blocked list by current minimum score) could go a long way to minimizing wasted work.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

I got to play with GraphViz, and have drawn out three graphs.

First: The (simulated) traversal of this algorithm: https://i.imgur.com/y3zAIGl.png

Second: The Negamax traversal of the P11111 tree, so that we have actual values we can use and discuss: https://svgshare.com/i/DuH.svg

EDIT: Third: The sequential Alpha-Beta traversal of the P11111 tree, so that we can see where Beta-cutoffs occur. https://svgshare.com/i/Dt1.svg

Here's an overview of the "syntax" of the simulated traversal:

Image

The full png is too large, so I'll only describe this small section of the P3* branch. (3rd move of the root). Alpha is literally the array: [infinity, -]. While Beta is also the array [infinity, -, F1, max, F2, max, -]. Because we have these "Symbolic Expression" arrays, lazy-execution of each node can take place... and the traversal of Alpha-Beta can be somewhat out-of-order. F1 and F2 are not yet calculated, but this P3 branch will be explored.

P32 is beta-blocked, because F1 and F2 are needed to resolve the question score >= Beta? (the beta-cutoff question). This simplifies down to F31 >= -max(F1, F2).

We can see from the Negamax Traversal (https://svgshare.com/i/DuH.svg) that F1, F2, and F31 will become 29, 23, and -30 respectively (F1 = -AlphaBeta(P1, ...), so F1 will be negated compared to what is written in P1). -30 >=-max(29, 23) == -29 is false, so the P32-node will be evaluated (but only after F1, F2, and F31 are all complete). We can double-check the sequential Alpha-beta tree

I do apologize for the obtuse style, but I made these diagrams for my own self study. At least, I understand the diagrams, and that's the important part. :-)

From this section of the tree, you can see that the algorithm will potentially execute these in parallel: P31111, P31112, P31113, P31211, P31212, P31213, P31311, P31312, and P31313.

Note that P313* is a young brother of the P31-branch, and yet this algorithm I'm proposing will execute it in parallel. YBWC and ABDADA are unable to execute P313* before P311*, so it seems that my algorithm has "discovered" nodes that can be made parallel.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

I think I've come up with the final format for FutureExpressions, so I'll publish my initial thoughts.

A FutureAtom is a 32-bit value consisting of a 7-bit opcode, a 1-bit negation symbol, and 24-bit value. There are 3 opcodes:

1. Symbol -- 24-bit pointer to the "FutureArray", more on that later

2. Value -- A 16-bit signed integer between -32767 to +32767. This will be representative of centipawns (or probably a more granular number). The value -32768 is non-intuitive with respect to negations, so I'd do everything in my power to avoid the 16-bit signed value -32768. Another note: +32767 is Positive Infinity, while -32767 is semantically equivalent to negative infinity. And negate(Positive infinity) == negative infinity with this scheme (not true if I used -32768 as -inf, so this should simplify a lot of code).

3. Max -- The 24-bit value says how many "former" array members need to be popped to be evaluated.

The "Future Array" is a ~96MB structure consisting of three values:

* int16_t value[16777216]; // 16777216 is 24-bits
* uint32_t ready[16777216 / 32]; // Bitmask representing "value" is ready to be read from.
* uint32_t in_use[16777216 / 32]; // Bitmask for allocation + garbage collection
* uint32_t use_count[16777216]; // For reference counting. +1 per FutureExpression, -1 each time the value gets erased or bound.

// 50-bits per symbol. Not bad! Different threads can have different "starting-points" to reduce contention.

Parallel allocation is relatively cheap: "reserved = AtomicExchange(in_use[X/32], 0xFFFFFFFF)" reserves up to 32-bits safely and atomically. You may "lose the race" however, but "reserved" will tell you which bits you managed to win the race with.

If AtomicDecrement(use_count[X]) == 1, then this thread gets stuck with cleanup duty, and has to perform AtomicAnd(&FutureArray.in_use[X/32], ~(1 << (X%32)))

The ready[X/32] flag is set when the task associated with value[X] is done executing.

-------------

With regards to the "speculation" front, I've worked out the logic for Beta-blocked tasks. In effect, the Beta-block question is purely alpha >?= beta. If so, the move is refuted and the task should return immediately (no further children should be explored).

Fortunately, "alpha" expressions always consist of a top-level positive-max() statement, while "beta" expressions always consist of a top-level negative-max() statement. This means the following equality is possible:

Alpha >= LowerBound(Alpha) >?= UpperBound(Beta) >= Beta

LowerBound([A1, A2, A3...]) seems like it can be computed as follows:

* max() operations treat unbound symbols as -infinity
* -max() operations treat unbound symbols as +infinity (which will be later negated).

For example, the hypothetical LowerBound([11, F2, -max2, F3, 30, max3]) is: LB(max3(30, F3, -max2(F2, 11))) == LB(max3(30, F3, -infinity)) == -infinity

UpperBound([A1, A2, A3...]) seems like can be computed as follows:

* max() operations treat unbound symbols as +infinity
* -max() operations treat unbound symbols as -infinity

For example, UpperBound([11, F2, -max2, F3, 30, max3]) is: UB(max3(30, F3, -max2(F2, 11))) == UB(max3(30, F3, 11)) == +infinity (from the unbound F3 variable)

These UpperBound and LowerBound functions could certainly be used to (in parallel!) figure out which Beta-blocked tasks are refuted, even before the futures are evaluated. Take P3122 for example, which is BetaBlocked( [F3121] >?= [F1 F2 F311 -max3]). F1 and F2 are relying upon many, many tasks before they can return. But F311 and F3121 are "local", and there's a chance they can refute this branch of the tree long before F1 and F2 are done calculating.

There probably is a good heuristic to use with LowerBound(Alpha) and/or UpperBound(Beta) (quite possibly even LowerBound(Alpha) - UpperBound(Beta) as the basis for a heuristic?). I haven't thought of any yet, but whatever the heuristic will be... I'm putting it into a GPU-parallel SIMD Heap for efficient storing + sorting + priority queue: https://arxiv.org/pdf/1906.06504.pdf
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

LowerBound([A1, A2, A3...]) seems like it can be computed as follows:

* max() operations treat unbound symbols as -infinity
* -max() operations treat unbound symbols as +infinity (which will be later negated).

For example, the hypothetical LowerBound([11, F2, -max2, F3, 30, max3]) is: LB(max3(30, F3, -max2(F2, 11))) == LB(max3(30, F3, -infinity)) == -infinity
All of this is hogwash. I'm convinced LowerBound / Upperbound is possible, but I just found a counter-example.

LowerBound(-max(max(F1, 5))) should equal -infinity, but the procedure described returns -5.

In any case, the development of a proper LowerBound and UpperBound expression could be useful for heuristics and extra early Beta cutoffs.

EDIT: Although, maybe the simplest heuristics are best? The oldest son has the highest priority (literally zero chance of beta-cutoff), but the 2nd son has the next highest priority in any given node (they can only be cutoff by the oldest son). I'm strongly of the opinion that speculation should only be used as a last resort: all other parallel work must be completed before speculation takes place.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

To make this methodology practical, I probably should remove the "work efficient" requirement. The absurd RAM requirements are the final nail in the coffin.

Remember: Beta-blocked are the tasks waiting for the results of (Alpha >= Beta ??) ... these tasks must be stored until the calculation is ready to run. Each Beta-blocked task requires around ~1024 bytes of RAM (150 bytes for the board-state, 256-bytes for FutureExpression x2, maybe 512ish bytes for movement-generator purposes + other misc bytes). With a branching factor of 35, there will be 52 Million Beta-blocked tasks to be found by Ply 10. That's around 50GB of RAM needed to hold all of these tasks. This grows exponentially to 2TB required by Ply12.

Once a Beta-blocked task is complete, it will only use 50-bits of RAM: 2-bytes for centipawns evaluation, 4-bytes for the use_count, and 2x 1-bit flags for "ready_to_read" and "in_use". Once all FutureExpressions across the program have been updated (aka: use_count drops to zero), these last 50-bits will be deallocated and recycled for new tasks.

Therefore: speculating saves on RAM, and there simply isn't enough RAM to truly do all calculations in parallel (even the small subset of "perfectly parallel" tasks I've described earlier in this topic is too large to hold in RAM today).
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

dragontamer5788 wrote: Thu Jul 04, 2019 6:56 amThere probably is a good heuristic to use with LowerBound(Alpha) and/or UpperBound(Beta) (quite possibly even LowerBound(Alpha) - UpperBound(Beta) as the basis for a heuristic?). I haven't thought of any yet, but whatever the heuristic will be... I'm putting it into a GPU-parallel SIMD Heap for efficient storing + sorting + priority queue: https://arxiv.org/pdf/1906.06504.pdf
AlphaBeta tasks are active and have no speculation. To minimize their memory usage, processing the nodes with the least amount of "DepthRemaining" is all that is needed.

All AlphaBeta tasks will return a variable that will be relevant to some BetaBlocked question: (alpha >= beta ??). For example, F1111111111

------
dragontamer5788 wrote: Thu Jul 04, 2019 7:57 am All of this is hogwash. I'm convinced LowerBound / Upperbound is possible, but I just found a counter-example.

LowerBound(-max(max(F1, 5))) should equal -infinity, but the procedure described returns -5.
Well, I've solved this problem. But I don't think I can use LowerBound or UpperBound outside of extra-early Beta cutoff. The solution is similar to the Upperbound and Lowerbound flags in ABDADA.

In effect, every symbol is represented as the pair <-inf, +inf>. So max(F1, Value) will be computed as <Value, +Infinity>. While -max(F1, Value) will be computed as <-infinity, -value>. The <-infinity, value> is an upperbound, while <value, +infinity> is a lower-bound. These can be represented as Upperbound_Value and Lowerbound_value opcodes on a FutureAtom format.

"Shallow" lower-bounds and upper-bounds will be easy to calculate. But "Deep" lower-bounds will need to recursively enter the Symbol tree IE: Max(F1, 5) seems simple... but F1 itself is going to be -max(F11, F12, F13, F14...). And F11 is -max(F111, F112, F113, F114...). It would take a lot of effort to perform a "deep traversal" for upper-bounds and lower-bounds of an arbitrary expression. However, evaluating a "Depth-limited 3-ply Lower Bound" (Check F1, F11, and F111, but do NOT check F1111 and below) could still give a good lower-bound and upper-bound value for the purposes of speculative Beta Cutoffs.
Therefore: speculating saves on RAM, and there simply isn't enough RAM to truly do all calculations in parallel (even the small subset of "perfectly parallel" tasks I've described earlier in this topic is too large to hold in RAM today).
Speculative execution of Beta is surprisingly easy. I walked through P3B as an example. Let me summarize:

Code: Select all

F = AlphaBeta(P, -Inf, +Inf)
F3 = AlphaBeta(P3, -Inf, [F1, F2, -Max2])

"Step Into F3"
--------------------
Alpha=-Inf
Beta = [F1, F2, -Max2]

F31 = AlphaBeta(P31, [F1, F2, Max2], +Inf)
Alpha = max(Alpha, F31)
F3B = BetaBlock: (F31 >= [F1, F2, -Max2] ?)
Basically, Task F is the root task. F3 is spawned from investigating the root. F31 is the "Oldest Brother" we're all familiar with. F3B is the "Block" condition because of my work-efficient assumption.

At this point, lets remove my work-efficient assumption and enter the realm of speculative execution. How would things progress? Personally speaking, I was surprised at how well-behaved everything is.

Code: Select all

Alpha = F31
Beta = [F1, F2, -Max2]

F32 = -AlphaBeta(P32,  [F1, F2, Max2], [-F31])
Alpha = max(Alpha.F32)
BetaBlock: ([F31, F32, Max2] >= [F1, F2, -Max2])
Hey, would you look at that? All the futures execute just fine under a speculative condition. The new Beta-block condition ([F31, F32, Max2] >= [F1, F2, -Max2]) is correct from a blocking perspective too! To work-efficiently compute this Beta-block requires waiting on the results of F31, F32, F1, and F2.

Lets take a look at how F32 will execute:

Code: Select all

F32 = -AlphaBeta(P32,  [F1, F2, Max2], [-F31])
Alpha = [F1, F2, Max2]
Beta = [-F31]

F321 = -AlphaBeta(P321, [F31], [F1, F2, -Max2])
Alpha = max(Alpha, F321)
F32B = BetaBlock( [F1, F2, F321, Max3] >= [-F31] ??)
This betaBlock is similar to the negated-form of the original Beta-Block: F3B = BetaBlock: (F31 >= [F1, F2, -Max2] ?). The difference is that F321 was added to the "Max" statement. Otherwise, it is clear that the original Beta-block condition has been preserved.

The "cost" for this speculation is approximately (DepthRemaining x 2). There are (DepthRemaining) Alpha-Beta nodes created (F321, F3211, F32111...), and there are (Depth) Beta-blocked nodes created (F32B, F321B, and F3211B). So all else considered: this speculative execution was very cheap, especially on nodes that were near the bottom of the tree.

My work-efficiency heuristic will probably be number-of-symbols-waiting. In the case of F32B: BetaBlock( [F1, F2, F321, Max3] >= [-F31] ??), there are four symbols (F1, F2, F321, and F31) that the beta-blocked task is waiting on. The more symbols waiting, the less "work efficient" the problem will be. There will be a single, perfectly-work efficient beta-blocked task where all symbols are bound. There will be a large number of Beta-Blocked tasks waiting on only 1-symbol (or maybe 2-symbols).

In contrast, a task like P11122 is waiting on BetaBlocked( F11121 >= -F1111 ??). With only two symbols that its waiting for, it is probably less likely to be Beta-cutoff.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

Here is some amusing math that deeply annoys me.

* Occupancy 16/40 per Compute Unit x 64 Compute Units (Vega64) x 64 threads per wavefront == 65,536 hardware threads. Occupany 40/40 ( 163,840 threads) is overkill and probably too difficult to achieve. Occupancy 4/40 (16,384 threads) won't have any possibility of memory-hiding, so 16/40 seems like a reasonable target
* 35 Children per node x 1024 (estimated) x 12-Depth == 420 kB per depth-first search traversal.
* 65536 threads x 420 kB depth-first search traversals == 27 GB.

Uhhhh... that doesn't even get to Q-Search (which also needs #children x depth x 1024 bytes of memory). My 8GB GPU doesn't even have enough RAM to have Occupancy 16/40 depth-first searches going simultaneously. I'm "fixing" this problem by setting an arbitrary limit to 6-children expanded per node at a time. With 6 children + 1 "leftover" node (the "leftover" node will expand out to the other 29 siblings at a later-time), the nodes will fit in... roughly 5GB of RAM through depth 12.

In any case: its going to be difficult enough to even get a simple depth-first search to fit on this GPU with a reasonable amount of threads (65536 SIMD-threads). I'll have to cut out every unnecessary byte in the task-structure.
smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by smatovic »

dragontamer5788 wrote: Mon Jul 15, 2019 6:20 am Here is some amusing math that deeply annoys me.

* Occupancy 16/40 per Compute Unit x 64 Compute Units (Vega64) x 64 threads per wavefront == 65,536 hardware threads. Occupany 40/40 ( 163,840 threads) is overkill and probably too difficult to achieve. Occupancy 4/40 (16,384 threads) won't have any possibility of memory-hiding, so 16/40 seems like a reasonable target
* 35 Children per node x 1024 (estimated) x 12-Depth == 420 kB per depth-first search traversal.
* 65536 threads x 420 kB depth-first search traversals == 27 GB.

Uhhhh... that doesn't even get to Q-Search (which also needs #children x depth x 1024 bytes of memory). My 8GB GPU doesn't even have enough RAM to have Occupancy 16/40 depth-first searches going simultaneously. I'm "fixing" this problem by setting an arbitrary limit to 6-children expanded per node at a time. With 6 children + 1 "leftover" node (the "leftover" node will expand out to the other 29 siblings at a later-time), the nodes will fit in... roughly 5GB of RAM through depth 12.

In any case: its going to be difficult enough to even get a simple depth-first search to fit on this GPU with a reasonable amount of threads (65536 SIMD-threads). I'll have to cut out every unnecessary byte in the task-structure.
...maybe you can apply your algorithm fully on an subtree, and stepwise iterate
through the full game tree.

Similar to what Ankan did in gpu perft, to call perft from host for different
subtrees.

--
Srdja
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

smatovic wrote: Mon Jul 15, 2019 9:18 am
dragontamer5788 wrote: Mon Jul 15, 2019 6:20 am Here is some amusing math that deeply annoys me.

* Occupancy 16/40 per Compute Unit x 64 Compute Units (Vega64) x 64 threads per wavefront == 65,536 hardware threads. Occupany 40/40 ( 163,840 threads) is overkill and probably too difficult to achieve. Occupancy 4/40 (16,384 threads) won't have any possibility of memory-hiding, so 16/40 seems like a reasonable target
* 35 Children per node x 1024 (estimated) x 12-Depth == 420 kB per depth-first search traversal.
* 65536 threads x 420 kB depth-first search traversals == 27 GB.

Uhhhh... that doesn't even get to Q-Search (which also needs #children x depth x 1024 bytes of memory). My 8GB GPU doesn't even have enough RAM to have Occupancy 16/40 depth-first searches going simultaneously. I'm "fixing" this problem by setting an arbitrary limit to 6-children expanded per node at a time. With 6 children + 1 "leftover" node (the "leftover" node will expand out to the other 29 siblings at a later-time), the nodes will fit in... roughly 5GB of RAM through depth 12.

In any case: its going to be difficult enough to even get a simple depth-first search to fit on this GPU with a reasonable amount of threads (65536 SIMD-threads). I'll have to cut out every unnecessary byte in the task-structure.
...maybe you can apply your algorithm fully on an subtree, and stepwise iterate
through the full game tree.

Similar to what Ankan did in gpu perft, to call perft from host for different
subtrees.

--
Srdja
That methodology would work for sure... but fortunately, I have a lot of options available. 27GB isn't completely terrible: my CPU has 32GB of RAM and HBCC should allow the Vega64 to borrow RAM from the CPU in the worst case. So my setup can in fact run with 27GB being used by the GPU, most of that RAM will be getting shuttled through the (relatively) slow PCIe x16 pipe instead of the local 8GB HBM2 RAM, but it would in fact work.

Another thing: Ply0 through Ply4 have less than 65536 nodes available, so these nodes do not expand at the rate I described earlier. The first 5 ply are far cheaper than I estimated.

--------

The real issue is that if I spend all 27GB on the parallelization task-structure, then I don't have any room left over for useful features, like a Transposition Table. Apparently, a ton of RAM is going to be used just on allowing these tasks to operate independently of each other. I think I'm going to just accept the fact that the parallelization data-structures are going to likely be bigger than the transposition table or EGTB.

I can easily cut my per-task data-structure down to 512 bytes. And with some sacrifices, I probably can get down to 384 bytes. Depth Remaining 3+ will only expand maybe 6 children at a time: most of the parallelization opportunity is at the leaf-nodes... at Depth Remaining 1, Depth Remaining 0, and Q-Search.

Plenty of options available to me. It just means that I have to take the memory-problem seriously, even with 8GB+ on modern GPUs.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 »

dragontamer5788 wrote: Mon Jul 15, 2019 6:20 am * Occupancy 16/40 per Compute Unit x 64 Compute Units (Vega64) x 64 threads per wavefront == 65,536 hardware threads. Occupany 40/40 ( 163,840 threads) is overkill and probably too difficult to achieve. Occupancy 4/40 (16,384 threads) won't have any possibility of memory-hiding, so 16/40 seems like a reasonable target
* 35 Children per node x 1024 (estimated) x 12-Depth == 420 kB per depth-first search traversal.
* 65536 threads x 420 kB depth-first search traversals == 27 GB.
I think I have a basic methodology with an incomplete simulation or two that seems to solve this issue.

Limit the size of "FutureExpression" to only hold ~16 children. This naturally limits the problem to 16-children per node... but even more. This is a "deeper" limitation than children-limited depth-first search would normally be. P.16 for example will spawn with bounds <-inf, -max(F.1, F.2... F.15)>. This will cause P.16.1.2 to try to spawn with <-inf, -max(F.1, F.2... F.15, F.16.1.1)>, but there is not enough space.

In effect, the entire P.16.1.* branch will be explored with only one-child per node. The P.1.1.* branch will spawn with 17 children, but again the P.1.1.16.1.* branch will only spawn with 1-child as well.

So the entire tree "focuses" on the older-brother, and saves a lot of space. A breadth-first search with these limitations seem to grow at O(n^3) space instead of O(35^n) space. Of course, as F.1, F.2... are calculated, there will have to be a "scanning" process that updates these FutureExpressions to the actual value calculated. But that's all part of the plan.