ABDADA+ suggestion

smatovic · Post by **smatovic** » Sat Jun 29, 2019 9:10 am

I have implemented ABDADA on gpu and it achieves an time to depth speedup only
up to 32 parallel threads resp. workers. With more than 32 workers there is not
much to gain, tested it with up to 256 workers, but not at longer time controls.

Looking at the algorithm and game tree, I think by adding another defer step
there could be something gained.

From CPW, https://www.chessprogramming.org/ABDADA

Code: Select all

    5. The analysis of a position is done in three phases:
        1.The eldest son is analyzed, regardless of the activities of other processors
        2.Next, all other sons not currently being analyzed by other processors are analyzed
        3.In the final phase, any remaining sons not yet analyzed are searched, i.e. the corresponding entry in the TT indicates this node and its siblings have not been searched to the required depth

Between phase 2. and 3. there could be a clause added:

Code: Select all

        If all other sons are already analyzed by other processors, 
        then return to parent node, continue with next move, and set flag for second iteration (phase #3).

Maybe this kind of clause is already implied in 3., but as I undestand it, in 3.
all remaining sons which search has not yet finished (depth reached) has to be searched.

Unfortunately, I have no workstation atm to test this, so what is your opinion?

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 29, 2019 3:13 pm

There is no need for the intermediate step because when you do (3), the first thing the search does is check the hashtable
and it will find hits if the children have been search by other processors. You can even go over all the movelist in the second pass,
instead of moves that aren't exclusively looked at, and not loose efficency because of this.

What do you consider a thread in your implementation (a CUDA thread, or a warp (32 threads)) ?

Daniel

smatovic · Post by **smatovic** » Sat Jun 29, 2019 3:42 pm

Daniel Shawul wrote: ↑Sat Jun 29, 2019 3:13 pm There is no need for the intermediate step because when you do (3), the first thing the search does is check the hashtable
and it will find hits if the children have been search by other processors.

Yes, assumed that the other processors finished their search already and have stored the result in hashtable.

My thought was that the children are not yet searched fully, and too many workers start to search the same children,
cos my scaling drops with 32 workers, but there may be other reasons for this.

Daniel Shawul wrote: ↑Sat Jun 29, 2019 3:13 pm You can even go over all the movelist in the second pass,
instead of moves that aren't exclusively looked at, and not loose efficency because of this.

Ah, cos of values from hashtable.

Daniel Shawul wrote: ↑Sat Jun 29, 2019 3:13 pm What do you consider a thread in your implementation (a CUDA thread, or a warp (32 threads)) ?

In your terms a warp.

Zeta v099 couples 64 gpu threads (CUDA threads) to one 'worker' (Work-Group resp. Block) to work
on the same node in parallel during move gen, move pick and eval, and a shared hash table approach
is used for parallel AlphaBeta across these workers.

...just for the completeness, a little mod to avoid looping:

Code: Select all

    5.The analysis of a position is done in three phases:
        1.The eldest son is analyzed, regardless of the activities of other processors
        2.a Next, all other sons not currently being analyzed by other processors are analyzed
        2.b If parent node is not in phase 3. and
             if all other sons are already analyzed by other processors,
             then return to parent node, skip current sibling, continue with next sibling, and set flag for second iteration (phase 3.)
        3.In the final phase, any remaining sons not yet analyzed are searched, i.e. the corresponding entry in the TT indicates this node and its siblings have not been searched to the required depth

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Sat Jun 29, 2019 5:33 pm

ABDADA (and YBW) are both speculative algorithms: they do work hoping that the work was not wasted. It would seem to me that the goal of speculative execution is to do two things:

1. Select nodes (and subtrees) that have a high-chance of being necessary work.
2. Avoid nodes (and subtrees) that have a high-chance of being unnecessary.

Lets consider a hypothetical game with 35 available moves per node, that will be evaluated to depth 5. The nodes of the tree are thus labeled P.1.1.1.1.1 through P.35.35.35.35.35, where "P" stands for the root node. Which nodes have the highest chance of being beta-cutoff? And which nodes have the lowest chance?

YBW and ABDADA keenly recognize that the oldest brother (P.1.1.1.1.1, P.2.1.1.1.1.1, ... P.35.1.1.1.1.1) must be evaluated before beta-cutoff can be calculated. So all of these nodes must be evaluated.

However, immediately after these oldest brothers are evaluated, YBW and ABDADA immediately spawn P.2.2.* through P.2.35.*, P.3.2.* through P.3.35.*, etc. etc. And it is clear to me that P.2.2.* is "less speculative" than P.35.2.*. Even with random move ordering, the "youngest brother" has a far higher chance of being beta-cutoff than the "2nd oldest brother".

Consider P.2.2.*. The only condition that this subtree is beta-cutoff is eval(P.2.1) > eval(P.1).

Lets consider P.35.2.*. This subtree is beta-cutoff on the condition eval(P.35.1) > max(eval(P.1), eval(P.2), eval(P.3), eval(P.4), ... eval(P.34). There are 34 "older brothers", all of whom may have a better move than P.35.1.

Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.

dragontamer5788 · Post by **dragontamer5788** » Sat Jun 29, 2019 6:07 pm

So, if I were to create speculative algorithm, how would it work?

1. Oldest brother must be evaluated to completion (P.1.1.1.1.1, P.2.1.1.1.1, P.35.1.1.1.1). Remember that I'm assuming Depth 5.

2. Instead of spawning all brothers, only spawn 2 or 3 brothers (P.1.2.* through P.1.4.*. "Block" P.5.* through P.35.* due to the increased chance of beta-cutoff).

I would expect an algorithm with this kind of prioritization would have less work wasted. Its not work-efficient compared to Sequential AlphaBeta, but it would be more work efficient than YBW or ABDADA. After all, The entire P.2.* tree could be potentially used to refute P.35.2.*. Why search P.35.2.* when there's such a huge chance of refutation?

dragontamer5788 · Post by **dragontamer5788** » Sat Jun 29, 2019 6:16 pm

smatovic wrote: ↑Sat Jun 29, 2019 9:10 amBetween phase 2. and 3. there could be a clause added:
Code: Select all
        If all other sons are already analyzed by other processors, 
        then return to parent node, continue with next move, and set flag for second iteration (phase #3).
Maybe this kind of clause is already implied in 3., but as I undestand it, in 3.
all remaining sons which search has not yet finished (depth reached) has to be searched.

Unfortunately, I have no workstation atm to test this, so what is your opinion?

Sorry for ranting above. I really should have tried to answer your question instead.

-----------

GPUs have 10,000+ SIMD threads available, but chess only has 35-moves per node.

So thread#0 will start searching P.0.* tree (P.0.* now has 1-processor looking at it). P.1.* tree then has 1-processor. Etc. etc. After 35-processors grab nodes, you have 1-processor analyzing P.1 through P.35.

This now completes #1 and #2 conditions. So this triggers condition #3, as far as I can tell. If you want the 9965 other SIMD-cores of your GPU to actually be doing work, you have to go deeper into the tree somehow.

smatovic · Post by **smatovic** » Sat Jun 29, 2019 6:47 pm

dragontamer5788 wrote: ↑Sat Jun 29, 2019 5:33 pm ABDADA (and YBW) are both speculative algorithms: they do work hoping that the work was not wasted. It would seem to me that the goal of speculative execution is to do two things:

1. Select nodes (and subtrees) that have a high-chance of being necessary work.
2. Avoid nodes (and subtrees) that have a high-chance of being unnecessary.

Lets consider a hypothetical game with 35 available moves per node, that will be evaluated to depth 5. The nodes of the tree are thus labeled P.1.1.1.1.1 through P.35.35.35.35.35, where "P" stands for the root node. Which nodes have the highest chance of being beta-cutoff? And which nodes have the lowest chance?

YBW and ABDADA keenly recognize that the oldest brother (P.1.1.1.1.1, P.2.1.1.1.1.1, ... P.35.1.1.1.1.1) must be evaluated before beta-cutoff can be calculated. So all of these nodes must be evaluated.

However, immediately after these oldest brothers are evaluated, YBW and ABDADA immediately spawn P.2.2.* through P.2.35.*, P.3.2.* through P.3.35.*, etc. etc. And it is clear to me that P.2.2.* is "less speculative" than P.35.2.*. Even with random move ordering, the "youngest brother" has a far higher chance of being beta-cutoff than the "2nd oldest brother".

Consider P.2.2.*. The only condition that this subtree is beta-cutoff is eval(P.2.1) > eval(P.1).

Lets consider P.35.2.*. This subtree is beta-cutoff on the condition eval(P.35.1) > max(eval(P.1), eval(P.2), eval(P.3), eval(P.4), ... eval(P.34). There are 34 "older brothers", all of whom may have a better move than P.35.1.

Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.

You withheld that ABDADA spawns P.1.1.1.1.1 to P.1.35.* first before it starts
to explore the root nodes.

And considering that a beta-cutoff in chess occur in ~90% on the first move, this
makes sense.

--
Srdja

smatovic · Post by **smatovic** » Sat Jun 29, 2019 6:47 pm

dragontamer5788 wrote: ↑Sat Jun 29, 2019 6:07 pm So, if I were to create speculative algorithm, how would it work?

1. Oldest brother must be evaluated to completion (P.1.1.1.1.1, P.2.1.1.1.1, P.35.1.1.1.1). Remember that I'm assuming Depth 5.

2. Instead of spawning all brothers, only spawn 2 or 3 brothers (P.1.2.* through P.1.4.*. "Block" P.5.* through P.35.* due to the increased chance of beta-cutoff).

I would expect an algorithm with this kind of prioritization would have less work wasted. Its not work-efficient compared to Sequential AlphaBeta, but it would be more work efficient than YBW or ABDADA. After all, The entire P.2.* tree could be potentially used to refute P.35.2.*. Why search P.35.2.* when there's such a huge chance of refutation?

Considering an effective branching factor of 2 to 3 for modern chess engines,
this makes sense to me.

--
Srdja

smatovic · Post by **smatovic** » Sat Jun 29, 2019 6:49 pm

dragontamer5788 wrote: ↑Sat Jun 29, 2019 6:16 pm
smatovic wrote: ↑Sat Jun 29, 2019 9:10 amBetween phase 2. and 3. there could be a clause added:
Code: Select all
        If all other sons are already analyzed by other processors, 
        then return to parent node, continue with next move, and set flag for second iteration (phase #3).
Maybe this kind of clause is already implied in 3., but as I undestand it, in 3.
all remaining sons which search has not yet finished (depth reached) has to be searched.

Unfortunately, I have no workstation atm to test this, so what is your opinion?
Sorry for ranting above. I really should have tried to answer your question instead.

-----------

GPUs have 10,000+ SIMD threads available, but chess only has 35-moves per node.

So thread#0 will start searching P.0.* tree (P.0.* now has 1-processor looking at it). P.1.* tree then has 1-processor. Etc. etc. After 35-processors grab nodes, you have 1-processor analyzing P.1 through P.35.

This now completes #1 and #2 conditions. So this triggers condition #3, as far as I can tell. If you want the 9965 other SIMD-cores of your GPU to actually be doing work, you have to go deeper into the tree somehow.

How many gpu threads you need to run to utilize a gpu fully depends on your code,
mainly global memory access which has latencies of hundreds of clock cycles...

AMD Fury X for example has 4096 stream cores, coupled to 256 SIMD units, coupled to 64 Compute Units.
Each Compute Unit is able to handle 40 active Wavefronts, with 64 threads per Wavefront, so 163840 threads in total.

Zeta v099 is with 64 gpu threads per worker able to utilize the VALU of one SIMD unit with ~50%,
so I do not need to run 163840 threads, but some hundred workers are enough.

--
Srdja

smatovic · Post by **smatovic** » Sat Jun 29, 2019 8:31 pm

smatovic wrote: ↑Sat Jun 29, 2019 6:47 pm
dragontamer5788 wrote: ↑Sat Jun 29, 2019 5:33 pm ABDADA (and YBW) are both speculative algorithms: they do work hoping that the work was not wasted. It would seem to me that the goal of speculative execution is to do two things:

1. Select nodes (and subtrees) that have a high-chance of being necessary work.
2. Avoid nodes (and subtrees) that have a high-chance of being unnecessary.

Lets consider a hypothetical game with 35 available moves per node, that will be evaluated to depth 5. The nodes of the tree are thus labeled P.1.1.1.1.1 through P.35.35.35.35.35, where "P" stands for the root node. Which nodes have the highest chance of being beta-cutoff? And which nodes have the lowest chance?

YBW and ABDADA keenly recognize that the oldest brother (P.1.1.1.1.1, P.2.1.1.1.1.1, ... P.35.1.1.1.1.1) must be evaluated before beta-cutoff can be calculated. So all of these nodes must be evaluated.

However, immediately after these oldest brothers are evaluated, YBW and ABDADA immediately spawn P.2.2.* through P.2.35.*, P.3.2.* through P.3.35.*, etc. etc. And it is clear to me that P.2.2.* is "less speculative" than P.35.2.*. Even with random move ordering, the "youngest brother" has a far higher chance of being beta-cutoff than the "2nd oldest brother".

Consider P.2.2.*. The only condition that this subtree is beta-cutoff is eval(P.2.1) > eval(P.1).

Lets consider P.35.2.*. This subtree is beta-cutoff on the condition eval(P.35.1) > max(eval(P.1), eval(P.2), eval(P.3), eval(P.4), ... eval(P.34). There are 34 "older brothers", all of whom may have a better move than P.35.1.

Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.
You withheld that ABDADA spawns P.1.1.1.1.1 to P.1.35.* first before it starts
to explore the root nodes.

And considering that a beta-cutoff in chess occur in ~90% on the first move, this
makes sense.

--
Srdja

dragontamer5788 wrote: ↑Sat Jun 29, 2019 5:33 pm ...
Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.
...

Hmm, maybe I don't get you,
but if you consider move ordering for phase #2 and #3, then P.35.2* has an lower priority than P.2.2.*.

--
Srdja

ABDADA+ suggestion

ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion