ABDADA+ suggestion

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

ABDADA+ suggestion

Post by smatovic »

I have implemented ABDADA on gpu and it achieves an time to depth speedup only
up to 32 parallel threads resp. workers. With more than 32 workers there is not
much to gain, tested it with up to 256 workers, but not at longer time controls.

Looking at the algorithm and game tree, I think by adding another defer step
there could be something gained.

From CPW, https://www.chessprogramming.org/ABDADA

Code: Select all

    5. The analysis of a position is done in three phases:
        1.The eldest son is analyzed, regardless of the activities of other processors
        2.Next, all other sons not currently being analyzed by other processors are analyzed
        3.In the final phase, any remaining sons not yet analyzed are searched, i.e. the corresponding entry in the TT indicates this node and its siblings have not been searched to the required depth

Between phase 2. and 3. there could be a clause added:

Code: Select all

        If all other sons are already analyzed by other processors, 
        then return to parent node, continue with next move, and set flag for second iteration (phase #3).
Maybe this kind of clause is already implied in 3., but as I undestand it, in 3.
all remaining sons which search has not yet finished (depth reached) has to be searched.

Unfortunately, I have no workstation atm to test this, so what is your opinion?

--
Srdja
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: ABDADA+ suggestion

Post by Daniel Shawul »

There is no need for the intermediate step because when you do (3), the first thing the search does is check the hashtable
and it will find hits if the children have been search by other processors. You can even go over all the movelist in the second pass,
instead of moves that aren't exclusively looked at, and not loose efficency because of this.

What do you consider a thread in your implementation (a CUDA thread, or a warp (32 threads)) ?

Daniel
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: ABDADA+ suggestion

Post by smatovic »

Daniel Shawul wrote: Sat Jun 29, 2019 3:13 pm There is no need for the intermediate step because when you do (3), the first thing the search does is check the hashtable
and it will find hits if the children have been search by other processors.
Yes, assumed that the other processors finished their search already and have stored the result in hashtable.

My thought was that the children are not yet searched fully, and too many workers start to search the same children,
cos my scaling drops with 32 workers, but there may be other reasons for this.
Daniel Shawul wrote: Sat Jun 29, 2019 3:13 pm You can even go over all the movelist in the second pass,
instead of moves that aren't exclusively looked at, and not loose efficency because of this.
Ah, cos of values from hashtable.
Daniel Shawul wrote: Sat Jun 29, 2019 3:13 pm What do you consider a thread in your implementation (a CUDA thread, or a warp (32 threads)) ?
In your terms a warp.

Zeta v099 couples 64 gpu threads (CUDA threads) to one 'worker' (Work-Group resp. Block) to work
on the same node in parallel during move gen, move pick and eval, and a shared hash table approach
is used for parallel AlphaBeta across these workers.

...just for the completeness, a little mod to avoid looping:

Code: Select all

    5.The analysis of a position is done in three phases:
        1.The eldest son is analyzed, regardless of the activities of other processors
        2.a Next, all other sons not currently being analyzed by other processors are analyzed
        2.b If parent node is not in phase 3. and
             if all other sons are already analyzed by other processors,
             then return to parent node, skip current sibling, continue with next sibling, and set flag for second iteration (phase 3.)
        3.In the final phase, any remaining sons not yet analyzed are searched, i.e. the corresponding entry in the TT indicates this node and its siblings have not been searched to the required depth
--
Srdja
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: ABDADA+ suggestion

Post by dragontamer5788 »

ABDADA (and YBW) are both speculative algorithms: they do work hoping that the work was not wasted. It would seem to me that the goal of speculative execution is to do two things:

1. Select nodes (and subtrees) that have a high-chance of being necessary work.
2. Avoid nodes (and subtrees) that have a high-chance of being unnecessary.

Lets consider a hypothetical game with 35 available moves per node, that will be evaluated to depth 5. The nodes of the tree are thus labeled P.1.1.1.1.1 through P.35.35.35.35.35, where "P" stands for the root node. Which nodes have the highest chance of being beta-cutoff? And which nodes have the lowest chance?

YBW and ABDADA keenly recognize that the oldest brother (P.1.1.1.1.1, P.2.1.1.1.1.1, ... P.35.1.1.1.1.1) must be evaluated before beta-cutoff can be calculated. So all of these nodes must be evaluated.

However, immediately after these oldest brothers are evaluated, YBW and ABDADA immediately spawn P.2.2.* through P.2.35.*, P.3.2.* through P.3.35.*, etc. etc. And it is clear to me that P.2.2.* is "less speculative" than P.35.2.*. Even with random move ordering, the "youngest brother" has a far higher chance of being beta-cutoff than the "2nd oldest brother".

Consider P.2.2.*. The only condition that this subtree is beta-cutoff is eval(P.2.1) > eval(P.1).

Lets consider P.35.2.*. This subtree is beta-cutoff on the condition eval(P.35.1) > max(eval(P.1), eval(P.2), eval(P.3), eval(P.4), ... eval(P.34). There are 34 "older brothers", all of whom may have a better move than P.35.1.

Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: ABDADA+ suggestion

Post by dragontamer5788 »

So, if I were to create speculative algorithm, how would it work?

1. Oldest brother must be evaluated to completion (P.1.1.1.1.1, P.2.1.1.1.1, P.35.1.1.1.1). Remember that I'm assuming Depth 5.

2. Instead of spawning all brothers, only spawn 2 or 3 brothers (P.1.2.* through P.1.4.*. "Block" P.5.* through P.35.* due to the increased chance of beta-cutoff).

I would expect an algorithm with this kind of prioritization would have less work wasted. Its not work-efficient compared to Sequential AlphaBeta, but it would be more work efficient than YBW or ABDADA. After all, The entire P.2.* tree could be potentially used to refute P.35.2.*. Why search P.35.2.* when there's such a huge chance of refutation?
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: ABDADA+ suggestion

Post by dragontamer5788 »

smatovic wrote: Sat Jun 29, 2019 9:10 amBetween phase 2. and 3. there could be a clause added:

Code: Select all

        If all other sons are already analyzed by other processors, 
        then return to parent node, continue with next move, and set flag for second iteration (phase #3).
Maybe this kind of clause is already implied in 3., but as I undestand it, in 3.
all remaining sons which search has not yet finished (depth reached) has to be searched.

Unfortunately, I have no workstation atm to test this, so what is your opinion?
Sorry for ranting above. I really should have tried to answer your question instead.

-----------

GPUs have 10,000+ SIMD threads available, but chess only has 35-moves per node.

So thread#0 will start searching P.0.* tree (P.0.* now has 1-processor looking at it). P.1.* tree then has 1-processor. Etc. etc. After 35-processors grab nodes, you have 1-processor analyzing P.1 through P.35.

This now completes #1 and #2 conditions. So this triggers condition #3, as far as I can tell. If you want the 9965 other SIMD-cores of your GPU to actually be doing work, you have to go deeper into the tree somehow.
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: ABDADA+ suggestion

Post by smatovic »

dragontamer5788 wrote: Sat Jun 29, 2019 5:33 pm ABDADA (and YBW) are both speculative algorithms: they do work hoping that the work was not wasted. It would seem to me that the goal of speculative execution is to do two things:

1. Select nodes (and subtrees) that have a high-chance of being necessary work.
2. Avoid nodes (and subtrees) that have a high-chance of being unnecessary.

Lets consider a hypothetical game with 35 available moves per node, that will be evaluated to depth 5. The nodes of the tree are thus labeled P.1.1.1.1.1 through P.35.35.35.35.35, where "P" stands for the root node. Which nodes have the highest chance of being beta-cutoff? And which nodes have the lowest chance?

YBW and ABDADA keenly recognize that the oldest brother (P.1.1.1.1.1, P.2.1.1.1.1.1, ... P.35.1.1.1.1.1) must be evaluated before beta-cutoff can be calculated. So all of these nodes must be evaluated.

However, immediately after these oldest brothers are evaluated, YBW and ABDADA immediately spawn P.2.2.* through P.2.35.*, P.3.2.* through P.3.35.*, etc. etc. And it is clear to me that P.2.2.* is "less speculative" than P.35.2.*. Even with random move ordering, the "youngest brother" has a far higher chance of being beta-cutoff than the "2nd oldest brother".

Consider P.2.2.*. The only condition that this subtree is beta-cutoff is eval(P.2.1) > eval(P.1).

Lets consider P.35.2.*. This subtree is beta-cutoff on the condition eval(P.35.1) > max(eval(P.1), eval(P.2), eval(P.3), eval(P.4), ... eval(P.34). There are 34 "older brothers", all of whom may have a better move than P.35.1.

Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.
You withheld that ABDADA spawns P.1.1.1.1.1 to P.1.35.* first before it starts
to explore the root nodes.

And considering that a beta-cutoff in chess occur in ~90% on the first move, this
makes sense.

--
Srdja
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: ABDADA+ suggestion

Post by smatovic »

dragontamer5788 wrote: Sat Jun 29, 2019 6:07 pm So, if I were to create speculative algorithm, how would it work?

1. Oldest brother must be evaluated to completion (P.1.1.1.1.1, P.2.1.1.1.1, P.35.1.1.1.1). Remember that I'm assuming Depth 5.

2. Instead of spawning all brothers, only spawn 2 or 3 brothers (P.1.2.* through P.1.4.*. "Block" P.5.* through P.35.* due to the increased chance of beta-cutoff).

I would expect an algorithm with this kind of prioritization would have less work wasted. Its not work-efficient compared to Sequential AlphaBeta, but it would be more work efficient than YBW or ABDADA. After all, The entire P.2.* tree could be potentially used to refute P.35.2.*. Why search P.35.2.* when there's such a huge chance of refutation?
Considering an effective branching factor of 2 to 3 for modern chess engines,
this makes sense to me.

--
Srdja
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: ABDADA+ suggestion

Post by smatovic »

dragontamer5788 wrote: Sat Jun 29, 2019 6:16 pm
smatovic wrote: Sat Jun 29, 2019 9:10 amBetween phase 2. and 3. there could be a clause added:

Code: Select all

        If all other sons are already analyzed by other processors, 
        then return to parent node, continue with next move, and set flag for second iteration (phase #3).
Maybe this kind of clause is already implied in 3., but as I undestand it, in 3.
all remaining sons which search has not yet finished (depth reached) has to be searched.

Unfortunately, I have no workstation atm to test this, so what is your opinion?
Sorry for ranting above. I really should have tried to answer your question instead.

-----------

GPUs have 10,000+ SIMD threads available, but chess only has 35-moves per node.

So thread#0 will start searching P.0.* tree (P.0.* now has 1-processor looking at it). P.1.* tree then has 1-processor. Etc. etc. After 35-processors grab nodes, you have 1-processor analyzing P.1 through P.35.

This now completes #1 and #2 conditions. So this triggers condition #3, as far as I can tell. If you want the 9965 other SIMD-cores of your GPU to actually be doing work, you have to go deeper into the tree somehow.
How many gpu threads you need to run to utilize a gpu fully depends on your code,
mainly global memory access which has latencies of hundreds of clock cycles...

AMD Fury X for example has 4096 stream cores, coupled to 256 SIMD units, coupled to 64 Compute Units.
Each Compute Unit is able to handle 40 active Wavefronts, with 64 threads per Wavefront, so 163840 threads in total.

Zeta v099 is with 64 gpu threads per worker able to utilize the VALU of one SIMD unit with ~50%,
so I do not need to run 163840 threads, but some hundred workers are enough.

--
Srdja
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: ABDADA+ suggestion

Post by smatovic »

smatovic wrote: Sat Jun 29, 2019 6:47 pm
dragontamer5788 wrote: Sat Jun 29, 2019 5:33 pm ABDADA (and YBW) are both speculative algorithms: they do work hoping that the work was not wasted. It would seem to me that the goal of speculative execution is to do two things:

1. Select nodes (and subtrees) that have a high-chance of being necessary work.
2. Avoid nodes (and subtrees) that have a high-chance of being unnecessary.

Lets consider a hypothetical game with 35 available moves per node, that will be evaluated to depth 5. The nodes of the tree are thus labeled P.1.1.1.1.1 through P.35.35.35.35.35, where "P" stands for the root node. Which nodes have the highest chance of being beta-cutoff? And which nodes have the lowest chance?

YBW and ABDADA keenly recognize that the oldest brother (P.1.1.1.1.1, P.2.1.1.1.1.1, ... P.35.1.1.1.1.1) must be evaluated before beta-cutoff can be calculated. So all of these nodes must be evaluated.

However, immediately after these oldest brothers are evaluated, YBW and ABDADA immediately spawn P.2.2.* through P.2.35.*, P.3.2.* through P.3.35.*, etc. etc. And it is clear to me that P.2.2.* is "less speculative" than P.35.2.*. Even with random move ordering, the "youngest brother" has a far higher chance of being beta-cutoff than the "2nd oldest brother".

Consider P.2.2.*. The only condition that this subtree is beta-cutoff is eval(P.2.1) > eval(P.1).

Lets consider P.35.2.*. This subtree is beta-cutoff on the condition eval(P.35.1) > max(eval(P.1), eval(P.2), eval(P.3), eval(P.4), ... eval(P.34). There are 34 "older brothers", all of whom may have a better move than P.35.1.

Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.
You withheld that ABDADA spawns P.1.1.1.1.1 to P.1.35.* first before it starts
to explore the root nodes.

And considering that a beta-cutoff in chess occur in ~90% on the first move, this
makes sense.

--
Srdja
dragontamer5788 wrote: Sat Jun 29, 2019 5:33 pm ...
Under YBW and ABDADA, P.2.2.* and P.35.2.* subtrees have the same priority. I would argue that this is work-inefficient: P.35.2* must have a lower priority than P.2.2.*.
...
Hmm, maybe I don't get you,
but if you consider move ordering for phase #2 and #3, then P.35.2* has an lower priority than P.2.2.*.

--
Srdja