ABDADA+ suggestion

dragontamer5788 · Post by **dragontamer5788** » Sun Jun 30, 2019 8:16 am

@ Srdja, here's the ABDADA paper I'm using: https://dl.acm.org/citation.cfm?id=228345

Hmm, maybe I don't get you,
but if you consider move ordering for phase #2 and #3, then P.35.2* has an lower priority than P.2.2.*.

From my reading of the paper, "iteration = 0" will send a thread to P.35.*. This thread will immediately evaluate P.35.1.*, and after that, P.35.2.* will be evaluated.

Thread #0 enters P.1, sets the P.1.exclusive. When Thread#1 enters P.1, it will read "ON_EVALUATION" due to the P.1.exclusive flag being set. Thread #1 then enters P.2 and sets it as exclusive. Thread #2 gets "ON_EVALUATION" from both P.1 and P.2, so it goes to P.3. Etc. etc. Thread#34 will be the one to enter P.35.*.

After thread#34, all remaining threads (#35 and above) will get "ON_EVALUATION" for all 35 nodes, and therefore leave phase#1 (iteration=0). As such, all these threads will enter P.1.* tree and "focus" on this path. Which is a form of priority I guess.

But thread #34 will stay in P.35.* and eventually have a high chance of wasting work (as it evaluates P.35.2, and P.35.3).

smatovic · Post by **smatovic** » Sun Jun 30, 2019 9:43 am

dragontamer5788 wrote: ↑Sun Jun 30, 2019 8:16 am @ Srdja, here's the ABDADA paper I'm using: https://dl.acm.org/citation.cfm?id=228345

Hmm, maybe I don't get you,
but if you consider move ordering for phase #2 and #3, then P.35.2* has an lower priority than P.2.2.*.
From my reading of the paper, "iteration = 0" will send a thread to P.35.*. This thread will immediately evaluate P.35.1.*, and after that, P.35.2.* will be evaluated.

Thread #0 enters P.1, sets the P.1.exclusive. When Thread#1 enters P.1, it will read "ON_EVALUATION" due to the P.1.exclusive flag being set. Thread #1 then enters P.2 and sets it as exclusive. Thread #2 gets "ON_EVALUATION" from both P.1 and P.2, so it goes to P.3. Etc. etc. Thread#34 will be the one to enter P.35.*.

After thread#34, all remaining threads (#35 and above) will get "ON_EVALUATION" for all 35 nodes, and therefore leave phase#1 (iteration=0). As such, all these threads will enter P.1.* tree and "focus" on this path. Which is a form of priority I guess.

But thread #34 will stay in P.35.* and eventually have a high chance of wasting work (as it evaluates P.35.2, and P.35.3).

Yes, but nobody claimed that it is an perfect scheme...

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Sun Jun 30, 2019 8:21 pm

smatovic wrote: ↑Sun Jun 30, 2019 9:43 amYes, but nobody claimed that it is an perfect scheme...

--
Srdja

Ah yes, but we're trying to improve the scheme in this topic, yes?

<Between Phase 2 and 3>
If all other sons are already analyzed by other processors,
then return to parent node, continue with next move, and set flag for second iteration (phase #3).

I believe this is your proposal. This rule would send more threads to P35 than classic ABDADA. I think the opposite should happen: fewer threads should be sent to P35, and more threads should favor P1.

smatovic · Post by **smatovic** » Sun Jun 30, 2019 8:52 pm

dragontamer5788 wrote: ↑Sun Jun 30, 2019 8:21 pm
smatovic wrote: ↑Sun Jun 30, 2019 9:43 amYes, but nobody claimed that it is an perfect scheme...

--
Srdja
Ah yes, but we're trying to improve the scheme in this topic, yes?

Touche.

dragontamer5788 wrote: ↑Sun Jun 30, 2019 8:21 pm
<Between Phase 2 and 3>
If all other sons are already analyzed by other processors,
then return to parent node, continue with next move, and set flag for second iteration (phase #3).
I believe this is your proposal. This rule would send more threads to P35 than classic ABDADA. I think the opposite should happen: fewer threads should be sent to P35, and more threads should favor P1.

I disagree.

My guess is that in the middle of the game tree too many processors search
during #3 the same node, so it is better to send some processors back to parent
nodes to be able to evaluate the more interesting older brothers first.

It is not much work to implement and test this, but like I said, I have no gpu
workstation atm. Maybe I will rent something on Amazon AWS or Google Cloud to
figure out where my ABDADA bottleneck is...

--
Srdja

smatovic · Post by **smatovic** » Thu Jul 04, 2019 1:05 am

Okay, I cleaned up my code, and tried my own + suggestion, it has no effect,
at least the scaling seems now stable with up to 32 workers, but I am still
clueless why scaling stalls with more than 32 workers...

Code: Select all

# Zeta v099m, start pos, depth 12, Nvidia V100, tt: 2048 MB, ttabdada: 1536 MB
### workers #nps     #nps speedup  #time in s  #ttd speedup  #relative ttd speedup ###
### 1       86646     1.000000      156.913000  1.000000      1.000000 
### 2       180349    2.081446      64.248000   2.442302      2.442302 
### 4       348558    4.022782      36.207000   4.333775      1.774464 
### 8       704125    8.126457      25.372000   6.184495      1.427046 
### 16      1371378   15.827367     14.236000   11.022267     1.782242 
### 32      2709332   31.268980     9.545000    16.439288     1.491461 
### 64      5482871   63.278986     9.315000    16.845196     1.024691 
### 128     10408893  120.131258    9.321000    16.834353     0.999356 
### 160     12152199  140.251125    10.590000   14.817092     0.880170

According to ABDADA papers, 32 processors reached x16 speedups for chess,
but the stated values of x70 speedup for 128 processors seems odd...

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Thu Jul 04, 2019 4:38 am

smatovic wrote: ↑Thu Jul 04, 2019 1:05 am Okay, I cleaned up my code, and tried my own + suggestion, it has no effect,
at least the scaling seems now stable with up to 32 workers, but I am still
clueless why scaling stalls with more than 32 workers...
Code: Select all
# Zeta v099m, start pos, depth 12, Nvidia V100, tt: 2048 MB, ttabdada: 1536 MB
### workers #nps     #nps speedup  #time in s  #ttd speedup  #relative ttd speedup ###
### 1       86646     1.000000      156.913000  1.000000      1.000000 
### 2       180349    2.081446      64.248000   2.442302      2.442302 
### 4       348558    4.022782      36.207000   4.333775      1.774464 
### 8       704125    8.126457      25.372000   6.184495      1.427046 
### 16      1371378   15.827367     14.236000   11.022267     1.782242 
### 32      2709332   31.268980     9.545000    16.439288     1.491461 
### 64      5482871   63.278986     9.315000    16.845196     1.024691 
### 128     10408893  120.131258    9.321000    16.834353     0.999356 
### 160     12152199  140.251125    10.590000   14.817092     0.880170
According to ABDADA papers, 32 processors reached x16 speedups for chess,
but the stated values of x70 speedup for 128 processors seems odd...

--
Srdja

Very interesting results.

1. Are you renting a server with only 32-processors on it? NVidia Volta only has 80 SMs.

2. Are you memory-bound?

3. The "obscure" question: are your L1 caches communicating fast enough for ABDADA to function properly? AMD GPUs do not have coherent caches, so you have to forcibly flush the L1 cache (ex: through a memory barrier) if you actually want different compute-units to see the results of a computation. This would be important for the Transposition Table: if the various compute units are keeping the TT in L1 cache and aren't sharing the data fast enough, maybe that's the problem?

So 2 "easy" questions, and 1 "hard" one. That's all I can think of for now.

smatovic · Post by **smatovic** » Thu Jul 04, 2019 3:09 pm

dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am 1. Are you renting a server with only 32-processors on it? NVidia Volta only has 80 SMs.

I did rent a 2core cpu with V100.

Afaik, the V100 has 80 SMs resp. Compute Units, each with 64 cores divided into
four 16 wide SIMD units. But I can run only 2 workers per Compute Unit efficient
on Volta and Turing series with Zeta v099, contrary to Pascal, which is able to
run 4 workers efficient. So my guess is that the integer units are 32 wide on
Volta and Turing, so two integer SIMD units per Compute Unit, 160 in total.

Otherwise Zeta v099 would have some serious bottleneck for nps scaling, but this
did not show up on AMD Fury X with 256 SIMD units.

dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am 2. Are you memory-bound?

Yes and no. I access global memory for TT, ABDADA TT, and storing a move list.
Here an 'empty run' without any memory for TTs that shows pure nps scaling:

Code: Select all

# Zeta v099m, start pos, depth 10, Nvidia V100, tt1:0 MB, tt2.abdada:0 MB
### workers	#nps		#nps speedup	#time in s	#ttd speedup	#relative ttd speedup ###
### 1		90178	  	1.000000	  77.231000	  1.000000	1.000000 
### 2		189790	  	2.104615	  70.000000	  1.103300    	1.103300 
### 4		378751	  	4.200038	  70.079000	  1.102056	0.998873 
### 8		764234	  	8.474728	  69.941000	  1.104231	1.001973 
### 16		1530162	  	16.968241	  69.123000	  1.117298	1.011834 
### 32		3147821	  	34.906751	  65.940000	  1.171231	1.048271 
### 64		6353573	  	70.455909	  66.149000	  1.167531	0.996840 
### 128		12206487	135.359921	  66.626000	  1.159172	0.992841 
### 160		14145328	156.860077	  70.993000	  1.087868	0.938487

Superlinear speedups may occur cos there is some fix overhead for calling the
gpu kernel for each search depth.

dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am 3. The "obscure" question: are your L1 caches communicating fast enough for ABDADA to function properly? AMD GPUs do not have coherent caches, so you have to forcibly flush the L1 cache (ex: through a memory barrier) if you actually want different compute-units to see the results of a computation. This would be important for the Transposition Table: if the various compute units are keeping the TT in L1 cache and aren't sharing the data fast enough, maybe that's the problem?

This is new to me, I have 4 global memory barriers and a lot of local ones,
maybe I have to dig deeper into this.

dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am So 2 "easy" questions, and 1 "hard" one. That's all I can think of for now.

Thanks.

--
Srdja

smatovic · Post by **smatovic** » Thu Jul 04, 2019 3:21 pm

Okay, I locked only nodes with depth>1 during ABDADA phase #2, iteration 0,
changed to detph>0 and results look better...

Code: Select all

################################################################################
# Zeta 099m, startposition, depth 12, best of 4 runs, Nvidia V100:
#
### workers #nps          #nps speedup   #time in s   #ttd speedup   #relative ttd speedup ###
### 1       86827         1.000000       156.586000   1.000000       1.000000 
### 2       180282        2.076336       55.749000    2.808768       2.808768 
### 4       356910        4.110588       35.564000    4.402936       1.567568 
### 8       704741        8.116611       19.637000    7.974029       1.811071 
### 16      1385758       15.959989      14.583000    10.737571      1.346568 
### 32      2786039       32.087242      11.124000    14.076411      1.310949 
### 64      5460849       62.893443      8.838000     17.717357      1.258656 
### 128     10235993      117.889516     7.377000     21.226244      1.198048 
### 160     11639290      134.051505     7.202000     21.742016      1.024299

There is for sure still something to improve, but I am ok with the current
results, here an smp benchmark on Hyatt24 positions, single run:

Code: Select all

# Zeta v099m, Hyatt24, depth 10, Nvidia V100, tt1:2048 MB, tt2.abdada:1536 MB
### workers   #ttd speedup    #rel ttd speedup ###
### 1         1.000000        1.000000 
### 2         1.895372        1.895372 
### 4         3.579926        1.888772 
### 8         5.810123        1.622973 
### 16        7.767453        1.336883 
### 32        10.286573       1.324317 
### 64        10.895601       1.059206 
### 128       12.373947       1.135683 
### 160       11.523171       0.931245

Next step would be to measure Elo instead of time to depth speedups...

--
Srdja

ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion

Re: ABDADA+ suggestion