dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am
1. Are you renting a server with only 32-processors on it? NVidia Volta only has 80 SMs.
I did rent a 2core cpu with V100.
Afaik, the V100 has 80 SMs resp. Compute Units, each with 64 cores divided into
four 16 wide SIMD units. But I can run only 2 workers per Compute Unit efficient
on Volta and Turing series with Zeta v099, contrary to Pascal, which is able to
run 4 workers efficient. So my guess is that the integer units are 32 wide on
Volta and Turing, so two integer SIMD units per Compute Unit, 160 in total.
Otherwise Zeta v099 would have some serious bottleneck for nps scaling, but this
did not show up on AMD Fury X with 256 SIMD units.
Yes and no. I access global memory for TT, ABDADA TT, and storing a move list.
Here an 'empty run' without any memory for TTs that shows pure nps scaling:
Code: Select all
# Zeta v099m, start pos, depth 10, Nvidia V100, tt1:0 MB, tt2.abdada:0 MB
### workers #nps #nps speedup #time in s #ttd speedup #relative ttd speedup ###
### 1 90178 1.000000 77.231000 1.000000 1.000000
### 2 189790 2.104615 70.000000 1.103300 1.103300
### 4 378751 4.200038 70.079000 1.102056 0.998873
### 8 764234 8.474728 69.941000 1.104231 1.001973
### 16 1530162 16.968241 69.123000 1.117298 1.011834
### 32 3147821 34.906751 65.940000 1.171231 1.048271
### 64 6353573 70.455909 66.149000 1.167531 0.996840
### 128 12206487 135.359921 66.626000 1.159172 0.992841
### 160 14145328 156.860077 70.993000 1.087868 0.938487
Superlinear speedups may occur cos there is some fix overhead for calling the
gpu kernel for each search depth.
dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am
3. The "obscure" question: are your L1 caches communicating fast enough for ABDADA to function properly? AMD GPUs do not have coherent caches, so you have to forcibly flush the L1 cache (ex: through a memory barrier) if you actually want different compute-units to see the results of a computation. This would be important for the Transposition Table: if the various compute units are keeping the TT in L1 cache and aren't sharing the data fast enough, maybe that's the problem?
This is new to me, I have 4 global memory barriers and a lot of local ones,
maybe I have to dig deeper into this.
dragontamer5788 wrote: ↑Thu Jul 04, 2019 4:38 am
So 2 "easy" questions, and 1 "hard" one. That's all I can think of for now.
Thanks.
--
Srdja