hgm wrote: ↑Thu Jul 11, 2019 11:44 pm
I don't know in which order the threads will visit the tree nodes, but if it is anything like a recursive walk of a sub-tree, it would be faster and more compact to put the moves on a stack.
Bingo. There will absolutely be a local stack for this reason. For each physical thread, there will be a thread-private stack. So from a single-threaded perspective, everything is going to be pushing / popping from this local stack.
----------
However: private stacks per-thread means that Thread#1 might get all the work, while Thread#16384 will get no work at all. Ex: Thread#1 gets the work AlphaBeta(root-node). Without sharing data, threads #2 through #16384 won't ever be able to "help out". These stacks must communicate through some kind of global data-structure.
My current plan is to have a linked list as the global data-structure. There will be 8-linked lists (with spinlocks for synchronization), for DepthRemaining0, DepthRemaining1, DepthRemaining2... DepthRemaining7+ nodes. I was considering a
concurrent GPU queue but alas, that looks like way more effort and 8+ linked lists would effectively perform the same task.
So when Thread#1 "runs out of space" (which should be quickly: the local stack will be purposefully small), Thread#1 will offload some of its tasks to the global array[8] of linked list. Other tasks will prefer to "help out" by choosing the lowest depth (ex: Depth 0) tasks first, but they'll grab Depth7+ if work runs out.
By keeping the local-stack small, I will ensure good balancing of work across all 16384 physical SIMD-threads of my GPU. However, making the local-stack bigger will reduce the amount of global-synchronization (spinlocks are very, very slow on a GPU). I'll likely have to tune the size for performance.
If you divide up memory in chuncks of, say, 512 moves, you could grow a stack in each of those, and only allocate a new chunck when there is 'serious danger' that the remaining empty space in the chunck is not enough to hold the list. The number of moves will never incrase from 32 to 200 in a single move; it must be easy to guess an upper limit for the number of moves in a (grand-)daughter with high confidence, e.g. by adding 16. So if you currently have 48 moves, you allocate a new 512-move chunck if the remaining space was less than 64, and otherwise just grow the stack in the current chunck. If in 0.1% of the cases you would overrun the remaining free space you could abort, allocate a new chunck, and redo the generation there.
I've got a fixed-size SIMD-malloc routine figured out for GPUs based off of bitmasks. It turns out that when you have 16384 threads, you can perform a parallel malloc very, very quickly by scanning the bitmask array.
Well, I'm kidding a little bit. But not really. The wavefront size on AMD GPUs is 64 threads (at a minimum, all my code has 64x SIMD threads at any given time). These 64-threads can read/write 32-bits at a time. We're talking 2048-bits scanned per clock-tick per wavefront (with 256 wavefronts working in parallel on my particular GPU). Synchronization is slow, but Atomic operations (such as AtomicAnd and AtomicOr) are very fast on a GPU. Protect the malloc/free routines with a memory-barrier (for proper synchronization) and bam, parallel GPU-efficient global fixed size mallocs.
Mallocs are made slightly more efficient by giving the wavefronts a random starting location every time they call malloc(), so there's less synchronization issues (You don't want all 256x parallel wavefronts to be hitting the starting memory location at the same time, every time).
1-million chunks x 64-bytes per chunk (32-moves per chunk) == 64MB heap. The bitmask array is only 128kB (aka: 1-million bits). The wavefront steps through the entire bitmask in just 512 clock ticks (worst-case). Because its a "wavefront malloc", I'm allocating 64x chunks per scan. So the code benefits from the efficiency of scale.