gpu chess summary 2017

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

gpu chess summary 2017

Post by smatovic »

I recently tinkered a bit with gpu chess again,
so here are my current conclusions...

A) One Thread One Board

Running one thread, with one discrete alphabeta searcher, and thousands of threads
in total, gives the best overall nodes per second performance.

But with thousands of workers it is difficult to get an efficient parallel search
algorithm running.

I have tried different load balancing approaches for this design.

Monte Carlo AB, Lifo Stack based parallel processing, multiple AlphaBeta windows,
but the best i could come up with was an parallel Best-First-Minimax-Search.

Implemented in Zeta v097 and v098 it reached about 1800 Elo (CCRL scale) with
about 5 Mnps on a Nv GTX 580.

B) One SIMD Unit One Board

Coupling (for example 64) threads together to work on the same node during
move generation, move picking and evaluation is possible. But the overhead
of a SIMD friendly computation and syncing the threads for collecting data
results in nps valus that are not able to outperform the latest high-end cpus.

Maybe a switch from Bitboard design to single (or half or quarter) precision
would give the desired nps values.

The advantage of this design is, that you have only hundreds of workers to
feed an parallel search algorithm with.

With Zeta v099 i have implemented an alphabeta searcher with standard search
enhancements (ID, TT, MVV-LVA, QS, Nullmove, Killer and Countermove, LMR)
and Lazy SMP (randomized move order, communication via Shared Hash Table)
with this design.

One single worker (64 threads) with about 45 Knps on a Nv GTX 750 was able to
reach 1700 Elo and 64 workers with about 1,5 Mnps reached 2000 Elo (CCRL scale).

I think that with design B) an 3000+ Elo engine is possible on gpu,
but without increasing the nps throughput per worker significantly there would
be no gain compared to current cpu incarnations.

C) Kernel Level Recursion
One requirement of this project was that all gpgpu capable gpus are supported,
so i used OpenCL version 1.x, which does not support kernel level recursion.

Without this feature i doubt that parallel search algorithms like YBWC are
possible to implement, at least i could not figure out how.

So modern gpus could profit from an more efficient parallel search like
YBWC via kernel level recursion.

Maybe i will do another run when Nvidia is going to support OpenCL 2.x.

--
Srdja

PS: I took the Zeta sources offline, but you can pm me if you want a copy.