I work currently on a GPU implementation of a Chess Engine and want to introduce a parallel processing scheme i use.
Initial situation:
- tenthousands of parallel threads
- no recursion
- no built-in communication between threads
- little registers amount
- sufficient RAM
I assume a max amount of 128 (maybe 256 in future) possible childs for each node.
So i am able to run 128 threads on one movelist of a chess position in parallel.
Because GPUs have little registers amount (private/local memory) all the movelists have to be stored in GPU-RAM (global memory).
With an array of "search depth * X * 128 * 128 * sizeof(Move)" the gpu is able to run x*128 threads for x chess positions in parallel.
// pid => Process ID, 0-127
// for each search depth
while (sd >= 0)
// for each board and a fix search depth
while (done[sd] < 128 && sd < max_depth) {
// every of the 128 parallel process uses the same initial board
board = globalboards[(sd*128)+done[sd]];
// until end of boards
if (!board)
break;
// every of the 128 parallel process uses an individual move
move = globalmoves[(sd*128*128)+((done[sd])*128)+pid];
// only if our process id gets a move
if (!move)
continue;
// domove
// move up in search tree
sd++;
// copy board to global
globalboards[(sd*128)+pid] = board;
// generate legal moves
moves = genMoves(board);
// and store in global
globalmoves[(sd*128*128) + (pid*128)+i] = move.i;
}
// move down in tree
done[sd] = 0;
sd--;
}
// pid => Process ID, 0-127
// for each search depth
while (sd >= 0)
// for each board and a fix search depth
while (done[sd] < 128 && sd < max_depth) {
// every of the 128 parallel process uses the same initial board
board = globalboards[(sd*128)+done[sd]];
// until end of boards
if (!board)
break;
// every of the 128 parallel process uses an individual move
move = globalmoves[(sd*128*128)+((done[sd])*128)+pid];
// next board in array
done[sd]++
// move up in search tree
sd++;
// only if our process id gets a move
if (!move)
continue;
// domove
// copy board to global
globalboards[(sd*128)+pid] = board;
// generate legal moves
moves = genMoves(board);
// and store in global
globalmoves[(sd*128*128) + (pid*128)+i] = move.i;
}
// move down in tree
done[sd] = 0;
sd--;
}
Could you please describe where you plan to plug in the search itself, i.e. things like move ordering, evaluation, maybe alpha/beta handling, ...? Up to now you have described move generation and parallel processing management.
Could you please describe where you plan to plug in the search itself, i.e. things like move ordering, evaluation, maybe alpha/beta handling, ...? Up to now you have described move generation and parallel processing management.
When i got the alpha-beta-prunning running i will post an update.
pseudo code update with negamax-like scores,
it looks somekind of ugly compared to recursion,
but for those who are interested it should give an idea how to use it on a simd-device...
// pid => Process ID, 0-127
globalboards[sd*128];
globalscores[sd*128];
globalmoves[sd*128*128];
// for each search depth
while (sd >= 0)
// for each board and a fix search depth
while (done[sd] < 128 && sd < max_depth) {
// every of the 128 parallel process uses the same initial board
board = globalboards[(sd*128)+done[sd]];
// until end of boards
if (!board)
break;
// every of the 128 parallel process uses an individual move
move = globalmoves[(sd*128*128)+((done[sd])*128)+pid];
// next board in array
done[sd]++
// move up in search tree
sd++;
// switch site
stm = !stm;
// only if our process id gets a move
if (!move)
continue;
// domove
// copy board to global
globalboards[(sd*128)+pid] = board;
// generate legal moves
moves = genMoves(board);
// and store in global
globalmoves[(sd*128*128) + (pid*128)+i] = moves.i;
// do global negamax
if (sd == max_depth)
for moves.count do i
atom_max(&globalscores[(sd-1)*128+pid], -eval(board, moves.i) );
}
// move down in tree
done[sd] = 0;
sd--;
// switch site
stm = !stm;
// do global negamax
if (sd > 0)
atom_max(&globalscores[(sd-1)*128+done[sd]], -globalscores[(sd)*128+pid] );
}
// for every search depth
while (sd >= 0) {
// and for each move/board with a finite search depth
while (movecounter[sd] > 0 && sd < max_depth) {
decrease(movecounter[sd]);
// get an board for Process
boardindex = getBoardIndex();
// get an move for process
moveindex = getMoveIndex();
// copy board and move from global to local
board = getBoard(boardindex);
move = getMove(moveindex);
// move up in search tree
sd++;
// switch site to move
stm = !stm;
// only if the process got work todo
if (!board || !move)
continue;
domove(board, move);
moves = genMoves(board);
increase(movecounter[sd], moves.count);
}
// move down in search tree
// "clear all counters"
// decrease search depth
sd--;
// switch site to move
stm = !stm;
}
i managed to get a non-recursive, negamax-like search algorithm running on one SIMD Unit of a GPU. But GPUs consist of tens of these SIMD Units and a effective sync between them is simply not possible.
So the idea to pack the hole computation on the GPU is a stopper, somehow the CPU has to be involved to control the flow.
SPPS, one SIMD Unit, 128 Threads: 500 Knps
DPPS, one CPU 2 Ghz Core: 5000 Knps
DPPS, one SIMD Unit 1 Thread 100 Knps
DPPS, one SIMD Unit 2 Threads 200 Knps
DPPS, one SIMD Unit 4 Threads 400 Knps
DPPS, one SIMD Unit 8 Threads 800 Knps
DPPS, one SIMD Unit 16 Threads 1000 Knps
DPPS, one SIMD Unit 32 Threads 1200 Knps
DPPS, one SIMD Unit 64 Threads 1200 Knps