SPPS - a Simple Parallel Processing Scheme

smatovic · Post by **smatovic** » Sat Jun 18, 2011 5:41 pm

I work currently on a GPU implementation of a Chess Engine and want to introduce a parallel processing scheme i use.

Initial situation:
- tenthousands of parallel threads
- no recursion
- no built-in communication between threads
- little registers amount
- sufficient RAM

I assume a max amount of 128 (maybe 256 in future) possible childs for each node.
So i am able to run 128 threads on one movelist of a chess position in parallel.

Because GPUs have little registers amount (private/local memory) all the movelists have to be stored in GPU-RAM (global memory).

With an array of "search depth * X * 128 * 128 * sizeof(Move)" the gpu is able to run x*128 threads for x chess positions in parallel.

Pseudo Code for 128 threads:

Code: Select all

// pid => Process ID, 0-127

// for each search depth
while &#40;sd >= 0&#41;

    // for each board and a fix search depth
    while &#40;done&#91;sd&#93; < 128 && sd < max_depth&#41; &#123;

        // every of the 128 parallel process uses the same initial board
        board   = globalboards&#91;&#40;sd*128&#41;+done&#91;sd&#93;&#93;;

        // until end of boards
        if (!board&#41;
            break;

        // every of the 128 parallel process uses an individual move
        move    = globalmoves&#91;&#40;sd*128*128&#41;+(&#40;done&#91;sd&#93;)*128&#41;+pid&#93;;

        // only if our process id gets a move
        if (!move&#41;
            continue;

        // domove

        // move up in search tree
        sd++;

        // copy board to global
        globalboards&#91;&#40;sd*128&#41;+pid&#93; = board;

        // generate legal moves 
        moves = genMoves&#40;board&#41;;
        // and store in global
        globalmoves&#91;&#40;sd*128*128&#41; + &#40;pid*128&#41;+i&#93; = move.i; 

    &#125;
    // move down in tree
    done&#91;sd&#93; = 0;
    sd--;
&#125;

smatovic · Post by **smatovic** » Sat Jun 18, 2011 9:22 pm

ups, one line added and one moved.

Code: Select all

// pid => Process ID, 0-127

// for each search depth
while &#40;sd >= 0&#41;

    // for each board and a fix search depth
    while &#40;done&#91;sd&#93; < 128 && sd < max_depth&#41; &#123;

        // every of the 128 parallel process uses the same initial board
        board   = globalboards&#91;&#40;sd*128&#41;+done&#91;sd&#93;&#93;;

        // until end of boards
        if (!board&#41;
            break;

        // every of the 128 parallel process uses an individual move
        move    = globalmoves&#91;&#40;sd*128*128&#41;+(&#40;done&#91;sd&#93;)*128&#41;+pid&#93;;

        // next board in array
        done&#91;sd&#93;++

        // move up in search tree
        sd++;

        // only if our process id gets a move
        if (!move&#41;
            continue;

        // domove

        // copy board to global
        globalboards&#91;&#40;sd*128&#41;+pid&#93; = board;

        // generate legal moves
        moves = genMoves&#40;board&#41;;
        // and store in global
        globalmoves&#91;&#40;sd*128*128&#41; + &#40;pid*128&#41;+i&#93; = move.i;

    &#125;
    // move down in tree
    done&#91;sd&#93; = 0;
    sd--;
&#125;

Sven · Post by **Sven** » Sun Jun 19, 2011 10:45 pm

Could you please describe where you plan to plug in the search itself, i.e. things like move ordering, evaluation, maybe alpha/beta handling, ...? Up to now you have described move generation and parallel processing management.

Sven

smatovic · Post by **smatovic** » Mon Jun 20, 2011 12:41 pm

Could you please describe where you plan to plug in the search itself, i.e. things like move ordering, evaluation, maybe alpha/beta handling, ...? Up to now you have described move generation and parallel processing management.

When i got the alpha-beta-prunning running i will post an update.

--
Srdja

smatovic · Post by **smatovic** » Thu Jun 23, 2011 2:54 am

pseudo code update with negamax-like scores,
it looks somekind of ugly compared to recursion,
but for those who are interested it should give an idea how to use it on a simd-device...

Code: Select all

// pid => Process ID, 0-127

globalboards&#91;sd*128&#93;;
globalscores&#91;sd*128&#93;;
globalmoves&#91;sd*128*128&#93;;


// for each search depth
while &#40;sd >= 0&#41;

    // for each board and a fix search depth
    while &#40;done&#91;sd&#93; < 128 && sd < max_depth&#41; &#123;

        // every of the 128 parallel process uses the same initial board
        board   = globalboards&#91;&#40;sd*128&#41;+done&#91;sd&#93;&#93;;

        // until end of boards
        if (!board&#41;
            break;

        // every of the 128 parallel process uses an individual move
        move    = globalmoves&#91;&#40;sd*128*128&#41;+(&#40;done&#91;sd&#93;)*128&#41;+pid&#93;;

        // next board in array
        done&#91;sd&#93;++

        // move up in search tree
        sd++;

        // switch site
        stm = !stm;

        // only if our process id gets a move
        if (!move&#41;
            continue;

        // domove

        // copy board to global
        globalboards&#91;&#40;sd*128&#41;+pid&#93; = board;

        // generate legal moves
        moves = genMoves&#40;board&#41;;
        // and store in global
        globalmoves&#91;&#40;sd*128*128&#41; + &#40;pid*128&#41;+i&#93; = moves.i;

        // do global negamax
        if &#40;sd == max_depth&#41;
            for moves.count do i
                atom_max&#40;&globalscores&#91;&#40;sd-1&#41;*128+pid&#93;, -eval&#40;board, moves.i&#41; );
    &#125;
    // move down in tree
    done&#91;sd&#93; = 0;
    sd--;
    // switch site
    stm = !stm;
    // do global negamax
    if &#40;sd > 0&#41;
        atom_max&#40;&globalscores&#91;&#40;sd-1&#41;*128+done&#91;sd&#93;&#93;, -globalscores&#91;&#40;sd&#41;*128+pid&#93; );

&#125;

smatovic · Post by **smatovic** » Tue Jun 28, 2011 5:12 pm

Here a simplified version of the DPPS i use.

Code: Select all

    // for every search depth
    while &#40;sd >= 0&#41; &#123;

        // and for each move/board with a finite search depth
        while &#40;movecounter&#91;sd&#93; > 0 && sd < max_depth&#41; &#123;

            decrease&#40;movecounter&#91;sd&#93;);
            
            // get an board for Process
            boardindex  = getBoardIndex&#40;);
            // get an move for process
            moveindex   = getMoveIndex&#40;);

            // copy board and move from global to local
            board = getBoard&#40;boardindex&#41;;
            move  = getMove&#40;moveindex&#41;;

            // move up in search tree
            sd++;
            // switch site to move
            stm = !stm;

            // only if the process got work todo
            if (!board || !move&#41;
                continue;

            domove&#40;board, move&#41;;
            moves = genMoves&#40;board&#41;;
            increase&#40;movecounter&#91;sd&#93;, moves.count&#41;;

        &#125;

        // move down in search tree
        // "clear all counters"
        // decrease search depth
        sd--;
        // switch site to move
        stm = !stm;
    &#125;

smatovic · Post by **smatovic** » Mon Jul 04, 2011 9:01 pm

fyi:

i managed to get a non-recursive, negamax-like search algorithm running on one SIMD Unit of a GPU. But GPUs consist of tens of these SIMD Units and a effective sync between them is simply not possible.

So the idea to pack the hole computation on the GPU is a stopper, somehow the CPU has to be involved to control the flow.

--
Srdja

smatovic · Post by **smatovic** » Mon Jul 04, 2011 10:09 pm

Here some roughly calculated numbers:

Code: Select all

SPPS, one SIMD Unit, 128 Threads&#58;      500 Knps

DPPS, one CPU 2 Ghz Core&#58;             5000 Knps

DPPS, one SIMD Unit  1 Thread          100 Knps
DPPS, one SIMD Unit  2 Threads         200 Knps
DPPS, one SIMD Unit  4 Threads         400 Knps
DPPS, one SIMD Unit  8 Threads         800 Knps
DPPS, one SIMD Unit 16 Threads        1000 Knps
DPPS, one SIMD Unit 32 Threads        1200 Knps
DPPS, one SIMD Unit 64 Threads        1200 Knps

--
Srdja

SPPS - a Simple Parallel Processing Scheme

SPPS - a Simple Parallel Processing Scheme

Re: SPPS - a Simple Parallel Processing Scheme

Re: SPPS - a Simple Parallel Processing Scheme

Re: SPPS - a Simple Parallel Processing Scheme

Re: SPPS - a Simple Parallel Processing Scheme

Re: DPPS - a Dynamic Parallel Processing Scheme

Re: GPUs and unsynched SIMD Units

Re: - and some numbers