SPPS - a Simple Parallel Processing Scheme

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

SPPS - a Simple Parallel Processing Scheme

Post by smatovic »

I work currently on a GPU implementation of a Chess Engine and want to introduce a parallel processing scheme i use.

Initial situation:
- tenthousands of parallel threads
- no recursion
- no built-in communication between threads
- little registers amount
- sufficient RAM

I assume a max amount of 128 (maybe 256 in future) possible childs for each node.
So i am able to run 128 threads on one movelist of a chess position in parallel.

Because GPUs have little registers amount (private/local memory) all the movelists have to be stored in GPU-RAM (global memory).

With an array of "search depth * X * 128 * 128 * sizeof(Move)" the gpu is able to run x*128 threads for x chess positions in parallel.

Pseudo Code for 128 threads:

Code: Select all

// pid => Process ID, 0-127

// for each search depth
while (sd >= 0)

    // for each board and a fix search depth
    while &#40;done&#91;sd&#93; < 128 && sd < max_depth&#41; &#123;

        // every of the 128 parallel process uses the same initial board
        board   = globalboards&#91;&#40;sd*128&#41;+done&#91;sd&#93;&#93;;

        // until end of boards
        if (!board&#41;
            break;

        // every of the 128 parallel process uses an individual move
        move    = globalmoves&#91;&#40;sd*128*128&#41;+(&#40;done&#91;sd&#93;)*128&#41;+pid&#93;;

        // only if our process id gets a move
        if (!move&#41;
            continue;

        // domove

        // move up in search tree
        sd++;

        // copy board to global
        globalboards&#91;&#40;sd*128&#41;+pid&#93; = board;

        // generate legal moves 
        moves = genMoves&#40;board&#41;;
        // and store in global
        globalmoves&#91;&#40;sd*128*128&#41; + &#40;pid*128&#41;+i&#93; = move.i; 

    &#125;
    // move down in tree
    done&#91;sd&#93; = 0;
    sd--;
&#125;




smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: SPPS - a Simple Parallel Processing Scheme

Post by smatovic »

ups, one line added and one moved.

Code: Select all

// pid => Process ID, 0-127

// for each search depth
while &#40;sd >= 0&#41;

    // for each board and a fix search depth
    while &#40;done&#91;sd&#93; < 128 && sd < max_depth&#41; &#123;

        // every of the 128 parallel process uses the same initial board
        board   = globalboards&#91;&#40;sd*128&#41;+done&#91;sd&#93;&#93;;

        // until end of boards
        if (!board&#41;
            break;

        // every of the 128 parallel process uses an individual move
        move    = globalmoves&#91;&#40;sd*128*128&#41;+(&#40;done&#91;sd&#93;)*128&#41;+pid&#93;;

        // next board in array
        done&#91;sd&#93;++

        // move up in search tree
        sd++;

        // only if our process id gets a move
        if (!move&#41;
            continue;

        // domove

        // copy board to global
        globalboards&#91;&#40;sd*128&#41;+pid&#93; = board;

        // generate legal moves
        moves = genMoves&#40;board&#41;;
        // and store in global
        globalmoves&#91;&#40;sd*128*128&#41; + &#40;pid*128&#41;+i&#93; = move.i;

    &#125;
    // move down in tree
    done&#91;sd&#93; = 0;
    sd--;
&#125;

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: SPPS - a Simple Parallel Processing Scheme

Post by Sven »

Could you please describe where you plan to plug in the search itself, i.e. things like move ordering, evaluation, maybe alpha/beta handling, ...? Up to now you have described move generation and parallel processing management.

Sven
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: SPPS - a Simple Parallel Processing Scheme

Post by smatovic »

Could you please describe where you plan to plug in the search itself, i.e. things like move ordering, evaluation, maybe alpha/beta handling, ...? Up to now you have described move generation and parallel processing management.
When i got the alpha-beta-prunning running i will post an update.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: SPPS - a Simple Parallel Processing Scheme

Post by smatovic »

pseudo code update with negamax-like scores,
it looks somekind of ugly compared to recursion,
but for those who are interested it should give an idea how to use it on a simd-device...

Code: Select all

// pid => Process ID, 0-127

globalboards&#91;sd*128&#93;;
globalscores&#91;sd*128&#93;;
globalmoves&#91;sd*128*128&#93;;


// for each search depth
while &#40;sd >= 0&#41;

    // for each board and a fix search depth
    while &#40;done&#91;sd&#93; < 128 && sd < max_depth&#41; &#123;

        // every of the 128 parallel process uses the same initial board
        board   = globalboards&#91;&#40;sd*128&#41;+done&#91;sd&#93;&#93;;

        // until end of boards
        if (!board&#41;
            break;

        // every of the 128 parallel process uses an individual move
        move    = globalmoves&#91;&#40;sd*128*128&#41;+(&#40;done&#91;sd&#93;)*128&#41;+pid&#93;;

        // next board in array
        done&#91;sd&#93;++

        // move up in search tree
        sd++;

        // switch site
        stm = !stm;

        // only if our process id gets a move
        if (!move&#41;
            continue;

        // domove

        // copy board to global
        globalboards&#91;&#40;sd*128&#41;+pid&#93; = board;

        // generate legal moves
        moves = genMoves&#40;board&#41;;
        // and store in global
        globalmoves&#91;&#40;sd*128*128&#41; + &#40;pid*128&#41;+i&#93; = moves.i;

        // do global negamax
        if &#40;sd == max_depth&#41;
            for moves.count do i
                atom_max&#40;&globalscores&#91;&#40;sd-1&#41;*128+pid&#93;, -eval&#40;board, moves.i&#41; );
    &#125;
    // move down in tree
    done&#91;sd&#93; = 0;
    sd--;
    // switch site
    stm = !stm;
    // do global negamax
    if &#40;sd > 0&#41;
        atom_max&#40;&globalscores&#91;&#40;sd-1&#41;*128+done&#91;sd&#93;&#93;, -globalscores&#91;&#40;sd&#41;*128+pid&#93; );

&#125;
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: DPPS - a Dynamic Parallel Processing Scheme

Post by smatovic »

Here a simplified version of the DPPS i use.

Code: Select all

    // for every search depth
    while &#40;sd >= 0&#41; &#123;

        // and for each move/board with a finite search depth
        while &#40;movecounter&#91;sd&#93; > 0 && sd < max_depth&#41; &#123;

            decrease&#40;movecounter&#91;sd&#93;);
            
            // get an board for Process
            boardindex  = getBoardIndex&#40;);
            // get an move for process
            moveindex   = getMoveIndex&#40;);

            // copy board and move from global to local
            board = getBoard&#40;boardindex&#41;;
            move  = getMove&#40;moveindex&#41;;

            // move up in search tree
            sd++;
            // switch site to move
            stm = !stm;

            // only if the process got work todo
            if (!board || !move&#41;
                continue;

            domove&#40;board, move&#41;;
            moves = genMoves&#40;board&#41;;
            increase&#40;movecounter&#91;sd&#93;, moves.count&#41;;

        &#125;

        // move down in search tree
        // "clear all counters"
        // decrease search depth
        sd--;
        // switch site to move
        stm = !stm;
    &#125;
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPUs and unsynched SIMD Units

Post by smatovic »

fyi:

i managed to get a non-recursive, negamax-like search algorithm running on one SIMD Unit of a GPU. But GPUs consist of tens of these SIMD Units and a effective sync between them is simply not possible.

So the idea to pack the hole computation on the GPU is a stopper, somehow the CPU has to be involved to control the flow.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: - and some numbers

Post by smatovic »

Here some roughly calculated numbers:

Code: Select all

SPPS, one SIMD Unit, 128 Threads&#58;      500 Knps

DPPS, one CPU 2 Ghz Core&#58;             5000 Knps

DPPS, one SIMD Unit  1 Thread          100 Knps
DPPS, one SIMD Unit  2 Threads         200 Knps
DPPS, one SIMD Unit  4 Threads         400 Knps
DPPS, one SIMD Unit  8 Threads         800 Knps
DPPS, one SIMD Unit 16 Threads        1000 Knps
DPPS, one SIMD Unit 32 Threads        1200 Knps
DPPS, one SIMD Unit 64 Threads        1200 Knps

--
Srdja