Alphabeta on top of Engine cluster

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: Alphabeta on top of Engine cluster

Post by j.t. »

dangi12012 wrote: Fri Dec 17, 2021 5:38 pm Does anyone here know what SF uses internally for its MT magic or is it best to just read the sourcecode?
It seems to me like Stockfish uses Lazy SMP.

EDIT: I edited out my question "What does MT stand for". I am now editing it back in, to avoid confusion.
Last edited by j.t. on Fri Dec 17, 2021 6:25 pm, edited 2 times in total.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Alphabeta on top of Engine cluster

Post by Desperado »

j.t. wrote: Fri Dec 17, 2021 6:20 pm
dangi12012 wrote: Fri Dec 17, 2021 5:38 pm Does anyone here know what SF uses internally for its MT magic or is it best to just read the sourcecode?
It seems to me like Stockfish uses Lazy SMP.
I guess he means multithreading.
R. Tomasi
Posts: 307
Joined: Wed Sep 01, 2021 4:08 pm
Location: Germany
Full name: Roland Tomasi

Re: Alphabeta on top of Engine cluster

Post by R. Tomasi »

dangi12012 wrote: Fri Dec 17, 2021 5:36 pm Your information is outdated regarding smp.
Well, maybe not that outdated, after all: https://github.com/official-stockfish/S ... h/pull/468...

dangi12012 wrote: Fri Dec 17, 2021 5:36 pm Why comment that something is impossible. Its like on stackoverflow where 9 people say its impossible and 1 person provides the perfect answer.
Well you ask for simple pseudo-code for a problem that nobody has managed to solve in a "simple" way. That's why you're asking for the impossible. You will not get "simple" pseudo-code for that...
Mergi
Posts: 127
Joined: Sat Aug 21, 2021 9:55 pm
Full name: Jen

Re: Alphabeta on top of Engine cluster

Post by Mergi »

Here's a simple (and actually legit) pseudo code for lazy SMP.

Code: Select all

for thread in threads
	do normal AB search

wait for all threads to finish
select best thread
uci output best thread's move
:shock:

Obviously you can try some tricks with randomizing the root move ordering, so each thread has a very distinct search tree ... but i found that at least for my tests with 4 cores it didn't help at all. The lazier i was with the algorithm the better the results were :D The only thing i do is that if a helper thread falls behind the main thread (thread with id == 0), then it skips some depth and immeditelly tries to do a deeper search.

The only form of communication between threads is the shared transposition table. For SF it is a lockless table where data races can and do occur. Every move that is retrieved from this TT is checked for validity (it can be played on the board without crashing the engine).

Generally the advantage of lazy SMP is that it's exceedingly easy to implement and scales better than most other algos when run on a lot of threads.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Alphabeta on top of Engine cluster

Post by dangi12012 »

Mergi wrote: Fri Dec 17, 2021 7:47 pm Here's a simple (and actually legit) pseudo code for lazy SMP.

Code: Select all

for thread in threads
	do normal AB search

wait for all threads to finish
select best thread
uci output best thread's move
:shock:

Obviously you can try some tricks with randomizing the root move ordering, so each thread has a very distinct search tree ... but i found that at least for my tests with 4 cores it didn't help at all. The lazier i was with the algorithm the better the results were :D The only thing i do is that if a helper thread falls behind the main thread (thread with id == 0), then it skips some depth and immeditelly tries to do a deeper search.

The only form of communication between threads is the shared transposition table. For SF it is a lockless table where data races can and do occur. Every move that is retrieved from this TT is checked for validity (it can be played on the board without crashing the engine).

Generally the advantage of lazy SMP is that it's exceedingly easy to implement and scales better than most other algos when run on a lot of threads.
Randomizing the move order gives the advantage of giving the TT more "entropy" or unique states. Your Idea is actually very good.
I wouldnt even wait for all threads to finish. I would just go for normal move ordering and threads pick the 2,3,4th ranked move and see whats "behind it". If a move is proved to be best - all threads continue to search below there!

Because AB just converges towards the best solution - the core problem that move ordering cannot be perfect (otherwise you wouldnt need search at all) is still there. AB is just smart with how quickly to stop wasting time.

The idea here is literally normal AB but the definition of a node changes. A node becomes an active thread that does AB search itself. So your top level algorithm does not select optimal moves - it selects the optimal thread that has found the optimal move strategy so far!

When good people come with good answers - I like that a lot.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Alphabeta on top of Engine cluster

Post by dangi12012 »

There we have easy pseudocode for AB:

Code: Select all

int AlphaBeta(int tiefe, int alpha, int beta)
{
    if (tiefe == 0)
        return Bewerten();
    BOOL PVgefunden = FALSE;
    best = -unendlich;
    Zugliste = GeneriereMoeglicheZuege();
    for each (Zug in Zugliste)
    {
        FuehreZugAus(Zug);
        if (PVgefunden)
        {
            wert = -AlphaBeta(tiefe-1, -alpha-1, -alpha);
            if (wert > alpha && wert < beta)
                wert = -AlphaBeta(tiefe-1, -beta, -wert);
        } else
            wert = -AlphaBeta(tiefe-1, -beta, -alpha);
        MacheZugRueckgaengig(Zug);
        if (wert > best)
        {
            if (wert >= beta)
                return wert;
            best = wert;
            if (wert > alpha)
            {
                alpha = wert;
                PVgefunden = TRUE;
            }
        }
    }
    return best;
}
Now we switch reference frames. GeneriereMoeglicheZuege becomes generate all possible AB Threads. Where each thread doesnt doesnt deepen by only the first move (according to sorting) but 1,2,3,4 depending on threadid and skips the other ones. That is AB on top of AB. This will find the optimal threadid which in turn has the optimal move. Repeated searches would have to be stopped so we still need a TT.
Seems to me that parallelsisation is possible after all.

The main reason why this works is that move sorting is not (and cannot be) perfect. If move ordering were perfect you wouldnt need a search.
So branching into moves[0] first and having a concurrent thread that branches into moves[1] at the same time is a very smart thing. All you need is a config parameter that contains the bitmask of what to skip and what to branch into.

The way you skip the expensive synchronisation is having not each thread talk to each other 32*32 but you have the context of an AB tree on top of that. So communication is 32*1.

I see that maybe the smartest move may be (offset in the moves going down the stack) 3,1,0,1,2,1
With perfect move ordering it would be 0,0,0,0,0 -> but this is literally impossible to sort. And thats why parallelisation is not only possible - but overall the faster solution.

Cool Idea!
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
connor_mcmonigle
Posts: 544
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: Alphabeta on top of Engine cluster

Post by connor_mcmonigle »

dangi12012 wrote: Fri Dec 17, 2021 8:36 pm
Mergi wrote: Fri Dec 17, 2021 7:47 pm Here's a simple (and actually legit) pseudo code for lazy SMP.

Code: Select all

for thread in threads
	do normal AB search

wait for all threads to finish
select best thread
uci output best thread's move
:shock:

Obviously you can try some tricks with randomizing the root move ordering, so each thread has a very distinct search tree ... but i found that at least for my tests with 4 cores it didn't help at all. The lazier i was with the algorithm the better the results were :D The only thing i do is that if a helper thread falls behind the main thread (thread with id == 0), then it skips some depth and immeditelly tries to do a deeper search.

The only form of communication between threads is the shared transposition table. For SF it is a lockless table where data races can and do occur. Every move that is retrieved from this TT is checked for validity (it can be played on the board without crashing the engine).

Generally the advantage of lazy SMP is that it's exceedingly easy to implement and scales better than most other algos when run on a lot of threads.
Randomizing the move order gives the advantage of giving the TT more "entropy" or unique states. Your Idea is actually very good.
I wouldnt even wait for all threads to finish. I would just go for normal move ordering and threads pick the 2,3,4th ranked move and see whats "behind it". If a move is proved to be best - all threads continue to search below there!

Because AB just converges towards the best solution - the core problem that move ordering cannot be perfect (otherwise you wouldnt need search at all) is still there. AB is just smart with how quickly to stop wasting time.

The idea here is literally normal AB but the definition of a node changes. A node becomes an active thread that does AB search itself. So your top level algorithm does not select optimal moves - it selects the optimal thread that has found the optimal move strategy so far!

When good people come with good answers - I like that a lot.
Everyone who has responded to you seems to be in complete agreement as far as I can tell. In any case, you seem to have misread the pseudo code? LazySMP works by launching one normal pvsearch at the root for each thread with a shared hash table. All threads search the same root moves and schemes to randomize the root move order for each thread have been found to be empirically neutral at best. This is the parallelism strategy employed by Stockfish and effectively all top minimax+alpha-beta pruning search engines. It's far simpler than what you suggest.

Stockfish does have some additional parallel search specific heuristics such as "breadcrumbs" and thread count specific LMR coefficients, but these are relatively minor tweaks.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Alphabeta on top of Engine cluster

Post by mvanthoor »

dangi12012 wrote: Fri Dec 17, 2021 2:34 am As I said im not interested in discussing problems but only Pseudocode. How would AB look like in its simplest form?
The standard form at least is depth-first, so it descends through the left most branch of the tree first until depth = 0. Then it evaluates the position. Then it goes up one node, and down again one node over to the right, evaluates that again, and so on. After all the leaves have been evaluated, the value of the node above those leaves can be determined. (Either the highest value if it is alpha evaluating , or the lowest value if beta is evaluating.)

To do what you want to do it would mean that you'd have to split up a branch. Maybe it is possible somehow, but it would not be efficient. It is just more efficient to split the move list into a number of threads and then have each list search its chunk of moves. One thread can influence other threats through the hash table: if one thread puts a score/move into the TT, it may cause another thread to skip/cut that position. That is the basis of Lazy SMP. It is easy to implement compared to other SMP algorithms, it's conceptually easy to understand, and to take the cake, it even works very well. Therefore most engines switched to this implementation.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Alphabeta on top of Engine cluster

Post by mvanthoor »

dangi12012 wrote: Fri Dec 17, 2021 4:02 pm I like montecarlo. It has the nice side effect of being game agnostic as well..

But given 512 waiting engines. What algo is the best for that? Seems to mee that the same parallelisation by different depths works there too?
hashtable and movegen would be maintained by a server. And all the eval would be handled by a cluster.

Advantage: No additional programming required. Also game agnostic since you seperate AB+TT from the acutal engines.

It has been done. See here:

http://www.tckerrigan.com/Chess/Parallel_Search/How_To/

It basically does what what you describe, but within a single executable. You wrap your engine into a class, then build an executable around it so this class (and thus your engine) can be launched multiple times, and share your hash table between the classes. (Obviously your uci / xboard protocol now has to also be provided by that wrapper executable.)

Effectively it is a more convoluted form of Lazy SMP, where you put the entire engine into a thread instead of just the search function.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Alphabeta on top of Engine cluster

Post by dangi12012 »

mvanthoor wrote: Sun Dec 19, 2021 5:21 pm
dangi12012 wrote: Fri Dec 17, 2021 4:02 pm I like montecarlo. It has the nice side effect of being game agnostic as well..

But given 512 waiting engines. What algo is the best for that? Seems to mee that the same parallelisation by different depths works there too?
hashtable and movegen would be maintained by a server. And all the eval would be handled by a cluster.

Advantage: No additional programming required. Also game agnostic since you seperate AB+TT from the acutal engines.

It has been done. See here:

http://www.tckerrigan.com/Chess/Parallel_Search/How_To/

It basically does what what you describe, but within a single executable. You wrap your engine into a class, then build an executable around it so this class (and thus your engine) can be launched multiple times, and share your hash table between the classes. (Obviously your uci / xboard protocol now has to also be provided by that wrapper executable.)

Effectively it is a more convoluted form of Lazy SMP, where you put the entire engine into a thread instead of just the search function.
Thank you very much. Great comment and a nice suprise from the usual "it cannot be done" croud. Good to see that this already exists.
I will put that onto my list to create something like that.

But not as a class wrapper but as an executable wrapper. The uci interface is flexible enought with go perft 1 to be movegenerator and searcher as a distributed service.

Actual TT sharing is more difficult in distributed systems. (maybe the engine is on a different machine). And I could only think of a uci extension that would allow network TT sharing as a new option via memory mapped IO (which is very platform dependent if its a good idea)

Anyays thanks for that link - this is a huge help.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer