Alphabeta on top of Engine cluster

j.t. · Post by **j.t.** » Fri Dec 17, 2021 6:20 pm

dangi12012 wrote: ↑Fri Dec 17, 2021 5:38 pm Does anyone here know what SF uses internally for its MT magic or is it best to just read the sourcecode?

It seems to me like Stockfish uses Lazy SMP.

EDIT: I edited out my question "What does MT stand for". I am now editing it back in, to avoid confusion.

Desperado · Post by **Desperado** » Fri Dec 17, 2021 6:21 pm

j.t. wrote: ↑Fri Dec 17, 2021 6:20 pm
dangi12012 wrote: ↑Fri Dec 17, 2021 5:38 pm Does anyone here know what SF uses internally for its MT magic or is it best to just read the sourcecode?
It seems to me like Stockfish uses Lazy SMP.

I guess he means multithreading.

R. Tomasi · Post by **R. Tomasi** » Fri Dec 17, 2021 6:35 pm

dangi12012 wrote: ↑Fri Dec 17, 2021 5:36 pm Your information is outdated regarding smp.

Well, maybe not that outdated, after all: https://github.com/official-stockfish/S ... h/pull/468...

dangi12012 wrote: ↑Fri Dec 17, 2021 5:36 pm Why comment that something is impossible. Its like on stackoverflow where 9 people say its impossible and 1 person provides the perfect answer.

Well you ask for simple pseudo-code for a problem that nobody has managed to solve in a "simple" way. That's why you're asking for the impossible. You will not get "simple" pseudo-code for that...

Mergi · Post by **Mergi** » Fri Dec 17, 2021 7:47 pm

Here's a simple (and actually legit) pseudo code for lazy SMP.

Code: Select all

for thread in threads
	do normal AB search

wait for all threads to finish
select best thread
uci output best thread's move

Obviously you can try some tricks with randomizing the root move ordering, so each thread has a very distinct search tree ... but i found that at least for my tests with 4 cores it didn't help at all. The lazier i was with the algorithm the better the results were

The only thing i do is that if a helper thread falls behind the main thread (thread with id == 0), then it skips some depth and immeditelly tries to do a deeper search.

The only form of communication between threads is the shared transposition table. For SF it is a lockless table where data races can and do occur. Every move that is retrieved from this TT is checked for validity (it can be played on the board without crashing the engine).

Generally the advantage of lazy SMP is that it's exceedingly easy to implement and scales better than most other algos when run on a lot of threads.

dangi12012 · Post by **dangi12012** » Fri Dec 17, 2021 8:36 pm

Mergi wrote: ↑Fri Dec 17, 2021 7:47 pm Here's a simple (and actually legit) pseudo code for lazy SMP.
Code: Select all
for thread in threads
	do normal AB search

wait for all threads to finish
select best thread
uci output best thread's move
Obviously you can try some tricks with randomizing the root move ordering, so each thread has a very distinct search tree ... but i found that at least for my tests with 4 cores it didn't help at all. The lazier i was with the algorithm the better the results were The only thing i do is that if a helper thread falls behind the main thread (thread with id == 0), then it skips some depth and immeditelly tries to do a deeper search.

The only form of communication between threads is the shared transposition table. For SF it is a lockless table where data races can and do occur. Every move that is retrieved from this TT is checked for validity (it can be played on the board without crashing the engine).

Generally the advantage of lazy SMP is that it's exceedingly easy to implement and scales better than most other algos when run on a lot of threads.

Randomizing the move order gives the advantage of giving the TT more "entropy" or unique states. Your Idea is actually very good.
I wouldnt even wait for all threads to finish. I would just go for normal move ordering and threads pick the 2,3,4th ranked move and see whats "behind it". If a move is proved to be best - all threads continue to search below there!

Because AB just converges towards the best solution - the core problem that move ordering cannot be perfect (otherwise you wouldnt need search at all) is still there. AB is just smart with how quickly to stop wasting time.

The idea here is literally normal AB but the definition of a node changes. A node becomes an active thread that does AB search itself. So your top level algorithm does not select optimal moves - it selects the optimal thread that has found the optimal move strategy so far!

When good people come with good answers - I like that a lot.

dangi12012 · Post by **dangi12012** » Fri Dec 17, 2021 8:59 pm

There we have easy pseudocode for AB:

Code: Select all

int AlphaBeta(int tiefe, int alpha, int beta)
{
    if (tiefe == 0)
        return Bewerten();
    BOOL PVgefunden = FALSE;
    best = -unendlich;
    Zugliste = GeneriereMoeglicheZuege();
    for each (Zug in Zugliste)
    {
        FuehreZugAus(Zug);
        if (PVgefunden)
        {
            wert = -AlphaBeta(tiefe-1, -alpha-1, -alpha);
            if (wert > alpha && wert < beta)
                wert = -AlphaBeta(tiefe-1, -beta, -wert);
        } else
            wert = -AlphaBeta(tiefe-1, -beta, -alpha);
        MacheZugRueckgaengig(Zug);
        if (wert > best)
        {
            if (wert >= beta)
                return wert;
            best = wert;
            if (wert > alpha)
            {
                alpha = wert;
                PVgefunden = TRUE;
            }
        }
    }
    return best;
}

Now we switch reference frames. GeneriereMoeglicheZuege becomes generate all possible AB Threads. Where each thread doesnt doesnt deepen by only the first move (according to sorting) but 1,2,3,4 depending on threadid and skips the other ones. That is AB on top of AB. This will find the optimal threadid which in turn has the optimal move. Repeated searches would have to be stopped so we still need a TT.
Seems to me that parallelsisation is possible after all.

The main reason why this works is that move sorting is not (and cannot be) perfect. If move ordering were perfect you wouldnt need a search.
So branching into moves[0] first and having a concurrent thread that branches into moves[1] at the same time is a very smart thing. All you need is a config parameter that contains the bitmask of what to skip and what to branch into.

The way you skip the expensive synchronisation is having not each thread talk to each other 32*32 but you have the context of an AB tree on top of that. So communication is 32*1.

I see that maybe the smartest move may be (offset in the moves going down the stack) 3,1,0,1,2,1
With perfect move ordering it would be 0,0,0,0,0 -> but this is literally impossible to sort. And thats why parallelisation is not only possible - but overall the faster solution.

Cool Idea!

connor_mcmonigle · Post by **connor_mcmonigle** » Fri Dec 17, 2021 9:10 pm

dangi12012 wrote: ↑Fri Dec 17, 2021 8:36 pm
Mergi wrote: ↑Fri Dec 17, 2021 7:47 pm Here's a simple (and actually legit) pseudo code for lazy SMP.
Code: Select all
for thread in threads
	do normal AB search

wait for all threads to finish
select best thread
uci output best thread's move
Obviously you can try some tricks with randomizing the root move ordering, so each thread has a very distinct search tree ... but i found that at least for my tests with 4 cores it didn't help at all. The lazier i was with the algorithm the better the results were The only thing i do is that if a helper thread falls behind the main thread (thread with id == 0), then it skips some depth and immeditelly tries to do a deeper search.

The only form of communication between threads is the shared transposition table. For SF it is a lockless table where data races can and do occur. Every move that is retrieved from this TT is checked for validity (it can be played on the board without crashing the engine).

Generally the advantage of lazy SMP is that it's exceedingly easy to implement and scales better than most other algos when run on a lot of threads.
Randomizing the move order gives the advantage of giving the TT more "entropy" or unique states. Your Idea is actually very good.
I wouldnt even wait for all threads to finish. I would just go for normal move ordering and threads pick the 2,3,4th ranked move and see whats "behind it". If a move is proved to be best - all threads continue to search below there!

Because AB just converges towards the best solution - the core problem that move ordering cannot be perfect (otherwise you wouldnt need search at all) is still there. AB is just smart with how quickly to stop wasting time.

The idea here is literally normal AB but the definition of a node changes. A node becomes an active thread that does AB search itself. So your top level algorithm does not select optimal moves - it selects the optimal thread that has found the optimal move strategy so far!

When good people come with good answers - I like that a lot.

Everyone who has responded to you seems to be in complete agreement as far as I can tell. In any case, you seem to have misread the pseudo code? LazySMP works by launching one normal pvsearch at the root for each thread with a shared hash table. All threads search the same root moves and schemes to randomize the root move order for each thread have been found to be empirically neutral at best. This is the parallelism strategy employed by Stockfish and effectively all top minimax+alpha-beta pruning search engines. It's far simpler than what you suggest.

Stockfish does have some additional parallel search specific heuristics such as "breadcrumbs" and thread count specific LMR coefficients, but these are relatively minor tweaks.

mvanthoor · Post by **mvanthoor** » Sun Dec 19, 2021 5:16 pm

dangi12012 wrote: ↑Fri Dec 17, 2021 2:34 am As I said im not interested in discussing problems but only Pseudocode. How would AB look like in its simplest form?

The standard form at least is depth-first, so it descends through the left most branch of the tree first until depth = 0. Then it evaluates the position. Then it goes up one node, and down again one node over to the right, evaluates that again, and so on. After all the leaves have been evaluated, the value of the node above those leaves can be determined. (Either the highest value if it is alpha evaluating , or the lowest value if beta is evaluating.)

To do what you want to do it would mean that you'd have to split up a branch. Maybe it is possible somehow, but it would not be efficient. It is just more efficient to split the move list into a number of threads and then have each list search its chunk of moves. One thread can influence other threats through the hash table: if one thread puts a score/move into the TT, it may cause another thread to skip/cut that position. That is the basis of Lazy SMP. It is easy to implement compared to other SMP algorithms, it's conceptually easy to understand, and to take the cake, it even works very well. Therefore most engines switched to this implementation.

mvanthoor · Post by **mvanthoor** » Sun Dec 19, 2021 5:21 pm

dangi12012 wrote: ↑Fri Dec 17, 2021 4:02 pm I like montecarlo. It has the nice side effect of being game agnostic as well..

But given 512 waiting engines. What algo is the best for that? Seems to mee that the same parallelisation by different depths works there too?
hashtable and movegen would be maintained by a server. And all the eval would be handled by a cluster.

Advantage: No additional programming required. Also game agnostic since you seperate AB+TT from the acutal engines.

It has been done. See here:

http://www.tckerrigan.com/Chess/Parallel_Search/How_To/

It basically does what what you describe, but within a single executable. You wrap your engine into a class, then build an executable around it so this class (and thus your engine) can be launched multiple times, and share your hash table between the classes. (Obviously your uci / xboard protocol now has to also be provided by that wrapper executable.)

Effectively it is a more convoluted form of Lazy SMP, where you put the entire engine into a thread instead of just the search function.

dangi12012 · Post by **dangi12012** » Sun Dec 19, 2021 7:03 pm

mvanthoor wrote: ↑Sun Dec 19, 2021 5:21 pm
dangi12012 wrote: ↑Fri Dec 17, 2021 4:02 pm I like montecarlo. It has the nice side effect of being game agnostic as well..

But given 512 waiting engines. What algo is the best for that? Seems to mee that the same parallelisation by different depths works there too?
hashtable and movegen would be maintained by a server. And all the eval would be handled by a cluster.

Advantage: No additional programming required. Also game agnostic since you seperate AB+TT from the acutal engines.

It has been done. See here:

http://www.tckerrigan.com/Chess/Parallel_Search/How_To/

It basically does what what you describe, but within a single executable. You wrap your engine into a class, then build an executable around it so this class (and thus your engine) can be launched multiple times, and share your hash table between the classes. (Obviously your uci / xboard protocol now has to also be provided by that wrapper executable.)

Effectively it is a more convoluted form of Lazy SMP, where you put the entire engine into a thread instead of just the search function.

Thank you very much. Great comment and a nice suprise from the usual "it cannot be done" croud. Good to see that this already exists.
I will put that onto my list to create something like that.

But not as a class wrapper but as an executable wrapper. The uci interface is flexible enought with go perft 1 to be movegenerator and searcher as a distributed service.

Actual TT sharing is more difficult in distributed systems. (maybe the engine is on a different machine). And I could only think of a uci extension that would allow network TT sharing as a new option via memory mapped IO (which is very platform dependent if its a good idea)

Anyays thanks for that link - this is a huge help.

Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster

Re: Alphabeta on top of Engine cluster