Simplest way to implement quick and dirty lazy smp

silentshark · Post by **silentshark** » Fri Jan 04, 2019 2:16 pm

Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?

xr_a_y · Post by **xr_a_y** » Fri Jan 04, 2019 2:22 pm

Just run the search in multiple forks using a shared hash table.

In C++11, it is much fun using std::thread/mutex/condition. Stockfish is a very simple and easy to understand implementation.
The one in Minic is derived from Stockfish and I guess quite copy/paste-able.

smatovic · Post by **smatovic** » Sat Jan 05, 2019 4:19 pm

silentshark wrote: ↑Fri Jan 04, 2019 2:16 pm Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?

1) Make sure all threads have their own memory stacks except global hash table
2) use the SHT, shared hash table, in all threads for TT-moves and TT-scores
3) test RMO - randomized move order in helper threads,
- root thread performs a normal AB search
- helper threads pick a random move after Nullmove, TT-move, IID-move etc.
- search terminates when any thread finishes its search
4) consider a quick but not dirty ABDADA implementation

--
Srdja

odomobo · Post by **odomobo** » Sat Jan 05, 2019 4:26 pm

Does randomized move order provide a benefit for lazy SMP?

smatovic · Post by **smatovic** » Sat Jan 05, 2019 4:37 pm

odomobo wrote: ↑Sat Jan 05, 2019 4:26 pm Does randomized move order provide a benefit for lazy SMP?

I implemented RMO on gpu and it was superior to plain SHT,
i tried some lazy smp ideas like different search depths,
but these did not work for me.

My ABDADA implementation is superior to RMO,
at least up to 32 threads.

--
Srdja

silentshark · Post by **silentshark** » Sat Jan 05, 2019 6:59 pm

All great suggestions, many thanks

JVMerlino · Post by **JVMerlino** » Sat Jan 05, 2019 7:31 pm

silentshark wrote: ↑Fri Jan 04, 2019 2:16 pm Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?

Well, if your code is not thread-safe and you REALLY want the QUICKEST and DIRTIEST implementation, you could do what I did in Myrddin (thanks to an idea from H.G. Muller). Instead of threads, I used processes which are extremely easy to launch in parallel. You just have to make your various hash tables accessible by all processes via shared memory, and then have the master process feed moves via pipes to all of the slaves while they are in analyze mode. They fill up the hash tables, which the main process uses to speed up its search.

It worked a fair bit better than expected. My testing showed +90 elo for 4 CPU, and this is borne out by CCRL testing, which shows +93 (although CCRL has only played about 260 games with the 4 CPU version).

jm

hgm · Post by **hgm** » Sat Jan 05, 2019 9:47 pm

Yeah, I used that in my engine HaQiKi D, also with satisfactory results (eventually). I don't connect the processes (by pipes) in a star topology, though, but as a linear chain. When a process receives the command "cores N" with N > 1 it launches a daughter process, and passes the command "cores N-1" to it. That will set up the chain.

The process that receives the "xboard" command (which is not passed along the chain) knows it is the master process communicating with the GUI. This then allocates the hash memory in a sharable way, and sends its name in a (non-standard) "attach" command (which is passed along to daughters) to the daughter, so the latter can map it in its own memory instead of allocating hash memory.

Game-state changing commands are passed to daughters, as well as any move the master makes itself. This keeps the processes in sync. Whenever the master starts searching, it sends an "analyze" command to its daughter, whenever it stops searching it sends "exit" to stop the analysis (both also passed along).

Basically the whole thing is implemented in the protocol driver loop.

JVMerlino · Post by **JVMerlino** » Sat Jan 05, 2019 10:23 pm

hgm wrote: ↑Sat Jan 05, 2019 9:47 pm Yeah, I used that in my engine HaQiKi D, also with satisfactory results (eventually). I don't connect the processes (by pipes) in a star topology, though, but as a linear chain. When a process receives the command "cores N" with N > 1 it launches a daughter process, and passes the command "cores N-1" to it. That will set up the chain.

The process that receives the "xboard" command (which is not passed along the chain) knows it is the master process communicating with the GUI. This then allocates the hash memory in a sharable way, and sends its name in a (non-standard) "attach" command (which is passed along to daughters) to the daughter, so the latter can map it in its own memory instead of allocating hash memory.

Game-state changing commands are passed to daughters, as well as any move the master makes itself. This keeps the processes in sync. Whenever the master starts searching, it sends an "analyze" command to its daughter, whenever it stops searching it sends "exit" to stop the analysis (both also passed along).

Basically the whole thing is implemented in the protocol driver loop.

The only significant difference between your implementation and mine is that I pass the address for the hash tables as commandline parameters to the slaves. How do you pass commands down the chain if not via pipes?

lucasart · Post by **lucasart** » Sun Jan 06, 2019 1:44 am

JVMerlino wrote: ↑Sat Jan 05, 2019 7:31 pm
silentshark wrote: ↑Fri Jan 04, 2019 2:16 pm Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
Well, if your code is not thread-safe and you REALLY want the QUICKEST and DIRTIEST implementation, you could do what I did in Myrddin (thanks to an idea from H.G. Muller). Instead of threads, I used processes which are extremely easy to launch in parallel. You just have to make your various hash tables accessible by all processes via shared memory, and then have the master process feed moves via pipes to all of the slaves while they are in analyze mode. They fill up the hash tables, which the main process uses to speed up its search.

It worked a fair bit better than expected. My testing showed +90 elo for 4 CPU, and this is borne out by CCRL testing, which shows +93 (although CCRL has only played about 260 games with the 4 CPU version).

jm

I fail to see how this is easier. Honestly, managing subprocesses, and communications via pipes, is a lot more complicated in my experience (especially on POSIX systems, with the fork logic being quite mind bending). And it's hideous, from a design/coding standpoint. Simply start threads in an id_loop() and join when done. That's all there is to it. The rest is details, which you can improve on as you go along.

Simplest way to implement quick and dirty lazy smp

Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp

Re: Simplest way to implement quick and dirty lazy smp