Simplest way to implement quick and dirty lazy smp

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
silentshark
Posts: 327
Joined: Sat Mar 27, 2010 7:15 pm

Simplest way to implement quick and dirty lazy smp

Post by silentshark »

Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Simplest way to implement quick and dirty lazy smp

Post by xr_a_y »

Just run the search in multiple forks using a shared hash table.

In C++11, it is much fun using std::thread/mutex/condition. Stockfish is a very simple and easy to understand implementation.
The one in Minic is derived from Stockfish and I guess quite copy/paste-able.
smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Simplest way to implement quick and dirty lazy smp

Post by smatovic »

silentshark wrote: Fri Jan 04, 2019 2:16 pm Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
1) Make sure all threads have their own memory stacks except global hash table
2) use the SHT, shared hash table, in all threads for TT-moves and TT-scores
3) test RMO - randomized move order in helper threads,
- root thread performs a normal AB search
- helper threads pick a random move after Nullmove, TT-move, IID-move etc.
- search terminates when any thread finishes its search
4) consider a quick but not dirty ABDADA implementation

--
Srdja
odomobo
Posts: 96
Joined: Fri Jul 06, 2018 1:09 am
Location: Chicago, IL
Full name: Josh Odom

Re: Simplest way to implement quick and dirty lazy smp

Post by odomobo »

Does randomized move order provide a benefit for lazy SMP?
smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Simplest way to implement quick and dirty lazy smp

Post by smatovic »

odomobo wrote: Sat Jan 05, 2019 4:26 pm Does randomized move order provide a benefit for lazy SMP?
I implemented RMO on gpu and it was superior to plain SHT,
i tried some lazy smp ideas like different search depths,
but these did not work for me.

My ABDADA implementation is superior to RMO,
at least up to 32 threads.

--
Srdja
User avatar
silentshark
Posts: 327
Joined: Sat Mar 27, 2010 7:15 pm

Re: Simplest way to implement quick and dirty lazy smp

Post by silentshark »

All great suggestions, many thanks :-)
JVMerlino
Posts: 1357
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: Simplest way to implement quick and dirty lazy smp

Post by JVMerlino »

silentshark wrote: Fri Jan 04, 2019 2:16 pm Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
Well, if your code is not thread-safe and you REALLY want the QUICKEST and DIRTIEST implementation, you could do what I did in Myrddin (thanks to an idea from H.G. Muller). Instead of threads, I used processes which are extremely easy to launch in parallel. You just have to make your various hash tables accessible by all processes via shared memory, and then have the master process feed moves via pipes to all of the slaves while they are in analyze mode. They fill up the hash tables, which the main process uses to speed up its search.

It worked a fair bit better than expected. My testing showed +90 elo for 4 CPU, and this is borne out by CCRL testing, which shows +93 (although CCRL has only played about 260 games with the 4 CPU version).

jm
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Simplest way to implement quick and dirty lazy smp

Post by hgm »

Yeah, I used that in my engine HaQiKi D, also with satisfactory results (eventually). I don't connect the processes (by pipes) in a star topology, though, but as a linear chain. When a process receives the command "cores N" with N > 1 it launches a daughter process, and passes the command "cores N-1" to it. That will set up the chain.

The process that receives the "xboard" command (which is not passed along the chain) knows it is the master process communicating with the GUI. This then allocates the hash memory in a sharable way, and sends its name in a (non-standard) "attach" command (which is passed along to daughters) to the daughter, so the latter can map it in its own memory instead of allocating hash memory.

Game-state changing commands are passed to daughters, as well as any move the master makes itself. This keeps the processes in sync. Whenever the master starts searching, it sends an "analyze" command to its daughter, whenever it stops searching it sends "exit" to stop the analysis (both also passed along).

Basically the whole thing is implemented in the protocol driver loop.
JVMerlino
Posts: 1357
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: Simplest way to implement quick and dirty lazy smp

Post by JVMerlino »

hgm wrote: Sat Jan 05, 2019 9:47 pm Yeah, I used that in my engine HaQiKi D, also with satisfactory results (eventually). I don't connect the processes (by pipes) in a star topology, though, but as a linear chain. When a process receives the command "cores N" with N > 1 it launches a daughter process, and passes the command "cores N-1" to it. That will set up the chain.

The process that receives the "xboard" command (which is not passed along the chain) knows it is the master process communicating with the GUI. This then allocates the hash memory in a sharable way, and sends its name in a (non-standard) "attach" command (which is passed along to daughters) to the daughter, so the latter can map it in its own memory instead of allocating hash memory.

Game-state changing commands are passed to daughters, as well as any move the master makes itself. This keeps the processes in sync. Whenever the master starts searching, it sends an "analyze" command to its daughter, whenever it stops searching it sends "exit" to stop the analysis (both also passed along).

Basically the whole thing is implemented in the protocol driver loop.
The only significant difference between your implementation and mine is that I pass the address for the hash tables as commandline parameters to the slaves. How do you pass commands down the chain if not via pipes?
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Simplest way to implement quick and dirty lazy smp

Post by lucasart »

JVMerlino wrote: Sat Jan 05, 2019 7:31 pm
silentshark wrote: Fri Jan 04, 2019 2:16 pm Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
Well, if your code is not thread-safe and you REALLY want the QUICKEST and DIRTIEST implementation, you could do what I did in Myrddin (thanks to an idea from H.G. Muller). Instead of threads, I used processes which are extremely easy to launch in parallel. You just have to make your various hash tables accessible by all processes via shared memory, and then have the master process feed moves via pipes to all of the slaves while they are in analyze mode. They fill up the hash tables, which the main process uses to speed up its search.

It worked a fair bit better than expected. My testing showed +90 elo for 4 CPU, and this is borne out by CCRL testing, which shows +93 (although CCRL has only played about 260 games with the 4 CPU version).

jm
I fail to see how this is easier. Honestly, managing subprocesses, and communications via pipes, is a lot more complicated in my experience (especially on POSIX systems, with the fork logic being quite mind bending). And it's hideous, from a design/coding standpoint. Simply start threads in an id_loop() and join when done. That's all there is to it. The rest is details, which you can improve on as you go along.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.