SMP and helpfull master concept

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: SMP and helpfull master concept

Post by Zach Wegner »

Codeman wrote:I understand Bruce's idea, though it seems a little bit messy. But how is the concept of DTS implemented in Crafty?

To stick with bruno causse's example, T1 stops working at N1, helps working at N3 and when finished (if T2 still isn't finished yet do some more work in between) go back to N1 and finish the job by propagating the values up.

How does this switching between tasks work for the master process? Are the functions from the first task (N1) still on the call stack when it returns from its task on N3?
Yes, in Crafty.
Or is it possible to, once T1 is finished on N1, give T2 the responsibility to in the end also deal with the propagating of values, so that T1 doesn't have to return to the node and could erase its stack.
That's an important part of DTS. It would be theoretically possible to implement it recursively, but it would be just as messy or messier than just doing an iterative search. Don't do it.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP and helpfull master concept

Post by bob »

Codeman wrote:
bob wrote:There are two solutions.

In Crafty, I do what you are describing. I never have a process idle if others have work left to do.

In Ferret, Bruce used way more threads than processors, but had them blocked. Then when you split, the thread that is in control at that point will block, but it will first unblock another thread to take its place and continue to keep all cpus busy. I like my approach better as you don't have to try to figure out how many threads you need to start up front...
I understand Bruce's idea, though it seems a little bit messy. But how is the concept of DTS implemented in Crafty?
It isn't. I can't "poke around" to find the best split point, which was a key point of difference between DTS and YBW-type algorithms.


To stick with bruno causse's example, T1 stops working at N1, helps working at N3 and when finished (if T2 still isn't finished yet do some more work in between) go back to N1 and finish the job by propagating the values up.

How does this switching between tasks work for the master process? Are the functions from the first task (N1) still on the call stack when it returns from its task on N3?
This is complicated to explain but I'll give it a try. I have a function "ThreadWait()" where idle processes are sent when the engine initializes. There will initially be N-1 threads here. All they do is check their "pointer to work" and so long as it is zero, they spin. Whenever they are given work to do, their "pointer to work" is set to point to a "split block" that has a search requirement. This code simply calls SearchParallel() to do the search, and when that returns, we go back into the idle loop waiting for more work. So that part is pretty simple. Now for the "master" or original process.

When it discovers it has searched the first branch from any node, AND there are idle processes spinning away in ThreadWait() it calls a function Thread() which determines how many idle processes there are and then allocates N+1 (N = number of idle processes) split blocks. It copies the current split block to each of those, and then stuffs a pointer to each in a different "pointer to work" which instantly fires the N-1 idle processes off into a search. It then sets its own "pointer to work" to point to that extra split block, then calls ThreadWait() which would normally spin waiting on work, but it has already given itself something to do. So it, too, calls Search Parallel() and now all threads are busy.

When any of the N-1 processes completes any available work and can find no more, SearchParalle() returns which takes us back to ThreadWait() to wait for more work. When the original process runs out of work, it also returns to ThreadWait() but since it is idle, it can be given work by any of the other threads that are busy.

The special case is that the original process is the one that must return from ThreadWait() and get back into its now-completed search and back things up. There is a special check that essentially says "OK, I am fixing to spin with no work. If I am the "master" of a pending split block (the split block that is the parent of the N+1 we allocated above) we can not return until there are no more threads working on split blocks linked to that one. So we spin, and help others, and spin, and help others, until we get back to the "spin point" and suddenly notice "wow, nobody else is busy at my original split block now, so I can do a normal return and go back into the search code and finish up."

Sounds complicated. It is complicated. But it works flawlessly. The split blocks sort of show the parallel structure at any instant. The zero (0) split block is the original block containing the starting position. If 3 threads are idle when we complete searching the first branch at any ply (this satisfies the YBW condition) then I allocate 4 split blocks, 1-4, and link them to block 0, and the four threads wearch those in parallel. At the split ply, they use the parent split block which gives a nice way to share the move list, and which also contains the lock to use to access that shared info. It now looks like this:

plies (1-N): block(0)
plies (n+1 to whatever) block (1,2,3,4) [one for each active thread.

At any ply below N+1, when a process becomes idle,another process can force it to "join in". if you look at the split block links, 0 is always the root, it will be linked to split blocks for processes that are working together at some point in the tree. One or more of _those_ split blocks can end up pointing to a smaller set of blocks for the processes that are helping that thread...

Messy to be sure, but there is very little accumulated idle time waiting on work The only flaw, compared to DTS, is that I have to split when there is an idle processor and at least the first branch at a node has been searched already. With DTS I could look at all plies to find the best candidate. If I see a node at ply=2 where we have already searched 10 branches, that is a much better split point than one at ply N (near the frontier) where I just completed the first branch. In Crafty I have to take this latter split point. In Cray Blitz I could take whichever ply looks to be the best candidate to require complete searching.


Or is it possible to, once T1 is finished on N1, give T2 the responsibility to in the end also deal with the propagating of values, so that T1 doesn't have to return to the node and could erase its stack.

That is probably possible, but "erasing its stack" really means a _lot_ of returns to unwind things. I decided to keep things simpler by using the current approach.

The latter option seems more advantageous to me but I am not quite sure how to implement it recursively.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: SMP and helpfull master concept

Post by Edmund »

Thanks for clarification Zach and Bob.
A really interesting field of research indeed, I will give it some time to think about.

Edmund