SMP thread goes here

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP thread goes here

Post by bob »

hgm wrote:
bob wrote:It would seem to me that if you can "selectively broadcast" then you must know who is splitting with whom, else you will unintentionally stop the wrong processors here and there.
The essence of broadcasting is that you inject a message into a common medium, without knowing who is listening (or in fact, if anyone is listening at all). It is the listener who decides if he is going to pick up the signal.

In this particular case, the 'medium' would be the particular split node in which the cut-off occurs. THe thread stumbling on the cutoff would set a flag there. Only threads working at the node would poll that flag, the other threads would be unaware of it, and could thus never abort their search in reaction to it. They would be polling their own split nodes.

But even the listeners would not be aware with whom they were splitting. They only have to know _where_ they split.
what about the case of the processors working _below_ that split point at another one? For example you split at ply=3 using processors 1, 2. Later processor 3 becomes idle and splits with processor 1 or 2 at ply=7. When you get the fail high at ply=3, how do you make your broadcast go beyond 1 and 2, and pick up 3 which is not splitting at that point, but which is working below that point and needs to be stopped as well?

That was the gist of my question about your "broadcast" because just doing this at a split point is not good enough. You have to consider the splits that might have happened _below_ that split point and take care of those at the same instant. Which means you somehow have to know who is doing what and where it is being done, so when you decide to stop a sub-tree search, you know who is involved in that subtree at any level, and get 'em all.

I suspect you are somehow not considering that problem...
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SMP thread goes here

Post by hgm »

I even already explained how I would solve that problem, above. :shock:

If a broadcast occurs at the split point at ply 3, both CPU #1 and #2 pick it up. Say CPU #3 had a splitpoint with #1 at ply 7. Since CPU #1 would have recieved the broadcast, it would now start cleaning up its tree below ply 3. In the process it would close down the split point at ply 7. That would cause it to re-broadcast the abort message there, so that CPU #3 would pick it up.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP thread goes here

Post by bob »

hgm wrote:I even already explained how I would solve that problem, above. :shock:

If a broadcast occurs at the split point at ply 3, both CPU #1 and #2 pick it up. Say CPU #3 had a splitpoint with #1 at ply 7. Since CPU #1 would have recieved the broadcast, it would now start cleaning up its tree below ply 3. In the process it would close down the split point at ply 7. That would cause it to re-broadcast the abort message there, so that CPU #3 would pick it up.
Then how is that different than what I have been discussing all along? The only advantage my data structure offers is that anyone can look at the organization of the split blocks to see who is doing what to whom. It makes debugging easier as when something hangs (a common occurrence early in a parallel search development) anybody can dump the "hierarchy" to show who has split with whom at what ply. The blocks can be sanity-checked when playing with a known bug to isolate where something becomes broken. Etc.

I can still think of sticky problems however.

1 and 2 split at ply 3. Then 2 and 4 split at ply=7. 1 finishes and splits with 4 at ply=10. now 2 gets a fail high at ply=3. How does 1 discover this since it has finished at ply=3 already and is working elsewhere. Suppose it is not working on a part of the current sub-tree, but was when it split at ply=3? In my approach, I simply set the stop flag for all siblings at ply=3, then for each sibling, I check to see if they have split somewhere deeper and if so, I recursively repeat the process, stopping all siblings at those split points. In debug mode I can display all of this as it happens to help me pinpoint hangups and deadlocks or crashes quickly.

I think organizing things with simple links is beyond easy, and offers a wealth of debugging assistance when the wrong processor hangs up and you can't find out what it is working on because of that. In my split block scheme, anybody can display the split state of the entire tree very easily and accurately.

Nothing says your approach is wrong, of course. But once you have done this enough, you will realize just how critical the debugging issues are, as there is no other parallel algorithm that offers the same level of potential problems as traversing an alpha/beta tree. It's an inherently sequential algorithm, being run on a parallel platform. In any case, having some global state that can be displayed for debugging is a real advantage that is difficult to survive without.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SMP thread goes here

Post by hgm »

bob wrote:1 and 2 split at ply 3. Then 2 and 4 split at ply=7. 1 finishes and splits with 4 at ply=10. now 2 gets a fail high at ply=3. How does 1 discover this since it has finished at ply=3 already and is working elsewhere. Suppose it is not working on a part of the current sub-tree, but was when it split at ply=3?
I don't get it.

#1 does no longer have to know there is a fail high at ply=3, as it is no longer working on that split point.

#2 cannot get a fail high at ply=3 without the search of the subtree it was working on has been completed. Part of that tree is searched by #4, which again delegated part of the work to #1. The fail high at ply=3 of the move searched by #2 cannot occur before #2 and #4 have finished their part, and the result of it has been passed (via #4) to #2. If #2 finds a fail high at ply=7, it will broadcast from ply=7, which will inform #4. Who again broadcasts at ply=10, to recall #1. Fail highs of #2 above ply=7 do not affect #4 or #1 before they propagate to ply=7 (if they do). Fail highs of #2 above ply=7 cannot occur, as #2 cannot return from ply=7 without #4 contributing its score to it (or at least being informed by the broadcast that its score is no longer needed).

Do you mean that #1 did not have anything to do at ply=3, but cannot return from it because it owns it and would be responsible for returning its score (which it cannot do that before #2 is done)? Then I would not really say that #1 has finished at ply=3. It would still keep ply=3 on its stack of splitpoints, together with the new splitpoint it starts working, ply=10. It would have to keep polling both. This is what I see as the main disadvantage of this approach, that the number of split points on the private split-point stacks can become large, and you have to keep polling them all. To ameliorate this I suggested the alternate polling.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP thread goes here

Post by bob »

hgm wrote:
bob wrote:1 and 2 split at ply 3. Then 2 and 4 split at ply=7. 1 finishes and splits with 4 at ply=10. now 2 gets a fail high at ply=3. How does 1 discover this since it has finished at ply=3 already and is working elsewhere. Suppose it is not working on a part of the current sub-tree, but was when it split at ply=3?
I don't get it.

#1 does no longer have to know there is a fail high at ply=3, as it is no longer working on that split point.

#2 cannot get a fail high at ply=3 without the search of the subtree it was working on has been completed.
Certainly it can. You split at ply=3, you search two moves m1 and m2 at that ply. my fails low, which causes ply-3 to fail high, searching m2 and the others is not necessary and we don't need those results to fail high...

Part of that tree is searched by #4, which again delegated part of the work to #1. The fail high at ply=3 of the move searched by #2 cannot occur before #2 and #4 have finished their part, and the result of it has been passed (via #4) to #2. If #2 finds a fail high at ply=7, it will broadcast from ply=7, which will inform #4. Who again broadcasts at ply=10, to recall #1. Fail highs of #2 above ply=7 do not affect #4 or #1 before they propagate to ply=7 (if they do). Fail highs of #2 above ply=7 cannot occur, as #2 cannot return from ply=7 without #4 contributing its score to it (or at least being informed by the broadcast that its score is no longer needed).

Do you mean that #1 did not have anything to do at ply=3, but cannot return from it because it owns it and would be responsible for returning its score (which it cannot do that before #2 is done)? Then I would not really say that #1 has finished at ply=3. It would still keep ply=3 on its stack of splitpoints,

That is my point. But currently, it is working deeper in the tree, not knowing that it needs to stop. Because it is not checking the split point at 3 if I read your explanation right, it is only checking the current split point which is far deeper in the tree...

together with the new splitpoint it starts working, ply=10. It would have to keep polling both. This is what I see as the main disadvantage of this approach, that the number of split points on the private split-point stacks can become large, and you have to keep polling them all. To ameliorate this I suggested the alternate polling.
I suggest no polling whatsoever and getting rid of the idea completely. I only do this stuff when I know that others need notification, and everybody checks the current split block "stop" flag frequently, which comes from cache and costs nothing until it gets invalidated by someone posting the stop value.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SMP thread goes here

Post by hgm »

bob wrote:..., and everybody checks the current split block "stop" flag frequently, ....
Well, that is what I call 'polling'. The only way to avoid it would be to rely on signals / interrupts. That seems even less efficient, so there is no way to get rid of polling.

It is just a matter of who is polling what, and who will write the variables being polled. In your scheme there is only one poller, and many split nodes that could cause writing of the flag they poll. In my scheme there are multiple pollers to each flag, and only one split node that causes writing there. That saves me the trouble of (and the inter-cache traffic associated with) keeping track of who is working on each split node.

The question of course is if it wouldn't make the polling much more expensive. That depends on the poll frequency, although polling in itself is not very expensive as long as the variable is not written.

And you did _not_ read the explanation right. I would not only poll the flag of the 'current' split block, but the flags of all split blocks. This is why the threads maintain stacks of split blocks, to know what to poll.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP thread goes here

Post by bob »

hgm wrote:
bob wrote:..., and everybody checks the current split block "stop" flag frequently, ....
Well, that is what I call 'polling'. The only way to avoid it would be to rely on signals / interrupts. That seems even less efficient, so there is no way to get rid of polling.

It is just a matter of who is polling what, and who will write the variables being polled. In your scheme there is only one poller, and many split nodes that could cause writing of the flag they poll. In my scheme there are multiple pollers to each flag, and only one split node that causes writing there. That saves me the trouble of (and the inter-cache traffic associated with) keeping track of who is working on each split node.

The question of course is if it wouldn't make the polling much more expensive. That depends on the poll frequency, although polling in itself is not very expensive as long as the variable is not written.

And you did _not_ read the explanation right. I would not only poll the flag of the 'current' split block, but the flags of all split blocks. This is why the threads maintain stacks of split blocks, to know what to poll.
#1 principle for a software/hardware designer:

Make the common case fast

My search threads poll exactly one value, which sticks in cache pretty well. Polling multiple variables, in order to avoid doing a little extra work when you want to stop everyone, violates the above principle. I don't have to stop everyone very often. Thousands of times is a large number in a 3 minute search, so it is infrequent. I have to poll every node in order to stop immediately and avoid extra/unnecessary overhead.

That is why I do it as I do. The stops are so infrequent the overhead there is irrelevant. The "checks" (polls) happen so often, I made that as efficient as possible...

Here's some numbers from the last CCT9 event, where searches were over a minute...

SMP-> splits=1467 aborts=139 data=7/128 elap=1:17
SMP-> splits=390 aborts=57 data=5/128 elap=1:36
SMP-> splits=2807 aborts=273 data=5/128 elap=1:36
SMP-> splits=2859 aborts=350 data=5/128 elap=1:33
SMP-> splits=2248 aborts=219 data=6/128 elap=1:40
SMP-> splits=2841 aborts=252 data=5/128 elap=1:24
SMP-> splits=803 aborts=128 data=6/128 elap=1:21
SMP-> splits=440 aborts=44 data=5/128 elap=18.43
SMP-> splits=126 aborts=11 data=4/128 elap=1:04
SMP-> splits=2292 aborts=197 data=5/128 elap=49.31
SMP-> splits=792 aborts=86 data=5/128 elap=1:17
SMP-> splits=827 aborts=66 data=4/128 elap=1:17
SMP-> splits=1254 aborts=100 data=5/128 elap=1:16
SMP-> splits=50 aborts=10 data=3/128 elap=1:15
SMP-> splits=229 aborts=36 data=7/128 elap=40.55
SMP-> splits=699 aborts=60 data=4/128 elap=56.69
SMP-> splits=138 aborts=4 data=4/128 elap=14.14

As you can see, the number of "aborts" (that's the number of times I had to tell other processors at some split point to "cease and desist" is a pretty small number...

Certainly when compared to the total nodes searched which would be the number of times I have to check for the abort flag...
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SMP thread goes here

Post by hgm »

Thanks for the numbers. I guess you must split pretty close to the root, to have such a low number of splits when, no doubt, the number of nodes was in the hundreds of millions. This might of course be worse if splitting would also occur often closer to the leaves. (Which is obviously worse for communication overhead, but perhaps you would earn it back in search overhead, who knows?)

The principle is a good one, but as far as I am concerned, it is in the early development phases secondary to "keep it simple". First achieve correctness, then optimize for speed. And "common cas fast" is a speed issue.

Considering the extremely low abort rate, I am surprised you choose for polling a flag at all. This violates your own principle! You could have reduced the common case of polling, which is done hundreds of million times per search, to absolute zero, by not doing it at all. To implement the abort you could simply send a signal to all threads working at the split node. (Or in fact, all threads in your engine, as the aborts happen so infrequently that it is pretty much immaterial how inefficiently you handle it.) The threads could then only poll their flags in the interrupt that occurs when they catch the signal.

You would of course be dependent on how efficiently our OS implements signals; if it would really interrupt the target thread immediately to notify it. or if it would just set a flag and wait for the thread to poll it during the clock interrupt. The latter would not be so good.

Not that my idea to keep the common case fast was the alternate polling scheme. The common case there would be to simply increment a counter and take the branch because you don't want to poll yet. In the less common case where you would want to poll, you would still poll only the one whose turn it is. This doesn't sound like more work than polling your abort flag. (But it is of course slower than doing nothing at all.)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP thread goes here

Post by bob »

hgm wrote:Thanks for the numbers. I guess you must split pretty close to the root, to have such a low number of splits when, no doubt, the number of nodes was in the hundreds of millions. This might of course be worse if splitting would also occur often closer to the leaves. (Which is obviously worse for communication overhead, but perhaps you would earn it back in search overhead, who knows?)
I split _everywhere_ but in the last N plies (currently N=4, but this varies depending on the machine configuration, I use 4 for 1-8 cores at least). I can split at the root, or anywhere down in the tree until the last 4 plies. So I don't just split at shallow depths, I split anywhere except near the leaves...

[/quote]

The principle is a good one, but as far as I am concerned, it is in the early development phases secondary to "keep it simple". First achieve correctness, then optimize for speed. And "common cas fast" is a speed issue.[/quote]

Actually it isn't. It is far easier to just do something like:

if (tree->abort) return(0);

at the top of search, rather than having to loop across multiple instances of that stop flag. IMHO.

Considering the extremely low abort rate, I am surprised you choose for polling a flag at all. This violates your own principle! You could have reduced the common case of polling, which is done hundreds of million times per search, to absolute zero, by not doing it at all. To implement the abort you could simply send a signal to all threads working at the split node. (
Signals are not safe. Been there, done that, got the T-shirt. They work fine in single-threaded programs. But not in multiple-threads. They don't get delivered instantly, there are timing issues (race conditions) involved. Etc. I tried them back in the 80's in Cray Blitz and gave up. I tried them in Crafty (the alarm signal is a neat way to catch "time is up" with zero overhead) and gave up as they were unreliable in parallel programs.

Or in fact, all threads in your engine, as the aborts happen so infrequently that it is pretty much immaterial how inefficiently you handle it.) The threads could then only poll their flags in the interrupt that occurs when they catch the signal.
The latency is high, and many libraries will fail when a signal interrupts a system call and the EINTR value gets returned, etc. Not a good solution IMHO from several prior attempts.

You would of course be dependent on how efficiently our OS implements signals; if it would really interrupt the target thread immediately to notify it. or if it would just set a flag and wait for the thread to poll it during the clock interrupt. The latter would not be so good.
This failed under both windows, solaris, linux, irix, aix, and others. I did a ton of testing trying out various solutions and finally gave up on the signal idea.

Not that my idea to keep the common case fast was the alternate polling scheme. The common case there would be to simply increment a counter and take the branch because you don't want to poll yet. In the less common case where you would want to poll, you would still poll only the one whose turn it is. This doesn't sound like more work than polling your abort flag. (But it is of course slower than doing nothing at all.)
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SMP thread goes here

Post by hgm »

The same story all over again. So operating systems are crappy... It doesn't seem so difficult to program decent signals. Oh well, let's not go into that...

Yes, your polling is ultra-simple code. But of course you need lots of code lines to set up the linked lists of split blocks. In my scheme that part is nonexistent. And later you have to walk that tree to send all the signals. That is not trivial either.

Of course you can argue that these lines are almost never executed, but that doesn't make the code any simpler. A bug there would still bring the program to a grinding halt, even if it is executed once per game...

My probe line would be something like

Code: Select all

if(!count--)
{ if(flag[nr]) { count=0; return(0); }
   count = PROBE_INTERVAL;
   if(!nr--) nr = MaxSpliNode;
}
Not so horribly complex either, and most of the time only a decrement and branch. And it would save lots of complexity elsewhere.