SMP thread goes here

Mangar · Post by **Mangar** » Tue Oct 02, 2007 3:45 pm

Hi,

as I am doing the same thing as hgm I have a question to Bob´s implementation:

If you have only one flag to signal "break" how does a thread know when he should stop breaking? There may be a split point below the split point that causes the break.

On the other hand: Polling all split positions seems not to be that hard. I can´t measure a change if polling at horizont or at horizont - 1 or at horizont -2. Neither in speed nor in amount of nodes searched. It is smaller as the statistic noise.

On the other hand I only test with 4 cpu and have limited the re-re-help. There are never more than 5 split points on my stack.

Greetings Volker

bob · Post by **bob** » Tue Oct 02, 2007 5:49 pm

hgm wrote:True, but you could put it in the same cache line with the other (private) variables that are used every node, so you would be sure it always is an L1 hit. (Or if it would be a miss due to an interrupt / task switch, you would have had to process the miss anyway on behalf of the other variables). For a flag somewhere in shared memory that would not be the case.

You could also put the polling in a piece of code that would be executed only in nodes with a larger remaining depth. E.g. if you have a conditional statement to calculate if LMR should be done, one of the conditions would be something like if(d>5). And of course you could refrain from polling in QS, putting the code in the place where the non-captures are generated, rather than on entry of the node.

Let me remind you that L1 hits are _not_ one-cycle operations any longer. L2 hits are significantly worse... So any memory reference is significantly worse than a register reference, whether it is in L1 or farther away.

bob · Post by **bob** » Tue Oct 02, 2007 5:53 pm

Mangar wrote:Hi,

as I am doing the same thing as hgm I have a question to Bob´s implementation:

If you have only one flag to signal "break" how does a thread know when he should stop breaking? There may be a split point below the split point that causes the break.

On the other hand: Polling all split positions seems not to be that hard. I can´t measure a change if polling at horizont or at horizont - 1 or at horizont -2. Neither in speed nor in amount of nodes searched. It is smaller as the statistic noise.

On the other hand I only test with 4 cpu and have limited the re-re-help. There are never more than 5 split points on my stack.

Greetings Volker

Two different issues. I have one global abort flag that is used when time runs out, for example. For stopping a thread (or threads) at a split point, I have a stop flag in each split block. And I have the split blocks linked together in the form of a tree that corresponds exactly with who is splitting where. At a single split point, if one processor gets an unexpected fail-high, It does the following:

(1) set the local stop flag in each split block linked to the current node.

(2) for each split block in (1), set the local stop flag for any child split blocks they have as other processors could be helping them deeper in the sub-tree.

(3) repeat (2) for each split block where the stop flag is set.

That lets me terminate _everybody_ that is working at this split block or below, without bothering anybody that is working at some other unrelated part of the tree.

Hope that helps...

hgm · Post by **hgm** » Tue Oct 02, 2007 6:12 pm

bob wrote:Let me remind you that L1 hits are _not_ one-cycle operations any longer. L2 hits are significantly worse... So any memory reference is significantly worse than a register reference, whether it is in L1 or farther away.

Well, very nice to remind me. After nearly a month I had completely forgotten that.

Not sure why you remind me of this, though.

Surely you do not want to suggest that a register should be reserved for the counter that determines if it is time to probe the shared abort flag(s). Besides, you were supposed to argue that this probing-once-every-so-many-nodes was slow, not fast. And it doesn't seem to me that a shared flag could be kept in a register.

But, like you remark: the advantage of only incrementing a private counter as 'most common case' would be that it could be made a guaranteed L1 hit, which, as you say, is fast compared to L2 hits. While it is questionable if the abort flag would be permanently L1 cached, being referenced only once per node.

bob · Post by **bob** » Wed Oct 03, 2007 2:37 am

I remind you because you seem to imply regularly that if something hits in level 1 we can treat it as free. It isn't...

that was my only point...

bob · Post by **bob** » Wed Oct 03, 2007 3:46 am

Mangar wrote:Hi,

as I am currently optimizing Spike´s parallel altorithm I whant to join the discussion. I like to talk about the way I implemented it as it seems to be a bit other than most approaches.

Spike uses objects (it is not object oriented). Every data Spike needs for search is stored in a "search" object. Global data is declared as static.
Every thread gets an instance of this object.

Spike has a recursive search. But most search related data is stored in an array of positions and not in the system stack.
The data stored in a position is for example alpha, beta, current value, move list, amount of moves, current position value.

The data needed for a split position is as well stored in the stack array. There is no special data block for split points. :
1. A pointer to a stack position
2. A "volatile" thread count

We have split blocks since we don't have "objects" in normal C. You are doing exactly the same thing I assume, in that when you create a new search object, you have to seed it with data from an existing search object? Which is exactly what I do with the split blocks I use in pure C programming.

There are some implementations I added for speed but they are not neccessary for the algorithm. I will not mention those here.

Once a thread is beginning to search at a new position I call him "position owner". As a postition owner it will search the first move without any parallelism. After searching the first move it will check if the position he owns is a good-enough split position.
If yes he will. Set the psotion thread count to 1 signaling that any thread may help here.

Any thread not working is polling for thread counts > 0 at any position of any thread. If one or more positions with thread counts > 0 are found the best one is joined.
Joining is done by:
1. Increase the thread count of the owner position
2. Store a pointer to the owner position data
3. Doing all neccessairy moves to reach the position
4. Storing the split positions of the orignal thread up to this position (all those are indirect helped positions)
5. Start searching moves.

Leaving is done by
1. Undoing all moves done by joining
2. Setting the position pointer to zero
3. Decreasing the thread count
4. Only for the position owner: "wait" until the thread count is zero. (He is able to help any thread that is helping somewhere at this position but he cannot leave the position)

Sounds very similar to what I do except I am not in a C++ framework, but the basic work done is exactly the same and the way you implement YBW is the way I am doing it as well.

As many thread are able to change the position owner data, increase/decrease of thread count, getting moves, changing values, ... are protected by a critical section.

Cutoffs are handled by polling. Helping threads are polling every helped position for alpha >= beta.

In this implementation a thread has no need to know which thread is helping him at a position. But the other way round is neccessairy. A thread has to know where he is helping. Either directly or indirectl thus by helping a thread that itself is helping another thread.

Currently I have a speedup of about 3.05 with 4 CPU at a depth 14 search. It is measured by calculating 30 times 100 positions to a depth of 14 to reduce statistic noises.
Currently I have additional possibilities to increase speedup. These are:
1. Splitting at root (I don´t do yet)
2. Stopping sibling threads on alpha increase without beta-cutoff.
3. Heuristic when better help siblings than searching a whole new move.

Greeting Volker

Splitting at the root is a winner, but it does add complexity.

Mangar · Post by **Mangar** » Wed Oct 03, 2007 11:02 pm

Hi,

first thanks for all those answers. Then one more point to discuss (or to ask)

Where and when having locks for syncronisation at split points. I don´t know if I use the right name for it. With a lock I mean a synchronisation object (with "enter" and "leave") to secure critical sections that may not be changed by multiple threads at the same time. As far as I know only a write or a read to a machine integer (either 32 bit or 64 bit for 64 bit systems) is guaranteed to not be interrupted by another thread.

I need several locks. This is, maybe not efficient. I am not sure how much time they need but I think that idle time in my implementation is only possible by waiting for a lock. Once I used one lock for every thread (thus multiple split points used the same lock). Next I tried one lock for every split point. In a 4 CPU environment that didn´t change enough to measure a increase in speedup.

Now what I lock (or protect by a syncronisation object):

1. I hold a counter for every split point holding the amount of thread helping. I do lock the access to this counter. (I didn´t do it in first try before I knew that "i++" could be raced. Two i++ from different threads could result in on i++). I need to lock adding and removing a thread to a split point.

2. I lock the change of Alpha and MaxValue at every split point as if (CurVal > MaxVal) MaxVal = CurVal can be raced. The lock is only set if CurVal > MaxVal (I have a fail soft approach thats why I have a MaxVal that may be lower than Alpha). But inside the protected code I check for CurVal > MaxVal again to avoid races.

3. Taking the next move from the list of moves is protected by a lock as CurMove = MoveFld[MoveNo]; MoveNo++ can be raced.

Thats all. Maybe there are tricks to avoid locking at those points? On thing that may be is not implemented good: I do the locking even if there is currently only one thread at a "possible split point" position. In my implementation the a thread adds itself to a split point thus the first thread never knows if he is really the only one without locking.

It is a long time ago I took a loock at crafty code but I remember you have a assembler "interlock-exchange". What is it good for?

Greetings Volker

bob · Post by **bob** » Fri Oct 05, 2007 1:43 am

Mangar wrote:Hi,

first thanks for all those answers. Then one more point to discuss (or to ask)

Where and when having locks for syncronisation at split points. I don´t know if I use the right name for it. With a lock I mean a synchronisation object (with "enter" and "leave") to secure critical sections that may not be changed by multiple threads at the same time. As far as I know only a write or a read to a machine integer (either 32 bit or 64 bit for 64 bit systems) is guaranteed to not be interrupted by another thread.

I need several locks. This is, maybe not efficient. I am not sure how much time they need but I think that idle time in my implementation is only possible by waiting for a lock. Once I used one lock for every thread (thus multiple split points used the same lock). Next I tried one lock for every split point. In a 4 CPU environment that didn´t change enough to measure a increase in speedup.

Now what I lock (or protect by a syncronisation object):

1. I hold a counter for every split point holding the amount of thread helping. I do lock the access to this counter. (I didn´t do it in first try before I knew that "i++" could be raced. Two i++ from different threads could result in on i++). I need to lock adding and removing a thread to a split point.

2. I lock the change of Alpha and MaxValue at every split point as if (CurVal > MaxVal) MaxVal = CurVal can be raced. The lock is only set if CurVal > MaxVal (I have a fail soft approach thats why I have a MaxVal that may be lower than Alpha). But inside the protected code I check for CurVal > MaxVal again to avoid races.

3. Taking the next move from the list of moves is protected by a lock as CurMove = MoveFld[MoveNo]; MoveNo++ can be raced.

Thats all. Maybe there are tricks to avoid locking at those points? On thing that may be is not implemented good: I do the locking even if there is currently only one thread at a "possible split point" position. In my implementation the a thread adds itself to a split point thus the first thread never knows if he is really the only one without locking.

It is a long time ago I took a loock at crafty code but I remember you have a assembler "interlock-exchange". What is it good for?

Greetings Volker

What I have always done is to have a single lock built in to every split block. If I am doing a parallel search at some ply, in general each search has its own split block and nothing needs locking in there. But each split block points back to the same parent, and when any of the processors splitting together there need to update the parent split block, they use the lock in that split block to protect against simultaneous updates.

I never allow more than 4 processors to split at a single point, which means that there will never be more than four processors contending for the same lock. And since I also restrict how deeply in the tree I split, I further reduce the chance for mutual exclusion waiting by making sure each processor has a significant amount of work to do before needing to "report back and lock something".

Mangar · Post by **Mangar** » Fri Oct 05, 2007 7:44 am

Hi,

is there any call to a lock at positions currently not searched in parallel? A second thread may join at any time, updating the parent split point. Thus the thread owning the parent split point must take care of it.

Greetings Volker

bob · Post by **bob** » Fri Oct 05, 2007 11:07 pm

Mangar wrote:Hi,

is there any call to a lock at positions currently not searched in parallel? A second thread may join at any time, updating the parent split point. Thus the thread owning the parent split point must take care of it.

Greetings Volker

Not in my code. Another thread can't spontaneously "split" with an active thread, the active thread has to "invite him in" and start using the lock at that ply for future actions taken there.

SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here