SMP thread goes here

bob · Post by **bob** » Thu Aug 30, 2007 1:54 am

Let's re-start the SMP thread as things are now all on the right-hand side, and with several joining in, it is hard to see what goes with what...

Zach Wegner · Post by **Zach Wegner** » Thu Aug 30, 2007 2:34 am

My apologies, I always read in chronological order.

Here's my reply from the other thread:

bob wrote:I don't like the stack idea although I had originally considered it. The problem is this. A single thread splits at A, then B, then C, then D, then E. And while it is working on E, the other thread working with it at B completes and there is nothing else left to do. I simply clean that up immediately, where in a stack approach you either leave it sitting around which causes wildly excessive split block usage on large systems, or you violate the stack principle and squeeze that entry out instead.

I don't follow. When thread 2 finishes SP B, I detach it from that SP. If it is the last thread, then one of the threads would have already unsplitted every SP below that one. In my approach I assume that whenever an SP is deallocated every SP below it in the tree is already gone. I also unsplit as soon as the number of processors at an SP drops to 1. So, whenever a processor finishes working on a node, it either unsplits in the case that it is the last one (which only happens when it finishes before it gets the chance to receive the UNSPLIT message), or it detaches from every split point and enters a waiting state after telling the last processor to unsplit. I don't see any more split point usage than in your scheme, just more children on each split point. It seems that the way I do it is similar to the way you do it in Crafty, but handling the split stack and the search stack separately so that the master processor doesn't have to hold onto a split point until all the others are done.

It is just an issue of data structures to me, and with my approach, it is easy to take a snapshot to see who is doing what while debugging, since any processor/thread can see every split block and dump the output as it exists in that instant... From experience (my first parallel search was done on a Univac 1100/42 in 1978) I wanted to make debugging as easy as possible, which means when something hangs, "somebody" can display the state of the tree, who is working where and for whom, etc...

I see the same thing. Every active processor in my scheme has the exact same first split point. The only problem with mine is that I can only see the search state below the split point, because the stack actively being used for each process isn't in shared memory. But I can still see the tree of processors, and which split points each is working for...

By the way, this gives me ideas for my interactive debugging facility. Currently it only works with single processors, but I suppose I could change it so that at every debug break point, I could set a "pause" flag. Every child process could then copy its tree state to shared memory and wait for the flag to be cleared. I still have a long way to go with that...

bob · Post by **bob** » Thu Aug 30, 2007 4:44 am

Zach Wegner wrote:My apologies, I always read in chronological order. Here's my reply from the other thread:

bob wrote:I don't like the stack idea although I had originally considered it. The problem is this. A single thread splits at A, then B, then C, then D, then E. And while it is working on E, the other thread working with it at B completes and there is nothing else left to do. I simply clean that up immediately, where in a stack approach you either leave it sitting around which causes wildly excessive split block usage on large systems, or you violate the stack principle and squeeze that entry out instead.
I don't follow. When thread 2 finishes SP B, I detach it from that SP. If it is the last thread, then one of the threads would have already unsplitted every SP below that one. In my approach I assume that whenever an SP is deallocated every SP below it in the tree is already gone. I also unsplit as soon as the number of processors at an SP drops to 1. So, whenever a processor finishes working on a node, it either unsplits in the case that it is the last one (which only happens when it finishes before it gets the chance to receive the UNSPLIT message), or it detaches from every split point and enters a waiting state after telling the last processor to unsplit. I don't see any more split point usage than in your scheme, just more children on each split point. It seems that the way I do it is similar to the way you do it in Crafty, but handling the split stack and the search stack separately so that the master processor doesn't have to hold onto a split point until all the others are done.
It is just an issue of data structures to me, and with my approach, it is easy to take a snapshot to see who is doing what while debugging, since any processor/thread can see every split block and dump the output as it exists in that instant... From experience (my first parallel search was done on a Univac 1100/42 in 1978) I wanted to make debugging as easy as possible, which means when something hangs, "somebody" can display the state of the tree, who is working where and for whom, etc...
I see the same thing. Every active processor in my scheme has the exact same first split point. The only problem with mine is that I can only see the search state below the split point, because the stack actively being used for each process isn't in shared memory. But I can still see the tree of processors, and which split points each is working for...

By the way, this gives me ideas for my interactive debugging facility. Currently it only works with single processors, but I suppose I could change it so that at every debug break point, I could set a "pause" flag. Every child process could then copy its tree state to shared memory and wait for the flag to be cleared. I still have a long way to go with that...

I am not sure how you can "unsplit all blocks below that point..."

TID1 searches to ply 3 using split block A. It then splits with TID2 and uses split blocks B and C. Later TID3 splits with TID1 (which is on split block B) and gets split blocks D and E. TID4 splits with TID2 (which is on split block C) and gets blocks F and G. This can happen over and over. What happens to split blocks at ply-3 when TID2 finishes its branch but TID1 is still busy and still has split blocks _below_ ply 3 that are active???

I allow splits anywhere, recursively. T1 splits with T2. T1 runs out of work and again splits farther down in the sub-tree T2 is working on. T1 or T2 runs out of work and splits still further down. Those split blocks add up.

On 2 processor systems, I rarely see more than 8 split blocks used. On 4-processor systems this runs up to 55-60. On 8 processor systems, I have run out with a max of 256. On 16 processor systems it _really_ starts to add up.

In your "every processor in my program has the exact same first split point" that is also something I don't generally allow unless I only have 4 processors or less. Otherwise I never let more than 4 work at the same split point as it becomes too inefficient, particularly when there are not enough moves and doing a split produces threads that can't find a thing to do... It sounds like we have some basic differences in how we split the tree. I chose to follow the "a split can occur anywhere, regardless of what has already been split" type of design...

Zach Wegner · Post by **Zach Wegner** » Thu Aug 30, 2007 5:29 am

bob wrote:I am not sure how you can "unsplit all blocks below that point..."

TID1 searches to ply 3 using split block A. It then splits with TID2 and uses split blocks B and C. Later TID3 splits with TID1 (which is on split block B) and gets split blocks D and E. TID4 splits with TID2 (which is on split block C) and gets blocks F and G. This can happen over and over. What happens to split blocks at ply-3 when TID2 finishes its branch but TID1 is still busy and still has split blocks _below_ ply 3 that are active???

First I will clear up a little semantic difference we have here. It seems your split blocks are allocated one for each processor, one for each split point. Mine are just one for every split point. What would happen in this situation is that when T3 and T4 join those split points, they add the first split point to their own stack. I will call the first split point (your split blocks B+C) split point A. Then D+E become B, and F+G become C. So this means that of the four processors, T1 and T3 will have a split point stack of { NULL, A, B }, and T2 and T4 will have {NULL, A, C }. Split point A will have four children, but only two moves from that node are currently being searched as two processors are split further down the tree along each path. So when T2 backs up to ply 3, it and T4 have already detached from SP C. When it detaches from SP A, there will still be two processors below it, so it won't be deallocated yet. In other words, a split point is only deallocated when there is at most one processor searching the tree below that node.

I allow splits anywhere, recursively. T1 splits with T2. T1 runs out of work and again splits farther down in the sub-tree T2 is working on. T1 or T2 runs out of work and splits still further down. Those split blocks add up.

On 2 processor systems, I rarely see more than 8 split blocks used. On 4-processor systems this runs up to 55-60. On 8 processor systems, I have run out with a max of 256. On 16 processor systems it _really_ starts to add up.

In your "every processor in my program has the exact same first split point" that is also something I don't generally allow unless I only have 4 processors or less. Otherwise I never let more than 4 work at the same split point as it becomes too inefficient, particularly when there are not enough moves and doing a split produces threads that can't find a thing to do... It sounds like we have some basic differences in how we split the tree. I chose to follow the "a split can occur anywhere, regardless of what has already been split" type of design...

I think our approaches are pretty similar, just the data structures used which are different. Keep in mind I am using iterative search. I don't restrict processors to work at the same node, I just put a pointer to the split point on its stack. That way, when there are two processors at a split point, either one of them can return from that split point and be able to make the split point before it their active one. To give an example: T1 and T2 search split point A. T1 splits with T3 below SP A, at SP B. Now when T1 finishes its moves at SP B, it must go into an idle state to find another split point. How does T3 return from SP B? It must know that there is another split point before it, so it can back up a valid score to it.

bob · Post by **bob** » Thu Aug 30, 2007 6:54 am

Zach Wegner wrote:
bob wrote:I am not sure how you can "unsplit all blocks below that point..."

TID1 searches to ply 3 using split block A. It then splits with TID2 and uses split blocks B and C. Later TID3 splits with TID1 (which is on split block B) and gets split blocks D and E. TID4 splits with TID2 (which is on split block C) and gets blocks F and G. This can happen over and over. What happens to split blocks at ply-3 when TID2 finishes its branch but TID1 is still busy and still has split blocks _below_ ply 3 that are active???
First I will clear up a little semantic difference we have here. It seems your split blocks are allocated one for each processor, one for each split point. Mine are just one for every split point.

My "split blocks" are the local memory required to hold the private tree state each thread needs while searching. Where are you putting that data (the board position, repetition list, and so forth)??

As each thread searches, it needs its own board data structure, hash signature stuff, and several other things that have to be stored in different memory addresses for each active thread. That's what goes in my split blocks, and that's why when I split I make multiple split blocks, so that I can do multiple simultaneous searches without interference.

What would happen in this situation is that when T3 and T4 join those split points, they add the first split point to their own stack. I will call the first split point (your split blocks B+C) split point A. Then D+E become B, and F+G become C. So this means that of the four processors, T1 and T3 will have a split point stack of { NULL, A, B }, and T2 and T4 will have {NULL, A, C }. Split point A will have four children, but only two moves from that node are currently being searched as two processors are split further down the tree along each path. So when T2 backs up to ply 3, it and T4 have already detached from SP C. When it detaches from SP A, there will still be two processors below it, so it won't be deallocated yet. In other words, a split point is only deallocated when there is at most one processor searching the tree below that node.

I allow splits anywhere, recursively. T1 splits with T2. T1 runs out of work and again splits farther down in the sub-tree T2 is working on. T1 or T2 runs out of work and splits still further down. Those split blocks add up.

On 2 processor systems, I rarely see more than 8 split blocks used. On 4-processor systems this runs up to 55-60. On 8 processor systems, I have run out with a max of 256. On 16 processor systems it _really_ starts to add up.

In your "every processor in my program has the exact same first split point" that is also something I don't generally allow unless I only have 4 processors or less. Otherwise I never let more than 4 work at the same split point as it becomes too inefficient, particularly when there are not enough moves and doing a split produces threads that can't find a thing to do... It sounds like we have some basic differences in how we split the tree. I chose to follow the "a split can occur anywhere, regardless of what has already been split" type of design...
I think our approaches are pretty similar, just the data structures used which are different. Keep in mind I am using iterative search. I don't restrict processors to work at the same node, I just put a pointer to the split point on its stack. That way, when there are two processors at a split point, either one of them can return from that split point and be able to make the split point before it their active one. To give an example: T1 and T2 search split point A. T1 splits with T3 below SP A, at SP B. Now when T1 finishes its moves at SP B, it must go into an idle state to find another split point. How does T3 return from SP B? It must know that there is another split point before it, so it can back up a valid score to it.

OK an iterated search makes a difference, although it would seem that each processor needs its own private memory, probably one copy per thread, period, indexed by thread-ID or something similar? That was how Cray Blitz looked... But it does make it difficult to discuss some specifics since we have such a basically different way of handling the parallel search.

Zach Wegner · Post by **Zach Wegner** » Thu Aug 30, 2007 7:15 am

bob wrote:My "split blocks" are the local memory required to hold the private tree state each thread needs while searching. Where are you putting that data (the board position, repetition list, and so forth)??

As each thread searches, it needs its own board data structure, hash signature stuff, and several other things that have to be stored in different memory addresses for each active thread. That's what goes in my split blocks, and that's why when I split I make multiple split blocks, so that I can do multiple simultaneous searches without interference.

I use processes and shared memory for starters, so this is a little different for me. The local data for each processor remains in place throughout runtime (i.e. I have it as static globals for each process, no pointers). Each split point has a copy of the board state up to that point, including position, game history, and the search stack. Here is my split point structure:

Code: Select all

typedef struct SPLIT_POINT
&#123;
    int id;
    BOOL active;
    struct BOARD board;
    volatile int child_count;
    volatile BOOL update&#91;MAX_CPUS&#93;;
    volatile BOOL is_child&#91;MAX_CPUS&#93;;
    volatile MOVE move_list&#91;256&#93;;
    SEARCH_BLOCK * volatile sb;
    volatile BOOL no_moves_left;
    LOCK_T lock;
&#125; SPLIT_POINT;

So the local data is copied to the split point, and then copied from the split point to other processors local data. This is possible because I maintain within BOARD a stack of split points, so that processors do not only know the immediate split point they are working on, but every one before it in the tree.

What would happen in this situation is that when T3 and T4 join those split points, they add the first split point to their own stack. I will call the first split point (your split blocks B+C) split point A. Then D+E become B, and F+G become C. So this means that of the four processors, T1 and T3 will have a split point stack of { NULL, A, B }, and T2 and T4 will have {NULL, A, C }. Split point A will have four children, but only two moves from that node are currently being searched as two processors are split further down the tree along each path. So when T2 backs up to ply 3, it and T4 have already detached from SP C. When it detaches from SP A, there will still be two processors below it, so it won't be deallocated yet. In other words, a split point is only deallocated when there is at most one processor searching the tree below that node.

I allow splits anywhere, recursively. T1 splits with T2. T1 runs out of work and again splits farther down in the sub-tree T2 is working on. T1 or T2 runs out of work and splits still further down. Those split blocks add up.

On 2 processor systems, I rarely see more than 8 split blocks used. On 4-processor systems this runs up to 55-60. On 8 processor systems, I have run out with a max of 256. On 16 processor systems it _really_ starts to add up.

In your "every processor in my program has the exact same first split point" that is also something I don't generally allow unless I only have 4 processors or less. Otherwise I never let more than 4 work at the same split point as it becomes too inefficient, particularly when there are not enough moves and doing a split produces threads that can't find a thing to do... It sounds like we have some basic differences in how we split the tree. I chose to follow the "a split can occur anywhere, regardless of what has already been split" type of design...
I think our approaches are pretty similar, just the data structures used which are different. Keep in mind I am using iterative search. I don't restrict processors to work at the same node, I just put a pointer to the split point on its stack. That way, when there are two processors at a split point, either one of them can return from that split point and be able to make the split point before it their active one. To give an example: T1 and T2 search split point A. T1 splits with T3 below SP A, at SP B. Now when T1 finishes its moves at SP B, it must go into an idle state to find another split point. How does T3 return from SP B? It must know that there is another split point before it, so it can back up a valid score to it.
OK an iterated search makes a difference, although it would seem that each processor needs its own private memory, probably one copy per thread, period, indexed by thread-ID or something similar? That was how Cray Blitz looked... But it does make it difficult to discuss some specifics since we have such a basically different way of handling the parallel search.

Yes, I haven't looked at Crafty in a while, and it seems like some of the implementation has changed (you converted to processes on *nix, right?). It's good that there is someone to discuss this with, as it looks like I am set to be the fourth person ever to get a DTS implementation running, and two of 'em aren't talking...

hgm · Post by **hgm** » Thu Aug 30, 2007 11:45 am

bob wrote:I have a flag for each thread that they check frequently, and this flag says "stop now and back out..." The trick is to know who to set that for, because it is not only what is going on at this particular split point, but also any split points deeper into the tree, since they too are no longer useful.

This is essentially the second solution I mentioned: Each thread has one flag, and the originator of a message has to deliver it explicitly to the intended recepients, for which purpose it has to maintain lists of who is working on which splitpoint.

My plan was to let the message passing to threads at split points lower in the tree indirectly: as the threads directly informed of the abort (because they were working at the split point where the fail high occurred) clean up the tree they were working on, they will close these lower split points. And when they do, they can inform all threads working from there, which will then pass on the message to their daughters, etc. Or in other words, there is no need to connect the splt points 'vertically', as the return stack of the interrupted thread already does this implicitly.

It still seems a pain to maintain the list of threads working at each split point, though.

The first solution I mentioned would also use a stack similar to Zach's. But it would not copy any split points from its master there. Just its own. If one of our (direct or indirect) masters is recalled at one of its split points not on our stack, it would eventually reach one of our split points when dismantling its own search tree. And I would rely on the abort message being passed indirectly via that split point.

In this approach there would be no centralized information on who is working on which split point. Each thread would know (from its stack) on which split points it is working, but it wouldn't be aware which other threads are working there (if indeed, any are working there at all). It would just 'broadcast' the abort message to anyone tuned in to it (i.e. working at the split point).

Obvious disadvantage is that each thread should be tuned in to several broadcasts, i.e., it would have to poll all the split nodes on its private stack. In stead of getting the messge delivered on its doorstep, it has to hunt around for it. This is more work for the receiver, but less complicated for the sender.

To limit the polling overhead, I might not poll every node on my split-node stack on every occasion, but (if split[N] is the top element) alternate polling according to a scheme:

N, N-1, N, N-2, N, N-1, N, N-3, N, N-1, N, N-2, N, N-1, N, N-4, N, N-1, N, N-2, N, ...

i.e. the top would be polled every other time, the element blow it twice as infrequent, etc. In particular, I would poll element N-k of the stack when the k-th bit of a binary counter would make the 0->1 transition. This would poll the split points closer to the root much less frequently, but such split points are much mor unlikely to change state per unit of time. So this seems justified.

bob · Post by **bob** » Thu Aug 30, 2007 8:49 pm

Zach Wegner wrote:
bob wrote:My "split blocks" are the local memory required to hold the private tree state each thread needs while searching. Where are you putting that data (the board position, repetition list, and so forth)??

As each thread searches, it needs its own board data structure, hash signature stuff, and several other things that have to be stored in different memory addresses for each active thread. That's what goes in my split blocks, and that's why when I split I make multiple split blocks, so that I can do multiple simultaneous searches without interference.
I use processes and shared memory for starters, so this is a little different for me. The local data for each processor remains in place throughout runtime (i.e. I have it as static globals for each process, no pointers). Each split point has a copy of the board state up to that point, including position, game history, and the search stack. Here is my split point structure:
Code: Select all
typedef struct SPLIT_POINT
&#123;
    int id;
    BOOL active;
    struct BOARD board;
    volatile int child_count;
    volatile BOOL update&#91;MAX_CPUS&#93;;
    volatile BOOL is_child&#91;MAX_CPUS&#93;;
    volatile MOVE move_list&#91;256&#93;;
    SEARCH_BLOCK * volatile sb;
    volatile BOOL no_moves_left;
    LOCK_T lock;
&#125; SPLIT_POINT;
So the local data is copied to the split point, and then copied from the split point to other processors local data. This is possible because I maintain within BOARD a stack of split points, so that processors do not only know the immediate split point they are working on, but every one before it in the tree.

I actually use real processes (via fork()) and shared memory (via shmget() and shmat()) because I found too many incompatibilites within pthreads on different platforms.

However I allocate a split block at every split point, so that (a) the children can have their own copies of the chess board and stuff and (b) at the split point, the parent split block is used to manage the shared move list at that one ply, along with the necessary lock value...

What would happen in this situation is that when T3 and T4 join those split points, they add the first split point to their own stack. I will call the first split point (your split blocks B+C) split point A. Then D+E become B, and F+G become C. So this means that of the four processors, T1 and T3 will have a split point stack of { NULL, A, B }, and T2 and T4 will have {NULL, A, C }. Split point A will have four children, but only two moves from that node are currently being searched as two processors are split further down the tree along each path. So when T2 backs up to ply 3, it and T4 have already detached from SP C. When it detaches from SP A, there will still be two processors below it, so it won't be deallocated yet. In other words, a split point is only deallocated when there is at most one processor searching the tree below that node.

I allow splits anywhere, recursively. T1 splits with T2. T1 runs out of work and again splits farther down in the sub-tree T2 is working on. T1 or T2 runs out of work and splits still further down. Those split blocks add up.

On 2 processor systems, I rarely see more than 8 split blocks used. On 4-processor systems this runs up to 55-60. On 8 processor systems, I have run out with a max of 256. On 16 processor systems it _really_ starts to add up.

In your "every processor in my program has the exact same first split point" that is also something I don't generally allow unless I only have 4 processors or less. Otherwise I never let more than 4 work at the same split point as it becomes too inefficient, particularly when there are not enough moves and doing a split produces threads that can't find a thing to do... It sounds like we have some basic differences in how we split the tree. I chose to follow the "a split can occur anywhere, regardless of what has already been split" type of design...
I think our approaches are pretty similar, just the data structures used which are different. Keep in mind I am using iterative search. I don't restrict processors to work at the same node, I just put a pointer to the split point on its stack. That way, when there are two processors at a split point, either one of them can return from that split point and be able to make the split point before it their active one. To give an example: T1 and T2 search split point A. T1 splits with T3 below SP A, at SP B. Now when T1 finishes its moves at SP B, it must go into an idle state to find another split point. How does T3 return from SP B? It must know that there is another split point before it, so it can back up a valid score to it.
OK an iterated search makes a difference, although it would seem that each processor needs its own private memory, probably one copy per thread, period, indexed by thread-ID or something similar? That was how Cray Blitz looked... But it does make it difficult to discuss some specifics since we have such a basically different way of handling the parallel search.
Yes, I haven't looked at Crafty in a while, and it seems like some of the implementation has changed (you converted to processes on *nix, right?). It's good that there is someone to discuss this with, as it looks like I am set to be the fourth person ever to get a DTS implementation running, and two of 'em aren't talking...

That's pretty funny, because neither one of them minded asking me question after question about DTS.

But I do know what you mean...

bob · Post by **bob** » Thu Aug 30, 2007 8:55 pm

hgm wrote:
bob wrote:I have a flag for each thread that they check frequently, and this flag says "stop now and back out..." The trick is to know who to set that for, because it is not only what is going on at this particular split point, but also any split points deeper into the tree, since they too are no longer useful.
This is essentially the second solution I mentioned: Each thread has one flag, and the originator of a message has to deliver it explicitly to the intended recepients, for which purpose it has to maintain lists of who is working on which splitpoint.

My plan was to let the message passing to threads at split points lower in the tree indirectly: as the threads directly informed of the abort (because they were working at the split point where the fail high occurred) clean up the tree they were working on, they will close these lower split points. And when they do, they can inform all threads working from there, which will then pass on the message to their daughters, etc. Or in other words, there is no need to connect the splt points 'vertically', as the return stack of the interrupted thread already does this implicitly.

It still seems a pain to maintain the list of threads working at each split point, though.

For me, it was something that happened naturally and easily. As I copy the data from the parent split block to a new split block for each process searching here, I just note their ids in the parent split block, and I cross-link all the ids into each split block at this ply since they are "siblings". It really only added about 10 lines of code, max, but it gave me a useful debugging tool, in that it is now possible at any instant in time to verify that the split blocks are arranged in a sane matter (since split block[0] is the one used at the root, kind of like inode 0 in unix for the root directory.)

The first solution I mentioned would also use a stack similar to Zach's. But it would not copy any split points from its master there. Just its own. If one of our (direct or indirect) masters is recalled at one of its split points not on our stack, it would eventually reach one of our split points when dismantling its own search tree. And I would rely on the abort message being passed indirectly via that split point.

In this approach there would be no centralized information on who is working on which split point. Each thread would know (from its stack) on which split points it is working, but it wouldn't be aware which other threads are working there (if indeed, any are working there at all). It would just 'broadcast' the abort message to anyone tuned in to it (i.e. working at the split point).

It would seem to me that if you can "selectively broadcast" then you must know who is splitting with whom, else you will unintentionally stop the wrong processors here and there.

Obvious disadvantage is that each thread should be tuned in to several broadcasts, i.e., it would have to poll all the split nodes on its private stack. In stead of getting the messge delivered on its doorstep, it has to hunt around for it. This is more work for the receiver, but less complicated for the sender.

To limit the polling overhead, I might not poll every node on my split-node stack on every occasion, but (if split[N] is the top element) alternate polling according to a scheme:

N, N-1, N, N-2, N, N-1, N, N-3, N, N-1, N, N-2, N, N-1, N, N-4, N, N-1, N, N-2, N, ...

i.e. the top would be polled every other time, the element blow it twice as infrequent, etc. In particular, I would poll element N-k of the stack when the k-th bit of a binary counter would make the 0->1 transition. This would poll the split points closer to the root much less frequently, but such split points are much mor unlikely to change state per unit of time. So this seems justified.

hgm · Post by **hgm** » Fri Aug 31, 2007 11:46 am

bob wrote:It would seem to me that if you can "selectively broadcast" then you must know who is splitting with whom, else you will unintentionally stop the wrong processors here and there.

The essence of broadcasting is that you inject a message into a common medium, without knowing who is listening (or in fact, if anyone is listening at all). It is the listener who decides if he is going to pick up the signal.

In this particular case, the 'medium' would be the particular split node in which the cut-off occurs. THe thread stumbling on the cutoff would set a flag there. Only threads working at the node would poll that flag, the other threads would be unaware of it, and could thus never abort their search in reaction to it. They would be polling their own split nodes.

But even the listeners would not be aware with whom they were splitting. They only have to know _where_ they split.

SMP thread goes here

SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here