An interesting parallel search non-bug

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: An interesting parallel search non-bug

Post by Edsel Apostol »

bob wrote:
Edsel Apostol wrote:
bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...
It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.
Why? All that is required is that the ALL nodes that are potential split points need to be searched completely. Order doesn't matter. "helpful master" doesn't get broken at all, as a thread is still a master of its thread. And when thread 0 joins itself at ply=10, it will eventually finish there, go to thread wait() and notice that now ply=20 is done, and start working its way back up the tree.

Not sure why you think this breaks "helpful master".

BTW my helpful master has NEVER had the requirement that it only help those that are helping it. I see no reason why it can't jump across the tree to any viable split point, and have always done it like that. But getting rid of the "master" concept is certainly a big win when moving to full-blown DTS. Now there are no masters anywhere.
In Hannibal we also do splits even in CUT nodes. I've tried doing splits in ALL and PV nodes only before but it is slower NPS wise (due to fewer splits available to join) and I didn't measure any advantage over splitting at any node types. This means that when we do pre-splits there is a big enough possibility that move 2 causes a cutoff, so it is better that it will be searched to completion without suspending and moving on to search move 3, etc. If splits is only being done in ALL nodes in Crafty (and in the root node, where all the moves are expected to be searched anyway) then it might be a non-issue in Crafty.
It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.

Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).
The challenge for me was "nearly lockless" which was a critical key with 20+ threads. The only lock that has measurable overhead is the NextMove() protection at split points. This took a lot of thought. I have also spent a ton of time on making certain there are not two adjacent memory words that are modified by separate threads, even in places I did not think about... performance killers.

But then there is DTS where pre-splits are a thing of the past. I've done a ton of measuring with the split with oneself, and do not split with oneself, in my code, the difference is absolutely unmeasurable, NPS, time to depth, tree size, etc. But it looked like an accident waiting to happen when I change something next year not thinking about that quirk. My tree overhead has yet to exceed 30%, even with 24 cores... which is not bad. IE a 24 core search searches a tree maybe 20-25% larger than the serial tree to the same depth. The NextMove() synchronization is my biggest overhead. DTS can help by splitting close to the root rather than always having to either split where I am, or gratuitously splitting closer to the root. Closer to the root -> less contention for the move list...

As far as your worry about going back to a "wrong split point" there's nothing you can do. And nothing says a different split point is any better. In fact, positions near the root are less likely to have to abort since their move ordering is better. I've done a lot of measuring on aborts by ply, no big surprise that as you get further from the root there are more of 'em. Disproportionally more than expected.

Since I added the "move ordering doesn't change, LMR always reduces split moves the same amount no matter what" code last year, tree variation reduced quite a bit, also...

Have you measured TTD with any degree of accuracy on 20 cores? My current results are in the 13x-14x range. With the split with myself option. I'm trying to rerun the tests with the change, but for 20 cores I saw absolutely no improvement or degradation. Totally benign change to do it or not do it for me...

Will have some 24 core results in a week or two, we have a new cluster with 12 x 2 x 2.5ghz xeons, as opposed to this current 10 x 2 x 2.6ghz xeons. Be interesting to see how turbo-boost works with 24 cores. My 20 core 2.6 ghz box runs at 2.9ghz non-stop with Crafty on all 20 cores. Hope the 24 core box will run at around 2.8ghz or so...
NextMove() is the main overhead in Hannibal as well. Threads will be waiting for a thread to finish generating quiet moves for example before they could get a move to be searched. Since we are doing pre-splits already an idea can be tried where if you are anticipating to do pre-splits in this node, then generate all the moves ahead before doing the split, where no threads have yet joined to avoid any lock contention. Getting a move from the movelist should not cause any major lock contention anymore. This depends on the percentage of the pre-splits being joined though. If only a small percentage of pre-splits is being joined then it will be a waste of time generating all moves at the beginning.

I also have made sure that move ordering is the same as in the serial search so LMR and other pruning based on moves tried should work the same. I have made a mistake regarding this one before and it causes some widening effect.

Unfortunately I don't have any access to big hardware. I only have an 8 core machine to test my ideas so I don't have any data with more than 8 cores. The only result I'm aware of Hannibal with 20 cores is in TCEC where it managed to sneak in to stage 3 ahead of some stronger engines, which I think is due to better SMP.

I'll be looking forward to your test results. It seems that Lazy SMP is in the trend nowadays but I would like to see that YBWC is still better (as long as it is implemented correctly). I'll probably try DTS when I do a rewrite of my engine in the future.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An interesting parallel search non-bug

Post by bob »

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...
It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.
Why? All that is required is that the ALL nodes that are potential split points need to be searched completely. Order doesn't matter. "helpful master" doesn't get broken at all, as a thread is still a master of its thread. And when thread 0 joins itself at ply=10, it will eventually finish there, go to thread wait() and notice that now ply=20 is done, and start working its way back up the tree.

Not sure why you think this breaks "helpful master".

BTW my helpful master has NEVER had the requirement that it only help those that are helping it. I see no reason why it can't jump across the tree to any viable split point, and have always done it like that. But getting rid of the "master" concept is certainly a big win when moving to full-blown DTS. Now there are no masters anywhere.
In Hannibal we also do splits even in CUT nodes. I've tried doing splits in ALL and PV nodes only before but it is slower NPS wise (due to fewer splits available to join) and I didn't measure any advantage over splitting at any node types. This means that when we do pre-splits there is a big enough possibility that move 2 causes a cutoff, so it is better that it will be searched to completion without suspending and moving on to search move 3, etc. If splits is only being done in ALL nodes in Crafty (and in the root node, where all the moves are expected to be searched anyway) then it might be a non-issue in Crafty.
It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.

Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).
The challenge for me was "nearly lockless" which was a critical key with 20+ threads. The only lock that has measurable overhead is the NextMove() protection at split points. This took a lot of thought. I have also spent a ton of time on making certain there are not two adjacent memory words that are modified by separate threads, even in places I did not think about... performance killers.

But then there is DTS where pre-splits are a thing of the past. I've done a ton of measuring with the split with oneself, and do not split with oneself, in my code, the difference is absolutely unmeasurable, NPS, time to depth, tree size, etc. But it looked like an accident waiting to happen when I change something next year not thinking about that quirk. My tree overhead has yet to exceed 30%, even with 24 cores... which is not bad. IE a 24 core search searches a tree maybe 20-25% larger than the serial tree to the same depth. The NextMove() synchronization is my biggest overhead. DTS can help by splitting close to the root rather than always having to either split where I am, or gratuitously splitting closer to the root. Closer to the root -> less contention for the move list...

As far as your worry about going back to a "wrong split point" there's nothing you can do. And nothing says a different split point is any better. In fact, positions near the root are less likely to have to abort since their move ordering is better. I've done a lot of measuring on aborts by ply, no big surprise that as you get further from the root there are more of 'em. Disproportionally more than expected.

Since I added the "move ordering doesn't change, LMR always reduces split moves the same amount no matter what" code last year, tree variation reduced quite a bit, also...

Have you measured TTD with any degree of accuracy on 20 cores? My current results are in the 13x-14x range. With the split with myself option. I'm trying to rerun the tests with the change, but for 20 cores I saw absolutely no improvement or degradation. Totally benign change to do it or not do it for me...

Will have some 24 core results in a week or two, we have a new cluster with 12 x 2 x 2.5ghz xeons, as opposed to this current 10 x 2 x 2.6ghz xeons. Be interesting to see how turbo-boost works with 24 cores. My 20 core 2.6 ghz box runs at 2.9ghz non-stop with Crafty on all 20 cores. Hope the 24 core box will run at around 2.8ghz or so...
NextMove() is the main overhead in Hannibal as well. Threads will be waiting for a thread to finish generating quiet moves for example before they could get a move to be searched. Since we are doing pre-splits already an idea can be tried where if you are anticipating to do pre-splits in this node, then generate all the moves ahead before doing the split, where no threads have yet joined to avoid any lock contention. Getting a move from the movelist should not cause any major lock contention anymore. This depends on the percentage of the pre-splits being joined though. If only a small percentage of pre-splits is being joined then it will be a waste of time generating all moves at the beginning.

I also have made sure that move ordering is the same as in the serial search so LMR and other pruning based on moves tried should work the same. I have made a mistake regarding this one before and it causes some widening effect.

Unfortunately I don't have any access to big hardware. I only have an 8 core machine to test my ideas so I don't have any data with more than 8 cores. The only result I'm aware of Hannibal with 20 cores is in TCEC where it managed to sneak in to stage 3 ahead of some stronger engines, which I think is due to better SMP.

I'll be looking forward to your test results. It seems that Lazy SMP is in the trend nowadays but I would like to see that YBWC is still better (as long as it is implemented correctly). I'll probably try DTS when I do a rewrite of my engine in the future.
I don't follow the "split at CUT nodes" idea. You don't know a node is a "CUT node" until it fails high, and splitting then is obviously wrong. Do you mean "predicting" it is a cut node, say because the previous ply has satisfied the YBW criterion and has searched at least one move with no fail high? I don't do that at all. My split criterion is simply YBW. I can split at any ply where I have searched at least one move, regardless of what has happened at previous plies.

If you can send me a linux executable, or else source with a linux makefile, and a shell script that will run whatever test you want to try, I can run it on our 20 core box easily enough. Ideally something like "runtest" which would do a make, do the 1 cpu run, then multiple cpu runs, and then end". I can tar up the whole directory and send it back.

This is a 2x10 2660 box that will run all 20 cores at 2.9ghz, so it moves right along...

Let me know if you are interested. Note that you will have to compute your own speedups, but I'd be interested to see your results posted as a point of comparison.

BTW I played with a different NextMove() for split points. It is possible to write a less efficient move generator, but one that can work in parallel. IE generate ONE move, spit it out, and continue. I ignored history for the first instantiation of this. It greatly reduced the NextMove() waiting, but it couldn't use history info for ordering. I also tried to pre-generate moves, but only after I had searched at least two moves, trying to avoid some of the overhead you mentioned. Was not really happy with either one. Most of the problem comes from the deep splits, since they do NextMove() much more frequently.

Still some things to tweak, but at least the parallel search stuff is approaching the point of being about as lean as I can make it.

the "gratuitous splits" info from Crafty looks like this for a single position (no idea if this is a representative position, just one taken from a game played on CCC):

splits=3.0M(1.1M) aborts=423.8K joins=9.7M

It did a total of 3M splits, 1.1M of those splits were not joined. joins are high, but there are 20 threads. The aborts looks a bit high, but I didn't look at the position to see if it is one of the pathological types... BTW this search took 4:06, so it is fairly long. Searched a total of 21.3 billion nodes, averaging 86M+ nodes per second. In this game, lowest NPS was 79M, highest was 96M, average right around 88M-90M.
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: An interesting parallel search non-bug

Post by Edsel Apostol »

bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...
It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.
Why? All that is required is that the ALL nodes that are potential split points need to be searched completely. Order doesn't matter. "helpful master" doesn't get broken at all, as a thread is still a master of its thread. And when thread 0 joins itself at ply=10, it will eventually finish there, go to thread wait() and notice that now ply=20 is done, and start working its way back up the tree.

Not sure why you think this breaks "helpful master".

BTW my helpful master has NEVER had the requirement that it only help those that are helping it. I see no reason why it can't jump across the tree to any viable split point, and have always done it like that. But getting rid of the "master" concept is certainly a big win when moving to full-blown DTS. Now there are no masters anywhere.
In Hannibal we also do splits even in CUT nodes. I've tried doing splits in ALL and PV nodes only before but it is slower NPS wise (due to fewer splits available to join) and I didn't measure any advantage over splitting at any node types. This means that when we do pre-splits there is a big enough possibility that move 2 causes a cutoff, so it is better that it will be searched to completion without suspending and moving on to search move 3, etc. If splits is only being done in ALL nodes in Crafty (and in the root node, where all the moves are expected to be searched anyway) then it might be a non-issue in Crafty.
It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.

Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).
The challenge for me was "nearly lockless" which was a critical key with 20+ threads. The only lock that has measurable overhead is the NextMove() protection at split points. This took a lot of thought. I have also spent a ton of time on making certain there are not two adjacent memory words that are modified by separate threads, even in places I did not think about... performance killers.

But then there is DTS where pre-splits are a thing of the past. I've done a ton of measuring with the split with oneself, and do not split with oneself, in my code, the difference is absolutely unmeasurable, NPS, time to depth, tree size, etc. But it looked like an accident waiting to happen when I change something next year not thinking about that quirk. My tree overhead has yet to exceed 30%, even with 24 cores... which is not bad. IE a 24 core search searches a tree maybe 20-25% larger than the serial tree to the same depth. The NextMove() synchronization is my biggest overhead. DTS can help by splitting close to the root rather than always having to either split where I am, or gratuitously splitting closer to the root. Closer to the root -> less contention for the move list...

As far as your worry about going back to a "wrong split point" there's nothing you can do. And nothing says a different split point is any better. In fact, positions near the root are less likely to have to abort since their move ordering is better. I've done a lot of measuring on aborts by ply, no big surprise that as you get further from the root there are more of 'em. Disproportionally more than expected.

Since I added the "move ordering doesn't change, LMR always reduces split moves the same amount no matter what" code last year, tree variation reduced quite a bit, also...

Have you measured TTD with any degree of accuracy on 20 cores? My current results are in the 13x-14x range. With the split with myself option. I'm trying to rerun the tests with the change, but for 20 cores I saw absolutely no improvement or degradation. Totally benign change to do it or not do it for me...

Will have some 24 core results in a week or two, we have a new cluster with 12 x 2 x 2.5ghz xeons, as opposed to this current 10 x 2 x 2.6ghz xeons. Be interesting to see how turbo-boost works with 24 cores. My 20 core 2.6 ghz box runs at 2.9ghz non-stop with Crafty on all 20 cores. Hope the 24 core box will run at around 2.8ghz or so...
NextMove() is the main overhead in Hannibal as well. Threads will be waiting for a thread to finish generating quiet moves for example before they could get a move to be searched. Since we are doing pre-splits already an idea can be tried where if you are anticipating to do pre-splits in this node, then generate all the moves ahead before doing the split, where no threads have yet joined to avoid any lock contention. Getting a move from the movelist should not cause any major lock contention anymore. This depends on the percentage of the pre-splits being joined though. If only a small percentage of pre-splits is being joined then it will be a waste of time generating all moves at the beginning.

I also have made sure that move ordering is the same as in the serial search so LMR and other pruning based on moves tried should work the same. I have made a mistake regarding this one before and it causes some widening effect.

Unfortunately I don't have any access to big hardware. I only have an 8 core machine to test my ideas so I don't have any data with more than 8 cores. The only result I'm aware of Hannibal with 20 cores is in TCEC where it managed to sneak in to stage 3 ahead of some stronger engines, which I think is due to better SMP.

I'll be looking forward to your test results. It seems that Lazy SMP is in the trend nowadays but I would like to see that YBWC is still better (as long as it is implemented correctly). I'll probably try DTS when I do a rewrite of my engine in the future.
I don't follow the "split at CUT nodes" idea. You don't know a node is a "CUT node" until it fails high, and splitting then is obviously wrong. Do you mean "predicting" it is a cut node, say because the previous ply has satisfied the YBW criterion and has searched at least one move with no fail high? I don't do that at all. My split criterion is simply YBW. I can split at any ply where I have searched at least one move, regardless of what has happened at previous plies.

If you can send me a linux executable, or else source with a linux makefile, and a shell script that will run whatever test you want to try, I can run it on our 20 core box easily enough. Ideally something like "runtest" which would do a make, do the 1 cpu run, then multiple cpu runs, and then end". I can tar up the whole directory and send it back.

This is a 2x10 2660 box that will run all 20 cores at 2.9ghz, so it moves right along...

Let me know if you are interested. Note that you will have to compute your own speedups, but I'd be interested to see your results posted as a point of comparison.

BTW I played with a different NextMove() for split points. It is possible to write a less efficient move generator, but one that can work in parallel. IE generate ONE move, spit it out, and continue. I ignored history for the first instantiation of this. It greatly reduced the NextMove() waiting, but it couldn't use history info for ordering. I also tried to pre-generate moves, but only after I had searched at least two moves, trying to avoid some of the overhead you mentioned. Was not really happy with either one. Most of the problem comes from the deep splits, since they do NextMove() much more frequently.

Still some things to tweak, but at least the parallel search stuff is approaching the point of being about as lean as I can make it.

the "gratuitous splits" info from Crafty looks like this for a single position (no idea if this is a representative position, just one taken from a game played on CCC):

splits=3.0M(1.1M) aborts=423.8K joins=9.7M

It did a total of 3M splits, 1.1M of those splits were not joined. joins are high, but there are 20 threads. The aborts looks a bit high, but I didn't look at the position to see if it is one of the pathological types... BTW this search took 4:06, so it is fairly long. Searched a total of 21.3 billion nodes, averaging 86M+ nodes per second. In this game, lowest NPS was 79M, highest was 96M, average right around 88M-90M.
You are right regarding the CUT node, it is just an expected CUT node. Maybe the reason why the idea (of avoiding splits in this node) is not successful is that a move has already been tried to satisfy YBW and therefore it is not a CUT node anymore, so it is not a good idea to avoid doing splits in this node. Right now we do split at any ply with at least one move searched and some other criteria (minimum depth, etc).

As for the non-bug you have observed when you opened this topic, the only way I think that it could work the way DTS works using a recursive search is to use C++11 threads ability to transfer threads ownership. Basically when the master thread is waiting for the slave threads to finish working, for it to be able to search on splits other than the splits of the slaves working for it, it can transfer it's thread ownership to a temporary thread and continue searching other splits, while the temporary thread sleeps and wait for the last slave to finish the splitpoint and transfer the ownership to that last slave thread. I'm not sure of the cost of the thread ownership transfer though if it is efficient. See Ownership Transfer here: http://www.bogotobogo.com/cplusplus/mul ... plus11.php

We don't do Linux development at the moment so we don't have any ready makefile/scripts or something. I only have a few positions to run some benchmark on SMP. I have gathered those from games in TCEC where Hannibal have a hard time resolving some fail high and fail low. I am not sure if that is the right set of positions to test though.

Here's the thread statistics from Hannibal on the start position searching for a minute using 7 threads (4 T, 3 HT) of a quad core i7-4790.

OUT: thread_id: 0 nodes: 41819407 splits: 0.0589726 joined: 72.2123 threads: 3.84059
OUT: thread_id: 1 nodes: 48842559 splits: 0.0697036 joined: 61.7829 threads: 3.24484
OUT: thread_id: 2 nodes: 48897705 splits: 0.0670522 joined: 62.5919 threads: 3.16426
OUT: thread_id: 3 nodes: 48874092 splits: 0.0689015 joined: 64.2049 threads: 3.28801
OUT: thread_id: 4 nodes: 48756559 splits: 0.0679375 joined: 63.2804 threads: 3.12581
OUT: thread_id: 5 nodes: 46221662 splits: 0.067914 joined: 66.4617 threads: 3.37018
OUT: thread_id: 6 nodes: 48385177 splits: 0.0739235 joined: 65.9053 threads: 3.69588

Nodes are somewhat balanced for all the threads except for thread_id 0 which is the main thread, which means that it might have spent some time waiting for other threads. Split nodes are few, around 0.06% of the nodes (might be less using 20 threads), though around 4 times that of Crafty (3M * 100 / 21.3B = 0.014%). The percentage of splits joined is around 65% (similar to Crafty's (3-1.1)/3=63.33%). As for the "threads" it's the average number of threads joining the splitpoints owned by that thread_id. It averages out to a little less than half of the threads used (7 threads). As for Crafty if you are using 20 threads, joins/splits joined = average threads per split = 9.7/1.9 = 5.1 threads. It seems low, unless you have max threads per splitpoint set which we don't do in Hannibal for efficiency reasons.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An interesting parallel search non-bug

Post by bob »

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...
It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.
Why? All that is required is that the ALL nodes that are potential split points need to be searched completely. Order doesn't matter. "helpful master" doesn't get broken at all, as a thread is still a master of its thread. And when thread 0 joins itself at ply=10, it will eventually finish there, go to thread wait() and notice that now ply=20 is done, and start working its way back up the tree.

Not sure why you think this breaks "helpful master".

BTW my helpful master has NEVER had the requirement that it only help those that are helping it. I see no reason why it can't jump across the tree to any viable split point, and have always done it like that. But getting rid of the "master" concept is certainly a big win when moving to full-blown DTS. Now there are no masters anywhere.
In Hannibal we also do splits even in CUT nodes. I've tried doing splits in ALL and PV nodes only before but it is slower NPS wise (due to fewer splits available to join) and I didn't measure any advantage over splitting at any node types. This means that when we do pre-splits there is a big enough possibility that move 2 causes a cutoff, so it is better that it will be searched to completion without suspending and moving on to search move 3, etc. If splits is only being done in ALL nodes in Crafty (and in the root node, where all the moves are expected to be searched anyway) then it might be a non-issue in Crafty.
It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.

Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).
The challenge for me was "nearly lockless" which was a critical key with 20+ threads. The only lock that has measurable overhead is the NextMove() protection at split points. This took a lot of thought. I have also spent a ton of time on making certain there are not two adjacent memory words that are modified by separate threads, even in places I did not think about... performance killers.

But then there is DTS where pre-splits are a thing of the past. I've done a ton of measuring with the split with oneself, and do not split with oneself, in my code, the difference is absolutely unmeasurable, NPS, time to depth, tree size, etc. But it looked like an accident waiting to happen when I change something next year not thinking about that quirk. My tree overhead has yet to exceed 30%, even with 24 cores... which is not bad. IE a 24 core search searches a tree maybe 20-25% larger than the serial tree to the same depth. The NextMove() synchronization is my biggest overhead. DTS can help by splitting close to the root rather than always having to either split where I am, or gratuitously splitting closer to the root. Closer to the root -> less contention for the move list...

As far as your worry about going back to a "wrong split point" there's nothing you can do. And nothing says a different split point is any better. In fact, positions near the root are less likely to have to abort since their move ordering is better. I've done a lot of measuring on aborts by ply, no big surprise that as you get further from the root there are more of 'em. Disproportionally more than expected.

Since I added the "move ordering doesn't change, LMR always reduces split moves the same amount no matter what" code last year, tree variation reduced quite a bit, also...

Have you measured TTD with any degree of accuracy on 20 cores? My current results are in the 13x-14x range. With the split with myself option. I'm trying to rerun the tests with the change, but for 20 cores I saw absolutely no improvement or degradation. Totally benign change to do it or not do it for me...

Will have some 24 core results in a week or two, we have a new cluster with 12 x 2 x 2.5ghz xeons, as opposed to this current 10 x 2 x 2.6ghz xeons. Be interesting to see how turbo-boost works with 24 cores. My 20 core 2.6 ghz box runs at 2.9ghz non-stop with Crafty on all 20 cores. Hope the 24 core box will run at around 2.8ghz or so...
NextMove() is the main overhead in Hannibal as well. Threads will be waiting for a thread to finish generating quiet moves for example before they could get a move to be searched. Since we are doing pre-splits already an idea can be tried where if you are anticipating to do pre-splits in this node, then generate all the moves ahead before doing the split, where no threads have yet joined to avoid any lock contention. Getting a move from the movelist should not cause any major lock contention anymore. This depends on the percentage of the pre-splits being joined though. If only a small percentage of pre-splits is being joined then it will be a waste of time generating all moves at the beginning.

I also have made sure that move ordering is the same as in the serial search so LMR and other pruning based on moves tried should work the same. I have made a mistake regarding this one before and it causes some widening effect.

Unfortunately I don't have any access to big hardware. I only have an 8 core machine to test my ideas so I don't have any data with more than 8 cores. The only result I'm aware of Hannibal with 20 cores is in TCEC where it managed to sneak in to stage 3 ahead of some stronger engines, which I think is due to better SMP.

I'll be looking forward to your test results. It seems that Lazy SMP is in the trend nowadays but I would like to see that YBWC is still better (as long as it is implemented correctly). I'll probably try DTS when I do a rewrite of my engine in the future.
I don't follow the "split at CUT nodes" idea. You don't know a node is a "CUT node" until it fails high, and splitting then is obviously wrong. Do you mean "predicting" it is a cut node, say because the previous ply has satisfied the YBW criterion and has searched at least one move with no fail high? I don't do that at all. My split criterion is simply YBW. I can split at any ply where I have searched at least one move, regardless of what has happened at previous plies.

If you can send me a linux executable, or else source with a linux makefile, and a shell script that will run whatever test you want to try, I can run it on our 20 core box easily enough. Ideally something like "runtest" which would do a make, do the 1 cpu run, then multiple cpu runs, and then end". I can tar up the whole directory and send it back.

This is a 2x10 2660 box that will run all 20 cores at 2.9ghz, so it moves right along...

Let me know if you are interested. Note that you will have to compute your own speedups, but I'd be interested to see your results posted as a point of comparison.

BTW I played with a different NextMove() for split points. It is possible to write a less efficient move generator, but one that can work in parallel. IE generate ONE move, spit it out, and continue. I ignored history for the first instantiation of this. It greatly reduced the NextMove() waiting, but it couldn't use history info for ordering. I also tried to pre-generate moves, but only after I had searched at least two moves, trying to avoid some of the overhead you mentioned. Was not really happy with either one. Most of the problem comes from the deep splits, since they do NextMove() much more frequently.

Still some things to tweak, but at least the parallel search stuff is approaching the point of being about as lean as I can make it.

the "gratuitous splits" info from Crafty looks like this for a single position (no idea if this is a representative position, just one taken from a game played on CCC):

splits=3.0M(1.1M) aborts=423.8K joins=9.7M

It did a total of 3M splits, 1.1M of those splits were not joined. joins are high, but there are 20 threads. The aborts looks a bit high, but I didn't look at the position to see if it is one of the pathological types... BTW this search took 4:06, so it is fairly long. Searched a total of 21.3 billion nodes, averaging 86M+ nodes per second. In this game, lowest NPS was 79M, highest was 96M, average right around 88M-90M.
You are right regarding the CUT node, it is just an expected CUT node. Maybe the reason why the idea (of avoiding splits in this node) is not successful is that a move has already been tried to satisfy YBW and therefore it is not a CUT node anymore, so it is not a good idea to avoid doing splits in this node. Right now we do split at any ply with at least one move searched and some other criteria (minimum depth, etc).

As for the non-bug you have observed when you opened this topic, the only way I think that it could work the way DTS works using a recursive search is to use C++11 threads ability to transfer threads ownership. Basically when the master thread is waiting for the slave threads to finish working, for it to be able to search on splits other than the splits of the slaves working for it, it can transfer it's thread ownership to a temporary thread and continue searching other splits, while the temporary thread sleeps and wait for the last slave to finish the splitpoint and transfer the ownership to that last slave thread. I'm not sure of the cost of the thread ownership transfer though if it is efficient. See Ownership Transfer here: http://www.bogotobogo.com/cplusplus/mul ... plus11.php

We don't do Linux development at the moment so we don't have any ready makefile/scripts or something. I only have a few positions to run some benchmark on SMP. I have gathered those from games in TCEC where Hannibal have a hard time resolving some fail high and fail low. I am not sure if that is the right set of positions to test though.

Here's the thread statistics from Hannibal on the start position searching for a minute using 7 threads (4 T, 3 HT) of a quad core i7-4790.

OUT: thread_id: 0 nodes: 41819407 splits: 0.0589726 joined: 72.2123 threads: 3.84059
OUT: thread_id: 1 nodes: 48842559 splits: 0.0697036 joined: 61.7829 threads: 3.24484
OUT: thread_id: 2 nodes: 48897705 splits: 0.0670522 joined: 62.5919 threads: 3.16426
OUT: thread_id: 3 nodes: 48874092 splits: 0.0689015 joined: 64.2049 threads: 3.28801
OUT: thread_id: 4 nodes: 48756559 splits: 0.0679375 joined: 63.2804 threads: 3.12581
OUT: thread_id: 5 nodes: 46221662 splits: 0.067914 joined: 66.4617 threads: 3.37018
OUT: thread_id: 6 nodes: 48385177 splits: 0.0739235 joined: 65.9053 threads: 3.69588

Nodes are somewhat balanced for all the threads except for thread_id 0 which is the main thread, which means that it might have spent some time waiting for other threads. Split nodes are few, around 0.06% of the nodes (might be less using 20 threads), though around 4 times that of Crafty (3M * 100 / 21.3B = 0.014%). The percentage of splits joined is around 65% (similar to Crafty's (3-1.1)/3=63.33%). As for the "threads" it's the average number of threads joining the splitpoints owned by that thread_id. It averages out to a little less than half of the threads used (7 threads). As for Crafty if you are using 20 threads, joins/splits joined = average threads per split = 9.7/1.9 = 5.1 threads. It seems low, unless you have max threads per splitpoint set which we don't do in Hannibal for efficiency reasons.
I enforce max threads per split point (I think the number is 12) for split points reasonably far from the root. Closer to the root the parameter is ignored since shared data is less often accessed and won't cause a train wreck on NUMA boxes.

The thing I watch most closely is the time output. I display two values, the total time used, and then the percentage of the total time used - waiting time. I want 100%, which means no waiting. I see that average 95% or better with the rewrite. But with 20 cores, 95% means one core is "lost". Not what I want. Still working on this as there are a few more points left to address.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An interesting parallel search non-bug

Post by bob »

sje wrote:
bob wrote:I'm going to give some thought to your FSM approach. Never thought of doing the search that way. I do my move selection lock with a FSM, which does eliminate spaghetti code. But I did this in the CB move selection as well. Just not in the search. The more I think about it, the better it sounds, however. More as I start to design this...
There are only two tricky parts with a state machine approach: these are the state where the graph calls itself to start at a new ply and the state where it returns. Note that more than one state can call and so more than one state can return. In Symbolic, the next state variable is kept in the ply indexed vector (and so is not a local variable). At a state where the graph calls itself, the next state indicator for the current ply is set, then the ply is incremented, and finally the next state indicator for the now new ply is set. Likely always, the first value of the next state indicator at a new ply will be StartAtNewPly while the next state indicator giving the return state may vary.

Just get a big piece of blank paper and draw a bunch of circles connected with directed arcs. Works for me. For the first attempt, keep it simple with having only one state which calls and only one state which returns. Things like a null move search which will need its own call/return states can go in later.

In Symbolic, I coded a constant vector of strings indexed by state with each string being the ASCII state name. I used this for single stepping and tracing. Very good for finding bugs.
paper? PAPER? As scotty said to the engineer when he was trying to explain transparent aluminum (voyage home) and he was using the mouse like a microphone and the engineer pointed to the keyboard, he replied "How quaint..." :)

Again, "paper"? How utterly quaint. :)

I think I have some of that laying around somewhere... might be yellow and curled from lack of use...
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: An interesting parallel search non-bug

Post by mhull »

bob wrote:Seems that I am talking myself into this... I'm going to think about Steven's FSM approach. The idea of a FSM state per ply is interesting, and it actually might simplify the CB style search approach. Going to be a Spring issue however, I'm now running our IT and teaching three courses, so Crafty is a late-night endeavor exclusively right now...
Is there a status report on Crafty DTS?
Matthew Hull
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An interesting parallel search non-bug

Post by bob »

mhull wrote:
bob wrote:Seems that I am talking myself into this... I'm going to think about Steven's FSM approach. The idea of a FSM state per ply is interesting, and it actually might simplify the CB style search approach. Going to be a Spring issue however, I'm now running our IT and teaching three courses, so Crafty is a late-night endeavor exclusively right now...
Is there a status report on Crafty DTS?
Not as of yet, no. Completely cleaned up the old parallel search to the point that it is extremely competitive with DTS on the number of cores I normally test on (20 or less). When we begin to get 32-64 core boxes, that will be motivation. :)