In Hannibal we also do splits even in CUT nodes. I've tried doing splits in ALL and PV nodes only before but it is slower NPS wise (due to fewer splits available to join) and I didn't measure any advantage over splitting at any node types. This means that when we do pre-splits there is a big enough possibility that move 2 causes a cutoff, so it is better that it will be searched to completion without suspending and moving on to search move 3, etc. If splits is only being done in ALL nodes in Crafty (and in the root node, where all the moves are expected to be searched anyway) then it might be a non-issue in Crafty.bob wrote:Why? All that is required is that the ALL nodes that are potential split points need to be searched completely. Order doesn't matter. "helpful master" doesn't get broken at all, as a thread is still a master of its thread. And when thread 0 joins itself at ply=10, it will eventually finish there, go to thread wait() and notice that now ply=20 is done, and start working its way back up the tree.Edsel Apostol wrote:It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...
Not sure why you think this breaks "helpful master".
BTW my helpful master has NEVER had the requirement that it only help those that are helping it. I see no reason why it can't jump across the tree to any viable split point, and have always done it like that. But getting rid of the "master" concept is certainly a big win when moving to full-blown DTS. Now there are no masters anywhere.
NextMove() is the main overhead in Hannibal as well. Threads will be waiting for a thread to finish generating quiet moves for example before they could get a move to be searched. Since we are doing pre-splits already an idea can be tried where if you are anticipating to do pre-splits in this node, then generate all the moves ahead before doing the split, where no threads have yet joined to avoid any lock contention. Getting a move from the movelist should not cause any major lock contention anymore. This depends on the percentage of the pre-splits being joined though. If only a small percentage of pre-splits is being joined then it will be a waste of time generating all moves at the beginning.The challenge for me was "nearly lockless" which was a critical key with 20+ threads. The only lock that has measurable overhead is the NextMove() protection at split points. This took a lot of thought. I have also spent a ton of time on making certain there are not two adjacent memory words that are modified by separate threads, even in places I did not think about... performance killers.It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.
Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).
But then there is DTS where pre-splits are a thing of the past. I've done a ton of measuring with the split with oneself, and do not split with oneself, in my code, the difference is absolutely unmeasurable, NPS, time to depth, tree size, etc. But it looked like an accident waiting to happen when I change something next year not thinking about that quirk. My tree overhead has yet to exceed 30%, even with 24 cores... which is not bad. IE a 24 core search searches a tree maybe 20-25% larger than the serial tree to the same depth. The NextMove() synchronization is my biggest overhead. DTS can help by splitting close to the root rather than always having to either split where I am, or gratuitously splitting closer to the root. Closer to the root -> less contention for the move list...
As far as your worry about going back to a "wrong split point" there's nothing you can do. And nothing says a different split point is any better. In fact, positions near the root are less likely to have to abort since their move ordering is better. I've done a lot of measuring on aborts by ply, no big surprise that as you get further from the root there are more of 'em. Disproportionally more than expected.
Since I added the "move ordering doesn't change, LMR always reduces split moves the same amount no matter what" code last year, tree variation reduced quite a bit, also...
Have you measured TTD with any degree of accuracy on 20 cores? My current results are in the 13x-14x range. With the split with myself option. I'm trying to rerun the tests with the change, but for 20 cores I saw absolutely no improvement or degradation. Totally benign change to do it or not do it for me...
Will have some 24 core results in a week or two, we have a new cluster with 12 x 2 x 2.5ghz xeons, as opposed to this current 10 x 2 x 2.6ghz xeons. Be interesting to see how turbo-boost works with 24 cores. My 20 core 2.6 ghz box runs at 2.9ghz non-stop with Crafty on all 20 cores. Hope the 24 core box will run at around 2.8ghz or so...
I also have made sure that move ordering is the same as in the serial search so LMR and other pruning based on moves tried should work the same. I have made a mistake regarding this one before and it causes some widening effect.
Unfortunately I don't have any access to big hardware. I only have an 8 core machine to test my ideas so I don't have any data with more than 8 cores. The only result I'm aware of Hannibal with 20 cores is in TCEC where it managed to sneak in to stage 3 ahead of some stronger engines, which I think is due to better SMP.
I'll be looking forward to your test results. It seems that Lazy SMP is in the trend nowadays but I would like to see that YBWC is still better (as long as it is implemented correctly). I'll probably try DTS when I do a rewrite of my engine in the future.