Edsel Apostol wrote:My latest code has only 1% synchronization overhead. I'm already using one lock per split point. I do have one global lock though in idle loop and inside the function to split remaining moves between threads.
Then you seem to have solved the problem already. A global lock for doing the split should be fine (and seems hard to avoid if active threads are picking up idle threads).
About your method, would splitting after good captures and killers leave a lot of idle time for the threads? What would be the compensation for the bigger idle time?
I think I hardly have idle time (given the NPS speed up). I already split at depth=2. The speed up in the first second of searching the initial position is a bit lower (5.4 on 6 cores), which must partly be due to the pv lines that are being spit out and partly due to the many synchronisations inherent to YBW (when you finish the search of the PV node at ply = N, i.e. the search of the first move at the PV node at ply = N-1, YBW implies that all threads but one are idle).
And your method regarding copying the position in the split point but instead using the moves from the root to that position, how is it faster than just copying the position on that potential split point?
It might depend on how big your position data is, but copying the moves takes only very little data (even though I don't store them consecutively in a single cache line, which maybe I should try). After that, the slave thread has all it needs and can release the lock. No need to copy the history of hash keys etc. Also, I keep a small hash table for speeding up draw-by-rep detection, and I don't want to copy that table at splits.
More importantly, if slave threads actively join a candidate split node without the master knowing it, I don't even have a consistent position state for that node. I don't want to interrupt the master, force it to back up its position all the way to the split node, copy the full state, and then let the master go back to where it was. And I also don't want to make a full copy of the state when creating candidate split points, because the overhead would be way too high. So copying the moves is the only option.
I would like to know the success rate of a potential split point being actually searched by the slave threads, have you measured it in your program? I would like to implement it in Hannibal.
I didn't measure it, but most potential split points probably never become a split point. Most potential split points are close to the leaves, and the idle threads look for one close to the root. But the overhead of a node being a potential split point (i.e. having to store some data and take a lock before selecting a move) turns out to be very low in my case.
I might try to collect some data if I figure out what data is useful.