syzygy wrote:bob wrote:syzygy wrote:bob wrote:syzygy wrote:bob wrote:Joerg Oster wrote:Yes, increasing MinSplitDepth for the endgame should definitely help.
That's at least my experience from many, many tries to tweak those parameters.
Also limiting # of threads per split point in endgames is important. No point in having more threads than reasonable moves to search. But min split depth is not enough for endgames, unfortunately. Takes more work and the mutex locks are pretty bad, period...
Certainly it makes no sense to have more threads than moves at a split node, but I do not see the point of limiting the number of threads more than that. If you have say 8 cores and 8 moves at a node (after having searched the first move) with remaining depth 17, it makes little sense to only let 5 threads search these 8 moves and hope that the other 3 threads join in at depth 15 or lower.
Max Threads per Split Point was removed from SF, which seems to help at least somewhat. But it did not solve the problem.
You should try it. I have. It works. Particularly out near the tips. If you split 5 plies from the tip, and have 8 moves to search using 8 threads, that is far less efficient than to simply split with 2 processors and use the other 6 somewhere else. The split overhead suddenly becomes a major factor when the subtrees near the tips are small.
OK, near the tips, basically at the minimum split depth, I accept that limiting the number of threads makes sense.
But some ply above the minimum split depth, I really do not see what it should give you. If min split depth is 7 and you are splitting at depth 9, then it should be a pretty safe bet that having 8 cores at the split node at depth 9 is better than having just 5 at depth 9 and the 3 remaining cores joining split nodes (several times) at depth 7.
Even at the minimum split depth, I do not see that it is necessarily an improvement. If it does help, you could also just increase minimum split depth by 1 or 2 and have the same benefit of decreased splitting overhead.
apples to oranges. Min split depth simply says "never split below this point." The min_split_nodes simply says "don't split until this specific thread has searched this many nodes from this position. The min_thread_group idea from crafty simply says "no more than this many threads at any single split point" unless we are within N plies of the root.
I am only talking about what you call max_thread_group here. (I suppose you meant max_thread_group, not min_thread_group.)
At the root it obviously makes sense to get as many processors as possible busy since those sub-trees represent significant amounts of work.
So at the root you do not want to use a max_thread_group restriction. Same for split nodes relatively close to the root.
Again, I only see max_thread_group making sense
at the minimum split depth. But it seems essentially equivalent to increasing min split depth by 1 or 2. The idea is: min split depth ensures that the subtree is big enough for splitting to make sense. If that is the case, splitting for a single move (that has a whole subtree beneath it) makes sense as well. If you need at least 2 or 3 moves to outweigh the splitting overhead, you better increase min split depth.
Again, I am not talking about min_split_nodes here. That seems to be a good idea and seems likely to address the specific problem that SF has in endgames.
I was simply explaining how I have three constraints.
I ignore max_thread_group (you were correct, max, not min) near the root. The reason I have it is that I do not want to have lots of contention over a single split block. That is much more damaging near the tips, while at the root contention is minimal (if you have 40 moves, the root split block will be locked/accessed at most 40 times, which can't even be measured in terms of cost, it is so low.)
This is a change from current release versions of crafty, that ignore max_thread_group at the root, but not elsewhere. I have greatly modified all of this code in the new version that I've been working on for several months, and I have not yet taken the time to discover the "optimal" point at which to ignore max_thread_group. Didn't want to do that until I reach what I consider a final SMP implementation. It will likely take 2+ weeks of testing to find the best value since a single test will end up taking 3 days or so at 5+5 time controls.
I am currently working on a condition I call "futile splits" where a race condition will cause a thread to try to split after another thread has already grabbed the idle processors. Not a large cost, but I am going to take it as close to zero as possible without introducing excessive lock overhead.
BTW there are other considerations about max_thread_group. Contention is an issue. The more threads you have at a split point, the more contention for that split block. max_thread_group was originally done for NUMA boxes. If I have 4 sockets, with 8 cores per socket, each socket turns out to be a NUMA node, with local memory being much faster to access than memory on the other nodes. I can add a simple change that I keep handy, to try to split with cores on the same node, rather than across the nodes. When I was working on this, the idea was "max_thread_group" would be set to the number of cores on a single socket for obvious reasons. Then I could make an effort, at least half-way to the tips or further, to split within a socket, not across sockets.
Currently this is not a particularly big deal, since most machines today are either one socket or two sockets. But I have used AMD boxes with 8 sockets and there it becomes significant. We have a 4x6 box that I have not yet run on.
So, bottom line, max_thread_group is not trying to limit split costs, I do that via the min_split_depth constraint to avoid splitting too close to the tips. It is really intended to reduce split block contention, particularly on NUMA architectures (which all today are it seems, at least with Intel and AMD, although a 1-socket NUMA box is not exactly a NUMA problem). For split blocks where contention is minimal, it is ignored. Currently at any ply <= 4, but that has yet to be tuned.