Stockfish - Decrease of nodes in endgame positions

syzygy · Post by **syzygy** » Wed Apr 23, 2014 9:22 am

bob wrote:The min_thread_group idea from crafty simply says "no more than this many threads at any single split point" unless we are within N plies of the root.

I had missed the "unless we are within N plies of the root" (if it was mentioned in your earlier comments, did not check). The "Max Threads per Split Point" restriction of SF did not have this condition if I am not mistaken.

I would think "unless remaining depth > min_split_depth", i.e. only apply max_thread_group if split depth = min_split_depth, but in any event your "unless" condition removes most of the problems.

I did not yet look into the crafty sources to see how "N" is determined.

bob · Post by **bob** » Wed Apr 23, 2014 6:27 pm

gladius wrote:
bob wrote:
gladius wrote:
Joerg Oster wrote:Here is some preliminary, but very interesting data.
<snip>
Nice! You should queue up a 8 thread test on fishtest.
Do they really do serious 8-thread testing? And this is really more important on 12-16 and beyond. 8 is sort of on the low edge of expected gain...
Very rarely, as not many machines can run the tests, so it takes a long time. But it is done, yes .

The reason I asked is that I do it on occasion. I can run 70 games at a time, which hurts (one cluster has 70 nodes with 8 cores per node). And it STILL needs 30K games to properly measure small improvements. That is just over 400 "game cycles". If it takes 1 minute to play a game (really too fast to test SMP changes) that turns into 6+ hours of testing. I generally go for at least 5m+5s which can make a game average something over 10 minutes which turns into over 3 days of testing for one cycle.

bob · Post by **bob** » Wed Apr 23, 2014 6:54 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
Joerg Oster wrote:Yes, increasing MinSplitDepth for the endgame should definitely help.
That's at least my experience from many, many tries to tweak those parameters.
Also limiting # of threads per split point in endgames is important. No point in having more threads than reasonable moves to search. But min split depth is not enough for endgames, unfortunately. Takes more work and the mutex locks are pretty bad, period...
Certainly it makes no sense to have more threads than moves at a split node, but I do not see the point of limiting the number of threads more than that. If you have say 8 cores and 8 moves at a node (after having searched the first move) with remaining depth 17, it makes little sense to only let 5 threads search these 8 moves and hope that the other 3 threads join in at depth 15 or lower.

Max Threads per Split Point was removed from SF, which seems to help at least somewhat. But it did not solve the problem.
You should try it. I have. It works. Particularly out near the tips. If you split 5 plies from the tip, and have 8 moves to search using 8 threads, that is far less efficient than to simply split with 2 processors and use the other 6 somewhere else. The split overhead suddenly becomes a major factor when the subtrees near the tips are small.
OK, near the tips, basically at the minimum split depth, I accept that limiting the number of threads makes sense.

But some ply above the minimum split depth, I really do not see what it should give you. If min split depth is 7 and you are splitting at depth 9, then it should be a pretty safe bet that having 8 cores at the split node at depth 9 is better than having just 5 at depth 9 and the 3 remaining cores joining split nodes (several times) at depth 7.

Even at the minimum split depth, I do not see that it is necessarily an improvement. If it does help, you could also just increase minimum split depth by 1 or 2 and have the same benefit of decreased splitting overhead.
apples to oranges. Min split depth simply says "never split below this point." The min_split_nodes simply says "don't split until this specific thread has searched this many nodes from this position. The min_thread_group idea from crafty simply says "no more than this many threads at any single split point" unless we are within N plies of the root.
I am only talking about what you call max_thread_group here. (I suppose you meant max_thread_group, not min_thread_group.)

At the root it obviously makes sense to get as many processors as possible busy since those sub-trees represent significant amounts of work.
So at the root you do not want to use a max_thread_group restriction. Same for split nodes relatively close to the root.

Again, I only see max_thread_group making sense at the minimum split depth. But it seems essentially equivalent to increasing min split depth by 1 or 2. The idea is: min split depth ensures that the subtree is big enough for splitting to make sense. If that is the case, splitting for a single move (that has a whole subtree beneath it) makes sense as well. If you need at least 2 or 3 moves to outweigh the splitting overhead, you better increase min split depth.

Again, I am not talking about min_split_nodes here. That seems to be a good idea and seems likely to address the specific problem that SF has in endgames.

I was simply explaining how I have three constraints.

I ignore max_thread_group (you were correct, max, not min) near the root. The reason I have it is that I do not want to have lots of contention over a single split block. That is much more damaging near the tips, while at the root contention is minimal (if you have 40 moves, the root split block will be locked/accessed at most 40 times, which can't even be measured in terms of cost, it is so low.)

This is a change from current release versions of crafty, that ignore max_thread_group at the root, but not elsewhere. I have greatly modified all of this code in the new version that I've been working on for several months, and I have not yet taken the time to discover the "optimal" point at which to ignore max_thread_group. Didn't want to do that until I reach what I consider a final SMP implementation. It will likely take 2+ weeks of testing to find the best value since a single test will end up taking 3 days or so at 5+5 time controls.

I am currently working on a condition I call "futile splits" where a race condition will cause a thread to try to split after another thread has already grabbed the idle processors. Not a large cost, but I am going to take it as close to zero as possible without introducing excessive lock overhead.

BTW there are other considerations about max_thread_group. Contention is an issue. The more threads you have at a split point, the more contention for that split block. max_thread_group was originally done for NUMA boxes. If I have 4 sockets, with 8 cores per socket, each socket turns out to be a NUMA node, with local memory being much faster to access than memory on the other nodes. I can add a simple change that I keep handy, to try to split with cores on the same node, rather than across the nodes. When I was working on this, the idea was "max_thread_group" would be set to the number of cores on a single socket for obvious reasons. Then I could make an effort, at least half-way to the tips or further, to split within a socket, not across sockets.

Currently this is not a particularly big deal, since most machines today are either one socket or two sockets. But I have used AMD boxes with 8 sockets and there it becomes significant. We have a 4x6 box that I have not yet run on.

So, bottom line, max_thread_group is not trying to limit split costs, I do that via the min_split_depth constraint to avoid splitting too close to the tips. It is really intended to reduce split block contention, particularly on NUMA architectures (which all today are it seems, at least with Intel and AMD, although a 1-socket NUMA box is not exactly a NUMA problem). For split blocks where contention is minimal, it is ignored. Currently at any ply <= 4, but that has yet to be tuned.

bob · Post by **bob** » Wed Apr 23, 2014 6:57 pm

syzygy wrote:
bob wrote:The min_thread_group idea from crafty simply says "no more than this many threads at any single split point" unless we are within N plies of the root.
I had missed the "unless we are within N plies of the root" (if it was mentioned in your earlier comments, did not check). The "Max Threads per Split Point" restriction of SF did not have this condition if I am not mistaken.

I would think "unless remaining depth > min_split_depth", i.e. only apply max_thread_group if split depth = min_split_depth, but in any event your "unless" condition removes most of the problems.

I did not yet look into the crafty sources to see how "N" is determined.

Released versions only ignore max_thread_group at ply=1. Current version (yet to be released) ignores it at ply <= 4. But this has yet to be tuned. Don't want to diddle with that until I complete all of the SMP changes (there are several that might well change this value). And in fact, I would like to avoid a constant. When searching 30 plies deep, I'd bet that ignoring it below ply 10 would be OK, NUMA or not. When searching 50 plies (endgames) I'd suspect 20 might work. So there is work left to do to complete this part. Right now it is very simplistic, to say the least.

syzygy · Post by **syzygy** » Wed Apr 23, 2014 10:23 pm

bob wrote:BTW there are other considerations about max_thread_group. Contention is an issue. The more threads you have at a split point, the more contention for that split block. max_thread_group was originally done for NUMA boxes. If I have 4 sockets, with 8 cores per socket, each socket turns out to be a NUMA node, with local memory being much faster to access than memory on the other nodes. I can add a simple change that I keep handy, to try to split with cores on the same node, rather than across the nodes. When I was working on this, the idea was "max_thread_group" would be set to the number of cores on a single socket for obvious reasons. Then I could make an effort, at least half-way to the tips or further, to split within a socket, not across sockets.

Yes, for NUMA-aware engines it may very well be a good idea.

For SF, contention of the split block lock could be an issue, but this also would only seem to play a role near the tips. (Well, it could play a role immediately after splitting when all slave threads try to grab a move at the same time. It seems to me that it would be better if the master thread provided each slave with a first move to search, as the master thread can generate these moves that will have to be generated anyway in one go. However, that seems not to have worked so well when it was tried. Maybe the explanation is that more work for the master thread during the splitting operation means all slaves remain idle for a longer time...)

bob · Post by **bob** » Wed Apr 23, 2014 11:47 pm

syzygy wrote:
bob wrote:BTW there are other considerations about max_thread_group. Contention is an issue. The more threads you have at a split point, the more contention for that split block. max_thread_group was originally done for NUMA boxes. If I have 4 sockets, with 8 cores per socket, each socket turns out to be a NUMA node, with local memory being much faster to access than memory on the other nodes. I can add a simple change that I keep handy, to try to split with cores on the same node, rather than across the nodes. When I was working on this, the idea was "max_thread_group" would be set to the number of cores on a single socket for obvious reasons. Then I could make an effort, at least half-way to the tips or further, to split within a socket, not across sockets.
Yes, for NUMA-aware engines it may very well be a good idea.

For SF, contention of the split block lock could be an issue, but this also would only seem to play a role near the tips. (Well, it could play a role immediately after splitting when all slave threads try to grab a move at the same time. It seems to me that it would be better if the master thread provided each slave with a first move to search, as the master thread can generate these moves that will have to be generated anyway in one go. However, that seems not to have worked so well when it was tried. Maybe the explanation is that more work for the master thread during the splitting operation means all slaves remain idle for a longer time...)

The "giving them the first move" sounds interesting, but then it turns into hackish-type programming, at least for me, since my way of choosing the next move is pretty clean and this would be an exception. But I'm going to think about that, although in reality the contention for the split block lock is really not much of a delay overall.

Eelco de Groot · Post by **Eelco de Groot** » Thu Apr 24, 2014 2:27 am

gladius wrote:
bob wrote:
gladius wrote:
Joerg Oster wrote:Here is some preliminary, but very interesting data.
<snip>
Nice! You should queue up a 8 thread test on fishtest.
Do they really do serious 8-thread testing? And this is really more important on 12-16 and beyond. 8 is sort of on the low edge of expected gain...
Very rarely, as not many machines can run the tests, so it takes a long time. But it is done, yes .

I really hope there is time to do this before an eventual new stage in TCEC. But I admit I probably would implement Joerg's code immediately in Rainbow Serpent as well, as opposed to any of the contempt code. Well just to say, this is much more attractive than the contempt testing, but that is just my personal opinion which the testers should ignore. But Please test this Jörg ! I am very interested

Regards, Eelco

Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions

Re: Stockfish - Decrease of nodes in endgame positions