LMR and null move selectivity

bob · Post by **bob** » Tue Sep 01, 2009 9:29 pm

michiguel wrote:
bob wrote:
Don wrote:
bob wrote:
xcomponent wrote:
Doesn't seem to hurt. But doesn't help either.
That's true when it comes down to LMR. For null move i would say that there is a noticeable difference.I mean "if(!pvnode)" for null move definitely hurts. I'm talking about my own tests of course. May be someone would disagree.
Same for me. _any_ restriction on where Null-move is tried, with the exception of the transposition table trick, hurts performance in my cluster testing.
What is the transposition table trick?

I am now NOT reducing any move after a null move. This is mostly due to superstition. In testing it comes out 1 or 2 ELO stronger but with a margin of error much greater (about 5) so it's not clear whether it's any kind of benefit or not.
This was first mentioned by someone I do not remember. The idea is this...

When you do a hash probe, often the draft is not sufficient and you can use the best move, but not the score/bound. But if you take the current depth, and subtract the usual null-move depth from it, you check the draft against this limit, which is what you will do in a null-move search anyway. If the draft is now sufficient, but the probe result says the score or bound will _not_ fail high here, then we just proved that a null-move search would not fail high here either. I return a flag "avoid_null = 1" to Search() which then refuses to try the null-move since we know it will be wasted.

If that isn't clear, let me know and I will do a more formal explanation.
If I understand correctly, I was doing exactly that since I added nullmove! (~8 years ago) to Gaviota. I always thought everybody was doing something similar since it is very simple. Of course, I never tested if it was better or worse, I assumed that in the worse case scenario it was neutral.

Miguel

This was added to Crafty version 9.4, which was a 1995-era program. This was also used in Cray Blitz. I believe Murray Campbell and Hsu first mentioned this idea in their singular extensions paper, which also included some null-move search discussions as well, but I might be wrong about the paper it was in. I do know I got the idea from Murray somewhere around 1990-1992, not sure exactly when as I have no versions of Cray Blitz that are that recent to look at the comments.

michiguel · Post by **michiguel** » Tue Sep 01, 2009 9:36 pm

bob wrote:
michiguel wrote:
bob wrote:
Don wrote:
bob wrote:
xcomponent wrote:
Doesn't seem to hurt. But doesn't help either.
That's true when it comes down to LMR. For null move i would say that there is a noticeable difference.I mean "if(!pvnode)" for null move definitely hurts. I'm talking about my own tests of course. May be someone would disagree.
Same for me. _any_ restriction on where Null-move is tried, with the exception of the transposition table trick, hurts performance in my cluster testing.
What is the transposition table trick?

I am now NOT reducing any move after a null move. This is mostly due to superstition. In testing it comes out 1 or 2 ELO stronger but with a margin of error much greater (about 5) so it's not clear whether it's any kind of benefit or not.
This was first mentioned by someone I do not remember. The idea is this...

When you do a hash probe, often the draft is not sufficient and you can use the best move, but not the score/bound. But if you take the current depth, and subtract the usual null-move depth from it, you check the draft against this limit, which is what you will do in a null-move search anyway. If the draft is now sufficient, but the probe result says the score or bound will _not_ fail high here, then we just proved that a null-move search would not fail high here either. I return a flag "avoid_null = 1" to Search() which then refuses to try the null-move since we know it will be wasted.

If that isn't clear, let me know and I will do a more formal explanation.
If I understand correctly, I was doing exactly that since I added nullmove! (~8 years ago) to Gaviota. I always thought everybody was doing something similar since it is very simple. Of course, I never tested if it was better or worse, I assumed that in the worse case scenario it was neutral.

Miguel
This was added to Crafty version 9.4, which was a 1995-era program. This was also used in Cray Blitz. I believe Murray Campbell and Hsu first mentioned this idea in their singular extensions paper, which also included some null-move search discussions as well, but I might be wrong about the paper it was in. I do know I got the idea from Murray somewhere around 1990-1992, not sure exactly when as I have no versions of Cray Blitz that are that recent to look at the comments.

Since this is a hobby for me, I love to reinvent the wheel

Miguel

frankp · Post by **frankp** » Tue Sep 01, 2009 9:39 pm

bob wrote:
frankp wrote:
bob wrote:
<snip>
games, and then fires off instances of my referee for each node used which takes those scripts and actually plays the indicated opponents using the indicated starting positions. When the match ends, the name tag in each PGN for that match is changed to "Crafty-xx.xRxx-scale" where xx.x is the normal version number, Rxx where xx is a unique number indicating my test version (I also have Txx for versions from Tracy, etc). When I finish I have an enormous PGN collection containing all versions, which gets pushed thru Bayeselo and the result added to a master results file. I also get directories of the form Crafty-xx.xRxx-scale-match which has just the PGN from that specific version/sub-version/scale in case I want to look at them individually later.
<more snip>
.
Thanks Bob. All this testing and real data, still no substitute for 'speculation' - it would seem ;-)

As a matter of interest, how much do you estimate Crafty's elo has advanced as a consequence of all this rigorous testing?

bob · Post by **bob** » Tue Sep 01, 2009 10:03 pm

frankp wrote:
bob wrote:
frankp wrote:
bob wrote:
<snip>
games, and then fires off instances of my referee for each node used which takes those scripts and actually plays the indicated opponents using the indicated starting positions. When the match ends, the name tag in each PGN for that match is changed to "Crafty-xx.xRxx-scale" where xx.x is the normal version number, Rxx where xx is a unique number indicating my test version (I also have Txx for versions from Tracy, etc). When I finish I have an enormous PGN collection containing all versions, which gets pushed thru Bayeselo and the result added to a master results file. I also get directories of the form Crafty-xx.xRxx-scale-match which has just the PGN from that specific version/sub-version/scale in case I want to look at them individually later.
<more snip>
.
Thanks Bob. All this testing and real data, still no substitute for 'speculation' - it would seem

As a matter of interest, how much do you estimate Crafty's elo has advanced as a consequence of all this rigorous testing?
We started doing this carefully between 22.x and 23.0. 23.0 was 70+ Elo stronger than 22.x (last version, whatever that was). Currently, 23.1 is +25 Elo stronger than 23.0. I am about ready to release it, but we keep coming up with last-minute tweaks that are easy to make and easy to test (many become easy to reject, also). I'd like to get this one out, not because of the +25, but because it has a few pruning changes that others might want to test, and it has some draw offer/claim changes that should solve some reported problems related to a race condition that was discussed at length.

So, in summary, we are about +100 better with 23.1 than the last 22.x version, not world-beating, but moving on up. We have spent a lot of time testing old ideas as well, some of which work, some of which do not. I beat LMR to death a couple of months back, spending about 2.5 months of testing time trying everything I could think of, everything mentioned here, everything mentioned in current open-source programs. And finally gave up as "nothing was better than the simplistic approach I was already using, no history counter crap, etc...

I'm testing ideas every last day.

Tord Romstad · Post by **Tord Romstad** » Tue Sep 01, 2009 10:15 pm

bob wrote:I have done a _ton_ of testing using this so-called "threat move" to control pruning. It really doesn't work.

Just to clarify how I use the threat moves, and how I arrived at the current implementation:

Stockfish doesn't have a sharp boundary between the main search and the qsearch, but a 7-ply intermediate phase which I sometimes call a "super qsearch", where we only search the hash table move, captures, checks, passed pawn pushes and the first few other moves (as ordered by our move ordering heuristics). The remaining moves are pruned, unless they appear particularly interesting for some reason. This is where the threat moves enter the picture.

When I first started this massive forward pruning, I pruned a little more aggressively than now. In order to understand the effects of the pruning and the most typical errors, I wrote a special version which didn't actually prune at all, but just kept note of which moves would have been pruned, and wrote the (position, move) pair to a log file whenever a move which would normally have been pruned caused a beta cutoff.

It turned out that in most cases when the first few moves failed low, but a quiet move late in the move list caused a beta cutoff, it was because of some very simple tactical threat, like a hanging piece. The early moves failed low because they left the piece hanging, and the moves which saved the piece appeared to late in the move list. This hanging piece was normally captured by the threat move detected by the null move search. That's how I came up with some nullmove-based criterions for when not to prune moves late in the move list.

After adding these tricks, the total number of pruned moves didn't decrease very much, but the relative frequency of incorrectly pruned moves dropped a lot.

I have no idea how much it helps in actual games. I've never been testing very much. At any rate, I think our pruning scheme is still very primitive (it's surprising that it seems to work reasonably in practice) and has a huge potential for improvement, but I think incorporating information found by the null move search will remain useful in some way.

BubbaTough · Post by **BubbaTough** » Wed Sep 02, 2009 12:41 am

bob wrote:
frankp wrote:
bob wrote:
frankp wrote:
bob wrote:
<snip>
games, and then fires off instances of my referee for each node used which takes those scripts and actually plays the indicated opponents using the indicated starting positions. When the match ends, the name tag in each PGN for that match is changed to "Crafty-xx.xRxx-scale" where xx.x is the normal version number, Rxx where xx is a unique number indicating my test version (I also have Txx for versions from Tracy, etc). When I finish I have an enormous PGN collection containing all versions, which gets pushed thru Bayeselo and the result added to a master results file. I also get directories of the form Crafty-xx.xRxx-scale-match which has just the PGN from that specific version/sub-version/scale in case I want to look at them individually later.
<more snip>
.
Thanks Bob. All this testing and real data, still no substitute for 'speculation' - it would seem

As a matter of interest, how much do you estimate Crafty's elo has advanced as a consequence of all this rigorous testing?
We started doing this carefully between 22.x and 23.0. 23.0 was 70+ Elo stronger than 22.x (last version, whatever that was). Currently, 23.1 is +25 Elo stronger than 23.0. I am about ready to release it, but we keep coming up with last-minute tweaks that are easy to make and easy to test (many become easy to reject, also). I'd like to get this one out, not because of the +25, but because it has a few pruning changes that others might want to test, and it has some draw offer/claim changes that should solve some reported problems related to a race condition that was discussed at length.

So, in summary, we are about +100 better with 23.1 than the last 22.x version, not world-beating, but moving on up. We have spent a lot of time testing old ideas as well, some of which work, some of which do not. I beat LMR to death a couple of months back, spending about 2.5 months of testing time trying everything I could think of, everything mentioned here, everything mentioned in current open-source programs. And finally gave up as "nothing was better than the simplistic approach I was already using, no history counter crap, etc...

I'm testing ideas every last day.
There is no question that Crafty made a significant jump in strength soon after the improved testing regiment was implemented. It was very noticeable. In my opinion it is possible to make a very fine engine without quite as powerful a testing tool as Bob wields, but no one should doubt Bob has a very useful set up that quickly yielded tangible results.

-Sam

bob · Post by **bob** » Wed Sep 02, 2009 1:35 am

BubbaTough wrote:
bob wrote:
frankp wrote:
bob wrote:
frankp wrote:
bob wrote:
<snip>
games, and then fires off instances of my referee for each node used which takes those scripts and actually plays the indicated opponents using the indicated starting positions. When the match ends, the name tag in each PGN for that match is changed to "Crafty-xx.xRxx-scale" where xx.x is the normal version number, Rxx where xx is a unique number indicating my test version (I also have Txx for versions from Tracy, etc). When I finish I have an enormous PGN collection containing all versions, which gets pushed thru Bayeselo and the result added to a master results file. I also get directories of the form Crafty-xx.xRxx-scale-match which has just the PGN from that specific version/sub-version/scale in case I want to look at them individually later.
<more snip>
.
Thanks Bob. All this testing and real data, still no substitute for 'speculation' - it would seem

As a matter of interest, how much do you estimate Crafty's elo has advanced as a consequence of all this rigorous testing?
We started doing this carefully between 22.x and 23.0. 23.0 was 70+ Elo stronger than 22.x (last version, whatever that was). Currently, 23.1 is +25 Elo stronger than 23.0. I am about ready to release it, but we keep coming up with last-minute tweaks that are easy to make and easy to test (many become easy to reject, also). I'd like to get this one out, not because of the +25, but because it has a few pruning changes that others might want to test, and it has some draw offer/claim changes that should solve some reported problems related to a race condition that was discussed at length.

So, in summary, we are about +100 better with 23.1 than the last 22.x version, not world-beating, but moving on up. We have spent a lot of time testing old ideas as well, some of which work, some of which do not. I beat LMR to death a couple of months back, spending about 2.5 months of testing time trying everything I could think of, everything mentioned here, everything mentioned in current open-source programs. And finally gave up as "nothing was better than the simplistic approach I was already using, no history counter crap, etc...

I'm testing ideas every last day.
There is no question that Crafty made a significant jump in strength soon after the improved testing regiment was implemented. It was very noticeable. In my opinion it is possible to make a very fine engine without quite as powerful a testing tool as Bob wields, but no one should doubt Bob has a very useful set up that quickly yielded tangible results.

-Sam
I would word it a bit differently. What I can get away with, that most can not, is that I can test ideas that look bad from the get-go, because I can test them at such a fast rate that there is no lost time to speak of. Anybody can actually test over 40,000 games to confirm or reject a new idea's usefulness. And they can possibly even do it in a day if they play very fast games. I can do this same thing in less than an hour for fast games. On our bigger cluster, I can actually get a 40,000 game match played 540 games at a time, which turns into about 75 game cycles. If you play 1 second games, that takes less than 1.5 minutes to complete. If you play 1 minute games, that is about 1.25 hours.

My cluster facilities simply let me be more sloppy since I don't need to worry about testing one extra parameter value, because the cost for doing so is so low I can ignore it and avoid wasting time thinking about what to test, and just test. I sometimes start tests with a wide range of values for a single parameter, playing 40K games for each value. And as I watch, the scores start off low, but each 40K game match shows an improvement as the parameter gets bigger, until it reaches the optimal value and the results start to get worse again. I can terminate at that point and go on to the next test. And I can do all of that in under a half-hour unless I believe that longer games are needed to validate changes measured at fast speeds, which I also do.

With limited hardware, you have to carefully think about what to test because of the time required. Take the null-move trick discussion yesterday. I played 80 K games, and ran longer time controls so that each run played 1 minute games, and I ran 160K total games in about 5 hours.

Eelco de Groot · Post by **Eelco de Groot** » Wed Sep 02, 2009 2:40 pm

Zach Wegner wrote:
mcostalba wrote:
bob wrote:
mcostalba wrote:
bob wrote: If that isn't clear, let me know and I will do a more formal explanation.
Could you please do a more formal explanation ? I got the idea but I think I still miss the details.

Thanks in advance
Marco
Sure.
Thanks for your explanation.

I have measured some quick statistics and it seems that with this trick would be possible to avoid about 5% of null move searches (not a biggie indeed given that null searches are already a small percent of total search time).

It is anyhow interesting to see that is not 100% accurate because a very small percent of null moves that satisfy the conditions fail high the same also if they shouldn't, about 340 on more then 120K so less then 0.3%

I didn't had the time to further analyse what kind of positions fail the rule.
Actually, when it "fails", it's a good thing. If the bound in the hash table has the UPPER bit, then it wasn't from a null move search since it failed low. So if the normal search failed low but null move will fail high, then you have a zugzwang-type position (or just some search stability noise), and you don't want to null move at all.

EDIT: what you might actually want to do is, if you're null-moving despite whatever's in the hash, and it fails according to the above, store a "zugzwang" bit in the hash table, and THEN don't null-move next time. You still don't get the threat move though, so maybe just don't accept the cutoffs from null move if the hash table says not to. Or maybe a separate "zugzwang threat move" table

Just a little idea I had about Marco's observation here, it's just a detail in all the interesting discussions you guys have been having. Maybe I'm wrong, but my guess was that Marco was just seeing the effect of what happens when a verification search does not support the fail high of the null move. Possibly related to a Zugzwang.

Using the hash table information alone if you have it, in my understanding, is just a maximally shortened form of doing a verification seach before the null move, only because there is enough draft, no actual search is necessary. So I would say it is not entirely true that Crafty does not do a verification search after null move, it does a mean trick with the hash tables before null move instead so it does not actually have to call a search

.

In Ancalagon I also do the verification searches before null move, a drawback is that there is not always a threat-move available as Tord has been saying, because no null move search was done to determine the threat. I don't know if that is a big problem or not. Maybe it just means in that case you can try a few other more quiet moves instead of moves that are connected to the null move.

This is especially true for Stockfish and Ancalagon because they can simply prune like mad, without even considering a reduced search or even more extreme jumping into q-search, in the null window searches, depending on the depth of the search, they just call "Continue" -which in C actually means "Do not continue?"-.

As Johan Cruyff would say in his Amsterdam dialect: "Ieder voordeel heb z'n nadeel." or translated, this selectivity, related to the tactical search in Glaurung or Stockfish, is "both a strength and a weakness". And move ordering, in relation to this is of more importance in Stockfish than in Crafty, that seems obvious. This just for the readers passing by here possibly not familiar with the source code of Stockfish yet

Regards, Eelco

LMR and null move selectivity

Re: LMR and null move selectivity

Re: LMR and null move selectivity

Re: LMR and null move selectivity -- data

Re: LMR and null move selectivity -- data

Re: LMR and null move selectivity

Re: LMR and null move selectivity -- data

Re: LMR and null move selectivity -- data

Re: LMR and null move selectivity