What's Your Engine's Maximum LMR Reduction?

bob · Post by **bob** » Fri Jun 12, 2015 6:48 pm

lkaufman wrote:
bob wrote:BTW, one thing I can guarantee. You can NOT tune this stuff with 10s +0.1s type games. We've been tuning null-move parameters, and until you get to something decent (5m+5s or so) you won't get anything useful tuning-wise. Some things look bad (particularly in self-play), some things look break-even. But at decent time controls, some of this will actually start to work. But it takes a ton of testing time.
Have you found that LMR reductions should be more at five minutes than at ten seconds, or less, or is there no pattern? Also, have you found that LMR reductions should be more against a gauntlet than in self-play, less, or no pattern? Same questions for null-move reductions? Finally, what is the evidence for your above statement? After all, Stockfish never tests anything at longer than one minute plus 0.05 seconds, and it's not such a weak program.
Thanks in advance for your comments.

We (myself, Tracy and Mike) have been fiddling with null-move and LMR for quite a while. 10s+0.1s just doesn't work. getting up to 5m+5s you can begin to see significant differences in whether something works or not, when comparing to 10s+0.1s or something similar.

You raise two issues, so one at a time:

When this self-test vs gauntlet came up the last time, I decided to investigate carefully. A SPRT test is pretty simple, so I simply modified a version of my cluster testing (primarily the shell script that creates the match scripts) to do p vs p' testing (self test). I found this to be inconsistent when compared to a gauntlet. IE self-test would say "bad", gauntlet would say "good". Note that these were not major changes, just tweaks. The most recent change was about slowly turning null move "down" rather than just chopping it off. Longer games showed the change to be a +12 gain, while self-testing showed it to be break-even or slightly worse (so close way more than 30K games were needed to resolve an accurate score). Ditto for short vs longer time controls. Short was saying "break-even or slightly worse". But I decided to go for a much longer (5m+5s) game and there was the improvement.

I've had ok luck with self-testing when making significant changes. But when the changes are small, self-testing takes a lot more games to resolve good/bad compared to the gauntlet I normally use.

So it is an observation, but without a lot of rigor backing it up. I trust gauntlet testing, it has not led me astray so far. Self-testing has given false positives and many more false negatives, unless the change is a significant one.

As far as your last comment, nothing says that self-testing at short time controls won't lead to an overall stronger engine, the question is, will it lead to the BEST that engine can play? SPRT is one of those things that sounds reasonable, but any time you try to cheat the numbers game to play fewer games, there's a gotcha hidden inside.

There might be a better way to choose the starting positions, but if so I have not discovered what that might be. The number of positions has to be large when playing so many games. I've reduced the number of positions as new and reliable engines have hit the scene (i.e. scorpio, senpai, etc) since more opponents means fewer games per opponent. I've seen the arguments about do you want positions that are dead even for balance, do you want positions that are too unbalanced and therefore favor one side (assuming optimal play), or what? I continue to play with this. Currently I reject positions where a short (1 second or so) search says one side is winning, or when that search says the score is >= -.1 and <= +.1 (i.e. drawish). I may get interested enough to use 1 minute searches and just let it grind away on part of a cluster (maybe 50-60 games at a time) and use the same sort of culling scheme and do some tweaking. IE take the FEN, add a ce evaluation, and then try to see if that can be used to refine the test positions. Almost random works well enough, but I wonder if it can be improved on. Right now, existing opening position collections are WAY too small to be useful.

I've even thought about taking an ECO classification file and using those positions. At least it would provide broad coverage.

The idea of playing massive numbers of games works. Absolutely. But how far it is from "working optimally" is an unknown (at least for me).

The final thing I don't like about SPRT is that it is not easy to run multiple tests at the same time. SPRT just says "P' is better than P (or not, or indeterminate). So you have to run P vs P', pick the best, then run best vs P'', etc. with gauntlet testing that's not necessary. I can run as many different versions as I want at the same time, and let BayesElo sort 'em out and tell me which changes were good, and which were bad, and then combine the good for testing.

lkaufman · Post by **lkaufman** » Sat Jun 13, 2015 4:43 am

bob wrote: I've had ok luck with self-testing when making significant changes. But when the changes are small, self-testing takes a lot more games to resolve good/bad compared to the gauntlet I normally use.

Thanks. I believe that the math works out to needing twice as many games for a gauntlet than for self-testing to get the error bar down to a given level, other things being equal. Assuming you are correct, have you any idea why your findings are the opposite of this? One factor is that there are more draws in self-testing, but that doesn't work out to offsetting the two to one rule above.

bob · Post by **bob** » Sat Jun 13, 2015 5:15 am

lkaufman wrote:
bob wrote: I've had ok luck with self-testing when making significant changes. But when the changes are small, self-testing takes a lot more games to resolve good/bad compared to the gauntlet I normally use.
Thanks. I believe that the math works out to needing twice as many games for a gauntlet than for self-testing to get the error bar down to a given level, other things being equal. Assuming you are correct, have you any idea why your findings are the opposite of this? One factor is that there are more draws in self-testing, but that doesn't work out to offsetting the two to one rule above.

Remember, in self-play SPRT was used as the termination criterion. I'm not particularly convinced that the potential savings of games played is worth the increased error potential.

My concern with self-testing is that you can walk down into a hole and never know it. For example, suppose your program is not very good at kingside attacks. You are working on the evaluation for the bishop pair and bishop mobility because you have noticed that other programs beat you with the bishop pair when you get rid of the pair too quickly. So you tweak on this, and it makes you try a bit harder to preserve the bishop pair, even if you have to weaken your king shelter to do so. But since you are self-playing, your opponent is not good at exploiting that weakness, and the bishop-pair preserving change works pretty well and tests better. But when you play against opponents that do carry out kingside attacks if you give them a crack to work through, you are now losing more games than before.

I've seen too many of these. You get a false positive because your "opponent" is no smarter than you in any regard, and you now have another tool in the toolbox to exploit against "him" (your old program). I'm not concerned by the number of games required, in general, but I am interested in how long a test takes which is indirectly related. Short time controls are nice. But not if they lead to incorrect conclusions.

I'll have more to say here in a day or so, as we might be on the trail of yet another example where fast games are worse than slow games, this with a basic evaluation constant rather than search-related.

I think the most powerful machine on the planet has something north of 3,000,000 processor cores. If I had access to that, EVERY GAME I tested with would be 40 moves in 2 hours.

So there are trade-offs for sure. I've been thinking about a comprehensive study of this issue by making changes to Crafty (minor/major eval changes and minor/major search changes) and then determining how the time control affects the accuracy of measuring the change.

I don't think anybody fully understands the interplay here, it'd be nice to have some actual measurements about (a) time control used; (b) self-test vs gauntlet test. Right now everything is anecdotal, based on witchcraft, superstition and voodoo.

One other issue. There is some sort of "assumption" that playing A vs A' requires fewer games than when playing A vs B. Statistical sampling stuff runs into a difficulty when the experiments (games) are not completely independent. And obviously they are anything but independent, given the same opponents, usually alternating colors on each position, etc. So whether self-play ACTUALLY produces equal quality of results with 1/2 the games is, to me, a hypothesis, as opposed to an actual fact. Particularly when self-play has potential issues caused by blind spots in your eval that are also present in your opponent. Given the choice of just playing against myself, vs against a gauntlet, I'd choose the gauntlet for accuracy. Even if it does take more games (100% more? 50% more?).

Let me see how this current test pans out and I'll report more details. I had been running with the assumption that short time controls are OK for eval changes, but I need some longer games as well when changes to search are being evaluated. I am now not so sure that is true based on a few false failures here of late...

The only test I have done here was the one I did a long while back. Namely I tested a change against a single opponent (not self-test) and then against my original 5 opponent gauntlet. Against the single opponent the change looked significantly better, but against the gauntlet it was significantly worse. Which sort of speaks to my self-testing suspicions. The only problem from your perspective is that you won't have many opponents that are strong enough, while I always have Stockfish to include in my testing.

More later, it is late and I am probably rambling after all the SMP search stuff I have been doing for the last 3-9 months...

Ferdy · Post by **Ferdy** » Sat Jun 13, 2015 8:56 am

bob wrote:
lkaufman wrote:
bob wrote: I've had ok luck with self-testing when making significant changes. But when the changes are small, self-testing takes a lot more games to resolve good/bad compared to the gauntlet I normally use.
Thanks. I believe that the math works out to needing twice as many games for a gauntlet than for self-testing to get the error bar down to a given level, other things being equal. Assuming you are correct, have you any idea why your findings are the opposite of this? One factor is that there are more draws in self-testing, but that doesn't work out to offsetting the two to one rule above.
Remember, in self-play SPRT was used as the termination criterion. I'm not particularly convinced that the potential savings of games played is worth the increased error potential.

My concern with self-testing is that you can walk down into a hole and never know it. For example, suppose your program is not very good at kingside attacks. You are working on the evaluation for the bishop pair and bishop mobility because you have noticed that other programs beat you with the bishop pair when you get rid of the pair too quickly. So you tweak on this, and it makes you try a bit harder to preserve the bishop pair, even if you have to weaken your king shelter to do so. But since you are self-playing, your opponent is not good at exploiting that weakness,

It might exploit now if you indeed successfully weaken the king shelter (of the same engine) just to preserve the bishop pair.

and the bishop-pair preserving change works pretty well and tests better. But when you play against opponents that do carry out kingside attacks if you give them a crack to work through, you are now losing more games than before.

cdani · Post by **cdani** » Sat Jun 13, 2015 11:05 am

bob wrote: My concern with self-testing is that you can walk down into a hole and never know it. For example, suppose your program is not very good at kingside attacks. You are working on the evaluation for the bishop pair and bishop mobility because you have noticed that other programs beat you with the bishop pair when you get rid of the pair too quickly. So you tweak on this, and it makes you try a bit harder to preserve the bishop pair, even if you have to weaken your king shelter to do so. But since you are self-playing, your opponent is not good at exploiting that weakness, and the bishop-pair preserving change works pretty well and tests better. But when you play against opponents that do carry out kingside attacks if you give them a crack to work through, you are now losing more games than before.

+1 Of course. I tend to do self play and gauntlet verification, or only gauntlet.

Also testing LMR stuff and the like can be pushed farther with selfplay, so you end cheating yourself.

bob wrote: The only problem from your perspective is that you won't have many opponents that are strong enough, while I always have Stockfish to include in my testing.

I'm not convinced also about using much stronger opponents. They tend to smash your engine much by search capability and less by chess knowledge. Anyway your point of having stronger opponents of course is an advantage.

bob · Post by **bob** » Sat Jun 13, 2015 5:05 pm

Ferdy wrote:
bob wrote:
lkaufman wrote:
bob wrote: I've had ok luck with self-testing when making significant changes. But when the changes are small, self-testing takes a lot more games to resolve good/bad compared to the gauntlet I normally use.
Thanks. I believe that the math works out to needing twice as many games for a gauntlet than for self-testing to get the error bar down to a given level, other things being equal. Assuming you are correct, have you any idea why your findings are the opposite of this? One factor is that there are more draws in self-testing, but that doesn't work out to offsetting the two to one rule above.
Remember, in self-play SPRT was used as the termination criterion. I'm not particularly convinced that the potential savings of games played is worth the increased error potential.

My concern with self-testing is that you can walk down into a hole and never know it. For example, suppose your program is not very good at kingside attacks. You are working on the evaluation for the bishop pair and bishop mobility because you have noticed that other programs beat you with the bishop pair when you get rid of the pair too quickly. So you tweak on this, and it makes you try a bit harder to preserve the bishop pair, even if you have to weaken your king shelter to do so. But since you are self-playing, your opponent is not good at exploiting that weakness,
It might exploit now if you indeed successfully weaken the king shelter (of the same engine) just to preserve the bishop pair.

and the bishop-pair preserving change works pretty well and tests better. But when you play against opponents that do carry out kingside attacks if you give them a crack to work through, you are now losing more games than before.

You have to have some knowledge in the eval to do this. Asymmetric evaluation also hurts here. In any case this is a pretty well-known problem with self-testing, which is not exactly a new idea after all.

bob · Post by **bob** » Sat Jun 13, 2015 5:07 pm

cdani wrote:
bob wrote: My concern with self-testing is that you can walk down into a hole and never know it. For example, suppose your program is not very good at kingside attacks. You are working on the evaluation for the bishop pair and bishop mobility because you have noticed that other programs beat you with the bishop pair when you get rid of the pair too quickly. So you tweak on this, and it makes you try a bit harder to preserve the bishop pair, even if you have to weaken your king shelter to do so. But since you are self-playing, your opponent is not good at exploiting that weakness, and the bishop-pair preserving change works pretty well and tests better. But when you play against opponents that do carry out kingside attacks if you give them a crack to work through, you are now losing more games than before.
+1 Of course. I tend to do self play and gauntlet verification, or only gauntlet.

Also testing LMR stuff and the like can be pushed farther with selfplay, so you end cheating yourself.

bob wrote: The only problem from your perspective is that you won't have many opponents that are strong enough, while I always have Stockfish to include in my testing.
I'm not convinced also about using much stronger opponents. They tend to smash your engine much by search capability and less by chess knowledge. Anyway your point of having stronger opponents of course is an advantage.

The problem I was addressing is that it is much easier to improve if you have programs to "catch up" with. If you are at the top of the heap, testing against only weaker engines doesn't show much since you win most of the games before you make any changes..

My cluster referee has the ability to play time-handicap games as one solution to this.

bob · Post by **bob** » Sat Jun 13, 2015 8:09 pm

AlvaroBegue wrote:
lucasart wrote:Maybe in English "concave" means "convexe" in French then.

For me, convex means f''(x) >= 0. That's not what you want for LMR. You want a function where the growth rate decreases (f''(x) <= 0).
The convention seems to be the same in English: http://en.wikipedia.org/wiki/Convex_function

I was not thinking of it in that light. A concave lens is a sense where you lay a straight-edge across it, and the straight-edge contacts the two edges of the lens but not the middle. A convex lens is the opposite, where the straight-edge comes in contact with the middle but not the edges. If you take the graph as given, it looks "concave" since it sags in the middle.

As far as which is better, I don't think that is decided. Could be either, or it could be purely linear.

Adam Hair · Post by **Adam Hair** » Sat Jun 13, 2015 9:25 pm

cdani wrote: I'm not convinced also about using much stronger opponents. They tend to smash your engine much by search capability and less by chess knowledge. Anyway your point of having stronger opponents of course is an advantage.

Don Dailey recommended using stronger opponents with time handicaps. This makes testing more time efficient.

cdani · Post by **cdani** » Sat Jun 13, 2015 10:50 pm

Adam Hair wrote:
Don Dailey recommended using stronger opponents with time handicaps. This makes testing more time efficient.

Interesting! Thanks. I will try.

What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?

Re: What's Your Engine's Maximum LMR Reduction?