We (myself, Tracy and Mike) have been fiddling with null-move and LMR for quite a while. 10s+0.1s just doesn't work. getting up to 5m+5s you can begin to see significant differences in whether something works or not, when comparing to 10s+0.1s or something similar.lkaufman wrote:Have you found that LMR reductions should be more at five minutes than at ten seconds, or less, or is there no pattern? Also, have you found that LMR reductions should be more against a gauntlet than in self-play, less, or no pattern? Same questions for null-move reductions? Finally, what is the evidence for your above statement? After all, Stockfish never tests anything at longer than one minute plus 0.05 seconds, and it's not such a weak program.bob wrote:BTW, one thing I can guarantee. You can NOT tune this stuff with 10s +0.1s type games. We've been tuning null-move parameters, and until you get to something decent (5m+5s or so) you won't get anything useful tuning-wise. Some things look bad (particularly in self-play), some things look break-even. But at decent time controls, some of this will actually start to work. But it takes a ton of testing time.
Thanks in advance for your comments.
You raise two issues, so one at a time:
When this self-test vs gauntlet came up the last time, I decided to investigate carefully. A SPRT test is pretty simple, so I simply modified a version of my cluster testing (primarily the shell script that creates the match scripts) to do p vs p' testing (self test). I found this to be inconsistent when compared to a gauntlet. IE self-test would say "bad", gauntlet would say "good". Note that these were not major changes, just tweaks. The most recent change was about slowly turning null move "down" rather than just chopping it off. Longer games showed the change to be a +12 gain, while self-testing showed it to be break-even or slightly worse (so close way more than 30K games were needed to resolve an accurate score). Ditto for short vs longer time controls. Short was saying "break-even or slightly worse". But I decided to go for a much longer (5m+5s) game and there was the improvement.
I've had ok luck with self-testing when making significant changes. But when the changes are small, self-testing takes a lot more games to resolve good/bad compared to the gauntlet I normally use.
So it is an observation, but without a lot of rigor backing it up. I trust gauntlet testing, it has not led me astray so far. Self-testing has given false positives and many more false negatives, unless the change is a significant one.
As far as your last comment, nothing says that self-testing at short time controls won't lead to an overall stronger engine, the question is, will it lead to the BEST that engine can play? SPRT is one of those things that sounds reasonable, but any time you try to cheat the numbers game to play fewer games, there's a gotcha hidden inside.
There might be a better way to choose the starting positions, but if so I have not discovered what that might be. The number of positions has to be large when playing so many games. I've reduced the number of positions as new and reliable engines have hit the scene (i.e. scorpio, senpai, etc) since more opponents means fewer games per opponent. I've seen the arguments about do you want positions that are dead even for balance, do you want positions that are too unbalanced and therefore favor one side (assuming optimal play), or what? I continue to play with this. Currently I reject positions where a short (1 second or so) search says one side is winning, or when that search says the score is >= -.1 and <= +.1 (i.e. drawish). I may get interested enough to use 1 minute searches and just let it grind away on part of a cluster (maybe 50-60 games at a time) and use the same sort of culling scheme and do some tweaking. IE take the FEN, add a ce evaluation, and then try to see if that can be used to refine the test positions. Almost random works well enough, but I wonder if it can be improved on. Right now, existing opening position collections are WAY too small to be useful.
I've even thought about taking an ECO classification file and using those positions. At least it would provide broad coverage.
The idea of playing massive numbers of games works. Absolutely. But how far it is from "working optimally" is an unknown (at least for me).
The final thing I don't like about SPRT is that it is not easy to run multiple tests at the same time. SPRT just says "P' is better than P (or not, or indeterminate). So you have to run P vs P', pick the best, then run best vs P'', etc. with gauntlet testing that's not necessary. I can run as many different versions as I want at the same time, and let BayesElo sort 'em out and tell me which changes were good, and which were bad, and then combine the good for testing.