Hi, I'm developing a new version of my engine, and I want to try various improvements.
I added continuation history, and i obtained an elo gain of about 20 (testing 1000 games at 5+0.05)
If I then want to try to change something else (say in the search) to see if it brings more improvements, is it better to try against the original version or against the one with continuation history added?
I ask this because I've had some strange results by doing the latter: testing against the updated version, there was an improvement,
but the same upgrade tested against the original version resulted in an elo loss (meaning not only it lost th 20 elo gain f cont. history, but even more)
Thanks!
Multiple change testing
Moderators: hgm, Rebel, chrisw
-
- Posts: 17
- Joined: Wed Jul 12, 2023 1:38 pm
- Full name: Giacomo Porpiglia
-
- Posts: 2555
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Multiple change testing
Not enough games IMO. I calculate with the sqrt(N) rule of thumb: In random coin throws, you have an uncertainty of sqrt(N). So in your case, sqrt(1000) = 31.6. Expectation value would be 500. Anything from to 46.8% to 53.2% could be just noise in testing. Now, 53.2% would be 22 Elo. Means, your +20 Elo is well within the noise range. For detecting such a change reliable, I think you should use 10000 games where the same math boils out to 49% to 51%, 7 Elo.
As for the actual question, I first test in self-play with enough games, typically 50k games. If that validates, I test against a handful of different engines (typically 6 engines with 10k games each) and compare the cumulative result to the baseline. If that also validates, then I accept the change and treat that as new baseline to test against in self-play.
The downside is that I use 10s/game fixed, no increments, and 14 games in parallel. That's not representative for long time controls, but I can get it done on my machine.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 17
- Joined: Wed Jul 12, 2023 1:38 pm
- Full name: Giacomo Porpiglia
Re: Multiple change testing
thanks for the quick response. As for the number of concurrent games, is it good to set 1 game for each core that my CPU has, or I can increase it even more?
-
- Posts: 2555
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Multiple change testing
I have an eight core CPU with SMT, i.e. 16 logical cores. With 14 games in parallel, I still have two logical cores for other tasks like web browsing. I also disable the core boost during testing because not only it makes the CPU behave more consistently, it also needs much less energy compared to all-core boost so that the cooling stays quiet. Can be done in the BIOS, but I have made a script with a little GUI so that I can make that setting from the desktop.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 17
- Joined: Wed Jul 12, 2023 1:38 pm
- Full name: Giacomo Porpiglia
Re: Multiple change testing
Alright, thanks. One last thing: the uncertainty reported by cutechess is not fully trustable with just a thousand games or so, right? from what I've seen it's the range in which the Elo diff is really, with 95% or 97.5% confidence (i don't remember which one). Is that right?
-
- Posts: 2555
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Multiple change testing
That will depend on how large the Elo difference is. If it's only small, like 20 Elo, then you need a lot more games than with a large difference. Telling that Stockfish is stronger than TSCP won't even need 1000 games, after all. Sometimes, I make a mistake when implementing something, and after 20 games, it's 0.5/20 games vs. the baseline, then I cancel the test already.
I think something like that, yeah. The sqrt(N) rule of thumb is about 95% when the result is rather close to 50%. You'll never get to true 100% as that would require an infinite amount of games. Just like it's theoretically possible that 10000 fair coin tosses all end up with heads.from what I've seen it's the range in which the Elo diff is really, with 95% or 97.5% confidence (i don't remember which one). Is that right?
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 2555
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Multiple change testing
Btw., if you test with many games in parallel with high total CPU usage, you may run into engine bugs with race conditions. I've seen a number of engines mishandling that. The issue pops up when there is an input thread that discards most GUI input during engine calculation, but the final move result comes from another thread (search / worker). That needs to either be properly synced, or the input thread needs to buffer instead of discarding. From the outside, such a bug would manifest as occasionally unresponsive engine, but only under high CPU load.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 27986
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Multiple change testing
That is a bit pessimistic. For one, there are draws, so the standard deviation of a single game result is not 0.5 point, but more like 0.4 (for equal WDL; with more draws it even becomes less.) So the 95% confidence, is 1.96 x 40% / 31.6 = 2.5%, or 17.5 Elo. So the result is outside the interval, and the confidence you can have that this is not purely coincidence is actually better than 97.5%. Callin 20 Elo "well within the noise range" of 22 Elo is also a bit of an overstatement, like saying that it would be well outside a range of 17.5 Elo would be too; it just means the probability that you would accept a patch based on a fluke result is 2% or 3%, instead of the targeted 2.5%. Hardly a disaster or a lucky break.Ras wrote: ↑Sat Jul 20, 2024 4:14 pmNot enough games IMO. I calculate with the sqrt(N) rule of thumb: In random coin throws, you have an uncertainty of sqrt(N). So in your case, sqrt(1000) = 31.6. Expectation value would be 500. Anything from to 46.8% to 53.2% could be just noise in testing. Now, 53.2% would be 22 Elo. Means, your +20 Elo is well within the noise range. For detecting such a change reliable, I think you should use 10000 games where the same math boils out to 49% to 51%, 7 Elo.
As for the actual question, I first test in self-play with enough games, typically 50k games. If that validates, I test against a handful of different engines (typically 6 engines with 10k games each) and compare the cumulative result to the baseline. If that also validates, then I accept the change and treat that as new baseline to test against in self-play.
The downside is that I use 10s/game fixed, no increments, and 14 games in parallel. That's not representative for long time controls, but I can get it done on my machine.
What I do in a case of doubt is run my next test for half the games with the patch. and half without. That gives you more games to shrink the error bars on the difference.
-
- Posts: 2555
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Multiple change testing
I don't think so. The order of magnitude with the sqrt(N) rule of thumb for small difference in strength is pretty accurate. 22 Elo, 17.5 Elo - doesn't matter. 20 Elo is still in that range. My conclusion remains that 1000 games is at least an order of magnitude too small for such a difference and hence runs the risk to accept bogus patches.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 27986
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Multiple change testing
You always run the risk of accepting bogus patches. A risk on the order of 2% seems acceptible, though. If testing time is your bottleneck, you will make faster progress including an occasional bogus patch than testing every patch to a 0.2% confidence before you accept them. I suppose this can reverse by the time only one in a thousand new ideas is any good.
Note that a bogus patch does not automatically make your engine weaker; the 2% probabilty for false acceptance in the example is the probability that a neutral patch is accepred through a statistical fluke. The chances that a patch that made you 10 Elo weaker is accepted already drops to 0.13%.
Besides, it is better to now and then retest patches after you applied enough new ones, as you never know how patches affect each other. What used to provide strength in a simple engine might just produce noise after you added more accurate methods to address the same positional feature. Unjustly accepted patches will then likely be eliminated as well. So it is not a disaster to occasionally accept one initially.
Note that a bogus patch does not automatically make your engine weaker; the 2% probabilty for false acceptance in the example is the probability that a neutral patch is accepred through a statistical fluke. The chances that a patch that made you 10 Elo weaker is accepted already drops to 0.13%.
Besides, it is better to now and then retest patches after you applied enough new ones, as you never know how patches affect each other. What used to provide strength in a simple engine might just produce noise after you added more accurate methods to address the same positional feature. Unjustly accepted patches will then likely be eliminated as well. So it is not a disaster to occasionally accept one initially.