Multiple change testing

jackk03 · Post by **jackk03** » Sat Jul 20, 2024 3:39 pm

Hi, I'm developing a new version of my engine, and I want to try various improvements.
I added continuation history, and i obtained an elo gain of about 20 (testing 1000 games at 5+0.05)
If I then want to try to change something else (say in the search) to see if it brings more improvements, is it better to try against the original version or against the one with continuation history added?
I ask this because I've had some strange results by doing the latter: testing against the updated version, there was an improvement,
but the same upgrade tested against the original version resulted in an elo loss (meaning not only it lost th 20 elo gain f cont. history, but even more)
Thanks!

Ras · Post by **Ras** » Sat Jul 20, 2024 4:14 pm

jackk03 wrote: ↑Sat Jul 20, 2024 3:39 pmI added continuation history, and i obtained an elo gain of about 20 (testing 1000 games at 5+0.05)

Not enough games IMO. I calculate with the sqrt(N) rule of thumb: In random coin throws, you have an uncertainty of sqrt(N). So in your case, sqrt(1000) = 31.6. Expectation value would be 500. Anything from to 46.8% to 53.2% could be just noise in testing. Now, 53.2% would be 22 Elo. Means, your +20 Elo is well within the noise range. For detecting such a change reliable, I think you should use 10000 games where the same math boils out to 49% to 51%, 7 Elo.

As for the actual question, I first test in self-play with enough games, typically 50k games. If that validates, I test against a handful of different engines (typically 6 engines with 10k games each) and compare the cumulative result to the baseline. If that also validates, then I accept the change and treat that as new baseline to test against in self-play.

The downside is that I use 10s/game fixed, no increments, and 14 games in parallel. That's not representative for long time controls, but I can get it done on my machine.

jackk03 · Post by **jackk03** » Sat Jul 20, 2024 4:19 pm

thanks for the quick response. As for the number of concurrent games, is it good to set 1 game for each core that my CPU has, or I can increase it even more?

Ras · Post by **Ras** » Sat Jul 20, 2024 4:41 pm

jackk03 wrote: ↑Sat Jul 20, 2024 4:19 pmAs for the number of concurrent games, is it good to set 1 game for each core that my CPU has, or I can increase it even more?

I have an eight core CPU with SMT, i.e. 16 logical cores. With 14 games in parallel, I still have two logical cores for other tasks like web browsing. I also disable the core boost during testing because not only it makes the CPU behave more consistently, it also needs much less energy compared to all-core boost so that the cooling stays quiet. Can be done in the BIOS, but I have made a script with a little GUI so that I can make that setting from the desktop.

jackk03 · Post by **jackk03** » Sat Jul 20, 2024 5:21 pm

Alright, thanks. One last thing: the uncertainty reported by cutechess is not fully trustable with just a thousand games or so, right? from what I've seen it's the range in which the Elo diff is really, with 95% or 97.5% confidence (i don't remember which one). Is that right?

Ras · Post by **Ras** » Sat Jul 20, 2024 5:40 pm

jackk03 wrote: ↑Sat Jul 20, 2024 5:21 pmAlright, thanks. One last thing: the uncertainty reported by cutechess is not fully trustable with just a thousand games or so, right?

That will depend on how large the Elo difference is. If it's only small, like 20 Elo, then you need a lot more games than with a large difference. Telling that Stockfish is stronger than TSCP won't even need 1000 games, after all. Sometimes, I make a mistake when implementing something, and after 20 games, it's 0.5/20 games vs. the baseline, then I cancel the test already.

from what I've seen it's the range in which the Elo diff is really, with 95% or 97.5% confidence (i don't remember which one). Is that right?

I think something like that, yeah. The sqrt(N) rule of thumb is about 95% when the result is rather close to 50%. You'll never get to true 100% as that would require an infinite amount of games. Just like it's theoretically possible that 10000 fair coin tosses all end up with heads.

Ras · Post by **Ras** » Sat Jul 20, 2024 7:09 pm

Btw., if you test with many games in parallel with high total CPU usage, you may run into engine bugs with race conditions. I've seen a number of engines mishandling that. The issue pops up when there is an input thread that discards most GUI input during engine calculation, but the final move result comes from another thread (search / worker). That needs to either be properly synced, or the input thread needs to buffer instead of discarding. From the outside, such a bug would manifest as occasionally unresponsive engine, but only under high CPU load.

hgm · Post by **hgm** » Sun Jul 21, 2024 7:10 pm

Ras wrote: ↑Sat Jul 20, 2024 4:14 pm
jackk03 wrote: ↑Sat Jul 20, 2024 3:39 pmI added continuation history, and i obtained an elo gain of about 20 (testing 1000 games at 5+0.05)
Not enough games IMO. I calculate with the sqrt(N) rule of thumb: In random coin throws, you have an uncertainty of sqrt(N). So in your case, sqrt(1000) = 31.6. Expectation value would be 500. Anything from to 46.8% to 53.2% could be just noise in testing. Now, 53.2% would be 22 Elo. Means, your +20 Elo is well within the noise range. For detecting such a change reliable, I think you should use 10000 games where the same math boils out to 49% to 51%, 7 Elo.

As for the actual question, I first test in self-play with enough games, typically 50k games. If that validates, I test against a handful of different engines (typically 6 engines with 10k games each) and compare the cumulative result to the baseline. If that also validates, then I accept the change and treat that as new baseline to test against in self-play.

The downside is that I use 10s/game fixed, no increments, and 14 games in parallel. That's not representative for long time controls, but I can get it done on my machine.

That is a bit pessimistic. For one, there are draws, so the standard deviation of a single game result is not 0.5 point, but more like 0.4 (for equal WDL; with more draws it even becomes less.) So the 95% confidence, is 1.96 x 40% / 31.6 = 2.5%, or 17.5 Elo. So the result is outside the interval, and the confidence you can have that this is not purely coincidence is actually better than 97.5%. Callin 20 Elo "well within the noise range" of 22 Elo is also a bit of an overstatement, like saying that it would be well outside a range of 17.5 Elo would be too; it just means the probability that you would accept a patch based on a fluke result is 2% or 3%, instead of the targeted 2.5%. Hardly a disaster or a lucky break.

What I do in a case of doubt is run my next test for half the games with the patch. and half without. That gives you more games to shrink the error bars on the difference.

Ras · Post by **Ras** » Sun Jul 21, 2024 7:18 pm

hgm wrote: ↑Sun Jul 21, 2024 7:10 pmThat is a bit pessimistic.

I don't think so. The order of magnitude with the sqrt(N) rule of thumb for small difference in strength is pretty accurate. 22 Elo, 17.5 Elo - doesn't matter. 20 Elo is still in that range. My conclusion remains that 1000 games is at least an order of magnitude too small for such a difference and hence runs the risk to accept bogus patches.

hgm · Post by **hgm** » Sun Jul 21, 2024 10:21 pm

You always run the risk of accepting bogus patches. A risk on the order of 2% seems acceptible, though. If testing time is your bottleneck, you will make faster progress including an occasional bogus patch than testing every patch to a 0.2% confidence before you accept them. I suppose this can reverse by the time only one in a thousand new ideas is any good.

Note that a bogus patch does not automatically make your engine weaker; the 2% probabilty for false acceptance in the example is the probability that a neutral patch is accepred through a statistical fluke. The chances that a patch that made you 10 Elo weaker is accepted already drops to 0.13%.

Besides, it is better to now and then retest patches after you applied enough new ones, as you never know how patches affect each other. What used to provide strength in a simple engine might just produce noise after you added more accurate methods to address the same positional feature. Unjustly accepted patches will then likely be eliminated as well. So it is not a disaster to occasionally accept one initially.

Multiple change testing

Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing

Re: Multiple change testing