An idea for a different type of testing

Uri Blass · Post by **Uri Blass** » Wed May 04, 2022 10:47 am

People usually accept only improvement in chess engines.

The idea is to accept changes that are not proved to be improvements but perform better at longer time control in the hope that the change is also an improvement at very long time control but testing is too expensive for very long time control.

For example if some change lose 10 elo in 10+0.1 based on 40000 games and lose 1 elo in 60+0.6 based on 40000 games
then you do not test it at 240+2.4 and simply accept the change with the hope that there is a rating improvement in very long time control.

Note that I suggest not to use the same time control for both engines because rating reduction may be because of diminishing returns so I suggest to test for example 11+0.11 against 10+0.1 and 66+0.66 against 60+0.6 and if you see a clear reduction in playing strength in the first test and a clear improvement in playing strength in the second test then you can say that the change perform better at longer time control(maybe difference of 10% is too high and you need a smaller difference in time control but the idea is clear).

I thought about it based on the following tests that suggest that things like this are possible and I wonder if there are programmers who use this method.
Note that there is also the opposite idea of not accepting changes when they perform worse at long time control when again you do not measure worse performance by elo but by the fact that with the patch 9+0.9 beat 10+0.1 but 54+0.54 lose against 60+0.6

https://tests.stockfishchess.org/tests/ ... 698c0093a8
https://tests.stockfishchess.org/tests/ ... 4303aa8df3
https://tests.stockfishchess.org/tests/ ... 698c008e7e

AndrewGrant · Post by **AndrewGrant** » Thu May 05, 2022 12:52 am

To your first point, in which you suggest that if an STC tests at -10 elo, but a following LTC tests at only -1 elo, then one can assume that at an ever greater time control, the change is positive:

Couple problems. The obvious one is elo compression with time control. The non obvious one, is that this method is going to lead you to recklessly test things at LTC that don't deserve the resources. Every day I write STCs that lose 10 elo. If I tested all of those at LTC, using some fixed game setup to compare elo as opposed to SPRT tests -- I would never get to test anything at all.

Witek · Post by **Witek** » Sat May 07, 2022 1:25 am

How about playing each game with randomized time control? Instead of just using 10+0.1, 60+1 or whatever, you randomize time with some exponential distribution (so there will be more games with short time controls). Then, the more games you play you will get an Elo score that spans all time controls. You could even plot some graph or try to fit a curve to sample points (where X axis is time and Y is Elo gain).

AndrewGrant · Post by **AndrewGrant** » Sat May 07, 2022 1:28 am

Witek wrote: ↑Sat May 07, 2022 1:25 am How about playing each game with randomized time control? Instead of just using 10+0.1, 60+1 or whatever, you randomize time with some exponential distribution (so there will be more games with short time controls). Then, the more games you play you will get an Elo score that spans all time controls. You could even plot some graph or try to fit a curve to sample points (where X axis is time and Y is Elo gain).

Sounds like introducing even more noise. Trying to compare two tests is already noisy enough. The reason people use SPRT and not A vs some_other_engine and B vs some_other_engine is to avoid the noise.

Witek · Post by **Witek** » Sat May 07, 2022 1:35 am

AndrewGrant wrote: ↑Sat May 07, 2022 1:28 am
Witek wrote: ↑Sat May 07, 2022 1:25 am How about playing each game with randomized time control? Instead of just using 10+0.1, 60+1 or whatever, you randomize time with some exponential distribution (so there will be more games with short time controls). Then, the more games you play you will get an Elo score that spans all time controls. You could even plot some graph or try to fit a curve to sample points (where X axis is time and Y is Elo gain).
Sounds like introducing even more noise. Trying to compare two tests is already noisy enough. The reason people use SPRT and not A vs some_other_engine and B vs some_other_engine is to avoid the noise.

Doesn't need to be random. Could make it deterministic by preparing a list of let say 100 time controls and run the gamesin batches so all TCs are covered in one batch.

An idea for a different type of testing

An idea for a different type of testing

Re: An idea for a different type of testing

Re: An idea for a different type of testing

Re: An idea for a different type of testing

Re: An idea for a different type of testing