New version randomness when testing

Discussion of chess software programming and technical issues.

Moderator: Ras

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

New version randomness when testing

Post by Michael Sherwin »

It appears that this type of randomness is worse than the randomness due to the hardware interactions. Given that a standard set of positions are used to test against a select few engines. Assume that a change was made that increases an engines elo by +10 points. If a 100 game per engine gauntlet is run several times there would be variance in the separate test that would nevertheless average out to about +10 elo.

Version randomness is so bad that in test of this nature the average results can be much different. The reason why is that an improved engine can choose better moves that nonetheless lead more often to positions that the new version handles poorly or the opponent engines handle better. So few test positions combined with few opponents can and do lead to very contrary results. Better versions can look worse and worse versions can look like an improvement.

Tester beware!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New version randomness when testing

Post by bob »

Michael Sherwin wrote:It appears that this type of randomness is worse than the randomness due to the hardware interactions. Given that a standard set of positions are used to test against a select few engines. Assume that a change was made that increases an engines elo by +10 points. If a 100 game per engine gauntlet is run several times there would be variance in the separate test that would nevertheless average out to about +10 elo.
Think about this:

suppose you run 100 games using 100 positions against 1 opponent. You get a pretty big error bar, right? OK, duplicate the PGN for those 100 games 5 times. Now you have 500 games, and the error bar is smaller, right? _wrong_. It will show up as smaller thru BayesElo, but that is for data that is uncorrelated. In your case, each of the 5 games played from the same position ( since they are really the same game since we just duplicated the PGN ) are 100% correlated. And you just fooled yourself into thinking you have a smaller error bar than you really have.

Using a small number of positions, and playing several games from each position does produce different games, but it is likely that the outcomes are more consistent. Which translates to correlated. Which translates into way more variance than you'd expect.

Version randomness is so bad that in test of this nature the average results can be much different. The reason why is that an improved engine can choose better moves that nonetheless lead more often to positions that the new version handles poorly or the opponent engines handle better. So few test positions combined with few opponents can and do lead to very contrary results. Better versions can look worse and worse versions can look like an improvement.

Tester beware!
Old news. Use many positions and enough opponents to provide enough games to reduce the error bar to something less than your expected gain or loss in strength...