New version randomness when testing

Michael Sherwin · Post by **Michael Sherwin** » Wed Jun 16, 2010 5:41 am

It appears that this type of randomness is worse than the randomness due to the hardware interactions. Given that a standard set of positions are used to test against a select few engines. Assume that a change was made that increases an engines elo by +10 points. If a 100 game per engine gauntlet is run several times there would be variance in the separate test that would nevertheless average out to about +10 elo.

Version randomness is so bad that in test of this nature the average results can be much different. The reason why is that an improved engine can choose better moves that nonetheless lead more often to positions that the new version handles poorly or the opponent engines handle better. So few test positions combined with few opponents can and do lead to very contrary results. Better versions can look worse and worse versions can look like an improvement.

Tester beware!

bob · Post by **bob** » Wed Jun 16, 2010 5:59 am

Michael Sherwin wrote:It appears that this type of randomness is worse than the randomness due to the hardware interactions. Given that a standard set of positions are used to test against a select few engines. Assume that a change was made that increases an engines elo by +10 points. If a 100 game per engine gauntlet is run several times there would be variance in the separate test that would nevertheless average out to about +10 elo.

Think about this:

suppose you run 100 games using 100 positions against 1 opponent. You get a pretty big error bar, right? OK, duplicate the PGN for those 100 games 5 times. Now you have 500 games, and the error bar is smaller, right? _wrong_. It will show up as smaller thru BayesElo, but that is for data that is uncorrelated. In your case, each of the 5 games played from the same position ( since they are really the same game since we just duplicated the PGN ) are 100% correlated. And you just fooled yourself into thinking you have a smaller error bar than you really have.

Using a small number of positions, and playing several games from each position does produce different games, but it is likely that the outcomes are more consistent. Which translates to correlated. Which translates into way more variance than you'd expect.

Version randomness is so bad that in test of this nature the average results can be much different. The reason why is that an improved engine can choose better moves that nonetheless lead more often to positions that the new version handles poorly or the opponent engines handle better. So few test positions combined with few opponents can and do lead to very contrary results. Better versions can look worse and worse versions can look like an improvement.

Tester beware!

Old news. Use many positions and enough opponents to provide enough games to reduce the error bar to something less than your expected gain or loss in strength...

New version randomness when testing

New version randomness when testing

Re: New version randomness when testing