It appears that this type of randomness is worse than the randomness due to the hardware interactions. Given that a standard set of positions are used to test against a select few engines. Assume that a change was made that increases an engines elo by +10 points. If a 100 game per engine gauntlet is run several times there would be variance in the separate test that would nevertheless average out to about +10 elo.
Version randomness is so bad that in test of this nature the average results can be much different. The reason why is that an improved engine can choose better moves that nonetheless lead more often to positions that the new version handles poorly or the opponent engines handle better. So few test positions combined with few opponents can and do lead to very contrary results. Better versions can look worse and worse versions can look like an improvement.
Tester beware!
New version randomness when testing
Moderator: Ras
-
Michael Sherwin
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
New version randomness when testing
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New version randomness when testing
Think about this:Michael Sherwin wrote:It appears that this type of randomness is worse than the randomness due to the hardware interactions. Given that a standard set of positions are used to test against a select few engines. Assume that a change was made that increases an engines elo by +10 points. If a 100 game per engine gauntlet is run several times there would be variance in the separate test that would nevertheless average out to about +10 elo.
suppose you run 100 games using 100 positions against 1 opponent. You get a pretty big error bar, right? OK, duplicate the PGN for those 100 games 5 times. Now you have 500 games, and the error bar is smaller, right? _wrong_. It will show up as smaller thru BayesElo, but that is for data that is uncorrelated. In your case, each of the 5 games played from the same position ( since they are really the same game since we just duplicated the PGN ) are 100% correlated. And you just fooled yourself into thinking you have a smaller error bar than you really have.
Using a small number of positions, and playing several games from each position does produce different games, but it is likely that the outcomes are more consistent. Which translates to correlated. Which translates into way more variance than you'd expect.
Old news. Use many positions and enough opponents to provide enough games to reduce the error bar to something less than your expected gain or loss in strength...
Version randomness is so bad that in test of this nature the average results can be much different. The reason why is that an improved engine can choose better moves that nonetheless lead more often to positions that the new version handles poorly or the opponent engines handle better. So few test positions combined with few opponents can and do lead to very contrary results. Better versions can look worse and worse versions can look like an improvement.
Tester beware!