That is completely irrelevant for this discussion. ELO is tuned to human competition only in the sense that the K constant used in the formula has to be adjusted to respond the changes in the strength of humans quickly enough but not too sensitive to temporary good or bad results. That is not even relevant in computer testing where K is not used. We go by performance rating. I sure hope you don't use incremental ratings in your tests.bob wrote:The formula used in Elo is tuned to humans. Humans don't suddenly get much stronger. Computers do. Etc.Don wrote:You need to review everything posted on this thread, specifically the comment H.G. made about the much greater number of games being needed for gauntlet testing vs head to head testing.bob wrote:And exactly what statistical principle is this based on???Michel wrote:In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
ELO was based on chess players and computers are chess players. ELO was not based on modern chess players either. But you take specificity to such great lengths that we would have to re-validate the ELO system every time a new player entered the global rating pool.
Elo was not exactly based on computer play in the first place, so where, exactly is a reference?
I have said this many times but you are not listening. I don't CARE to measure the exact ELO improvement, I am only interested in proving that one program is stronger than another.
But again, I did not see anyone explain why the following is supposedly false:
I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?
Or are we into the LOS stuff instead?
[edit]
OK, I went back and looked.
You are OK with JUST comparing version Y to version X. Which exaggerates the ELo difference as has been shown in every example of testing I have seen.
If the test exaggerates the difference, I don't care. If I want to only find out if one stick is longer than another I hold them up side by side and get my answer. If I care HOW much longer it is then I have to bring out a standardized tape measure.
If I make a program improvement, test it, and then find that it's 10 ELO better but your test only shows it is 5 ELO better, does that mean you will throw out the change? I don't care if it's 5 ELO or 500 ELO, I'm going to keep the change. So I don't care if self-testing slightly exaggerates the results or if it exaggerates it a lot or even if we use some other incompatible rating system.
We never compare the ratings of our tests to the rating lists. They don't even agree anyway. Our only concern is increment progress over time. From time to time we hand over a binary to someone else who will run a test for us against the top programs and report back to us. But that has no impact on our testing.
As opposed to the gauntlet-type testing where the elo seems to track pretty accurately with rating lists?
We use 4x fewer games to get the same statistical significance as you. The scale of the ELO is not relevant. So I would not care if it quadruples the apparent difference as long as it is consistent. Your argument about exaggerated difference is so completely irrelevant I don't know why you argue it.
So you use fewer games, to get an exaggerated Elo difference (larger error)?
The only relevant thing you have here is to make a case that it's not consistent (not transitive) between versions but you are choosing the weaker case to argue for some reason. I don't mind a debate but it should focus on the things that possibly could matter and not stupid irrelevant stuff.

