obvious/easy move

Don · Post by **Don** » Mon Jan 28, 2013 3:26 pm

bob wrote:
Don wrote:
bob wrote:
Michel wrote:
It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.
And exactly what statistical principle is this based on???

You need to review everything posted on this thread, specifically the comment H.G. made about the much greater number of games being needed for gauntlet testing vs head to head testing.

Elo was not exactly based on computer play in the first place, so where, exactly is a reference?
ELO was based on chess players and computers are chess players. ELO was not based on modern chess players either. But you take specificity to such great lengths that we would have to re-validate the ELO system every time a new player entered the global rating pool.
The formula used in Elo is tuned to humans. Humans don't suddenly get much stronger. Computers do. Etc.

That is completely irrelevant for this discussion. ELO is tuned to human competition only in the sense that the K constant used in the formula has to be adjusted to respond the changes in the strength of humans quickly enough but not too sensitive to temporary good or bad results. That is not even relevant in computer testing where K is not used. We go by performance rating. I sure hope you don't use incremental ratings in your tests.

But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?

Or are we into the LOS stuff instead?

[edit]

OK, I went back and looked.

You are OK with JUST comparing version Y to version X. Which exaggerates the ELo difference as has been shown in every example of testing I have seen.

I have said this many times but you are not listening. I don't CARE to measure the exact ELO improvement, I am only interested in proving that one program is stronger than another.

If the test exaggerates the difference, I don't care. If I want to only find out if one stick is longer than another I hold them up side by side and get my answer. If I care HOW much longer it is then I have to bring out a standardized tape measure.

If I make a program improvement, test it, and then find that it's 10 ELO better but your test only shows it is 5 ELO better, does that mean you will throw out the change? I don't care if it's 5 ELO or 500 ELO, I'm going to keep the change. So I don't care if self-testing slightly exaggerates the results or if it exaggerates it a lot or even if we use some other incompatible rating system.

As opposed to the gauntlet-type testing where the elo seems to track pretty accurately with rating lists?

We never compare the ratings of our tests to the rating lists. They don't even agree anyway. Our only concern is increment progress over time. From time to time we hand over a binary to someone else who will run a test for us against the top programs and report back to us. But that has no impact on our testing.

So you use fewer games, to get an exaggerated Elo difference (larger error)?

We use 4x fewer games to get the same statistical significance as you. The scale of the ELO is not relevant. So I would not care if it quadruples the apparent difference as long as it is consistent. Your argument about exaggerated difference is so completely irrelevant I don't know why you argue it.

The only relevant thing you have here is to make a case that it's not consistent (not transitive) between versions but you are choosing the weaker case to argue for some reason. I don't mind a debate but it should focus on the things that possibly could matter and not stupid irrelevant stuff.

Sven · Post by **Sven** » Mon Jan 28, 2013 3:44 pm

Sven Schüle wrote:
michiguel wrote:The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)
Two points:

a) When you write "error of self-testing", is that the error of the rating distribution or of the difference distribution?

b) HGM wrote in a previous post:
hgm wrote:2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.
and referred to self-testing explicitly in that statement. I think he is right that independence of the ratings is not given for self-testing. Wouldn't that mean that the error of the self-testing difference distribution is "2*" the error of the rating distribution, instead of "sqrt(2)*"?

Sven

Bump ...

Does the argument about non-independence of the self-play ratings kill the "factor 4" result?

Sven

michiguel · Post by **michiguel** » Mon Jan 28, 2013 4:21 pm

Sven Schüle wrote:
michiguel wrote:The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)
Two points:

a) When you write "error of self-testing", is that the error of the rating distribution or of the difference distribution?

The error of the difference or could be seen as the standard deviation of the distribution of the difference.

Miguel

b) HGM wrote in a previous post:
hgm wrote:2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.
and referred to self-testing explicitly in that statement. I think he is right that independence of the ratings is not given for self-testing. Wouldn't that mean that the error of the self-testing difference distribution is "2*" the error of the rating distribution, instead of "sqrt(2)*"?

Sven

Sven · Post by **Sven** » Mon Jan 28, 2013 4:34 pm

michiguel wrote:
Sven Schüle wrote:
michiguel wrote:The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)
Two points:

a) When you write "error of self-testing", is that the error of the rating distribution or of the difference distribution?
The error of the difference or could be seen as the standard deviation of the distribution of the difference.

That does not answer my question, I wanted to know whether your "Es" in your calculation above is the error of the rating distribution (the number that BayesElo prints) or the error of the difference distribution.

Sven

hgm · Post by **hgm** » Mon Jan 28, 2013 5:51 pm

It is the error in the rating difference. But for two programs they each get only (plus or minus) half the difference, and the error in half the difference is half the error in the difference. And the latter is what the rating program would print.

Sven · Post by **Sven** » Mon Jan 28, 2013 5:59 pm

hgm wrote:It is the error in the rating difference. But for two programs they each get only (plus or minus) half the difference, and the error in half the difference is half the error in the difference. And the latter is what the rating program would print.

O.k., so what is the error of the difference of A1 and A2 in this self-play case:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A1      47    8    8  1000   76%   -47   36%
   2 A2     -47    8    8  1000   24%    47   36%

and what in this "gauntlet" case:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A1     100    8    9  2000   64%    -1   36%
   2 B       -1    6    6  4000   50%     1   36%
   3 A2     -99    9    9  2000   36%    -1   36%

?

Sven

Sven · Post by **Sven** » Mon Jan 28, 2013 6:00 pm

hgm wrote:It is the error in the rating difference. But for two programs they each get only (plus or minus) half the difference, and the error in half the difference is half the error in the difference. And the latter is what the rating program would print.

I think that the rating program does not print the error margins for the difference but for the rating itself.

Sven

hgm · Post by **hgm** » Mon Jan 28, 2013 6:21 pm

Indeed. And that is "half the error in the difference", which is what I meant by "the latter".

In the A1-A2 case the error in the difference would be 2x8 = 16Elo. This fits nicely with the 560/sqrt(N) formula. (Note that BayesElo counts draws as one win + 1 loss, so you effectively have 1360 games here, which gives 15.18 Elo.

Sven · Post by **Sven** » Mon Jan 28, 2013 6:45 pm

hgm wrote:Indeed. And that is "half the error in the difference", which is what I meant by "the latter".

In the A1-A2 case the error in the difference would be 2x8 = 16Elo. This fits nicely with the 560/sqrt(N) formula. (Note that BayesElo counts draws as one win + 1 loss, so you effectively have 1360 games here, which gives 15.18 Elo.

And in the gauntlet case?

EDIT: Of course I know the answer, therefore I know that the answer is not "16" but something around 8 * sqrt(2). Doesn't that mean that the accuracy of the self-play with 1000 games is NOT the same as the one of the gauntlet case with 4000 games when focussing on the error of the rating difference?

Sven

hgm · Post by **hgm** » Mon Jan 28, 2013 7:14 pm

That is a bit more complex. There are two independent measurements there, with results D1 (for A1-B) and D2 (for A2-B) for the rating difference, and errors E1 and E2 in it. That means the program will assign the ratings:

A1 = D1 - (D1+D2)/3 = 2/3*D1 - 1/3*D2
A2 = D2 - (D1+D2)/3 = 2/3*D1 - 1/3*D1
B = 0 - (D1+D2)/3 = -1/3*D1 - 1/3*D2

So the error in the A1 rating would be the combination of 2/3*E1 and 1/3*E2, which (because these are independent errors) gives sqrt(4/9*E1*E1 + 1/9*E2*E2).

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results