obvious/easy move

Sven · Post by **Sven** » Mon Jan 28, 2013 7:35 pm

hgm wrote:That is a bit more complex. There are two independent measurements there, with results D1 (for A1-B) and D2 (for A2-B) for the rating difference, and errors E1 and E2 in it. That means the program will assign the ratings:

A1 = D1 - (D1+D2)/3 = 2/3*D1 - 1/3*D2
A2 = D2 - (D1+D2)/3 = 2/3*D1 - 1/3*D1
B = 0 - (D1+D2)/3 = -1/3*D1 - 1/3*D2

So the error in the A1 rating would be the combination of 2/3*E1 and 1/3*E2, which (because these are independent errors) gives sqrt(4/9*E1*E1 + 1/9*E2*E2).

But we are not interested in the differences of A1-B resp. A2-B but in the difference A1-A2 since these are the two "candidates".

My question was this: you confirmed that in the self-play case the error of the difference A1-A2 in the self-play case is 2* the error of the ratings of A1 resp. A2 due to the "100% anti-correlation". What is the error of the difference of A1-A2 in the gauntlet case if both A1 and A2 have played twice the number of games each compared to the self-play case? (In my example I had 1000 games self-play vs. 2000+2000 games in the gauntlet case.)

After all that was posted here it is sqrt(error(A1_rating)^2 + error(A2_rating)^2), and if we further assume that both ratings have the same error margins then it becomes sqrt(2) * error(A1_rating). Therefore, to get the same accuracy in the error of rating differences when comparing A1-A2 self-play vs. gauntlet we need 2x the number of games for a gauntlet compared to self-play, not 4x.

Right or wrong?

Sven

hgm · Post by **hgm** » Mon Jan 28, 2013 8:02 pm

Well, from the formula you can see that the reported errors are not independent, as the reported ratings still correlate (but not by 100%)

cov(A1,A2) = cov(2/3*D1-1/3*D2, 2/3*D2-1/3*D1) = -2/9*cov(D1,D1) -2/9*cov(D2,D2) = -2/9*(var(D1) + var(D2))

as D1 and D2 are independent. Now

var(A2) = var(2/3*D2-1/3*D1) = 4/9*var(D2) + 1/9*var(D1)

while

var(A1) = var(2/3*D1-1/3*D1) = 4/9*var(D1) + 1/9*var(D2)

The latter are the numbers the Elo program reports. (I.e. it repots their square roots.)

To calculate the error in the difference you would have add the variance AND covariances:

var(A2-A1) = var(A2) + var(A1) - 2*cov(A1, A2) = 5/9*(var(D1) + var(D2)) + 4/9*(var(D1) + var(D2)) = var(D1) + var(D2)

This is indeed what you would expect from calculating it by hand:

var(A2 - A1) = var(D2-D1) = var(D2) + var(D1)

But, as shown above, var(A1) + var(A2) is only 5/9 of that. So you would have to multiply the reported error bars by sqrt(9/5) before square-adding them.

Sven · Post by **Sven** » Mon Jan 28, 2013 8:09 pm

hgm wrote:Well, from the formula you can see that the reported errors are not independent, as the reported ratings still correlate (but not by 100%)

cov(A1,A2) = cov(2/3*D1-1/3*D2, 2/3*D2-1/3*D1) = -2/9*cov(D1,D1) -2/9*cov(D2,D2) = -2/9*(var(D1) + var(D2))

as D1 and D2 are independent. Now

var(A2) = var(2/3*D2-1/3*D1) = 4/9*var(D2) + 1/9*var(D1)

while

var(A1) = var(2/3*D1-1/3*D1) = 4/9*var(D1) + 1/9*var(D2)

The latter are the numbers the Elo program reports. (I.e. it repots their square roots.)

To calculate the error in the difference you would have add the variance AND covariances:

var(A2-A1) = var(A2) + var(A1) - 2*cov(A1, A2) = 5/9*(var(D1) + var(D2)) + 4/9*(var(D1) + var(D2)) = var(D1) + var(D2)

This is indeed what you would expect from calculating it by hand:

var(A2 - A1) = var(D2-D1) = var(D2) + var(D1)

But, as shown above, var(A1) + var(A2) is only 5/9 of that. So you would have to multiply the reported error bars by sqrt(9/5) before square-adding them.

I don't see at all how that stuff is related to my question. I'm still interested in a corresponding answer.

The ratings for A1 and A2 from the gauntlet case are surely 100% independent so I really don't know what you are talking about. Your D1, D2 stuff is not related at all to the question about the error of the rating difference of A1 and A2 in the gauntlet case.

Sven

hgm · Post by **hgm** » Mon Jan 28, 2013 8:20 pm

No, they are NOT independent. I just showed you they had a non-zero covariance. They are just less dependent than in the direct match. In the simplified case where var(D1) = var(D2) = V the variances are 5/9*V, and the covariance -4/9*V. So cov(A2, A1)/sqrt(var(A2)*var(A1)) is only -80%, not -100%. But the ratings still strongly anti-correlate.

Only in the limit where you would replace the gauntlet B by an infinite number of engines (1000 engines each playing only 1 game would be a huge step in that direction) will A1 and A2 become independent.

Sven · Post by **Sven** » Mon Jan 28, 2013 8:41 pm

hgm wrote:No, they are NOT independent. I just showed you they had a non-zero covariance. They are just less dependent than in the direct match. In the simplified case where var(D1) = var(D2) = V the variances are 5/9*V, and the covariance -4/9*V. So cov(A2, A1)/sqrt(var(A2)*var(A1)) is only -80%, not -100%. But the ratings still strongly anti-correlate.

Only in the limit where you would replace the gauntlet B by an infinite number of engines (1000 engines each playing only 1 game would be a huge step in that direction) will A1 and A2 become independent.

So you say that in reality the formula that was quoted here many times for the error margin of the rating difference for gauntlets is wrong. Maybe you are right that the A1, A2 ratings are not independent but at least everyone seemed to assume so, which was the reason for the "sqrt(2)" to enter the discussion. After your post it seems it is "sqrt(2 * 9/5)", although I did not understand every detail.

Then we have:

a) Self-play:
N games A1-A2, error margin of rating = ES1 (assumed identical to ES2), error margin of rating difference = 2*ES1

b) Gauntlet with one opponent B:
M games A1-B and M games A2-B, error margin of rating = EG1 (assumed identical to EG2), error margin of rating difference = sqrt(2 * 9/5) * EG1 according to your formula

M+M = X*N and we are looking for X.

Now what is X? Up to now everyone said "X=4". Seems no longer true?

And what about bigger gauntlets (>1 opponent)?

Sven

Sven · Post by **Sven** » Mon Jan 28, 2013 8:48 pm

Sven Schüle wrote:
hgm wrote:That is a bit more complex. There are two independent measurements there, with results D1 (for A1-B) and D2 (for A2-B) for the rating difference, and errors E1 and E2 in it. That means the program will assign the ratings:

A1 = D1 - (D1+D2)/3 = 2/3*D1 - 1/3*D2
A2 = D2 - (D1+D2)/3 = 2/3*D1 - 1/3*D1
B = 0 - (D1+D2)/3 = -1/3*D1 - 1/3*D2

So the error in the A1 rating would be the combination of 2/3*E1 and 1/3*E2, which (because these are independent errors) gives sqrt(4/9*E1*E1 + 1/9*E2*E2).
But we are not interested in the differences of A1-B resp. A2-B but in the difference A1-A2 since these are the two "candidates".

The reason why I completely misunderstood your post when reading it for the first time was that you used the term "rating difference" where you meant "relative rating". The term "rating difference" has been used for the comparison of two ratings in the context of this sub-thread.

Sven

hgm · Post by **hgm** » Mon Jan 28, 2013 9:12 pm

The formula people were using here was always used correctly, by directly applying it to gauntlet results (which are independent). It is you, however, who drags in an Elo calculating program, and want to interpret the error bars it quotes in the individual ratings. That is an entirely different game, full of pitfals. These programs shift the ratings around to keep their average zero, and this introduces apparent correlations between the ratings even in cases where common sense says they should have been measured independently. Correlation between the results flaws the formula. The true formula is

var(R2-R1) = var(R2) + var(R1) - 2*cov(R1, R2)

only when the covariance is zero this reduces to the formula that was applied by several people above. Which was all correct, because the covariance of the quantities they were taking the difference of were indeed independent.

The error bars quoted by rating programs can be very misleading (especially when there are few players). This is why these programs often quote LOS separately.

bob · Post by **bob** » Mon Jan 28, 2013 9:25 pm

Don wrote:
AlvaroBegue wrote:
bob wrote:But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?
I'll give it a try. We are not particularly interested in measuring the ELO score of our program: We are interested in measuring if a particular modification to our program is an improvement. If we match the proposed version against the previous version, you get a +/-4 ELO points estimate after 30K games. If you match the new version against ANYTHING, you'll know the ELO of the new version within +/-4 ELO, but we need to compute the difference between this value and the ELO score of the previous version, which also needs to be measured. That takes a factor of 2 more games and when you subtract two numbers with +/-4 error bars on them, you get an larger error bar of +/-4*sqrt(2). If you want to recover the +/-4 error bar, you need to again double the number of games. That's where the factor of 4 comes from.

Someone (Don?) said that you can get away with "just" a factor of 2 because presumably you have already run the ELO measurement of the previous version. However, this has a bit of a problem: If the testing of the previous version was lucky and got an estimate of ELO inflated by 12 points (a 3-sigma event happens sometimes), now every change will tend to look bad, just because other tests are unlikely to be that lucky.
It was me who said it. However my statement came with some qualifications and I agree the right number is 4x. We have actually ran across the problem you describe, at least in the past where we primarily did guantlets against foreign programs. We would run our 50,000 games and one version would get "lucky" by a few ELO. Even then we knew that this could be a problem because we would find it almost impossible to "beat" this version.

So the self testing solves this problem by simply testing the same version over and over again. There is a tradeoff here though. If you compare the two methods directly you cannot say 4x if you are testing the same version over and over again in one method but not the other.

However, I think 4x is the number you have to go with because these other consideration are logistical and irrelevant from a cost analysis point of view. You have convinced me that the self-testing methods (even compared to the rotating gauntlet to save computation) is more than 2x superior. First of all if the new version tests better you don't ever have to run the old version again, which is a savings that is comparable to the gauntlets 2x saving I mentioned. But you get something for doing this even if it doesn't test better, some statistical protection against a temporary false read.

Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...

bob · Post by **bob** » Mon Jan 28, 2013 9:28 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
Michel wrote:
It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.
And exactly what statistical principle is this based on???

You need to review everything posted on this thread, specifically the comment H.G. made about the much greater number of games being needed for gauntlet testing vs head to head testing.

Elo was not exactly based on computer play in the first place, so where, exactly is a reference?
ELO was based on chess players and computers are chess players. ELO was not based on modern chess players either. But you take specificity to such great lengths that we would have to re-validate the ELO system every time a new player entered the global rating pool.
The formula used in Elo is tuned to humans. Humans don't suddenly get much stronger. Computers do. Etc.

That is completely irrelevant for this discussion. ELO is tuned to human competition only in the sense that the K constant used in the formula has to be adjusted to respond the changes in the strength of humans quickly enough but not too sensitive to temporary good or bad results. That is not even relevant in computer testing where K is not used. We go by performance rating. I sure hope you don't use incremental ratings in your tests.

But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?

Or are we into the LOS stuff instead?

[edit]

OK, I went back and looked.

You are OK with JUST comparing version Y to version X. Which exaggerates the ELo difference as has been shown in every example of testing I have seen.
I have said this many times but you are not listening. I don't CARE to measure the exact ELO improvement, I am only interested in proving that one program is stronger than another.

If the test exaggerates the difference, I don't care. If I want to only find out if one stick is longer than another I hold them up side by side and get my answer. If I care HOW much longer it is then I have to bring out a standardized tape measure.

If I make a program improvement, test it, and then find that it's 10 ELO better but your test only shows it is 5 ELO better, does that mean you will throw out the change? I don't care if it's 5 ELO or 500 ELO, I'm going to keep the change. So I don't care if self-testing slightly exaggerates the results or if it exaggerates it a lot or even if we use some other incompatible rating system.

As opposed to the gauntlet-type testing where the elo seems to track pretty accurately with rating lists?

We never compare the ratings of our tests to the rating lists. They don't even agree anyway. Our only concern is increment progress over time. From time to time we hand over a binary to someone else who will run a test for us against the top programs and report back to us. But that has no impact on our testing.

So you use fewer games, to get an exaggerated Elo difference (larger error)?
We use 4x fewer games to get the same statistical significance as you. The scale of the ELO is not relevant. So I would not care if it quadruples the apparent difference as long as it is consistent. Your argument about exaggerated difference is so completely irrelevant I don't know why you argue it.

The only relevant thing you have here is to make a case that it's not consistent (not transitive) between versions but you are choosing the weaker case to argue for some reason. I don't mind a debate but it should focus on the things that possibly could matter and not stupid irrelevant stuff.

I have previously stated my primary objection to self-testing. A new version can beat the old version, but play worse against a pool of DIFFERENT programs. I've seen that happen multiple times.

That gives me pause to think. And in my case, where we are talking about 5 minutes of testing or 20 minutes (assuming one goes by your 4x number) the difference between the two is basically insignificant and I prefer the more reliable gauntlet format.

Sven · Post by **Sven** » Mon Jan 28, 2013 9:29 pm

hgm wrote:The formula people were using here was always used correctly, by directly applying it to gauntlet results (which are independent). It is you, however, who drags in an Elo calculating program, and want to interpret the error bars it quotes in the individual ratings.

Come on, that's not fair. Everyone around here, perhaps except you, is using a rating program to calculate error bars, especially since it is quite inconvenient to do that manually for a large number of games. So don't blame me for doing the same, please. I doubt that there are many people who are really aware of your specific findings about that difference in error bar calculation between rating programs and the "reality". For most people a rating program is a tool that does the work for them.

Within a couple of posts I tried hard to get a convincing answer from you. Only now did you come out with it, even though it must have been clear for everyone what I was talking about.

One final question to you: why are the ratings of A1 and A2 independent in the "reality" but dependent in case of using a rating program?

Sven

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results