obvious/easy move

bob · Post by **bob** » Sun Jan 27, 2013 5:56 pm

Michel wrote:
It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.

And exactly what statistical principle is this based on???

Elo was not exactly based on computer play in the first place, so where, exactly is a reference?

Don · Post by **Don** » Sun Jan 27, 2013 6:18 pm

bob wrote:
Michel wrote:
It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.
And exactly what statistical principle is this based on???

You need to review everything posted on this thread, specifically the comment H.G. made about the much greater number of games being needed for gauntlet testing vs head to head testing.

Elo was not exactly based on computer play in the first place, so where, exactly is a reference?

ELO was based on chess players and computers are chess players. ELO was not based on modern chess players either. But you take specificity to such great lengths that we would have to re-validate the ELO system every time a new player entered the global rating pool.

michiguel · Post by **michiguel** » Sun Jan 27, 2013 6:38 pm

hgm wrote:Wrong!

1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.

Correct, it is 4x

The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)

Now, lets run two gauntlets, one for version 1, one for version 2
If the total number of games in each gauntlet is m

E1 = k * sqrt(m) // error of the relative Elo Value V1 in the the gauntlet for the first version
E2 = k * sqrt(m) // error of the relative Elo Value V2 in the the gauntlet for the second version

The diff elor value calculated from two gauntlets is V2-V1, and the variance is (square of the error = variance, Ec^2 = Variance)

Variance = Ec^2 = E1^2 + E2^2
Ec^2 = k^2 * m + k^2 * m
Ec^2 = 2 * k^2 * m

Hence, total error of the difference V2-V1
Ec = k * sqrt(2) * sqrt(m)

So, the fractional error is
Ec/(m) = k * sqrt(2) /sqrt(m)

When the fractional errors for the self testing and the two gauntlets are the same?
Ec/m = Es/n when

k / sqrt(n) = sqrt(2) * k / sqrt(m)

sqrt(m)/sqrt(n) = sqrt(2)

m/n = 2

The number of games per gauntlet should be double, but since we have to run two gauntlets, the number of games needed are 4x

For instance if we run 1000 games in self testing, we need to run 2000 games in the gauntlet 1, and 2000 games in gauntlet 2.

Miguel

2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.

Don · Post by **Don** » Sun Jan 27, 2013 6:50 pm

Although 4X is correct one could argue that you only need 2X based on the following reasoning:

You start with A vs B. You play A vs foreign opponents using 2X more games and then you play B vs foreign opponents using 2X more games. For this case you have to play 4X more games than just testing A vs B.

HOWEVER, you now have the completed results of the "best" player which may be A or B. Let's assume B is your new reference players. You make a program change and now you must test program C. You DO NOT NEED to test B again as you already have his results - you must only test C. From now on you only need to run 2X more games to get equivalent results as you get additional utility from previously completed tests.

Still, 2X is a lot and I'm not willing to "throw out" half my hardware for this.

michiguel wrote:
hgm wrote:Wrong!

1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.

Correct, it is 4x

The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)

Now, lets run two gauntlets, one for version 1, one for version 2
If the total number of games in each gauntlet is m

E1 = k * sqrt(m) // error of the relative Elo Value V1 in the the gauntlet for the first version
E2 = k * sqrt(m) // error of the relative Elo Value V2 in the the gauntlet for the second version

The diff elor value calculated from two gauntlets is V2-V1, and the variance is (square of the error = variance, Ec^2 = Variance)

Variance = Ec^2 = E1^2 + E2^2
Ec^2 = k^2 * m + k^2 * m
Ec^2 = 2 * k^2 * m

Hence, total error of the difference V2-V1
Ec = k * sqrt(2) * sqrt(m)

So, the fractional error is
Ec/(m) = k * sqrt(2) /sqrt(m)

When the fractional errors for the self testing and the two gauntlets are the same?
Ec/m = Es/n when

k / sqrt(n) = sqrt(2) * k / sqrt(m)

sqrt(m)/sqrt(n) = sqrt(2)

m/n = 2

The number of games per gauntlet should be double, but since we have to run two gauntlets, the number of games needed are 4x

For instance if we run 1000 games in self testing, we need to run 2000 games in the gauntlet 1, and 2000 games in gauntlet 2.

Miguel

2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.

michiguel · Post by **michiguel** » Sun Jan 27, 2013 7:20 pm

Don wrote:Although 4X is correct one could argue that you only need 2X based on the following reasoning:

You start with A vs B. You play A vs foreign opponents using 2X more games and then you play B vs foreign opponents using 2X more games. For this case you have to play 4X more games than just testing A vs B.

HOWEVER, you now have the completed results of the "best" player which may be A or B. Let's assume B is your new reference players. You make a program change and now you must test program C. You DO NOT NEED to test B again as you already have his results - you must only test C. From now on you only need to run 2X more games to get equivalent results as you get additional utility from previously completed tests.

Still, 2X is a lot and I'm not willing to "throw out" half my hardware for this.

michiguel wrote:
hgm wrote:Wrong!

1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.

Correct, it is 4x

The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)

Now, lets run two gauntlets, one for version 1, one for version 2
If the total number of games in each gauntlet is m

E1 = k * sqrt(m) // error of the relative Elo Value V1 in the the gauntlet for the first version
E2 = k * sqrt(m) // error of the relative Elo Value V2 in the the gauntlet for the second version

The diff elor value calculated from two gauntlets is V2-V1, and the variance is (square of the error = variance, Ec^2 = Variance)

Variance = Ec^2 = E1^2 + E2^2
Ec^2 = k^2 * m + k^2 * m
Ec^2 = 2 * k^2 * m

Hence, total error of the difference V2-V1
Ec = k * sqrt(2) * sqrt(m)

So, the fractional error is
Ec/(m) = k * sqrt(2) /sqrt(m)

When the fractional errors for the self testing and the two gauntlets are the same?
Ec/m = Es/n when

k / sqrt(n) = sqrt(2) * k / sqrt(m)

sqrt(m)/sqrt(n) = sqrt(2)

m/n = 2

The number of games per gauntlet should be double, but since we have to run two gauntlets, the number of games needed are 4x

For instance if we run 1000 games in self testing, we need to run 2000 games in the gauntlet 1, and 2000 games in gauntlet 2.

Miguel

2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.

Good pragmatical observation and correct: the more you test, since the reference is always kept, the more you approach to a number that is 2x.

Miguel

Don · Post by **Don** » Sun Jan 27, 2013 7:31 pm

michiguel wrote:
Don wrote:Although 4X is correct one could argue that you only need 2X based on the following reasoning:

You start with A vs B. You play A vs foreign opponents using 2X more games and then you play B vs foreign opponents using 2X more games. For this case you have to play 4X more games than just testing A vs B.

HOWEVER, you now have the completed results of the "best" player which may be A or B. Let's assume B is your new reference players. You make a program change and now you must test program C. You DO NOT NEED to test B again as you already have his results - you must only test C. From now on you only need to run 2X more games to get equivalent results as you get additional utility from previously completed tests.

Still, 2X is a lot and I'm not willing to "throw out" half my hardware for this.

michiguel wrote:
hgm wrote:Wrong!

1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.

Correct, it is 4x

The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)

Now, lets run two gauntlets, one for version 1, one for version 2
If the total number of games in each gauntlet is m

E1 = k * sqrt(m) // error of the relative Elo Value V1 in the the gauntlet for the first version
E2 = k * sqrt(m) // error of the relative Elo Value V2 in the the gauntlet for the second version

The diff elor value calculated from two gauntlets is V2-V1, and the variance is (square of the error = variance, Ec^2 = Variance)

Variance = Ec^2 = E1^2 + E2^2
Ec^2 = k^2 * m + k^2 * m
Ec^2 = 2 * k^2 * m

Hence, total error of the difference V2-V1
Ec = k * sqrt(2) * sqrt(m)

So, the fractional error is
Ec/(m) = k * sqrt(2) /sqrt(m)

When the fractional errors for the self testing and the two gauntlets are the same?
Ec/m = Es/n when

k / sqrt(n) = sqrt(2) * k / sqrt(m)

sqrt(m)/sqrt(n) = sqrt(2)

m/n = 2

The number of games per gauntlet should be double, but since we have to run two gauntlets, the number of games needed are 4x

For instance if we run 1000 games in self testing, we need to run 2000 games in the gauntlet 1, and 2000 games in gauntlet 2.

Miguel

2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.
Good pragmatical observation and correct: the more you test, since the reference is always kept, the more you approach to a number that is 2x.

Miguel

Of course that doesn't change the fact that the "real" number is 4X and being forced to test this way just to "salvage" 2X is a restriction. Sometime we test 2 versions of a change together as an intermediate tuning step. Usually we don't require as many games as we simply want to create a reasonable reference version to start with, but that would be 4X more work with Bob's method which requires you to play each version against other programs.

bob · Post by **bob** » Mon Jan 28, 2013 2:54 am

Don wrote:
bob wrote:
Michel wrote:
It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.
And exactly what statistical principle is this based on???

You need to review everything posted on this thread, specifically the comment H.G. made about the much greater number of games being needed for gauntlet testing vs head to head testing.

Elo was not exactly based on computer play in the first place, so where, exactly is a reference?
ELO was based on chess players and computers are chess players. ELO was not based on modern chess players either. But you take specificity to such great lengths that we would have to re-validate the ELO system every time a new player entered the global rating pool.

The formula used in Elo is tuned to humans. Humans don't suddenly get much stronger. Computers do. Etc.

But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?

Or are we into the LOS stuff instead?

[edit]

OK, I went back and looked.

You are OK with JUST comparing version Y to version X. Which exaggerates the ELo difference as has been shown in every example of testing I have seen.

As opposed to the gauntlet-type testing where the elo seems to track pretty accurately with rating lists?

So you use fewer games, to get an exaggerated Elo difference (larger error)?

AlvaroBegue · Post by **AlvaroBegue** » Mon Jan 28, 2013 3:24 am

bob wrote:But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?

I'll give it a try. We are not particularly interested in measuring the ELO score of our program: We are interested in measuring if a particular modification to our program is an improvement. If we match the proposed version against the previous version, you get a +/-4 ELO points estimate after 30K games. If you match the new version against ANYTHING, you'll know the ELO of the new version within +/-4 ELO, but we need to compute the difference between this value and the ELO score of the previous version, which also needs to be measured. That takes a factor of 2 more games and when you subtract two numbers with +/-4 error bars on them, you get an larger error bar of +/-4*sqrt(2). If you want to recover the +/-4 error bar, you need to again double the number of games. That's where the factor of 4 comes from.

Someone (Don?) said that you can get away with "just" a factor of 2 because presumably you have already run the ELO measurement of the previous version. However, this has a bit of a problem: If the testing of the previous version was lucky and got an estimate of ELO inflated by 12 points (a 3-sigma event happens sometimes), now every change will tend to look bad, just because other tests are unlikely to be that lucky.

Sven · Post by **Sven** » Mon Jan 28, 2013 2:26 pm

michiguel wrote:The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)

fractional error (divide by n)
Es/n = k/sqrt(n)

Two points:

a) When you write "error of self-testing", is that the error of the rating distribution or of the difference distribution?

b) HGM wrote in a previous post:

hgm wrote:2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

and referred to self-testing explicitly in that statement. I think he is right that independence of the ratings is not given for self-testing. Wouldn't that mean that the error of the self-testing difference distribution is "2*" the error of the rating distribution, instead of "sqrt(2)*"?

Sven

Don · Post by **Don** » Mon Jan 28, 2013 2:43 pm

AlvaroBegue wrote:
bob wrote:But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?
I'll give it a try. We are not particularly interested in measuring the ELO score of our program: We are interested in measuring if a particular modification to our program is an improvement. If we match the proposed version against the previous version, you get a +/-4 ELO points estimate after 30K games. If you match the new version against ANYTHING, you'll know the ELO of the new version within +/-4 ELO, but we need to compute the difference between this value and the ELO score of the previous version, which also needs to be measured. That takes a factor of 2 more games and when you subtract two numbers with +/-4 error bars on them, you get an larger error bar of +/-4*sqrt(2). If you want to recover the +/-4 error bar, you need to again double the number of games. That's where the factor of 4 comes from.

Someone (Don?) said that you can get away with "just" a factor of 2 because presumably you have already run the ELO measurement of the previous version. However, this has a bit of a problem: If the testing of the previous version was lucky and got an estimate of ELO inflated by 12 points (a 3-sigma event happens sometimes), now every change will tend to look bad, just because other tests are unlikely to be that lucky.

It was me who said it. However my statement came with some qualifications and I agree the right number is 4x. We have actually ran across the problem you describe, at least in the past where we primarily did guantlets against foreign programs. We would run our 50,000 games and one version would get "lucky" by a few ELO. Even then we knew that this could be a problem because we would find it almost impossible to "beat" this version.

So the self testing solves this problem by simply testing the same version over and over again. There is a tradeoff here though. If you compare the two methods directly you cannot say 4x if you are testing the same version over and over again in one method but not the other.

However, I think 4x is the number you have to go with because these other consideration are logistical and irrelevant from a cost analysis point of view. You have convinced me that the self-testing methods (even compared to the rotating gauntlet to save computation) is more than 2x superior. First of all if the new version tests better you don't ever have to run the old version again, which is a savings that is comparable to the gauntlets 2x saving I mentioned. But you get something for doing this even if it doesn't test better, some statistical protection against a temporary false read.

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results