This is getting to be quite confusing, with differing answers depending on what software is assumed etc.
I'll restate the question in a very non-ambiguous manner that is software-independent. Let's assume we are comparing two versions and want to play enough games to be able to tell which is stronger with an error margin of 5 elo. We can either play each of them against a foreign gauntlet or play them directly against each other. Let's assume the draw percentage is the same in either case, and let's ignore the possibility that the results may be different even with an infinite sample size.
So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)? It seems that all three answers could be inferred from various posts on this topic.
margin of error
Moderators: hgm, Rebel, chrisw
-
- Posts: 5966
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: margin of error
For A's score to rise B's score has to fall down. So correlation coefficient is -1. If you play two separate gauntlets for A and B , they are independent since A's score don't affect B, so no correlation and the error is sqrt(D1^2+D2^2).ernest wrote:Why would they anti-correlate?... (instead of being independent)hgm wrote:I would expect the A-B error to be 10 in that case, because the errors would perfectly anti-correlate.
Nono, I believe the best estimation for the A-B error is 7 (sqrt thing).
-
- Posts: 27866
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: margin of error
You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: margin of error
I don't really understand this, because you are not really measuring the variance of A-B directly. You still have to calculate the variances and co-variances of A and B separately and use the formula for A-B to get the final error of margin. Negative correlation drives up the variance of A-B as we agreed so independence of results of A and B can only decrease the variance of A-B. That is if the variance of A and B are kept the same. Lets take an example for a result 200-200-100 between two engines.hgm wrote:You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
Code: Select all
Var(A)=Var(B)=1/5
Cov(A,B) = -1/5
Var(A-B)=1/5+1/5-2(-1/5)=4/5
Say instead A played against C, and B against C with two independent experiments and both got 200-200-100 against C.
Code: Select all
Var(A)=Var(B)=1/5
Cov(A,B)=0 even thoug Cov(A,C)=Cov(B,C)=-1/5
Var(A-B)= 1/5+1/5=2/5
--------------------------------------------------
Edit: Infact, the standard error of the difference is better for the case we play against two different engines:
1st method: s.e = sqrt(4/5) / sqrt(500) = sqrt(4/5 * 500)
2nd method: s.e = sqrt(2/5) / sqrt(2 * 500) = sqrt(1/5 * 500)
We need to play four times as many games b/n A and B to reach same margin of error as the second method, but that will be twice as many games.
So on equal grounds, The second method requires half the number of games to reach the same standard error. Quite the opposite of your claim.
----------------------------------------------------
P.S: I left out confidence intervals and standard error (which would require division by sqrt(N)) in my calculations. So Larry you should be careful when applying the formula directly on confidence values..
P.S.1: Detail calculations
Code: Select all
Var(A)=(200(1-0.5)^2+200(0-0.5)^2+100(0.5-0.5)^2)/500
Cov(A,B)=(200(1-0.5)(0-0.5)+200(0-0.5)(1-0.5)+100(0.5-0.5)(0.5-0.5))/500
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: margin of error
This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.hgm wrote:You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: margin of error
Well then I am truly baffled. My calculation says playing A vs B directly requires twice as many games than doing A vs C and B vs C. And now you guys are saying the other way round requires 4x as many games. So we have a difference of 8xDon wrote:This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.hgm wrote:You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.
Can you provide data and explain how you calculated error bars ? I am really curious now as I believe negative correlation drives up variance of A-B but decreases that of A + B, which I think is causing the confusion here. Well maybe I am wrong but we lack a good explanation ...
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: margin of error
HGM explanation is correct.Daniel Shawul wrote:Well then I am truly baffled. My calculation says playing A vs B directly requires twice as many games than doing A vs C and B vs C. And now you guys are saying the other way round requires 4x as many games. So we have a difference of 8xDon wrote:This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.hgm wrote:You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.
Can you provide data and explain how you calculated error bars ? I am really curious now as I believe negative correlation drives up variance of A-B but decreases that of A + B, which I think is causing the confusion here. Well maybe I am wrong but we lack a good explanation ...
A plays B and the the difference (deltaAB) has an error Eab.
Then, with the same number games, we can calculate deltaAC, and it will have error Eac = Eab (since number of games are the same).
Also, with the same number games, we can calculate deltaCB, and it will have error Ecb = Eab (since number of games are the same).
So, we can calculate indirectly
deltaAB = deltaAC + deltaCB
Here we can already see that the error of this indirect calculation is bigger than Eab, no matter what, and we are already playing twice as many games.
deltaAC and deltaCB are independent, so the error for the indirect calculation is
IndirectError_ab = sqrt(Eac^2 + Ecb^2)
IndirectError_ab = sqrt(Eac^2 + Eac^2)
IndirectError_ab = sqrt(2*Eac^2)
IndirectError_ab = sqrt(2) * Eac
If we want the IndirectError_ab to be Eab, we have to make Eac = Eab/sqrt(2). We can do that playing twice as many games, which makes the total 4x.
Miguel
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: margin of error
Logically if I play A vs B and use 20,000 games, both player have played 20,000 games. If I play A vs C, and then play a second match B vs C, I have to play at least twice as many games, 20,000 for the first match and 20,000 for the second match. In other words I have wasted a lot of testing resources involving a 3rd party. So I think it's pretty obvious that the answer is at least 2x.Daniel Shawul wrote:Well then I am truly baffled. My calculation says playing A vs B directly requires twice as many games than doing A vs C and B vs C. And now you guys are saying the other way round requires 4x as many games.Don wrote:This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.hgm wrote:You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.
The other difference is that we are gauging the relative strength of the player indirectly, through player C. But player C also has error margins adding to the imprecision of the results. So it's more than just a mere 2x since we are involving the 3rd party with the error he creates.
I don't know if this is a good analogy, but imagine that we were nearly the same height - an easy way to determine who was the tallest would be for us to stand back to back and compare directly. Probably both of us would be squirming around a bit so let's say that each of us could be off by as much as quarter of an inch in either direction.
Another way to see who is taller is to measure each of us separately, using (let's say) an equally imprecise measurement - a person holding a paper tape measure that was wrinkled up a bit - and he could also be accurate only up to 1/4 inch either way as he would have his hands full wrestling with the tape. In the back to back measurement you have 2 sources of error, in the tape scenario you have 4 sources of error (the tape measurement is applied twice.)
So we have a difference of 8x
Can you provide data and explain how you calculated error bars ? I am really curious now as I believe negative correlation drives up variance of A-B but decreases that of A + B, which I think is causing the confusion here. Well maybe I am wrong but we lack a good explanation ...
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: margin of error
No you are missing inclusion of co-variance completely. In the first A vs B test you have a big covariance so that affects the variance of A - B big time. Even HGM agreed that for the example I gave two standard errors of 5 elo each , the std(A-B) = 10 which your calculation ignores..HGM explanation is correct.
A plays B and the the difference (deltaAB) has an error Eab.
Then, with the same number games, we can calculate deltaAC, and it will have error Eac = Eab (since number of games are the same).
Also, with the same number games, we can calculate deltaCB, and it will have error Ecb = Eab (since number of games are the same).
So, we can calculate indirectly
deltaAB = deltaAC + deltaCB
Here we can already see that the error of this indirect calculation is bigger than Eab, no matter what, and we are already playing twice as many games.
deltaAC and deltaCB are independent, so the error for the indirect calculation is
IndirectError_ab = sqrt(Eac^2 + Ecb^2)
IndirectError_ab = sqrt(Eac^2 + Eac^2)
IndirectError_ab = sqrt(2*Eac^2)
IndirectError_ab = sqrt(2) * Eac
If we want the IndirectError_ab to be Eab, we have to make Eac = Eab/sqrt(2). We can do that playing twice as many games, which makes the total 4x.
Miguel
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: margin of error
Yes but you too are forgetting covariance. Remeber Remi warned that my forumal could be wrong since there is usually covariance. Please look at my calculation, and see how the covariance affects A-B significantly. It is equal in magnitude to the variance.Logically if I play A vs B and use 20,000 games, both player have played 20,000 games. If I play A vs C, and then play a second match B vs C, I have to play at least twice as many games, 20,000 for the first match and 20,000 for the second match. In other words I have wasted a lot of testing resources involving a 3rd party. So I think it's pretty obvious that the answer is at least 2x.
Code: Select all
For A vs B
var(A)=var(B)
cov(A,B)=-sqrt(var(A)var(B))=-var(A)
So var(A-B)=var(A)+var(A)-2(-var(A))=4var(A)