margin of error

Michel · Post by **Michel** » Mon Sep 24, 2012 5:23 pm

Wald.py crashes horribly for me.

Hmm. What is your version of Python?

It might be a white space issue. Python is white space sensitive.

Did you copy paste the script?

gladius · Post by **gladius** » Mon Sep 24, 2012 5:31 pm

ZirconiumX wrote:Wald.py crashes horribly for me.

Code: Select all

matthew$ ./wald.py
  File "./wald.py", line 238
    return (- 2*math.pi*n*math.exp(gamma*A)*(1 if n%2==0 else -1))/(A**2*gamma**2 + math.pi**2 * n**2)
                                                ^
SyntaxError: invalid syntax

Matthew:out

Are you running python < 2.5? The python ternary operator "a if b else c" was added then.

Houdini · Post by **Houdini** » Mon Sep 24, 2012 5:34 pm

hgm wrote:Note that when you repeatedly test against the same version (e.g. because on the average you reject 4 changes before you accept one and promote it to new reference version) you use your games more efficiently when you play some extra games with the reference version. E.g. when A is your reference version, and you want to compare B, D, E and F with it by playing them all against C, and you want to do 4 x 20k = 80k games in total, you could play 20k-N games B-C, D-C, E-C and F-C each , and 20k+4*N games A-C. The Elo error in A-C is then proportional to 1/sqrt(20k-N), and those in the others 1/sqrt(20k+4*N), and each of the differences thus has error sqrt(1/(20k-N) + 1/(20k+4*N)).

Now this error is minimal when V = 1/(20k-N) + 1/(20k+4*N) is minimal.

dV/dN = 1/(20k-N)^ 2 - 4/(20k+4*N)^2 = 0
1/(20k-N)^2 = 4/(20k+4*N)^2
1/(20k-N) = 2/(20k+4*N)
20k+4*N = 2*(20k-N)
20k+4*N = 40k-2*N
6*N = 20k
N = 3,333

So by playing 33k games A-C and 17k games for the others, you get a more accurate value for the difference of of B,D,E,F with A. The error would drop from 0.01 to 0.0094867. Meaning that you could get the same error with only 90% of the games. (30k A-C and 15k of the four others).

This goes at the expense of getting poorer comparison between B,D,E and F. But this is a bit like a beta cutoff: you are not interested to determine which is the poorest of changes you are going to reject!

Thank you for this interesting analysis.
It boils down to playing more games with the versions that are actual improvements and are kept as reference for future comparisons.
Makes a lot of sense!

Robert

michiguel · Post by **michiguel** » Mon Sep 24, 2012 6:19 pm

Daniel Shawul wrote:
michiguel wrote:
Daniel Shawul wrote:
No, both default and covariance assume Gaussian. That has nothing to do with this.
http://www.talkchess.com/forum/viewtopi ... &start=199

The point is one assume the the opponent is the true rating (gross approximation that could be valid when there are many opponents and lots of games), and the other not. So, that is why you get a difference that is close to 2.
I am talking about default and exactdist which is what we are comparing here. One gives 27 and the other 15 and the difference is exactly the gaussian assumption.
That is not the only difference!

From the link:
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)

Miguel
For the third time, I said the one I am comparing are default and exactscore. (which is what I have given the results for). For the third time the difference b/n the two is exactly the gaussian assumption.
Default: assume opponents ratings are their true ratings, and Gaussian distribution
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.

Yes, I was sleepy and I missed that. Still, it is not relevant to the whole discussion (I do not understand why you want to compare default with exactdist). If you want to obtain errors for these cases with BE, you have to do it with the covariance model. Exactdist and default are gross approximations.

Miguel

Michel · Post by **Michel** » Mon Sep 24, 2012 6:39 pm

When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?

BTW It occurred to me that with your simulation program you could actually verify if the results of wald are correct.

I have never confirmed them by simulation, only by some obvious sanity checking (like verifying that certain probabilities sum to 1).

The mathematics for deriving the formulas is a bit complicated so one has to be on the lookout for mistakes.

Don · Post by **Don** » Mon Sep 24, 2012 6:57 pm

Michel wrote:
When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?
BTW It occurred to me that with your simulation program you could actually verify if the results of wald are correct.

I have never confirmed them by simulation, only by some obvious sanity checking (like verifying that certain probabilities sum to 1).

The mathematics for deriving the formulas is a bit complicated so one has to be on the lookout for mistakes.

I'm running the simulation now and it is not looking good. But I'm also trying to check my own code for correctness.

I'm using the default draw percentage in wald.py and with straight up 20,000 game matches I have to beat 82%. So I want to beat 82% with LESS games on average.

I picked 2 ELO for the test as that is the typical magnitude of our improvements.

I'm a little confused by the "resolution" value. I am basically viewing this as something that returns "keep the change" or "don't keep the change" and I'm counting the number of times it makes the right decision for a 2 ELO change. I understand that it will be right much more often for a bigger improvement.

I also understand that it will be wise to choose a much lower false positive rate but to make this test compatible with my previous study I'm keeping them the same.

So what is "Resolution"? I see that it demands more games when I set resolution lower but I'm not clear on what that means.

Here is a comparison. Using simple 20,000 game matches I make the "right decision" 82 percent of the time for a 2 ELO improvement. I did not break down the false positives and false negatives. Using the following parameters with a resolution of 2 ELO I get slightly over 97% which is great, but I have to play 145,715 games on average.

Code: Select all

  alpha: 0.05000000
   beta: 0.05000000
    WSC: 0.00729533
    LSC: -0.00732605
    DSC: -0.00003072
     H0: -3.56278409
     H1: 3.56285335
    H1e: -0.00001033
        (WALD)  correct:     97045    incorrect:      2955      97.0450    145714.8 effort

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 7:00 pm

michiguel wrote:
Daniel Shawul wrote:
michiguel wrote:
Daniel Shawul wrote:
No, both default and covariance assume Gaussian. That has nothing to do with this.
http://www.talkchess.com/forum/viewtopi ... &start=199

The point is one assume the the opponent is the true rating (gross approximation that could be valid when there are many opponents and lots of games), and the other not. So, that is why you get a difference that is close to 2.
I am talking about default and exactdist which is what we are comparing here. One gives 27 and the other 15 and the difference is exactly the gaussian assumption.
That is not the only difference!

From the link:
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)

Miguel
For the third time, I said the one I am comparing are default and exactscore. (which is what I have given the results for). For the third time the difference b/n the two is exactly the gaussian assumption.
Default: assume opponents ratings are their true ratings, and Gaussian distribution
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
Yes, I was sleepy and I missed that. Still, it is not relevant to the whole discussion (I do not understand why you want to compare default with exactdist). If you want to obtain errors for these cases with BE, you have to do it with the covariance model. Exactdist and default are gross approximations.

Miguel

Exactdist is different from default (gives half as much variance) so why you say both are gross? Exactdist gives a result close to covariance.
Like I already said, my assumption was that the elo margins are given as A +- D1, B +- D2 which means each have a
A = WWWWLLLLDD
B = LLLLWWWWDD
score with a correlation of -1. Both elostat and default bayeselo agree with me on this. Now you guys are saying it is
A-B=WWWWLLLLDD
If that is the case var(A) = var(B) = 1/4 * var(A-B) and it is assumed A did not have those outcomes, but some _virtual_ result so that A-B's variance is that of a WWWWLLLLDD result. This is a weird assumption compared to assuming the outcomes of A and B as I did in the first one. If we take winning percentages (and not elos), I don't even know how you are going to apply your assumption. For the above result f.i. I have
(mean1,sd1) = (0.5,sqrt(1/5))
(mean2,sd2) = (0.5,sqrt(1/5))
(mean1-mean2,2sd1 (i.e including covariance)) = (0,2sqrt(1.5))
So what is your estimate the mean and sd of the difference in winning percentage for this result using your assumtpion.

hgm · Post by **hgm** » Mon Sep 24, 2012 7:15 pm

Houdini wrote:It boils down to playing more games with the versions that are actual improvements and are kept as reference for future comparisons.

Exactly. And the advantage is that once you do have accepted a version as an improvement, based on a test with the target confidence, but only 15k games, you will eventually get a 'second opinion' on it, while you are testing future changes by playing 15k games for them and 3k games with the new reference, for as long as the reference has not yet reached 30k.

This causes a dilemma if you immediatly run into a change that looks like it might qualify. An alternative strategy could therefore be to immediately crank up the number of games from 15k to 30k for a version that qualified based on the 15k vs (old) 30k comparison. And then run the next version for 15k to compare against that.

Houdini · Post by **Houdini** » Mon Sep 24, 2012 7:29 pm

hgm wrote:An alternative strategy could therefore be to immediately crank up the number of games from 15k to 30k for a version that qualified based on the 15k vs (old) 30k comparison. And then run the next version for 15k to compare against that.

Correct, I'd rather play the extra 15k (or whatever number) games for the new reference version immediately after its "qualification".

Don · Post by **Don** » Mon Sep 24, 2012 7:34 pm

hgm wrote:
Houdini wrote:It boils down to playing more games with the versions that are actual improvements and are kept as reference for future comparisons.
Exactly. And the advantage is that once you do have accepted a version as an improvement, based on a test with the target confidence, but only 15k games, you will eventually get a 'second opinion' on it, while you are testing future changes by playing 15k games for them and 3k games with the new reference, for as long as the reference has not yet reached 30k.

This causes a dilemma if you immediatly run into a change that looks like it might qualify. An alternative strategy could therefore be to immediately crank up the number of games from 15k to 30k for a version that qualified based on the 15k vs (old) 30k comparison. And then run the next version for 15k to compare against that.

Basically Larry and I already do that. When we have a version that beats everything else we immediately suspect that is't over-rated. In fact we made up a term for that, we call it "survivor bias" meaning if you test 10 different programs the one that is on top is probably the most lucky one, not the best one! So we will generally run a lot more games to verify that it is really best.

Here is a way to formalize that procedure if you are willing to abandon foreign testing: You play round robins (no reason you cannot also include foreign programs) with a fixed number of your own programs in the pool. Let's say 8 of your own programs. After you test the 9th program, you drop the lowest rated program so that you always have 8 programs.

You play a fixed number of rounds to achieve a minimum sample size, whatever that may be. So the longer a program survives in the pool, the more games it plays.

Does that make any sense? I don't really know

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error