Hmm. What is your version of Python?Wald.py crashes horribly for me.
It might be a white space issue. Python is white space sensitive.
Did you copy paste the script?
Moderator: Ras
Hmm. What is your version of Python?Wald.py crashes horribly for me.
Are you running python < 2.5? The python ternary operator "a if b else c" was added then.ZirconiumX wrote:Wald.py crashes horribly for me.
Matthew:outCode: Select all
matthew$ ./wald.py File "./wald.py", line 238 return (- 2*math.pi*n*math.exp(gamma*A)*(1 if n%2==0 else -1))/(A**2*gamma**2 + math.pi**2 * n**2) ^ SyntaxError: invalid syntax
Thank you for this interesting analysis.hgm wrote:Note that when you repeatedly test against the same version (e.g. because on the average you reject 4 changes before you accept one and promote it to new reference version) you use your games more efficiently when you play some extra games with the reference version. E.g. when A is your reference version, and you want to compare B, D, E and F with it by playing them all against C, and you want to do 4 x 20k = 80k games in total, you could play 20k-N games B-C, D-C, E-C and F-C each , and 20k+4*N games A-C. The Elo error in A-C is then proportional to 1/sqrt(20k-N), and those in the others 1/sqrt(20k+4*N), and each of the differences thus has error sqrt(1/(20k-N) + 1/(20k+4*N)).
Now this error is minimal when V = 1/(20k-N) + 1/(20k+4*N) is minimal.
dV/dN = 1/(20k-N)^ 2 - 4/(20k+4*N)^2 = 0
1/(20k-N)^2 = 4/(20k+4*N)^2
1/(20k-N) = 2/(20k+4*N)
20k+4*N = 2*(20k-N)
20k+4*N = 40k-2*N
6*N = 20k
N = 3,333
So by playing 33k games A-C and 17k games for the others, you get a more accurate value for the difference of of B,D,E,F with A. The error would drop from 0.01 to 0.0094867. Meaning that you could get the same error with only 90% of the games. (30k A-C and 15k of the four others).
This goes at the expense of getting poorer comparison between B,D,E and F. But this is a bit like a beta cutoff: you are not interested to determine which is the poorest of changes you are going to reject!
Yes, I was sleepy and I missed that. Still, it is not relevant to the whole discussion (I do not understand why you want to compare default with exactdist). If you want to obtain errors for these cases with BE, you have to do it with the covariance model. Exactdist and default are gross approximations.Daniel Shawul wrote:For the third time, I said the one I am comparing are default and exactscore. (which is what I have given the results for). For the third time the difference b/n the two is exactly the gaussian assumption.michiguel wrote:That is not the only difference!Daniel Shawul wrote:I am talking about default and exactdist which is what we are comparing here. One gives 27 and the other 15 and the difference is exactly the gaussian assumption.No, both default and covariance assume Gaussian. That has nothing to do with this.
http://www.talkchess.com/forum/viewtopi ... &start=199
The point is one assume the the opponent is the true rating (gross approximation that could be valid when there are many opponents and lots of games), and the other not. So, that is why you get a difference that is close to 2.
From the link:
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)
MiguelDefault: assume opponents ratings are their true ratings, and Gaussian distribution
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
BTW It occurred to me that with your simulation program you could actually verify if the results of wald are correct.When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?
I'm running the simulation now and it is not looking good. But I'm also trying to check my own code for correctness.Michel wrote:BTW It occurred to me that with your simulation program you could actually verify if the results of wald are correct.When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?
I have never confirmed them by simulation, only by some obvious sanity checking (like verifying that certain probabilities sum to 1).
The mathematics for deriving the formulas is a bit complicated so one has to be on the lookout for mistakes.
Code: Select all
alpha: 0.05000000
beta: 0.05000000
WSC: 0.00729533
LSC: -0.00732605
DSC: -0.00003072
H0: -3.56278409
H1: 3.56285335
H1e: -0.00001033
(WALD) correct: 97045 incorrect: 2955 97.0450 145714.8 effort
Exactdist is different from default (gives half as much variance) so why you say both are gross? Exactdist gives a result close to covariance.michiguel wrote:Yes, I was sleepy and I missed that. Still, it is not relevant to the whole discussion (I do not understand why you want to compare default with exactdist). If you want to obtain errors for these cases with BE, you have to do it with the covariance model. Exactdist and default are gross approximations.Daniel Shawul wrote:For the third time, I said the one I am comparing are default and exactscore. (which is what I have given the results for). For the third time the difference b/n the two is exactly the gaussian assumption.michiguel wrote:That is not the only difference!Daniel Shawul wrote:I am talking about default and exactdist which is what we are comparing here. One gives 27 and the other 15 and the difference is exactly the gaussian assumption.No, both default and covariance assume Gaussian. That has nothing to do with this.
http://www.talkchess.com/forum/viewtopi ... &start=199
The point is one assume the the opponent is the true rating (gross approximation that could be valid when there are many opponents and lots of games), and the other not. So, that is why you get a difference that is close to 2.
From the link:
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)
MiguelDefault: assume opponents ratings are their true ratings, and Gaussian distribution
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
Miguel
Exactly. And the advantage is that once you do have accepted a version as an improvement, based on a test with the target confidence, but only 15k games, you will eventually get a 'second opinion' on it, while you are testing future changes by playing 15k games for them and 3k games with the new reference, for as long as the reference has not yet reached 30k.Houdini wrote:It boils down to playing more games with the versions that are actual improvements and are kept as reference for future comparisons.
Correct, I'd rather play the extra 15k (or whatever number) games for the new reference version immediately after its "qualification".hgm wrote:An alternative strategy could therefore be to immediately crank up the number of games from 15k to 30k for a version that qualified based on the 15k vs (old) 30k comparison. And then run the next version for 15k to compare against that.
Basically Larry and I already do that. When we have a version that beats everything else we immediately suspect that is't over-rated. In fact we made up a term for that, we call it "survivor bias" meaning if you test 10 different programs the one that is on top is probably the most lucky one, not the best one! So we will generally run a lot more games to verify that it is really best.hgm wrote:Exactly. And the advantage is that once you do have accepted a version as an improvement, based on a test with the target confidence, but only 15k games, you will eventually get a 'second opinion' on it, while you are testing future changes by playing 15k games for them and 3k games with the new reference, for as long as the reference has not yet reached 30k.Houdini wrote:It boils down to playing more games with the versions that are actual improvements and are kept as reference for future comparisons.
This causes a dilemma if you immediatly run into a change that looks like it might qualify. An alternative strategy could therefore be to immediately crank up the number of games from 15k to 30k for a version that qualified based on the 15k vs (old) 30k comparison. And then run the next version for 15k to compare against that.