I'm using your code.
The 281 is not correlated to the 5%. It's the chance that the wrong conclusion is made due to stopping early.
Remember that without the 20 game margin (before starting to check for cut offs) I got 2,310 rather than 281, with everything else being equal. So without the margin you have 23% chance of drawing the wrong conclusion..
cutechess-cli suggestion (to Ilari)
Moderators: hgm, Rebel, chrisw
-
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
lucasart wrote:
Secondly the method is definitely incorrect. Just plot the graph of a one dimensional random walk with time on the x axis and distance on the y axis.
You can then also plot the region LOS<95%. It is the region between two explicit curves. Stopping early means stopping when the random walk leaves the region. This is different from completing the test as the random walk may leave the region and then return to it. And this really happens a lot.
You can hardly call this "your method". It is something that everyone thinks of.my method is correct
Secondly the method is definitely incorrect. Just plot the graph of a one dimensional random walk with time on the x axis and distance on the y axis.
You can then also plot the region LOS<95%. It is the region between two explicit curves. Stopping early means stopping when the random walk leaves the region. This is different from completing the test as the random walk may leave the region and then return to it. And this really happens a lot.
-
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
Re: cutechess-cli suggestion (to Ilari)
Edit: whoops too early in the morning.
Last edited by Zlaire on Mon Nov 07, 2011 9:50 am, edited 1 time in total.
-
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
Re: cutechess-cli suggestion (to Ilari)
My question is, is it such a bad thing that the random walk returns to the region on rare occassions? According to my little test this happens 3% of the time and in the other 97% it was never going to.
Statistical inaccuracy may be, but in a practical sense I'd be willing to accept those 3% in return of a much much faster test (which is correct 97% of the time).
Statistical inaccuracy may be, but in a practical sense I'd be willing to accept those 3% in return of a much much faster test (which is correct 97% of the time).
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
The problem is false positives. Confidence 95% means that there is aMy question is, is it such a bad thing that the random walk returns to the region on rare occassions? According to my little test this happens 3% of the time and in the other 97% it was never going to.
5% probability of rejecting the null hypothesis when it holds.
If you stop early with the naive LOS criterion the probability will be much bigger
than 5%.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
I thought a bit about the problem mathematically (I am too lazy to google...).
Basically we have to compute a Gaussian integral over a cube in path space. Of course this integral cannot be integrated exactly (a Gaussian integral over an interval can already not be integrated exactly, and now we are in an infinite dimensional space).
However we can try the "physics approach". We replace the characteristic function of the cube by a Gaussian. Then the integral can (likely) be integrated exactly.
The question is: how good an approximation is this for the original integral?
To get a feel for this we add a fourth order term in the exponential of the Gaussian approximating the hypercube (the fourth order term in principle can improve the approximation). In physics terminology this would be an "interaction term".
Now we get an integral which can no longer be evaluated exactly. However there is a standard approximation method using Feynman diagrams. This is an expansion in powers of the interaction coefficient. If the integral does not depend too much on the interaction term then we can infer
that we have a reasonable approximation.
Now I am sure that if this works it has already been done....
Full disclosure: I have not checked if this even works in one dimension (integrating a Gaussian over an interval).
Basically we have to compute a Gaussian integral over a cube in path space. Of course this integral cannot be integrated exactly (a Gaussian integral over an interval can already not be integrated exactly, and now we are in an infinite dimensional space).
However we can try the "physics approach". We replace the characteristic function of the cube by a Gaussian. Then the integral can (likely) be integrated exactly.
The question is: how good an approximation is this for the original integral?
To get a feel for this we add a fourth order term in the exponential of the Gaussian approximating the hypercube (the fourth order term in principle can improve the approximation). In physics terminology this would be an "interaction term".
Now we get an integral which can no longer be evaluated exactly. However there is a standard approximation method using Feynman diagrams. This is an expansion in powers of the interaction coefficient. If the integral does not depend too much on the interaction term then we can infer
that we have a reasonable approximation.
Now I am sure that if this works it has already been done....
Full disclosure: I have not checked if this even works in one dimension (integrating a Gaussian over an interval).
-
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
Re: cutechess-cli suggestion (to Ilari)
Yes, you wouldn't have a true 95% of course. But is it really "much" less than that?
What I'm proposing is in these types of quick tests (190 games should be like 10 minutes) you don't really have to worry about 100% accuracy. And considering you only have a 95% LOS to aim for in the first place, is this extra 3% really a problem?
I guess smaller differnces than my 55% win rate migth cause problems. I need to do some more testing.
Keep in mind I'm keeping this in a practical quick test sense. For a statistically accurate 95% LOS this is not valid.
What I'm proposing is in these types of quick tests (190 games should be like 10 minutes) you don't really have to worry about 100% accuracy. And considering you only have a 95% LOS to aim for in the first place, is this extra 3% really a problem?
I guess smaller differnces than my 55% win rate migth cause problems. I need to do some more testing.
Keep in mind I'm keeping this in a practical quick test sense. For a statistically accurate 95% LOS this is not valid.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
Yes that is the wrong approach.I guess smaller differnces than my 55% win rate migth cause problems. I need to do some more testing.
You have to test with win rate 50% and then measure the probability of
a false positive (a Type I error).
In fact you can tune the required LOS level for stopping early
so that you get 5% probability of a Type I error.
In that way you would have designed an "early stopping test"
which is equivalent to a traditional 95% test with respect to Type I errors.
To evaluate such a test one would then make a graph of the Type II errors:
the probability of accepting the null hypothesis when if it false (this graph has
the winning rate on the x axis and acceptance probability on the y axis; for
x=50% we get of course y=95%).
This graph can also be made for the traditional test (this is easy).
By comparing the two graphs one can get a feeling for the performance of
both tests. I don't really know how this works out.
EDIT: Of course chess also has draws, which complicates things somewhat more.
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
I did my own testing today
=> Code
(in VBA, I did it today at work hiding behind a spreadsheet)
=> Results
I took the most realistic assumption: real score being 55% = 40% wins + 30% draws. All drawings are independant, which is true in real chess (unless your program has a learning feature).
Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 20&
and ran it 10 times
A wins: 7028 B wins: 2972 avg(nb_games) = 20.1222 error% = 0.2972
A wins: 7134 B wins: 2866 avg(nb_games) = 20.126 error% = 0.2866
A wins: 7136 B wins: 2864 avg(nb_games) = 20.1289 error% = 0.2864
A wins: 7130 B wins: 2870 avg(nb_games) = 20.137 error% = 0.287
A wins: 7108 B wins: 2892 avg(nb_games) = 20.1367 error% = 0.2892
A wins: 7146 B wins: 2854 avg(nb_games) = 20.1299 error% = 0.2854
A wins: 7233 B wins: 2767 avg(nb_games) = 20.1258 error% = 0.2767
A wins: 7162 B wins: 2838 avg(nb_games) = 20.1316 error% = 0.2838
A wins: 7079 B wins: 2921 avg(nb_games) = 20.1286 error% = 0.2921
A wins: 7072 B wins: 2928 avg(nb_games) = 20.1322 error% = 0.2928
so given a confidence level -- a priori -- of 95%, we realise that the stopping rule biais reduces this confidence level to about 71% only!
If we double the warmup (and therefore ~double the nb of games) results are better, but still far from the 5%
Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 40&
A wins: 7805 B wins: 2195 avg(nb_games) = 40.084 error% = 0.2195
A wins: 7707 B wins: 2293 avg(nb_games) = 40.0837 error% = 0.2293
A wins: 7848 B wins: 2152 avg(nb_games) = 40.0807 error% = 0.2152
A wins: 7771 B wins: 2229 avg(nb_games) = 40.076 error% = 0.2229
A wins: 7793 B wins: 2207 avg(nb_games) = 40.0802 error% = 0.2207
A wins: 7842 B wins: 2158 avg(nb_games) = 40.0816 error% = 0.2158
A wins: 7834 B wins: 2166 avg(nb_games) = 40.0827 error% = 0.2166
A wins: 7771 B wins: 2229 avg(nb_games) = 40.0857 error% = 0.2229
A wins: 7860 B wins: 2140 avg(nb_games) = 40.0816 error% = 0.214
A wins: 7815 B wins: 2185 avg(nb_games) = 40.0766 error% = 0.2185
=> Conclusion
My naive method based on a LOS stop just doesn't work !
Lucas
=> Code
(in VBA, I did it today at work hiding behind a spreadsheet)
Code: Select all
Function generate&(pwin#, pdraw#)
Dim x#
x# = Rnd
generate = IIf(x < pwin, 1, IIf(x < pwin + pdraw, 0, -1))
End Function
Sub test()
Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 20&
Dim nwin&, ndraw&, ngames&, simul&, r&, phat#, qhat#, lhat#, muhat#, sigmahat#
Dim Awins&, Bwins&, totalgames&
Awins = 0: Bwins = 0: totalgames = 0
Randomize
For simul = 1 To N
ndraw = 0: nwin = 0
For ngames = 1 To max_games
r = generate(p, q)
If r = 1 Then
nwin = nwin + 1
ElseIf r = 0 Then
ndraw = ndraw + 1
End If
phat = nwin / ngames
qhat = ndraw / ngames
lhat = 1 - phat - qhat
muhat = phat - lhat
sigmahat = Sqr((phat * (1 - muhat) ^ 2 + qhat * (0 - muhat) ^ 2 + lhat * (-1 - muhat) ^ 2) / N)
If (ngames >= warmup And sigmahat * quantile < Abs(muhat)) Then
Exit For
End If
Next ngames
If ngames < max_games Then
totalgames = totalgames + ngames
If muhat > 0 Then
Awins = Awins + 1
Else
Bwins = Bwins + 1
End If
End If
Next simul
Debug.Print "A wins: " & Awins, "B wins: " & Bwins, "avg(nb_games) = " & totalgames / N, "error% = " & Bwins / N
End Sub
I took the most realistic assumption: real score being 55% = 40% wins + 30% draws. All drawings are independant, which is true in real chess (unless your program has a learning feature).
Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 20&
and ran it 10 times
A wins: 7028 B wins: 2972 avg(nb_games) = 20.1222 error% = 0.2972
A wins: 7134 B wins: 2866 avg(nb_games) = 20.126 error% = 0.2866
A wins: 7136 B wins: 2864 avg(nb_games) = 20.1289 error% = 0.2864
A wins: 7130 B wins: 2870 avg(nb_games) = 20.137 error% = 0.287
A wins: 7108 B wins: 2892 avg(nb_games) = 20.1367 error% = 0.2892
A wins: 7146 B wins: 2854 avg(nb_games) = 20.1299 error% = 0.2854
A wins: 7233 B wins: 2767 avg(nb_games) = 20.1258 error% = 0.2767
A wins: 7162 B wins: 2838 avg(nb_games) = 20.1316 error% = 0.2838
A wins: 7079 B wins: 2921 avg(nb_games) = 20.1286 error% = 0.2921
A wins: 7072 B wins: 2928 avg(nb_games) = 20.1322 error% = 0.2928
so given a confidence level -- a priori -- of 95%, we realise that the stopping rule biais reduces this confidence level to about 71% only!
If we double the warmup (and therefore ~double the nb of games) results are better, but still far from the 5%
Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 40&
A wins: 7805 B wins: 2195 avg(nb_games) = 40.084 error% = 0.2195
A wins: 7707 B wins: 2293 avg(nb_games) = 40.0837 error% = 0.2293
A wins: 7848 B wins: 2152 avg(nb_games) = 40.0807 error% = 0.2152
A wins: 7771 B wins: 2229 avg(nb_games) = 40.076 error% = 0.2229
A wins: 7793 B wins: 2207 avg(nb_games) = 40.0802 error% = 0.2207
A wins: 7842 B wins: 2158 avg(nb_games) = 40.0816 error% = 0.2158
A wins: 7834 B wins: 2166 avg(nb_games) = 40.0827 error% = 0.2166
A wins: 7771 B wins: 2229 avg(nb_games) = 40.0857 error% = 0.2229
A wins: 7860 B wins: 2140 avg(nb_games) = 40.0816 error% = 0.214
A wins: 7815 B wins: 2185 avg(nb_games) = 40.0766 error% = 0.2185
=> Conclusion
My naive method based on a LOS stop just doesn't work !
Lucas
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
Now some GOOD NEWS. I tested the NAS algorithm presented in the paper. There are lots of mistakes and imprecisions in this paper, so I had to correct the stopping condition and assume epsilon = 1 (which is the only case we're interested in here).
=> Code
=> Results
It works!
but it takes a lot more games to reach a conclusion, and the number of games to reach a conclusion is very very variable.
=> Next step
I'll test the EBStop algorithm. First I need to figure out how to compute the dt, which is not clearly explained in the paper. I'll probably find the answer in the full thesis paper.
Lucas
=> Code
Code: Select all
Sub test()
Randomize
Const delta# = 0.05, p# = 0.4, q# = 0.3, N = 10000&
Dim t&, X&, SX&, Xbar#, alpha#, Awins&, Bwins&, i&
Awins = 0: Bwins = 0
For i = 1 To N
SX = 0: t = 0
Do
t = t + 1
X = generate(p, q)
SX = SX + X
Xbar = SX / t
alpha = Sqr(0.5 / t * Log(t * (t + 1) / delta))
Loop Until alpha < Abs(Xbar)
If Xbar > 0 Then
Awins = Awins + 1
Else
Bwins = Bwins + 1
End If
Next i
Debug.Print "Awins: " & Awins, "Bwins: " & Bwins
End Sub
It works!
Code: Select all
Awins: 9600 Bwins: 400
=> Next step
I'll test the EBStop algorithm. First I need to figure out how to compute the dt, which is not clearly explained in the paper. I'll probably find the answer in the full thesis paper.
Lucas