cutechess-cli suggestion (to Ilari)

Zlaire · Post by **Zlaire** » Mon Nov 07, 2011 9:25 am

I'm using your code.

The 281 is not correlated to the 5%. It's the chance that the wrong conclusion is made due to stopping early.

Remember that without the 20 game margin (before starting to check for cut offs) I got 2,310 rather than 281, with everything else being equal. So without the margin you have 23% chance of drawing the wrong conclusion..

Michel · Post by **Michel** » Mon Nov 07, 2011 9:26 am

lucasart wrote:

my method is correct

You can hardly call this "your method". It is something that everyone thinks of.

Secondly the method is definitely incorrect. Just plot the graph of a one dimensional random walk with time on the x axis and distance on the y axis.

You can then also plot the region LOS<95%. It is the region between two explicit curves. Stopping early means stopping when the random walk leaves the region. This is different from completing the test as the random walk may leave the region and then return to it. And this really happens a lot.

Zlaire · Post by **Zlaire** » Mon Nov 07, 2011 9:39 am

Edit: whoops too early in the morning.

Zlaire · Post by **Zlaire** » Mon Nov 07, 2011 9:47 am

My question is, is it such a bad thing that the random walk returns to the region on rare occassions? According to my little test this happens 3% of the time and in the other 97% it was never going to.

Statistical inaccuracy may be, but in a practical sense I'd be willing to accept those 3% in return of a much much faster test (which is correct 97% of the time).

Michel · Post by **Michel** » Mon Nov 07, 2011 10:34 am

My question is, is it such a bad thing that the random walk returns to the region on rare occassions? According to my little test this happens 3% of the time and in the other 97% it was never going to.

The problem is false positives. Confidence 95% means that there is a
5% probability of rejecting the null hypothesis when it holds.

If you stop early with the naive LOS criterion the probability will be much bigger
than 5%.

Michel · Post by **Michel** » Mon Nov 07, 2011 10:52 am

I thought a bit about the problem mathematically (I am too lazy to google...).

Basically we have to compute a Gaussian integral over a cube in path space. Of course this integral cannot be integrated exactly (a Gaussian integral over an interval can already not be integrated exactly, and now we are in an infinite dimensional space).

However we can try the "physics approach". We replace the characteristic function of the cube by a Gaussian. Then the integral can (likely) be integrated exactly.

The question is: how good an approximation is this for the original integral?

To get a feel for this we add a fourth order term in the exponential of the Gaussian approximating the hypercube (the fourth order term in principle can improve the approximation). In physics terminology this would be an "interaction term".

Now we get an integral which can no longer be evaluated exactly. However there is a standard approximation method using Feynman diagrams. This is an expansion in powers of the interaction coefficient. If the integral does not depend too much on the interaction term then we can infer
that we have a reasonable approximation.

Now I am sure that if this works it has already been done....

Full disclosure: I have not checked if this even works in one dimension (integrating a Gaussian over an interval).

Zlaire · Post by **Zlaire** » Mon Nov 07, 2011 11:04 am

Yes, you wouldn't have a true 95% of course. But is it really "much" less than that?

What I'm proposing is in these types of quick tests (190 games should be like 10 minutes) you don't really have to worry about 100% accuracy. And considering you only have a 95% LOS to aim for in the first place, is this extra 3% really a problem?

I guess smaller differnces than my 55% win rate migth cause problems. I need to do some more testing.

Keep in mind I'm keeping this in a practical quick test sense. For a statistically accurate 95% LOS this is not valid.

Michel · Post by **Michel** » Mon Nov 07, 2011 11:22 am

I guess smaller differnces than my 55% win rate migth cause problems. I need to do some more testing.

Yes that is the wrong approach.

You have to test with win rate 50% and then measure the probability of
a false positive (a Type I error).

In fact you can tune the required LOS level for stopping early
so that you get 5% probability of a Type I error.

In that way you would have designed an "early stopping test"
which is equivalent to a traditional 95% test with respect to Type I errors.

To evaluate such a test one would then make a graph of the Type II errors:
the probability of accepting the null hypothesis when if it false (this graph has
the winning rate on the x axis and acceptance probability on the y axis; for
x=50% we get of course y=95%).

This graph can also be made for the traditional test (this is easy).
By comparing the two graphs one can get a feeling for the performance of
both tests. I don't really know how this works out.

EDIT: Of course chess also has draws, which complicates things somewhat more.

lucasart · Post by **lucasart** » Mon Nov 07, 2011 6:02 pm

I did my own testing today
=> Code
(in VBA, I did it today at work hiding behind a spreadsheet)

Code: Select all

Function generate&&#40;pwin#, pdraw#)
    Dim x#
    x# = Rnd
    generate = IIf&#40;x < pwin, 1, IIf&#40;x < pwin + pdraw, 0, -1&#41;)
End Function

Sub test&#40;)

    Dim quantile#&#58; quantile# = Application.WorksheetFunction.NormSInv&#40;0.95&#41;
    Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 20&
    Dim nwin&, ndraw&, ngames&, simul&, r&, phat#, qhat#, lhat#, muhat#, sigmahat#
   
    Dim Awins&, Bwins&, totalgames&
    Awins = 0&#58; Bwins = 0&#58; totalgames = 0
    Randomize

    For simul = 1 To N
   
        ndraw = 0&#58; nwin = 0

        For ngames = 1 To max_games

            r = generate&#40;p, q&#41;
            If r = 1 Then
                nwin = nwin + 1
            ElseIf r = 0 Then
                ndraw = ndraw + 1
            End If
           
            phat = nwin / ngames
            qhat = ndraw / ngames
            lhat = 1 - phat - qhat

            muhat = phat - lhat
            sigmahat = Sqr&#40;&#40;phat * &#40;1 - muhat&#41; ^ 2 + qhat * &#40;0 - muhat&#41; ^ 2 + lhat * (-1 - muhat&#41; ^ 2&#41; / N&#41;
   
            If &#40;ngames >= warmup And sigmahat * quantile < Abs&#40;muhat&#41;) Then
                Exit For
            End If

        Next ngames

        If ngames < max_games Then
            totalgames = totalgames + ngames
            If muhat > 0 Then
                Awins = Awins + 1
            Else
                Bwins = Bwins + 1
            End If
        End If
       
    Next simul
   
    Debug.Print "A wins&#58; " & Awins, "B wins&#58; " & Bwins, "avg&#40;nb_games&#41; = " & totalgames / N, "error% = " & Bwins / N
End Sub

=> Results

I took the most realistic assumption: real score being 55% = 40% wins + 30% draws. All drawings are independant, which is true in real chess (unless your program has a learning feature).

Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 20&

and ran it 10 times

A wins: 7028 B wins: 2972 avg(nb_games) = 20.1222 error% = 0.2972
A wins: 7134 B wins: 2866 avg(nb_games) = 20.126 error% = 0.2866
A wins: 7136 B wins: 2864 avg(nb_games) = 20.1289 error% = 0.2864
A wins: 7130 B wins: 2870 avg(nb_games) = 20.137 error% = 0.287
A wins: 7108 B wins: 2892 avg(nb_games) = 20.1367 error% = 0.2892
A wins: 7146 B wins: 2854 avg(nb_games) = 20.1299 error% = 0.2854
A wins: 7233 B wins: 2767 avg(nb_games) = 20.1258 error% = 0.2767
A wins: 7162 B wins: 2838 avg(nb_games) = 20.1316 error% = 0.2838
A wins: 7079 B wins: 2921 avg(nb_games) = 20.1286 error% = 0.2921
A wins: 7072 B wins: 2928 avg(nb_games) = 20.1322 error% = 0.2928

so given a confidence level -- a priori -- of 95%, we realise that the stopping rule biais reduces this confidence level to about 71% only!

If we double the warmup (and therefore ~double the nb of games) results are better, but still far from the 5%

Dim quantile#: quantile# = Application.WorksheetFunction.NormSInv(0.95)
Const p# = 0.4, q# = 0.3, N& = 10000, max_games& = 10000, warmup = 40&

A wins: 7805 B wins: 2195 avg(nb_games) = 40.084 error% = 0.2195
A wins: 7707 B wins: 2293 avg(nb_games) = 40.0837 error% = 0.2293
A wins: 7848 B wins: 2152 avg(nb_games) = 40.0807 error% = 0.2152
A wins: 7771 B wins: 2229 avg(nb_games) = 40.076 error% = 0.2229
A wins: 7793 B wins: 2207 avg(nb_games) = 40.0802 error% = 0.2207
A wins: 7842 B wins: 2158 avg(nb_games) = 40.0816 error% = 0.2158
A wins: 7834 B wins: 2166 avg(nb_games) = 40.0827 error% = 0.2166
A wins: 7771 B wins: 2229 avg(nb_games) = 40.0857 error% = 0.2229
A wins: 7860 B wins: 2140 avg(nb_games) = 40.0816 error% = 0.214
A wins: 7815 B wins: 2185 avg(nb_games) = 40.0766 error% = 0.2185

=> Conclusion

My naive method based on a LOS stop just doesn't work !

Lucas

lucasart · Post by **lucasart** » Mon Nov 07, 2011 6:16 pm

Now some GOOD NEWS. I tested the NAS algorithm presented in the paper. There are lots of mistakes and imprecisions in this paper, so I had to correct the stopping condition and assume epsilon = 1 (which is the only case we're interested in here).

=> Code

Code: Select all

Sub test&#40;)
    Randomize
    Const delta# = 0.05, p# = 0.4, q# = 0.3, N = 10000&
    Dim t&, X&, SX&, Xbar#, alpha#, Awins&, Bwins&, i&

    Awins = 0&#58; Bwins = 0
    For i = 1 To N
   
        SX = 0&#58; t = 0
        Do
            t = t + 1
            X = generate&#40;p, q&#41;
            SX = SX + X
            Xbar = SX / t
            alpha = Sqr&#40;0.5 / t * Log&#40;t * &#40;t + 1&#41; / delta&#41;)
        Loop Until alpha < Abs&#40;Xbar&#41;
   
        If Xbar > 0 Then
            Awins = Awins + 1
        Else
            Bwins = Bwins + 1
        End If
    Next i

    Debug.Print "Awins&#58; " & Awins, "Bwins&#58; " & Bwins
   
End Sub

=> Results

It works!

Code: Select all

Awins&#58; 9600   Bwins&#58; 400

but it takes a lot more games to reach a conclusion, and the number of games to reach a conclusion is very very variable.

=> Next step

I'll test the EBStop algorithm. First I need to figure out how to compute the dt, which is not clearly explained in the paper. I'll probably find the answer in the full thesis paper.

Lucas

cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)