Beta for Stockfish distributed testing

lucasart · Post by **lucasart** » Fri Mar 29, 2013 2:46 pm

Ajedrecista wrote:I obtain slightly better results if I divide by (N - 1) instead of N:
Code: Select all
stdev = math.sqrt(w*(1-mu)**2 + l*(0-mu)**2 + d*(0.5-mu)**2) / math.sqrt(N-1)

Oh come on! Do you have to spam everything I do ?
*I* am the one who lectured you about the N-1 that last time. Remember ?
- Yes, N-1 gives you an unbiaised estimator of the variance on a finite sample
- But the results we use are only *asymptotic*, so you're really splitting hair in two, and it won't make a measurable difference for N=16,000 games. I just used N to avoid divide by zero probems.

Ajedrecista wrote:
Code: Select all
if los < 0.05:
  result['style'] = '#FF6A6A'
elif los > 0.95:
  result['style'] = '#44EB44'

return result
Given that LOS is a one-sided test and error bars for a given confidence level are two-sided tests (if I am not wrong), then I think that the correct thing is:
Code: Select all
if los < 0.025:
  result['style'] = '#FF6A6A'
elif los > 0.975:
  result['style'] = '#44EB44'

return result

You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%

gladius · Post by **gladius** » Fri Mar 29, 2013 2:47 pm

geots wrote:
ThomasJMiller wrote:
Machine: glinscott - 23 cores
is that just one computer or the sum of several?

Flip a coin..... Thing is, quality control is headed south of the border. It's a nice thought, and you can take your best shot at it- but it'll be quasi at best. Too many hands, too many different personalities and too much different hardware. Everything comes with a price.

gts

Thank you for your constructive feedback. What alternative do you propose? Especially for an engine that is free (ie. no money to spend on hardware).

mcostalba · Post by **mcostalba** » Fri Mar 29, 2013 2:51 pm

lucasart wrote: You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%

Ok, perhaps I committed too early....sorry for that..I will wait until there is a consensus among statistic experts (I am not) before to revert.

mcostalba · Post by **mcostalba** » Fri Mar 29, 2013 2:54 pm

mcostalba wrote:
lucasart wrote: You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
Ok, perhaps I committed too early....sorry for that..I will wait until there is a consensus among statistic experts (I am not) before to revert.

BTW what about

red if Proba(elo < 0) >= 95%

This would change the formula ?

mcostalba · Post by **mcostalba** » Fri Mar 29, 2013 3:00 pm

geots wrote: Flip a coin..... Thing is, quality control is headed south of the border. It's a nice thought, and you can take your best shot at it- but it'll be quasi at best. Too many hands, too many different personalities and too much different hardware. Everything comes with a price.

Oh my gosh !

I would say that _your_ comments on this subject are a coin flip. Not that you are necessarily wrong, but given your background on fishtest, for you it is really like to flip a coin when you write a line about this stuff.

Ajedrecista · Post by **Ajedrecista** » Fri Mar 29, 2013 4:45 pm

Hello Marco:

mcostalba wrote:
lucasart wrote: You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
Ok, perhaps I committed too early....sorry for that..I will wait until there is a consensus among statistic experts (I am not) before to revert.

I am not an expert, so if you want to revert the commit I will not get annoyed. In fact, I enjoy a lot this form of open testing. Good luck with SF!

mcostalba wrote:BTW what about

red if Proba(elo < 0) >= 95%

This would change the formula ?

I see Proba(elo < 0) >= 95% and Proba(elo > 0) < 5% totally equivalent. Just remind that I am not an expert.

------------------------

@Lucas:

lucasart wrote:*I* am the one who lectured you about the N-1 that last time. Remember ?
- Yes, N-1 gives you an unbiaised estimator of the variance on a finite sample
- But the results we use are only *asymptotic*, so you're really splitting hair in two, and it won't make a measurable difference for N=16,000 games. I just used N to avoid divide by zero probems.

Of course I remember that you suggested me about (N - 1) but I already knew it before you wrote it in the forum. Simply I had not proved it in my programmes until you came with it here. (N - 1) will have a small impact when the number of games is low (almost none with thousands of games) although I know that conclusions can not be drawn so early. Regarding the issue of divide by 0, it is familiar to me that if games < 10, then the status was 'Pending...' instead of doing calculations... however, I can not find it on sources anymore.

lucasart wrote:You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%

I know that we want an unilateral test like LOS. It seems that I misunderstood the bounds: I thought that the desired thing was not colouring 95% of the times (97.5% - 2.5% = 95%), while you reduce it to 90% (95% - 5% = 90%), if I understand correctly.

------------------------

Just a thought: it seems that each passing day is more difficult to write something here in TalkChess without the fact that someone 'bites' you. It looks like my day for stop posting here must be soon for the sake of everybody.

Gary and Marco: sorry for the clear off-topic of this post. My intention was never spam a thread. Again, tons of good luck with SF, which is an engine that I like a lot.

Regards from Spain.

Ajedrecista.

Adam Hair · Post by **Adam Hair** » Fri Mar 29, 2013 5:06 pm

lucasart wrote:
Ajedrecista wrote:I obtain slightly better results if I divide by (N - 1) instead of N:
Code: Select all
stdev = math.sqrt(w*(1-mu)**2 + l*(0-mu)**2 + d*(0.5-mu)**2) / math.sqrt(N-1)
Oh come on! Do you have to spam everything I do ?

I am at a loss here. I have not seen where Jesùs has spammed you. I do see that he actively tries to contribute to discussions that he has some interest in. The same as you. The only difference is that he is much more polite than you are.

lucasart · Post by **lucasart** » Sat Mar 30, 2013 4:49 am

This forum is obviously not the right place to talk about statistics.

I did a study of the type I and type II error risk of the SF testing methodology: current test, and various early stopping rules. It's on the FishCooking google group.

Michel · Post by **Michel** » Sat Mar 30, 2013 6:00 am

I have one small comment.

To compute LOS accurately one should normally discard draws. This has been proved by Remy in this post http://www.talkchess.com/forum/viewtopi ... 05&t=30624 . From the formulas I see floating around here this is not being done.

It could be that you also get (a good approximation to) the right answer if you don't discard draws. I did not check this.

PS. The discard draw rule is of course only valid for a match with two participants.

lucasart · Post by **lucasart** » Sat Mar 30, 2013 6:49 am

Michel wrote:I have one small comment.

To compute LOS accurately one should normally discard draws. This has been proved by Remy in this post http://www.talkchess.com/forum/viewtopi ... 05&t=30624 . From the formulas I see floating around here this is not being done.

It could be that you also get (a good approximation to) the right answer if you don't discard draws. I did not check this.

PS. The discard draw rule is of course only valid for a match with two participants.

Try both formulas numerically, and you'll get the same results! What I am doing is simply to compute the empirical variance, and mean, and the gaussian approximation. Nothing fancy there. The LOS formula can be shown to be equivalent to the binomial case where you have removed the draws (asymptotically?).

Anyway, as you're a Mathematician, I would love it if you could join the discussion here instead
https://groups.google.com/forum/?fromgr ... zELnI8Cy8M

I already did some simulations on type I and type II errors, based on various stopping rules, and adding more interesting stuff (continuous stopping).

Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.