Beta for Stockfish distributed testing

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Beta for Stockfish distributed testing.

Post by lucasart »

Ajedrecista wrote:I obtain slightly better results if I divide by (N - 1) instead of N:

Code: Select all

stdev = math.sqrt(w*(1-mu)**2 + l*(0-mu)**2 + d*(0.5-mu)**2) / math.sqrt(N-1)
Oh come on! Do you have to spam everything I do ?
*I* am the one who lectured you about the N-1 that last time. Remember ?
- Yes, N-1 gives you an unbiaised estimator of the variance on a finite sample
- But the results we use are only *asymptotic*, so you're really splitting hair in two, and it won't make a measurable difference for N=16,000 games. I just used N to avoid divide by zero probems.
Ajedrecista wrote:

Code: Select all

if los < 0.05:
  result['style'] = '#FF6A6A'
elif los > 0.95:
  result['style'] = '#44EB44'

return result
Given that LOS is a one-sided test and error bars for a given confidence level are two-sided tests (if I am not wrong), then I think that the correct thing is:

Code: Select all

if los < 0.025:
  result['style'] = '#FF6A6A'
elif los > 0.975:
  result['style'] = '#44EB44'

return result
You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Beta for Stockfish distributed testing

Post by gladius »

geots wrote:
ThomasJMiller wrote:
Machine: glinscott - 23 cores
is that just one computer or the sum of several?




Flip a coin..... Thing is, quality control is headed south of the border. It's a nice thought, and you can take your best shot at it- but it'll be quasi at best. Too many hands, too many different personalities and too much different hardware. Everything comes with a price.


gts
Thank you for your constructive feedback. What alternative do you propose? Especially for an engine that is free (ie. no money to spend on hardware).
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Beta for Stockfish distributed testing.

Post by mcostalba »

lucasart wrote: You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
Ok, perhaps I committed too early....sorry for that..I will wait until there is a consensus among statistic experts (I am not) before to revert.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Beta for Stockfish distributed testing.

Post by mcostalba »

mcostalba wrote:
lucasart wrote: You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
Ok, perhaps I committed too early....sorry for that..I will wait until there is a consensus among statistic experts (I am not) before to revert.
BTW what about

red if Proba(elo < 0) >= 95%

This would change the formula ?
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Beta for Stockfish distributed testing

Post by mcostalba »

geots wrote: Flip a coin..... Thing is, quality control is headed south of the border. It's a nice thought, and you can take your best shot at it- but it'll be quasi at best. Too many hands, too many different personalities and too much different hardware. Everything comes with a price.
Oh my gosh !

I would say that _your_ comments on this subject are a coin flip. Not that you are necessarily wrong, but given your background on fishtest, for you it is really like to flip a coin when you write a line about this stuff.
User avatar
Ajedrecista
Posts: 2217
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Beta for Stockfish distributed testing.

Post by Ajedrecista »

Hello Marco:
mcostalba wrote:
lucasart wrote: You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
Ok, perhaps I committed too early....sorry for that..I will wait until there is a consensus among statistic experts (I am not) before to revert.
I am not an expert, so if you want to revert the commit I will not get annoyed. In fact, I enjoy a lot this form of open testing. Good luck with SF!
mcostalba wrote:BTW what about

red if Proba(elo < 0) >= 95%

This would change the formula ?
I see Proba(elo < 0) >= 95% and Proba(elo > 0) < 5% totally equivalent. Just remind that I am not an expert.

------------------------

@Lucas:
lucasart wrote:*I* am the one who lectured you about the N-1 that last time. Remember ?
- Yes, N-1 gives you an unbiaised estimator of the variance on a finite sample
- But the results we use are only *asymptotic*, so you're really splitting hair in two, and it won't make a measurable difference for N=16,000 games. I just used N to avoid divide by zero probems.
Of course I remember that you suggested me about (N - 1) but I already knew it before you wrote it in the forum. Simply I had not proved it in my programmes until you came with it here. (N - 1) will have a small impact when the number of games is low (almost none with thousands of games) although I know that conclusions can not be drawn so early. Regarding the issue of divide by 0, it is familiar to me that if games < 10, then the status was 'Pending...' instead of doing calculations... however, I can not find it on sources anymore.
lucasart wrote:You *are* wrong. We want a unilateral test! What we want is:
* green if Proba(elo > 0) >= 95%
* red if Proba(elo > 0) <= 5%
I know that we want an unilateral test like LOS. It seems that I misunderstood the bounds: I thought that the desired thing was not colouring 95% of the times (97.5% - 2.5% = 95%), while you reduce it to 90% (95% - 5% = 90%), if I understand correctly.

------------------------

Just a thought: it seems that each passing day is more difficult to write something here in TalkChess without the fact that someone 'bites' you. It looks like my day for stop posting here must be soon for the sake of everybody.

Gary and Marco: sorry for the clear off-topic of this post. My intention was never spam a thread. Again, tons of good luck with SF, which is an engine that I like a lot.

Regards from Spain.

Ajedrecista.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Beta for Stockfish distributed testing.

Post by Adam Hair »

lucasart wrote:
Ajedrecista wrote:I obtain slightly better results if I divide by (N - 1) instead of N:

Code: Select all

stdev = math.sqrt(w*(1-mu)**2 + l*(0-mu)**2 + d*(0.5-mu)**2) / math.sqrt(N-1)
Oh come on! Do you have to spam everything I do ?
I am at a loss here. I have not seen where Jesùs has spammed you. I do see that he actively tries to contribute to discussions that he has some interest in. The same as you. The only difference is that he is much more polite than you are.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Beta for Stockfish distributed testing.

Post by lucasart »

This forum is obviously not the right place to talk about statistics.

I did a study of the type I and type II error risk of the SF testing methodology: current test, and various early stopping rules. It's on the FishCooking google group.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Beta for Stockfish distributed testing.

Post by Michel »

I have one small comment.

To compute LOS accurately one should normally discard draws. This has been proved by Remy in this post http://www.talkchess.com/forum/viewtopi ... 05&t=30624 . From the formulas I see floating around here this is not being done.

It could be that you also get (a good approximation to) the right answer if you don't discard draws. I did not check this.

PS. The discard draw rule is of course only valid for a match with two participants.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Beta for Stockfish distributed testing.

Post by lucasart »

Michel wrote:I have one small comment.

To compute LOS accurately one should normally discard draws. This has been proved by Remy in this post http://www.talkchess.com/forum/viewtopi ... 05&t=30624 . From the formulas I see floating around here this is not being done.

It could be that you also get (a good approximation to) the right answer if you don't discard draws. I did not check this.

PS. The discard draw rule is of course only valid for a match with two participants.
Try both formulas numerically, and you'll get the same results! What I am doing is simply to compute the empirical variance, and mean, and the gaussian approximation. Nothing fancy there. The LOS formula can be shown to be equivalent to the binomial case where you have removed the draws (asymptotically?).

Anyway, as you're a Mathematician, I would love it if you could join the discussion here instead
https://groups.google.com/forum/?fromgr ... zELnI8Cy8M

I already did some simulations on type I and type II errors, based on various stopping rules, and adding more interesting stuff (continuous stopping).
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.