about wrong results in tests and wrong claims

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Uri Blass
Posts: 10108
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: about wrong results in tests and wrong claims

Post by Uri Blass »

jundery wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.

I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...
The probability is not zero but it is small enough to be practically sure that something in the generation is wrong.

It is the same as the case when 2 people write the same or almost the same chess program.
It is obvious that at least one of the programs is not original even if in theory it is possible with small probability that 2 different people write the same chess program independently.
This shows the problem with your approach though, ignoring all other considerations, and assuming the 1000 tests you are looking at are the only data points to consider.

At 95% probability the results are statistically significant. At 99% probability the results are no longer significant. So saying they are significant is depends on the certainty you want.

What you really need to do is to run a series of experiments, that then can be replicated and verified by others. Running code (or mathematical proof) will beat out email discussions every time. SPRT had a lot of opposition from people that are now vocal supporters on the Stockfish forum, once the verifiable evidence was presented the switch in position was almost immediate.
SPRT is better than fixed number of games but not ideal.
SPRT assume testing simple assumptions H0 against H1 but practically there may be very bad changes so it is not that we have H0 against H1.

I believe that if the nothing is wrong in the testing condition
stopping early when the result is very significant against the change (not only more than 95% but also more than 99%) can practically save games because there are certainly changes that reduce the elo by 20 or 30 elo points when there are no changes that increase the elo by 20 or 30 elo points and SPRT is best or close to be best only when practically all the changes are very small changes that is not the case.

Note that I believe that there are cases when something is wrong in the testing conditions and the engines do not get the same cpu time
for 300 games or something like that.

These cases do not happen often but I guess that they happen based on the data that I saw(of course I am not sure about it and it is not an extreme case like the 500 0's that I described).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: about wrong results in tests and wrong claims

Post by bob »

Don wrote:
bob wrote: I don't believe in stopping early, but certainly never to accept a change, only to reject one, since I would rather error by rejecting a good change, rather than by accepting a bad one...
To get the most bang for your buck you have to embrace intuition guided testing to a certain extent. For example WHAT you test generally has no strict scientific basis, you simply test things based on what you consider sound chess principles. We don't try things randomly, so a lot of judgement is involved even in the decision on what to try.

The type of changes we make fall into different categories too. Some are not very interesting and some are very interesting. We usually have a strong sense of whether a change is speculative or just conservative. A speculative change might get more testing before we throw it out simply because it is like a speculative financial investment, it has more risk but also more reward. We don't mind losing a conservative change that we already know is unlikely to provide more than 1 or 2 ELO at best.

We also might not be very interested in a particular change - so we don't lose any sleep over replacing it with a more exciting change if it's not testing well at the moment.

So what happens is that if a change starts out well, we tend to keep it running longer which means there is some manipulation of results. But it's a one way street, we never accept a change that hasn't been tested to full completion and we accept that we sometimes throw out good changes.

It's all about getting the most for the CPU resources we have. Stopping a test that starts out pretty badly has some statistical basis because we don't test anything anyway that we believe will test poorly. So if a test starts out poorly (with a non-trivial bad start) we know the odds in favor of a good result are much lower, and we will decide not continue.

I actually intend to formalize this at some point so that the "intuition" part is spelled out in advance of the test. We would rate our interest in the test and estimate the payoff and such in advance and then the test would run itself, stopping early based on some formalized stopping rules.

I already did a study and came up with something more general, based on stopping the test early in EITHER case but being far stricter about accepting changes. Using Monte Carlo simulation I created a workable system (which we have not implemented yet) for stopping a test without our input. The simulation measured the types of regressions we would suffer over time based on various stopping rules. Of course this is not based on any a priori inputs to the testing procedure based on our own judgment calls but that could be folded in.

These methods are no good for publishing scientific results however. When comparing 2 programs or even a change and making claims the test must be under much stricter conditions to be statistically admissible, but we are not doing this as a science experiment, we are trying to achieve practical results.
My point still stands. If you want to pick an early stopping point, pick it before you start. Because for two equal programs, it is very likely one will pull ahead before things even out. And that leads to incorrect decisions...

Nothing wrong with stopping early, just so it is done in a sound way...
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: about wrong results in tests and wrong claims

Post by syzygy »

Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.
Sure, but that does not contradict the statement. The statement assumes the tests are run correctly. If you somehow screw up the tests, both the intermediate AND the final results are meaningless.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: about wrong results in tests and wrong claims

Post by AlvaroBegue »

bob wrote:Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.
This is where you apply Bayes's theorem. Before I generate a string of 500 random numbers between 0 and 99, I have a prior that I am a decent programmer, so I probably wrote the random-number generator correctly. So say I estimate the prior probability of the generator being flawed at 0.01%. Now I observe an event that would be produced with high probability by a flawed generator, and with tiny probability by a correct generator. If I do the math, my posterior probability for the generator being flawed is now very close to 1, so I would say it is flawed. This is a perfectly valid way to reason (perhaps the only way).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: about wrong results in tests and wrong claims

Post by bob »

AlvaroBegue wrote:
bob wrote:Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.
This is where you apply Bayes's theorem. Before I generate a string of 500 random numbers between 0 and 99, I have a prior that I am a decent programmer, so I probably wrote the random-number generator correctly. So say I estimate the prior probability of the generator being flawed at 0.01%. Now I observe an event that would be produced with high probability by a flawed generator, and with tiny probability by a correct generator. If I do the math, my posterior probability for the generator being flawed is now very close to 1, so I would say it is flawed. This is a perfectly valid way to reason (perhaps the only way).
That was part of the point I was making. If the large test results are wrong, using intermediate results will be JUST as wrong. The assumption has to be that the test methodology has been carefully designed, implemented, and tested, first.

The other part was that if you use the criterion "stop if one program gets XXX Elo ahead" then you also introduce significant error because of the early swings one naturally sees when testing two engines that are close together in strength.

This "intuition" approach is exactly the opposite of what is really needed.
Michel
Posts: 2271
Joined: Mon Sep 29, 2008 1:50 am

Re: about wrong results in tests and wrong claims

Post by Michel »

This is where you apply Bayes's theorem
Beware of applying Bayes theorem to test design.

As far as I can tell applying Bayes theorem naively implies that early stopping based on p value ("siginificance level") _is allowed_. Nonetheless this is clearly false as can be shown in many ways (and everybody agrees to this now).

If you read the literature on Bayesian sequential testing you will find that the proponents apply a much more convoluted theoretical framework which assigns costs to tests and incorrect outcomes..
By some miracle the test minimizing expected cost is also a SPRT.
However it appears impossible to me to assign sensible costs
in the case of chess engine testing.

As far as I can tell it is never explained why applying Bayes theorem directly
gives the wrong result. If anyone understands this I am happy to learn about it.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: about wrong results in tests and wrong claims

Post by lucasart »

Uri Blass wrote: SPRT assume testing simple assumptions H0 against H1 but practically there may be very bad changes so it is not that we have H0 against H1.
Do you really understand how SPRT works ?
Because the more I read your post, the less I understand any of it.

Are you trying to say that you would prefer a test of
H0: elo<elo0 against H1: elo>elo0
Such sequential tests do exist, but they cannot be finite. The expected stopping time of such a test will go to infinity as elo gets closer to elo0. And when elo=elo0 (exactly) then the test is (almost surely) infinite. The NAS algorithm is an example (non monotonic adaptive sampling). I have experimented with such tests numerically, and the pain (much more costly) is not worth the gain (eliminate the grey zone when elo0 < elo < elo1, that practically introduces an "elo resolution" below which improvements cannot be reliable detected).
Uri Blass wrote: I believe that if the nothing is wrong in the testing condition
stopping early when the result is very significant against the change (not only more than 95% but also more than 99%) can practically save games because there are certainly changes that reduce the elo by 20 or 30 elo points when there are no changes that increase the elo by 20 or 30 elo points and SPRT is best or close to be best only when practically all the changes are very small changes that is not the case.
What's your point? SPRT will refute bad patches quicker if they are really bad. I recommend you write yourself a little simulator, and submit these misconceptions of yours to a numerical test, before you post...

And these 95% and 99% that you are talking about? What do you mean? Do you want to use the p-value? The p-value doesn't mean anything unless you predecide the number of games and look only after N games.
Uri Blass wrote: Note that I believe that there are cases when something is wrong in the testing conditions and the engines do not get the same cpu time
for 300 games or something like that.
If you are refering to fishtest, then I must disagree. The time that each test uses is rescaled based on the timing of './stockfish bench' on each machine.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Uri Blass
Posts: 10108
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: about wrong results in tests and wrong claims

Post by Uri Blass »

SPRT minimize the cost under some assumptions that are practically not correct.

The assumptions for example at stage I is that we need to decide between H0 the change lose 1.5 elo against H1 the change wins 4.5 elo

elo0: -1.50 alpha: 0.05 elo1: 4.50 beta:0.05

Not every change is losing 1.5 elo or winning 4.5 elo.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: about wrong results in tests and wrong claims

Post by lucasart »

Uri Blass wrote:SPRT minimize the cost under some assumptions that are practically not correct.

The assumptions for example at stage I is that we need to decide between H0 the change lose 1.5 elo against H1 the change wins 4.5 elo

elo0: -1.50 alpha: 0.05 elo1: 4.50 beta:0.05

Not every change is losing 1.5 elo or winning 4.5 elo.
This test is a *preselection*:
* a patch whose true value is -1.0 elo has indeed a probability greater than 5% of passing the preselection (type I error)
* a patch that is worth +2 elo has more chance of passing the test than it has of failing, but still has more than 5% chance of failing the test, even if it's an improvement. It's called testing resolution. You cannot have a finite test that does not have this flaw. Please read what I explained above about infinite test.

The real SPRT test is to use elo0=0 and elo1=6, where epsilon is your testing resolution.

PS: All these elo numbers are expressed in BayesELO, not in normal ELO. Be very careful with that, because with the extremely high values of drawelo (stockfish self-testing with aggressive draw adjudication produces formidable amounts of draws), the difference between the ELO and the BayesELO scale is very important.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Uri Blass
Posts: 10108
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: about wrong results in tests and wrong claims

Post by Uri Blass »

lucasart wrote:
Uri Blass wrote: SPRT assume testing simple assumptions H0 against H1 but practically there may be very bad changes so it is not that we have H0 against H1.
Do you really understand how SPRT works ?
Because the more I read your post, the less I understand any of it.

Are you trying to say that you would prefer a test of
H0: elo<elo0 against H1: elo>elo0
Such sequential tests do exist, but they cannot be finite. The expected stopping time of such a test will go to infinity as elo gets closer to elo0. And when elo=elo0 (exactly) then the test is (almost surely) infinite. The NAS algorithm is an example (non monotonic adaptive sampling). I have experimented with such tests numerically, and the pain (much more costly) is not worth the gain (eliminate the grey zone when elo0 < elo < elo1, that practically introduces an "elo resolution" below which improvements cannot be reliable detected).
Uri Blass wrote: I believe that if the nothing is wrong in the testing condition
stopping early when the result is very significant against the change (not only more than 95% but also more than 99%) can practically save games because there are certainly changes that reduce the elo by 20 or 30 elo points when there are no changes that increase the elo by 20 or 30 elo points and SPRT is best or close to be best only when practically all the changes are very small changes that is not the case.
What's your point? SPRT will refute bad patches quicker if they are really bad. I recommend you write yourself a little simulator, and submit these misconceptions of yours to a numerical test, before you post...

And these 95% and 99% that you are talking about? What do you mean? Do you want to use the p-value? The p-value doesn't mean anything unless you predecide the number of games and look only after N games.
Uri Blass wrote: Note that I believe that there are cases when something is wrong in the testing conditions and the engines do not get the same cpu time
for 300 games or something like that.
If you are refering to fishtest, then I must disagree. The time that each test uses is rescaled based on the timing of './stockfish bench' on each machine.
1)My point is that the time of stockfish bench may not be constant in some machines and there may be cases when another process run on the same machine and steal cpu time only from one of the programs.

Edit:If every machine is going to run bench before every game and
you use the results to detect if there is a problem in case of significant change on the time then it may be an improvement(when you practically do not lose a lot of testing time because the time of bench is still significantly faster than the time of the game).

2)SPRT also is based on the assumptions that all the changes are near zero elo.

Let take an extreme case that does not happen to show why SPRT is not optimal
Imagine that 10% of the patches have a serious bug that cause them to lose every game(practically this is not the case and there are bad patches that lose 30 elo because of bugs but I take an extreme example to show that SPRT is not optimal with practical assumptions)

If you use SPRT the bad patchs need to run for 100 games or something like that to be proved bad.

practically I can simply have a rule to stop after losing 40-0 with no draws that mean rejecting the bad patches faster without significant probability to reject good patch(most of the games are draws so the probability of a good patch to lose the first 40 consecutive games is clearly less than 1/(10^20) and practically I can say that it is not going to happen and I clearly save games in 10% of the cases.

The testing time that I save is not very big but I earn something even if I save 0.01% of my testing time and get practically no demage(not considering theorerical less than 0.000000000000000001% probability to reject a good patch).

Note also that my rule is not optimal and I made it only to show that SPRT is not optimal.