about wrong results in tests and wrong claims

Uri Blass · Post by **Uri Blass** » Sat Jun 01, 2013 12:41 pm

From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.

Ajedrecista · Post by **Ajedrecista** » Sat Jun 01, 2013 1:44 pm

Hello Uri:

Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.

First of all, I am not an expert in Statistics and other related stuff.

Talking about tests with a pre-arranged or fixed number of games (the typical regression tests of 20000 games in SF testing framework), stopping the test at an early point because you like the result (for example when Elo estimator is over the 95% confidence bars) introduces a bias in the test because one of the hypothesis of this test is that the number of games that should be played is N (20000 for example) and not N - A (the early stopping). Remember that a confidence of 95% implies that you can be wrong 5% of the times.

Disgracefully I am not smart enough to write an explanation, but other people such as Lucas, Michel, Rémi... can. I give you a link about some hipothesis for LOS:

LOS

------------

Don did an experiment a few months ago: IIRC he ran a test between two development betas of Komodo, with N games and once the test finished, the measured Elo gain was 2 or 3 Elo. Then he ran T mini-tests, each of them with N/T games, so the sum of all those T mini-tests bring N games again. The measured Elo gain for each mini-test varied a lot, from -30 to +30 Elo or something like this. Here is the link:

A word for casual testers

If you have a PGN with a regression test from SF testing framework (20000 games) and you do p partitions with 20000/p games each, and you compute the Elo estimator for each partition p_i (for example 100 partitions of 200 games each), I am sure that the variety and distribution of results will be very similar to the variety of Don's experiment. One version can be lucky in the first partition and/or unlucky in the second due to the high variance of only 200 games and it does not imply that one version had access to more resources during this fraction of the match.

Anyway, you can always try your idea of measure a nodes/second ratio for each version and each game, and plot the evolution of this ratio... maybe you will reach an interesting conclusion, worthy to be shared here in TalkChess.

I encourage bright minds to post more convinceable explanations in this thread.

Regards from Spain.

Ajedrecista.

Uri Blass · Post by **Uri Blass** » Sat Jun 01, 2013 3:03 pm

I agree that stopping the test at an early point because Elo estimator is over the 95% confidence bars is not a good idea and 95% is not
what caused me to suspect that something is wrong.

It still does not mean that you can get no conclusion based on part of the games and I think that suspecting that something is wrong in the test when you know that the change is small but the result is too good is clearly logical.

Of course I gave extreme examples and what happened in the stockfish testing is not something that is so extreme but it clearly caused me to suspect that something is wrong.

I posted the relevant case in the stockfish forum but it is not the only case that caused me to suspect and I remember also other results that did not seem logical to me when I said that maybe the reason is statistical noise.

Don · Post by **Don** » Sat Jun 01, 2013 6:41 pm

Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

There probably is a way to interpret this statistically, but it has to be interpreted differently. In other words the normal meaning we attach to the error margins is not valid in this case.

Having said that, we stop many tests early, but only when rejecting a change. We accept the possibility that some test might possibly be good when we reject it. It's a lot more important not to accept a regression or to "do no harm." It's almost like we are doctors trying various thinks on a sick patient, we don't want to hurt the patient so we take less chances with "speculative" therapies. In this case a speculative therapy is something we believe is good but haven't properly tested and could have bad side-effects such as making the patient sicker.

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.

bob · Post by **bob** » Sat Jun 01, 2013 7:03 pm

Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.

Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.

I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...

Uri Blass · Post by **Uri Blass** » Sat Jun 01, 2013 7:59 pm

bob wrote:
Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.

I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...

The probability is not zero but it is small enough to be practically sure that something in the generation is wrong.

It is the same as the case when 2 people write the same or almost the same chess program.
It is obvious that at least one of the programs is not original even if in theory it is possible with small probability that 2 different people write the same chess program independently.

bob · Post by **bob** » Sat Jun 01, 2013 8:28 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.

I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...
The probability is not zero but it is small enough to be practically sure that something in the generation is wrong.

It is the same as the case when 2 people write the same or almost the same chess program.
It is obvious that at least one of the programs is not original even if in theory it is possible with small probability that 2 different people write the same chess program independently.

My point was, the data is incomplete. Looking at 200 games out of 30,000 is not very revealing. And it gives a GREAT chance for introducing bias. If you pick a random point BEFORE the match starts, saying "if A is winning significantly after 1000 games I will stop" you introduce error, but not nearly as much as if you say "if A is winning significantly at any point during the match I will stop".

Obviously, if the experiment is flawed, all bets are off. But 99+% of experiments are good, once the initial details are worked out. I RARELY seen an issue with my cluster testing. Every "problem" I have encountered in the past is checked for before, during and after matches , automatically, to be sure no bias is introduced. In such a circumstance, 200 wins in a row just means 200 wins in a row. Self-test seems to make this worse. I ran a test last week while traveling, using my macbook. After 200 games, it was about 2/3 to 1/3 wins (ignoring the usual 25% draws) in favor of the new version. After 30K games, the old version was better.

I don't believe in stopping early, but certainly never to accept a change, only to reject one, since I would rather error by rejecting a good change, rather than by accepting a bad one...

jundery · Post by **jundery** » Sat Jun 01, 2013 8:38 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.

I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...
The probability is not zero but it is small enough to be practically sure that something in the generation is wrong.

It is the same as the case when 2 people write the same or almost the same chess program.
It is obvious that at least one of the programs is not original even if in theory it is possible with small probability that 2 different people write the same chess program independently.

This shows the problem with your approach though, ignoring all other considerations, and assuming the 1000 tests you are looking at are the only data points to consider.

At 95% probability the results are statistically significant. At 99% probability the results are no longer significant. So saying they are significant is depends on the certainty you want.

What you really need to do is to run a series of experiments, that then can be replicated and verified by others. Running code (or mathematical proof) will beat out email discussions every time. SPRT had a lot of opposition from people that are now vocal supporters on the Stockfish forum, once the verifiable evidence was presented the switch in position was almost immediate.

petero2 · Post by **petero2** » Sat Jun 01, 2013 8:50 pm

bob wrote:Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong..

If every atom in the observable universe would have performed this experiment every nanosecond since the big bang, the probability of obtaining a string of 500 zeros at least once is approximately:

0.01^500*1e80*13.8e9*365*24*3600*1e9 ~= 1e-894

If we assume that there were a googol universes that all did this experiment in parallel, the probability for the sequence to come up at least once would be:

1e100*1e-894 = 1e-794

or to make it even more clear:

.0000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000001

So I say it would be really stupid not to assume that something is wrong with the experiment.

Don · Post by **Don** » Sat Jun 01, 2013 8:56 pm

bob wrote: I don't believe in stopping early, but certainly never to accept a change, only to reject one, since I would rather error by rejecting a good change, rather than by accepting a bad one...

To get the most bang for your buck you have to embrace intuition guided testing to a certain extent. For example WHAT you test generally has no strict scientific basis, you simply test things based on what you consider sound chess principles. We don't try things randomly, so a lot of judgement is involved even in the decision on what to try.

The type of changes we make fall into different categories too. Some are not very interesting and some are very interesting. We usually have a strong sense of whether a change is speculative or just conservative. A speculative change might get more testing before we throw it out simply because it is like a speculative financial investment, it has more risk but also more reward. We don't mind losing a conservative change that we already know is unlikely to provide more than 1 or 2 ELO at best.

We also might not be very interested in a particular change - so we don't lose any sleep over replacing it with a more exciting change if it's not testing well at the moment.

So what happens is that if a change starts out well, we tend to keep it running longer which means there is some manipulation of results. But it's a one way street, we never accept a change that hasn't been tested to full completion and we accept that we sometimes throw out good changes.

It's all about getting the most for the CPU resources we have. Stopping a test that starts out pretty badly has some statistical basis because we don't test anything anyway that we believe will test poorly. So if a test starts out poorly (with a non-trivial bad start) we know the odds in favor of a good result are much lower, and we will decide not continue.

I actually intend to formalize this at some point so that the "intuition" part is spelled out in advance of the test. We would rate our interest in the test and estimate the payoff and such in advance and then the test would run itself, stopping early based on some formalized stopping rules.

I already did a study and came up with something more general, based on stopping the test early in EITHER case but being far stricter about accepting changes. Using Monte Carlo simulation I created a workable system (which we have not implemented yet) for stopping a test without our input. The simulation measured the types of regressions we would suffer over time based on various stopping rules. Of course this is not based on any a priori inputs to the testing procedure based on our own judgment calls but that could be folded in.

These methods are no good for publishing scientific results however. When comparing 2 programs or even a change and making claims the test must be under much stricter conditions to be statistically admissible, but we are not doing this as a science experiment, we are trying to achieve practical results.

about wrong results in tests and wrong claims

about wrong results in tests and wrong claims

Re: About wrong results in tests and wrong claims.

Re: About wrong results in tests and wrong claims.

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims