SPRT is better than fixed number of games but not ideal.jundery wrote:This shows the problem with your approach though, ignoring all other considerations, and assuming the 1000 tests you are looking at are the only data points to consider.Uri Blass wrote:The probability is not zero but it is small enough to be practically sure that something in the generation is wrong.bob wrote:Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.Uri Blass wrote:From the stockfish forum:
"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"
This is clearly a wrong claim
If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.
Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.
For example it is possible that A got more cpu time in the first 500 games of the match.
In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.
Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.
I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.
I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...
It is the same as the case when 2 people write the same or almost the same chess program.
It is obvious that at least one of the programs is not original even if in theory it is possible with small probability that 2 different people write the same chess program independently.
At 95% probability the results are statistically significant. At 99% probability the results are no longer significant. So saying they are significant is depends on the certainty you want.
What you really need to do is to run a series of experiments, that then can be replicated and verified by others. Running code (or mathematical proof) will beat out email discussions every time. SPRT had a lot of opposition from people that are now vocal supporters on the Stockfish forum, once the verifiable evidence was presented the switch in position was almost immediate.
SPRT assume testing simple assumptions H0 against H1 but practically there may be very bad changes so it is not that we have H0 against H1.
I believe that if the nothing is wrong in the testing condition
stopping early when the result is very significant against the change (not only more than 95% but also more than 99%) can practically save games because there are certainly changes that reduce the elo by 20 or 30 elo points when there are no changes that increase the elo by 20 or 30 elo points and SPRT is best or close to be best only when practically all the changes are very small changes that is not the case.
Note that I believe that there are cases when something is wrong in the testing conditions and the engines do not get the same cpu time
for 300 games or something like that.
These cases do not happen often but I guess that they happen based on the data that I saw(of course I am not sure about it and it is not an extreme case like the 500 0's that I described).
