Maybe then, in 1972, even though Fisher "crushed" Spassky to win the world championship, perhaps Spassky was MUCH stronger than Fisher in chess, and should have easily beaten Fisher, according to all this.bob wrote:Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.Uri Blass wrote:From the stockfish forum:
"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"
This is clearly a wrong claim
If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.
Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.
For example it is possible that A got more cpu time in the first 500 games of the match.
In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.
Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.
I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.
I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...
So perhaps all human comptition means practically nothing.
