about wrong results in tests and wrong claims

S.Taylor · Post by **S.Taylor** » Sun Jun 02, 2013 10:22 am

bob wrote:
Uri Blass wrote:From the stockfish forum:

"If you perform a statistical test then *only* the final result is relevant. Not the intermediate results. So it makes no sense to say that "after (say) 200 games the result was significant"

This is clearly a wrong claim

If you test A against B (let say for 1000 games and see after 100 games that the result was A 500 B 0 and after 1000 games you see result of 500-500 then it is clear that something is wrong with the results.

Even if you see A 450 B 50 after 500 games and A 500 B 500 it is clear that something is wrong with the results.

For example it is possible that A got more cpu time in the first 500 games of the match.

In order to solve this problem it is better to record the number of nodes per second of A and B (call it n(A) and n(B)) in every game and simply not include games when
n(a)/n(b) is significantly different than the expected value.

Unfortunately it seems that the stockfish team prefer to close their eyes and not check the possibility that something is wrong in part of their games.

I think that inspite of this noise usually changes that pass stage I and stage II are productive because cases when A get significantly more than 50% of the cpu time do not happen for enough games to force bad changes to pass tests but I guess that there are cases when one program get significantly more than 50% of the cpu time(let say 55%) when it happens not in a single game but in some hundrends of consecutive games.

I do not plan to program a tool to prevent this problem
so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.
Here's a question to ponder. If you generate a string of 500 random numbers between 0 and 99, what would you conclude if you got 500 zeroes? Flawed? Perhaps. The probability for getting 500 zeros is not zero, however, so drawing a conclusion would be wrong.

I see significant swings in results in my cluster testing. One impression after 2000 games, completely different result after 30,000...

Maybe then, in 1972, even though Fisher "crushed" Spassky to win the world championship, perhaps Spassky was MUCH stronger than Fisher in chess, and should have easily beaten Fisher, according to all this.
So perhaps all human comptition means practically nothing.

lucasart · Post by **lucasart** » Sun Jun 02, 2013 10:26 am

Uri Blass wrote: 1)My point is that the time of stockfish bench may not be constant in some machines and there may be cases when another process run on the same machine and steal cpu time only from one of the programs.

Here we have to assume that testers are not stupid, and are not running 15 concurrent games on a 4 core machine, while it's busy encoding a video. On a lightweight Linux distribution, the effect you describe is negligible, but with bulky Windows allowed... who knows

Uri Blass wrote: 2)SPRT also is based on the assumptions that all the changes are near zero elo.

Let take an extreme case that does not happen to show why SPRT is not optimal
Imagine that 10% of the patches have a serious bug that cause them to lose every game(practically this is not the case and there are bad patches that lose 30 elo because of bugs but I take an extreme example to show that SPRT is not optimal with practical assumptions)

If you use SPRT the bad patchs need to run for 100 games or something like that to be proved bad.

practically I can simply have a rule to stop after losing 40-0 with no draws that mean rejecting the bad patches faster without significant probability to reject good patch(most of the games are draws so the probability of a good patch to lose the first 40 consecutive games is clearly less than 1/(10^20) and practically I can say that it is not going to happen and I clearly save games in 10% of the cases.

The testing time that I save is not very big but I earn something even if I save 0.01% of my testing time and get practically no demage(not considering theorerical less than 0.000000000000000001% probability to reject a good patch).

Note also that my rule is not optimal and I made it only to show that SPRT is not optimal.

Ask yourself this:
* How often will your 40-0 rule kick in ?
* If it does, how many games on average will it save ?

The key to optimization is to understand where time is wasted, and fix that. You're basically being "penny wise and pound fool" here.

What you need is to reduce the number of tests that are selected for the long TC test, which is the real pain. Most patches do not scale well, and something that scores a tiny gain is most likely not going to pass the long TC test.

So in order to save time, I would advocate for using elo0=0 and elo1=6 also in the short TC test.

PS: These discussions should be held in the FishCooking forum, which was created for that. The whole intent is to reduce the audience (less trolls hopefully) and to avoid spamming talkchess with Stockfish testing specific stuff that not everyone is interested in. So why do you keep trolling this forum with arguments that started in the FishCooking forum ?

Uri Blass · Post by **Uri Blass** » Sun Jun 02, 2013 10:41 am

I think that in order to decide if some testing method is optimal we need to have some model about what practically happens.

For example SPRT with existing values may give very bad results if you believe that 99.99% of the patches lose 0.1 elo when 0.01% of the patches earn 1 elo.

In this case you can expect many bad patchs that lose 0.1 elo only because they are lucky to pass both stage I and stage II and you can expect reduction in playing strength even if you do a million of different tests
when of course it is possible to get some small elo improvement in the same time simply by playing at least 1 million of games before you decide to accept a patch.

Of course this case does not happen but my point is that you need first to have apriory opinion about the distribuition of elo change by patches to decide what is the optimal way to test in order to get an improvement.

Uri Blass · Post by **Uri Blass** » Sun Jun 02, 2013 10:45 am

I keep using this forum because the stockfish forum is relatively not for discussion about ideas.

I remember that you blamed me on trolling when I used the stockfish forum to discuss about ideas so inspite of the fact that I used the stockfish forum to do it I prefer mainly to continue the discussion here.

Uri

Uri Blass · Post by **Uri Blass** » Sun Jun 02, 2013 11:02 am

I can add that I think that this subject may be interesting also for other programmers and it is not about code of stockfish but about how to test
things correctly.

Note that I do not think that using elo0=0 and elo1=6 also in the short TC test is a good idea and I suspect that part of the productive changes that mainly help in longer time control are going to fail in the short TC if you use elo=0 and elo1=6.

mcostalba · Post by **mcostalba** » Sun Jun 02, 2013 12:18 pm

Uri Blass wrote: so the stockfish team may not like my post but at least I have a reason to suspect that there is a problem based on watching results.

Uri, you are starting from a dubious assumption and creating a theory out of this: unfortunately a well known pattern we see almost on a daily basis (I'm not referring to you, but in general many "technical" discussions start from a not-a-problem spawning an incredible amount of corresponding not-a-solution).

Uri, you are discussing about a possible artifact in this result:

http://tests.stockfishchess.org/tests/v ... 6ac564d845

Code: Select all

Tasks
Idx 	Worker 	Last Updated 	Played 	Wins 	Losses 	Draws 	Crashes
0 	rap-3cores 	2 days ago 	571 / 1000 	123 	114 	334 	0
1 	vdbergh-5cores 	2 days ago 	664 / 1000 	143 	144 	377 	0
2 	jromang-3cores 	2 days ago 	558 / 1000 	123 	131 	304 	0
3 	glinscott-23cores 	2 days ago 	1000 / 1000 	189 	217 	594 	0
4 	iisiraider-6cores 	2 days ago 	788 / 1000 	146 	162 	480 	0
5 	bpfliegel-3cores 	2 days ago 	528 / 1000 	111 	93 	324 	0
6 	tinker-3cores 	2 days ago 	580 / 1000 	117 	122 	341 	0
7 	tvijlbrief-1cores 	2 days ago 	151 / 1000 	30 	22 	99 	0
8 	iisiraider-6cores 	2 days ago 	1000 / 1000 	218 	205 	577 	0
9 	jkiiski-3cores 	2 days ago 	438 / 1000 	103 	89 	246 	0
10 	jundery-3cores 	2 days ago 	552 / 1000 	116 	118 	318 	0
11 	mthoresen-3cores 	2 days ago 	432 / 1000 	96 	94 	242 	0
12 	tinker-3cores 	2 days ago 	301 / 1000 	63 	63 	175 	0
13 	fastgm-3cores 	2 days ago 	552 / 1000 	109 	104 	339 	0
14 	rkl-3cores 	2 days ago 	585 / 1000 	115 	130 	340 	0
15 	tvijlbrief-3cores 	2 days ago 	433 / 1000 	93 	91 	249 	0
16 	drabel-1cores 	2 days ago 	127 / 1000 	26 	31 	70 	0
17 	rkl-1cores 	2 days ago 	193 / 1000 	39 	42 	112 	0
18 	mhunt-3cores 	2 days ago 	497 / 1000 	83 	107 	307 	0
19 	jromang-7cores 	2 days ago 	445 / 1000 	97 	102 	246 	2
20 	rap-3cores 	2 days ago 	477 / 1000 	88 	100 	289 	0
21 	vdbergh-3cores 	2 days ago 	273 / 1000 	56 	45 	172 	0
22 	tvijlbrief-3cores 	2 days ago 	581 / 1000 	134 	128 	319 	0
23 	rap-3cores 	2 days ago 	257 / 1000 	58 	60 	139 	0
24 	rap-1cores 	2 days ago 	67 / 1000 	13 	15 	39 	0
25 	glinscott-23cores 	2 days ago 	1000 / 1000 	215 	215 	570 	0
26 	mthoresen-15cores 	2 days ago 	655 / 1000 	129 	140 	386 	0
27 	glinscott-23cores 	2 days ago 	307 / 1000 	54 	69 	184 	0
28 	iisiraider-6cores 	2 days ago 	8 / 1000 	1 	3 	4 	0

Could you please look at those numbers and point out which worker(s) start to misbehave and at what time (tasks are in chronological order) ?

Don · Post by **Don** » Sun Jun 02, 2013 1:27 pm

bob wrote:
Don wrote:
bob wrote: I don't believe in stopping early, but certainly never to accept a change, only to reject one, since I would rather error by rejecting a good change, rather than by accepting a bad one...
To get the most bang for your buck you have to embrace intuition guided testing to a certain extent. For example WHAT you test generally has no strict scientific basis, you simply test things based on what you consider sound chess principles. We don't try things randomly, so a lot of judgement is involved even in the decision on what to try.

The type of changes we make fall into different categories too. Some are not very interesting and some are very interesting. We usually have a strong sense of whether a change is speculative or just conservative. A speculative change might get more testing before we throw it out simply because it is like a speculative financial investment, it has more risk but also more reward. We don't mind losing a conservative change that we already know is unlikely to provide more than 1 or 2 ELO at best.

We also might not be very interested in a particular change - so we don't lose any sleep over replacing it with a more exciting change if it's not testing well at the moment.

So what happens is that if a change starts out well, we tend to keep it running longer which means there is some manipulation of results. But it's a one way street, we never accept a change that hasn't been tested to full completion and we accept that we sometimes throw out good changes.

It's all about getting the most for the CPU resources we have. Stopping a test that starts out pretty badly has some statistical basis because we don't test anything anyway that we believe will test poorly. So if a test starts out poorly (with a non-trivial bad start) we know the odds in favor of a good result are much lower, and we will decide not continue.

I actually intend to formalize this at some point so that the "intuition" part is spelled out in advance of the test. We would rate our interest in the test and estimate the payoff and such in advance and then the test would run itself, stopping early based on some formalized stopping rules.

I already did a study and came up with something more general, based on stopping the test early in EITHER case but being far stricter about accepting changes. Using Monte Carlo simulation I created a workable system (which we have not implemented yet) for stopping a test without our input. The simulation measured the types of regressions we would suffer over time based on various stopping rules. Of course this is not based on any a priori inputs to the testing procedure based on our own judgment calls but that could be folded in.

These methods are no good for publishing scientific results however. When comparing 2 programs or even a change and making claims the test must be under much stricter conditions to be statistically admissible, but we are not doing this as a science experiment, we are trying to achieve practical results.
My point still stands. If you want to pick an early stopping point, pick it before you start. Because for two equal programs, it is very likely one will pull ahead before things even out. And that leads to incorrect decisions...

Nothing wrong with stopping early, just so it is done in a sound way...

The Monte Carlo simulation I did was designed to avoid having to wade through the math and understand SPRT, possible making a mistake. It was based on stopping precisely when you were ahead or behind some number of games and the number of games for stopping when behind was less than when stopping ahead. There was also a maximum number of games to be played should you not trigger a stopping rule (based on the number of opening positions n my large test book.) In the case you did not trigger a stopping rule the change was not kept. I simulated pairs of players that were equal in rating and 1 ELO weaker and 1 ELO stronger and I also simulated the draw expectancy that my tests actually return. I wanted to minimize the number of 0 and 1 ELO regressions kept while not missing too many 1 ELO improvements.

Probably the more important finding of my simulation is that it is going to be very difficult to make progress unless at least half your candidate changes are actually improvements. For example if you try ten things and 3 are 1 ELO point improvements and the other 7 are 1 point regressions the noise inherent in the test is going to be difficult to overcome. You have 7 chances to accept a regression.

Of course it gets much better if you are measuring big improvements, such as in the early days of program development where everything you try is a big improvement.

So what I learned was that you should probably not accept small improvements unless you have very strict criteria for accepting a change, something such as running multiple tests and requiring it to pass each one.

Unfortunately, there are some changes that actually test a bit weaker at hyper bullet time controls (such as most of us use for our initial test) but are beneficial changes at longer time controls. One could allow the first stage of such tests to be more like a registration process, to weed out the obviously weak versions.

Uri Blass · Post by **Uri Blass** » Sun Jun 02, 2013 1:37 pm

Marco,
I do not think that there is a problem in every test that you do
and my suspect is that is can happen in some consecutive hundrends of games out of 100,000 so if you are lucky you are going to see no problem.

My suspect was not about this test but about the following
when the first test was too good to be correct.
I do not think that my change was productive(it also got final score of less than 50%) and I saw result that seems too good to be correct(and I remember more significant result than 244-193 for my change earlier but I did not save it).

http://tests.stockfishchess.org/tests/v ... 6ac564d856

0 glinscott-23cores 2 days ago 1000 / 1000 244 193 563 1
1 glinscott-23cores 2 days ago 1000 / 1000 223 205 572 0
2 vdbergh-5cores 2 days ago 872 / 1000 156 185 531 0
3 mthoresen-15cores 2 days ago 1000 / 1000 197 196 607 0
4 glinscott-23cores 2 days ago 1000 / 1000 218 229 553 0
5 rkl-3cores 2 days ago 567 / 1000 102 110 355 0
6 tvijlbrief-3cores 2 days ago 516 / 1000 101 128 287 0
7 tinker-3cores 2 days ago 463 / 1000 104 94 265 0
8 mthoresen-15cores 2 days ago 1000 / 1000 187 208 605 0
9 glinscott-23cores 2 days ago 1000 / 1000 213 207 580 0
10 jundery-3cores 2 days ago 309 / 1000 69 61 179 0
11 jromang-3cores 2 days ago 305 / 1000 62 59 184 0
12 fastgm-3cores 2 days ago 221 / 1000 57 48 116 0
13 mthoresen-15cores 2 days ago 941 / 1000 183 204 554 0
14 jromang-7cores 2 days ago 322 / 1000 57 70 195 0
15 glinscott-23cores 2 days ago 704 / 1000 143 168 393 0
16 bpfliegel-3cores 2 days ago 134 / 1000 17 30 87 0
17 iisiraider-7cores 2 days ago 88 / 1000 10 20 58 0
18 iisiraider-7cores 2 days ago 90 / 1000 21 24 45 0

It is not the only case when I saw strange results
Another case that was more significant(see number 4 and 67-127 with 267 draws in almost 0 elo change seems that there was a problem)

http://tests.stockfishchess.org/tests/v ... 6ac564d863

0 mthoresen-15cores 1 days ago 1000 / 1000 232 216 552 0
1 glinscott-23cores 1 days ago 1000 / 1000 216 212 572 0
2 mthoresen-15cores 1 days ago 1000 / 1000 210 202 588 0
3 glinscott-23cores 1 days ago 1000 / 1000 206 219 575 0
4 rap-3cores 1 days ago 461 / 1000 67 127 267 0
5 glinscott-23cores 1 days ago 1000 / 1000 207 211 582 0
6 glinscott-23cores 1 days ago 1000 / 1000 215 214 571 0
7 glinscott-23cores 1 days ago 1000 / 1000 209 212 579 0
8 glinscott-23cores 1 days ago 1000 / 1000 219 203 578 0
9 rkl-1cores 1 days ago 800 / 1000 168 152 480 0
10 glinscott-23cores 1 days ago 1000 / 1000 216 207 577 0
11 glinscott-23cores 1 days ago 1000 / 1000 216 207 577 0
12 jkiiski-3cores 1 days ago 1000 / 1000 182 210 608 0
13 jundery-3cores 1 days ago 231 / 1000 47 52 132 0
14 mthoresen-3cores 1 days ago 258 / 1000 56 51 151 0
15 tinker-3cores 1 days ago 246 / 1000 52 54 140 0
16 fastgm-3cores 1 days ago 243 / 1000 47 58 138 0
17 jromang-3cores 1 days ago 235 / 1000 54 50 131 0
18 glinscott-23cores 1 days ago 971 / 1000 207 218 546 0
19 mschmidt-4cores 1 days ago 221 / 1000 41 51 129 0
20 rkl-3cores 1 days ago 232 / 1000 45 51 136 0
21 tvijlbrief-3cores 1 days ago 237 / 1000 51 46 140 0
22 bpfliegel-3cores 1 days ago 223 / 1000 45 46 132 0
23 jkiiski-3cores 1 days ago 186 / 1000 44 36 106 0
24 vdbergh-5cores 1 days ago 278 / 1000 60 53 165 0
25 jromang-7cores 1 days ago 365 / 1000 70 86 209 0
26 rap-3cores 1 days ago 65 / 1000 13 12 40 0
27 rap-3cores 1 days ago 53 / 1000 8 10 35 0

Don · Post by **Don** » Sun Jun 02, 2013 2:10 pm

lucasart wrote:
Uri Blass wrote: SPRT assume testing simple assumptions H0 against H1 but practically there may be very bad changes so it is not that we have H0 against H1.
Do you really understand how SPRT works ?
Because the more I read your post, the less I understand any of it.

Are you trying to say that you would prefer a test of
H0: elo<elo0 against H1: elo>elo0
Such sequential tests do exist, but they cannot be finite. The expected stopping time of such a test will go to infinity as elo gets closer to elo0. And when elo=elo0 (exactly) then the test is (almost surely) infinite. The NAS algorithm is an example (non monotonic adaptive sampling). I have experimented with such tests numerically, and the pain (much more costly) is not worth the gain (eliminate the grey zone when elo0 < elo < elo1, that practically introduces an "elo resolution" below which improvements cannot be reliable detected).

Uri Blass wrote: I believe that if the nothing is wrong in the testing condition
stopping early when the result is very significant against the change (not only more than 95% but also more than 99%) can practically save games because there are certainly changes that reduce the elo by 20 or 30 elo points when there are no changes that increase the elo by 20 or 30 elo points and SPRT is best or close to be best only when practically all the changes are very small changes that is not the case.
What's your point? SPRT will refute bad patches quicker if they are really bad. I recommend you write yourself a little simulator, and submit these misconceptions of yours to a numerical test, before you post...

And these 95% and 99% that you are talking about? What do you mean? Do you want to use the p-value? The p-value doesn't mean anything unless you predecide the number of games and look only after N games.

Uri Blass wrote: Note that I believe that there are cases when something is wrong in the testing conditions and the engines do not get the same cpu time
for 300 games or something like that.
If you are refering to fishtest, then I must disagree. The time that each test uses is rescaled based on the timing of './stockfish bench' on each machine.

We stopped using our distributed tester when we discovered that the machine a test is run on had a lot to do with the results even with this calibration.

Here is the problem. Program X runs better on AMD. Program Y runs better on Linux. But we calibrate with program Z.

These are not hypothetical examples, this is what actually is going to happen and your results are going to depend a lot on who is running the test at the time if you combine the results. In our case we discovered that Komodo runs better on Linux relative to any other strong program we are aware of. We used Stockfish for the calibration run but that probably wasn't the issue. Our changes always looked great when our Linux testers were on-line.

I considered several ways to get around this. One is to only permit self-testing, one version of Komodo against another and in retrospect I think that is how it should be done. A particular change can still be influenced a little by hardware and OS but probably not by enough to worry too much about.

Another way is much more interesting - one could identify each user/machine and track his results over time and "normalize" it. Our distributed tester reported the OS and platform and user and conceivable you could attach a specific signature to each user/machine. The basic idea when doing foreign testing is to track the performance of each foreign program ON THAT MACHINE over time when running the same test. The system would notice how well each foreign program performed on each machine (relative to the others) and the appropriate adjustment could be made. Incremental adjustments would be made constantly and new machines entering the system would start out with an estimated number based on the OS and hardware. If there was an exact match we would start with that. Over time the accuracy of the results would improve.

We still have this issue to a small extent because both Larry and I have hardware we use to test on. Even though we do the bulk of our testing on Linux, all the machines are different. If you think that should not matter consider that most of our improvements are in the 1-2 ELO range. After accumulated 3 or 4 of these type of changes we will sometimes run a combined test to prove that we have actually gained the several ELO we believe we have and sometimes it comes out as being no improvement at all.
Two ELO is at the threshold of what can be reliably shown to be an improvement.

AlvaroBegue · Post by **AlvaroBegue** » Sun Jun 02, 2013 3:34 pm

Here's how I see the problem. Start with a prior distribution of the ELO difference for a proposed change to the engine (people with a lot of experience in automated testing may have a good idea of what this distribution should be).

A testing procedure with early stopping can be described as two function: An acceptance function A(n) which indicates how many points you need to have after n games to accept the change, and a rejection function R(n) which indicates how many points you need to have after n games to reject the change.

Once you have these two ingredients (prior distribution and testing procedure), you can compute an average ELO improvement per game played (a tiny number). If anyone wants some details of how this could be done, I can try to explain it in some detail.

Now that we know what we are maximizing, some sort of optimization algorithm would let us design the optimal testing procedure.

about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims

Re: about wrong results in tests and wrong claims