Stats, Testing, Quit Early and Dumb Luck

CRoberson · Post by **CRoberson** » Tue Nov 03, 2009 3:28 pm

There are some conflicting thoughts here.

When we are running a test at 100 games which has an error margin of +/- 60 ELo, we are only saying that a test result inside this range is likely to be the same. So, a test result
yielding -55 Elo or even +55 Elo, says that the two programs may be the same. IOW, there exits insufficient evidence to say that they are dissimilar. If we want to prove them
dissimilar then we can run another 100 tests (for a total of 200) which yields an error margin of +/- 40 Elo. If it now says -50, then that is outside the margins and we have
sufficient evidence to claim they are dissimilar. Now, understand these margins are at a given confidence level.

Now, for the flip side of that statement.

Lets say we start a run of 200 tests and after 100 tests the results say the 2nd program is 100 Elo worse than the first. Can we stop the testing? Statistical theory says yes,
because we are outside the margins that coincide with 100 tests (+/- 60 Elo). Given the results are outside the margins, we can safely say that one program is better than the other
at the given confidence level. So, we can safely "Quit Early".

Now, here is the conflict.

Bob sees wild variations in the results for large numbers of games. Thus, we can not be sure of a small run like 100 or 200 games. That is assuming I've
interpreted Bob's statements correctly.

Here are my thoughts on why this conflict exists. IIRC, Bob made these observations on runs that were at extremely small TCs. The horizon effect is larger at the shorter searches
obtained during small TC runs than in larger TCs. I think this causes that problem. In my tests, I don't see these large variations in the runs. I do see variations, but not so large
that I can't stick to statistical theory when making decisions. In fact, I see a number of repeated games. My TCs are varied: G/3 min +2 sec, G/2 min + 1 sec and 40 moves in
45 sec repeating and 1 sec increment per move. Maybe, that is the issue.

Thus, statistical theory is effected by the speed of the tests. So, the question becomes: what is the smallest TC for which we can safely trust statistical margins?
At what TCs does dumb luck not cause decision issues within the confidence levels?

CRoberson · Post by **CRoberson** » Tue Nov 03, 2009 3:47 pm

Of course, the real issue is not TC but average depth of search. Thus, different programs are affected differently. Still, the question is at what
speed can we trust statistics or must we all purchase clusters?

George Tsavdaris · Post by **George Tsavdaris** » Tue Nov 03, 2009 4:13 pm

CRoberson wrote: Now, here is the conflict.

Bob sees wild variations in the results for large numbers of games. Thus, we can not be sure of a small run like 100 or 200 games. That is assuming I've
interpreted Bob's statements correctly.

Here are my thoughts on why this conflict exists. IIRC, Bob made these observations on runs that were at extremely small TCs. The horizon effect is larger at the shorter searches
obtained during small TC runs than in larger TCs. I think this causes that problem. In my tests, I don't see these large variations in the runs. I do see variations, but not so large
that I can't stick to statistical theory when making decisions. In fact, I see a number of repeated games. My TCs are varied: G/3 min +2 sec, G/2 min + 1 sec and 40 moves in
45 sec repeating and 1 sec increment per move. Maybe, that is the issue.

This is an interesting point you say here, although i think you use the term "horizon effect" wrongly here.
I guess by horizon effect you mean that as search depth increases quality of play increase so there is less luck factor in the moves of the engines so we need less games to test their relative strength.

The point you make is that for (fast) blitz games for example you need many more games to be sure about engine's relative strength, than the games you need to run for longer time controls games.

I'm not sure at all if this is true and if it is, i'm very puzzled about why it is happening. I guess it's because as time control increases, then search depth increases, so quality of play increases, so mistakes decrease and so the number of draws increase, so it's obvious that fluctuations in the ELOs of the engines, when playing more and more games, will be a lot less.

Thus, statistical theory is effected by the speed of the tests. So, the question becomes: what is the smallest TC for which we can safely trust statistical margins?

Trust them as showing us the absolute truth about which engine's ELOs? Never in the pedantic sense.
Only when we play all possible Chess positions(with black and with white) between all the engines in issue, we can say with absolute confidence which is the strongest engine.

But in the usual sense when the confidence interval is e.g 95% and the margin of error is [0, X>0] for eng-A versus eng-B one can say that eng-A is better than eng-B for all time controls.

bob · Post by **bob** » Tue Nov 03, 2009 6:20 pm

CRoberson wrote:There are some conflicting thoughts here.

When we are running a test at 100 games which has an error margin of +/- 60 ELo, we are only saying that a test result inside this range is likely to be the same. So, a test result
yielding -55 Elo or even +55 Elo, says that the two programs may be the same. IOW, there exits insufficient evidence to say that they are dissimilar. If we want to prove them
dissimilar then we can run another 100 tests (for a total of 200) which yields an error margin of +/- 40 Elo. If it now says -50, then that is outside the margins and we have
sufficient evidence to claim they are dissimilar. Now, understand these margins are at a given confidence level.

Now, for the flip side of that statement.

Lets say we start a run of 200 tests and after 100 tests the results say the 2nd program is 100 Elo worse than the first. Can we stop the testing? Statistical theory says yes,
because we are outside the margins that coincide with 100 tests (+/- 60 Elo). Given the results are outside the margins, we can safely say that one program is better than the other
at the given confidence level. So, we can safely "Quit Early".

Now, here is the conflict.

Bob sees wild variations in the results for large numbers of games. Thus, we can not be sure of a small run like 100 or 200 games. That is assuming I've
interpreted Bob's statements correctly.

Here are my thoughts on why this conflict exists. IIRC, Bob made these observations on runs that were at extremely small TCs. The horizon effect is larger at the shorter searches
obtained during small TC runs than in larger TCs. I think this causes that problem. In my tests, I don't see these large variations in the runs. I do see variations, but not so large
that I can't stick to statistical theory when making decisions. In fact, I see a number of repeated games. My TCs are varied: G/3 min +2 sec, G/2 min + 1 sec and 40 moves in
45 sec repeating and 1 sec increment per move. Maybe, that is the issue.

Thus, statistical theory is effected by the speed of the tests. So, the question becomes: what is the smallest TC for which we can safely trust statistical margins?
At what TCs does dumb luck not cause decision issues within the confidence levels?

I believe the answer to that question is a TC sufficient so that you can see mate from the root position. Otherwise "dumb luck" will always be present. Note that while I see "wild variation" I rarely see the multi-sigma variations. If you use a 2-sigma window (2 sigma on either side) you have a pretty good chance of being correct if two versions produce a rating +/- 2 sigma window where the windows do not overlap at all.

I've recently run my quick 10s+0.1s test as well as 1+1 and 5+5 and I didn't notice any significant variance in how the match scores jumped around. There is a _lot_ of luck in computer-vs-computer games. There is a lot more skill involved, of course. Enough games and you wash out the luck factor and zero in on the skill factor.

As far as your "early exit". We are not seeing any -50 changes to justify an early exit. We try to weed that out before we do any significant testing. Our changes are usually in the +/- 5 to +/- 10 range. Once in a blue moon we might see a +/- 15 to +/- 20. Yes, if you _really_ screw things up you can quit quickly. I have seen some -200 and -300 results when I unintentionally break something, such as combining two incompatible source packages so that something gets totally whacked and either we lose like mad or even flag games like mad as the program crashes.

bob · Post by **bob** » Tue Nov 03, 2009 6:21 pm

CRoberson wrote:Of course, the real issue is not TC but average depth of search. Thus, different programs are affected differently. Still, the question is at what
speed can we trust statistics or must we all purchase clusters?

I believe you can trust statistics at _any_ speed. I've never seen a "speed" term in any statistical formula I can recall.

CRoberson · Post by **CRoberson** » Tue Nov 03, 2009 10:17 pm

bob wrote:
CRoberson wrote:Of course, the real issue is not TC but average depth of search. Thus, different programs are affected differently. Still, the question is at what
speed can we trust statistics or must we all purchase clusters?
I believe you can trust statistics at _any_ speed. I've never seen a "speed" term in any statistical formula I can recall.

That is my point. Maybe, there should be one. Yes, luck exists until exhaustive search is possible. The flip side of that, is that luck increases
as you decrease search depth.

Decreased search (TC) increases the variance of the results? Yes or no. If yes, then matches with shorter TCs have results that are less reliable
than the same number of matches for longer TC's. Of course, one could argue that the amount of luck increases equally for all programs
invloved thus equalizing the issue, but it seems it should still increase the variance and thus impact the statistical margins.

bob · Post by **bob** » Tue Nov 03, 2009 10:28 pm

CRoberson wrote:
bob wrote:
CRoberson wrote:Of course, the real issue is not TC but average depth of search. Thus, different programs are affected differently. Still, the question is at what
speed can we trust statistics or must we all purchase clusters?
I believe you can trust statistics at _any_ speed. I've never seen a "speed" term in any statistical formula I can recall.

That is my point. Maybe, there should be one. Yes, luck exists until exhaustive search is possible. The flip side of that, is that luck increases
as you decrease search depth.

Decreased search (TC) increases the variance of the results? Yes or no. If yes, then matches with shorter TCs have results that are less reliable
than the same number of matches for longer TC's. Of course, one could argue that the amount of luck increases equally for all programs
invloved thus equalizing the issue, but it seems it should still increase the variance and thus impact the statistical margins.

I don't quite agree with luck decreasing as depth goes up. I might agree if you said "luck decreases as the depth of _one_ player goes up" since that player will see more and likely make fewer bad choices, and when he does make a bad choice, he still has a few plies hidden from his opponent where he might be able to correct things before they really go bad.

The probability of "luck" is based on the total depth of the game tree, from move N to the end of the game. Since a 15 or 20 ply search is a fraction of that space, the difference in luck is not that great assuming both programs get the same increased depth. As you get nearer to the end of the game, luck drops because the size of the remaining tree is reduced, until it eventually reaches the size of your search space and you can see the "ultimate truth" with no error or luck needed, at all.

As far as the variance? No. In fact, the variance (error bar calculations) don't even include the time, just the results. And the error bar does not change with respect to anything other than the number of wins, number of losses, and the number of draws.

I get about +/-4 after 40,000 games regardless of the time control. By the time I get to 200,000 games, I am down to around +/- 2 or so.

Michael Sherwin · Post by **Michael Sherwin** » Wed Nov 04, 2009 12:28 am

To use less games for meaningful test it might be a good idea to test the new version from slightly inferior positions. It would be difficult for dumb luck to play much of a role. In even positions with lots of choices it is easy for dumb luck to exceed error margins.

Desperado · Post by **Desperado** » Wed Nov 04, 2009 1:11 am

Hello ,

just beside the current topic, can someone refer me to
a source where the terms "error bar calculations", "error margin",
are explained.

I want to know where all the numbers you are using came from.

Thx, Michael

bob · Post by **bob** » Wed Nov 04, 2009 2:06 am

Desperado wrote:Hello ,

just beside the current topic, can someone refer me to
a source where the terms "error bar calculations", "error margin",
are explained.

I want to know where all the numbers you are using came from.

Thx, Michael

Search for BayesElo. Remi has a web page that explains what it does and how. And what the various values it displays measures.

Stats, Testing, Quit Early and Dumb Luck

Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck

Re: Stats, Testing, Quit Early and Dumb Luck