Does time control influences ELO error bar?

hgm · Post by **hgm** » Sat Apr 25, 2009 6:41 am

Error bars for a given number of games are only dependent on win / draw / loss fractions. But these can dependent on TC. See, for instance, Bob's post. So the error bars are dependent on TC.

But on prctice only very weakly. The score-fraction error bar is sqrt(score*(1-score) - drawFraction/4) / sqrt(nrOfGames). So for near-equality (score fraction = 0.5), with 36% draws the deominator is 40%, while with 0% draws it would be 50%. So even f draws would disappear completely in very fast games, you would oly need 1.25 times as many games to reach the same error bar as when te draw rate is 36%. so this is no reason to shy away from fast games.

Of course the score itself can be different at another TC; this wold be a valid reason to try longer TCs. In this case deriving the scaling with TC at fast to moderate TC, and extrapolating to long TC, might be a more efficient method than measuring directly at long TC. As was remarked in another thread. But t is less reliable, as it does rely on an assumtion, namely that the derived scaling continues to hold upto this longer TC. But a slightly less reliable estimate that you can do is of course highly prfeable over a reliable one that you cannot do and thus do not have.

MattieShoes · Post by **MattieShoes** » Sat Apr 25, 2009 7:59 am

If we want to know how time control affects strength, and you're willing to say the strength line is continuous, wouldn't it make more sense to test with relatively few games with several different time controls and interpolate rather than test extensively at a fast TC and extrapolate?

I'm weak on the lingo, but here's what I was thinking

The faint red line is "real" strength, increasing as TC increases. I generated fake test results with random numbers and inverse norm distribution with stdev of 20 Elo, so error bars are 40 Elo, which should be about 95% confidence. The blue line is a linear regression based on the test results.

So I guess what I'm wondering is if perhaps a plethora of "fuzzy points" in concatenation is perhaps better than one or two very well defined ones if we want this sort of test. My statistics knowledge is too weak to say with any sort of confidence, but this feels more right. . .

Assuming about 20 games per point, and the points represent 1-60 minutes per side, it'd be 1200 games and take a little bit longer than a single 32,000 game run of one-minute games on Dr. Hyatt's super-cluster. Of course, one could do something in between, more games fewer points... Extrapolating from one point just seems more dangerous though.

MattieShoes · Post by **MattieShoes** » Sat Apr 25, 2009 8:18 am

Though it occurs to me that 20 games isn't enough for 40 Elo error bars probably... Naturally, right after I can't edit the post

MattieShoes · Post by **MattieShoes** » Sat Apr 25, 2009 9:32 am

Okay, ignore the previous two posts. I've convinced myself that it'd take a very long time to get consistent results. I still like the idea of having multiple data points with different time controls, but there's no way to get around the time cost. So simply using a longer test as confirmation is just as effecient as anything.

I still think the idea of adding a tiny bit of variance to time controls to remove potential weirdness of a specific number though. Perhaps the upside is marginal, but I can't think of any real downside...

nevatre · Post by **nevatre** » Sat Apr 25, 2009 9:42 am

Yes, and once that information is gathered it might be possible to use it to adapt the search/eval. parameters to the time control.

This is a digression, but what I would like to do is to find a way to use the information in each game apart from the final score.

For example it might be that there is evidence based on the average score of a benefit from a change, but the benefit is not statistically significant at the chosen confidence level. Is there any other evidence that can be gathered from the same experiment to help make the decision to investigate further or abandon the idea?

One idea would be to use the length of the games, perhaps the altered program is tougher resulting in longer losses, and shorter wins. If that is so it would add weight to the evidence from the average score, suggesting that it would be good to investigate further.

Those further investigations might be separate analyses for white/black and for other classes of position, in the hope of understanding the effect of the change. Another possibility is to do some analysis comparing the positions around where score drops occured, in the hope of finding common theme(s). The results of these further investigations would suggest how the change could be adapted to make it more effective, or abandoned as hopeless.

For most programmers the resource expended on each test/experiment is large, so it makes sense to me to do some more analysis to try and make as much use as possible of the information gathered.

hgm · Post by **hgm** » Sat Apr 25, 2009 10:49 am

MattieShoes wrote:Okay, ignore the previous two posts. I've convinced myself that it'd take a very long time to get consistent results. I still like the idea of having multiple data points with different time controls, but there's no way to get around the time cost. So simply using a longer test as confirmation is just as effecient as anything.

I still think the idea of adding a tiny bit of variance to time controls to remove potential weirdness of a specific number though. Perhaps the upside is marginal, but I can't think of any real downside...

This is basially a form of orthogonal multi-testing:

If you have to play a large number of games to measure the effect of some change, you might as well vary another parameter over this sample as well (in the same way in the test and the reference), so that you gt informtion on the effect of that paramter for free.

In this case the other (or one of the other) parameters would be the TC. In stead of doing all games at 40/1 you dspread them over 40/0:30 to 40/1:30, and get some info on how the change scales over this 3-fold increase of TC.

krazyken · Post by **krazyken** » Sat Apr 25, 2009 1:06 pm

bob wrote:
krazyken wrote:here.
That's not exactly the correct subject. He asked if time control influenced the error bar, not the overall Elo rating...

well if you go to page two of that topic you will see:

bob wrote:
krazyken wrote:I would expect that longer time controls reduce the variability you would get in results (especially if you are avoiding randomization provided by the opening book). Thus I'd suspect the actual number of games needed to show a difference would be smaller with longer time controls.
Unfortunately you would be wrong. From a ton of prior testing...

variability directly influences the size of the error bars.

mcostalba · Post by **mcostalba** » Sat Apr 25, 2009 1:12 pm

Ok, here is the same puzzle put in another way, possibly more costrained.

Suppose I have engine A and engine B and after testing at 1+0 with 10.000 games I found A is stronger then B at 99.9% tolerance.

I also know, from previous experiment that A keeps stronger then B and with the SAME elo difference also at 40+0 time control.

Question: How many games at 40+0 I need before to state that A is stronger then B at 99.9% tolerance ?

Note: I don't need any absolute ELO value of the two engines, just to know A is stronger then B. Suppose I know frpm previous tests ELO difference keeps constant at the two time controls.

Hint: medium time per match at 1+0 is 0.75*(1 + 1) = 1.5 mintutes so total testing time is 1.5*10000 = 15.000 minutes. That corresponds to 0.75*(40 + 40) = 60 -> 15.000 / 60 = 250 games at 40+ 0

Second hint: I don't know the answer

Third hint: mathematician is strongly required here

krazyken · Post by **krazyken** » Sat Apr 25, 2009 1:22 pm

mcostalba wrote:Ok, here is the same puzzle put in another way, possibly more costrained.

Suppose I have engine A and engine B and after testing at 1+0 with 10.000 games I found A is stronger then B at 99.9% tolerance.

I also know, from previous experiment that A keeps stronger then B and with the SAME elo difference also at 40+0 time control.

Question: How many games at 40+0 I need before to state that A is stronger then B at 99.9% tolerance ?

Note: I don't need any absolute ELO value of the two engines, just to know A is stronger then B. Suppose I know frpm previous tests ELO difference keeps constant at the two time controls.

Hint: medium time per match at 1+0 is 0.75*(1 + 1) = 1.5 mintutes so total testing time is 1.5*10000 = 15.000 minutes. That corresponds to 0.75*(40 + 40) = 60 -> 15.000 / 60 = 250 games at 40+ 0

Second hint: I don't know the answer

Third hint: mathematician is strongly required here

The answer will depend on what the ELO difference is. you basically want to your 99.9% error bar equal to the ELO difference.

mcostalba · Post by **mcostalba** » Sat Apr 25, 2009 1:32 pm

krazyken wrote:
mcostalba wrote:Ok, here is the same puzzle put in another way, possibly more costrained.

Suppose I have engine A and engine B and after testing at 1+0 with 10.000 games I found A is stronger then B at 99.9% tolerance.

I also know, from previous experiment that A keeps stronger then B and with the SAME elo difference also at 40+0 time control.

Question: How many games at 40+0 I need before to state that A is stronger then B at 99.9% tolerance ?

Note: I don't need any absolute ELO value of the two engines, just to know A is stronger then B. Suppose I know frpm previous tests ELO difference keeps constant at the two time controls.

Hint: medium time per match at 1+0 is 0.75*(1 + 1) = 1.5 mintutes so total testing time is 1.5*10000 = 15.000 minutes. That corresponds to 0.75*(40 + 40) = 60 -> 15.000 / 60 = 250 games at 40+ 0

Second hint: I don't know the answer

Third hint: mathematician is strongly required here
The answer will depend on what the ELO difference is. you basically want to your 99.9% error bar equal to the ELO difference.

Yes ! This is what I want. Are you able to come out with a nice formula ?

I my humble opinion this could be a very nice subject for a paper or a student work in statistics because it seems it is never being analyzed before...if you have a cluster this is also a good way to consume tons of electricity on

, of course this is ONLY my humble opinion.

Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?