Does time control influences ELO error bar?

mcostalba · Post by **mcostalba** » Fri Apr 24, 2009 6:32 pm

Due to title length limitation I have posted an incorrect title.

The correct question is?

Does the ELO error bar convergence speed depends on the time control?

As example, if two engines play at 1'+0" I can see in the first 1000 games a lot of "fluctations" and the speed at wich estimated ELO converges at real ELO is quite low.

Perhaps if I play at 40+40 after 500 games I have already a good ELO estimate of the two engines.

The rationale could be that what it counts to have a realistic idea of engines strenght is the number of nodes/positions elaborated by the two engines when they fight against each other.

If I have tested the two engines for say, 700 million nodes/positions searched, I will have a good estimate of the relative strength indipendently from the time control and this means that I need to play, say 2000 games at 1+0 or just 200 games at 40+40.

This could explain why in CCRL / CEGT we have a good idea of an engine just after few hundreds games while when testing at blitz time you need thousands of games.

Does anybody has ever measured the convergence speed versus the time control used?

Thanks
Marco

krazyken · Post by **krazyken** » Fri Apr 24, 2009 6:51 pm

I asked a similar question not too long ago. The result is that there is no evidence to suggest that the time control changes the shape of the results distribution.

bob · Post by **bob** » Fri Apr 24, 2009 8:26 pm

krazyken wrote:I asked a similar question not too long ago. The result is that there is no evidence to suggest that the time control changes the shape of the results distribution.

What it does is increase the number of draws, somewhat proportional to the time limit increase. More draws at longer time controls...

mcostalba · Post by **mcostalba** » Fri Apr 24, 2009 8:46 pm

Could you please post some link on the past discussions?

I am not able to find any info on this subject.

Thanks
Marco

bob · Post by **bob** » Fri Apr 24, 2009 10:30 pm

mcostalba wrote:Could you please post some link on the past discussions?

I am not able to find any info on this subject.

Thanks
Marco

My comment is not part of a previous discussion. It is an observation that I see daily.

Here's some samples from a current test:

Code: Select all

   3 Crafty-23.1R02-1  2623    4    4 31128   53%  2601   22% 
   9 Crafty-23.1R02-3  2587    4    4 31128   48%  2601   26% 
  12 Crafty-23.1R02-4  2579    4    4 31128   47%  2601   28%

the -1 version is playing _very_ fast games. 10s + 0.1s the -3 version is playing 30s+0.5s, and the -4 version is playing 1m+1s. Notice the draws. The ratings don't mean much as they are all combined which makes them look a little fishy, but the draw percentages (last number on each line) goes up as games get longer.

krazyken · Post by **krazyken** » Fri Apr 24, 2009 10:33 pm

here.

MattieShoes · Post by **MattieShoes** » Sat Apr 25, 2009 2:56 am

I would think that since signal increases linearly and noise increases with sqrt(games), they would be approach equal given sufficient number of games. Essentially, each successive game offers less information than the one before it, so as games approach infinity, useful information from an additional game approaches zero. Given the increased draw percentage in longer time controls, I'd think there would be a point at which longer games gets more useful information than shorter ones in terms of error bars... I suppose one could calculate it out.

I think there's another issue though. One could think of strength as function of both search effeciency and eval accuracy, but the coeffecients change based on time control. So by changing time control, you're also changing the strength of the engine you're trying to measure...

I'm starting to think some randomness is called for in time control around the one you're testing for, just to eliminate the possibility of strange results caused by time management in the engines you're testing against (or your own). For instance, rather than testing 3 0 games only, throw in some 2 1, 2 2, 1 3, 1 4, 4 0, 3 1, etc.. Odds are the results would end up the same, but if the time it takes is the same anyway, it seems a bit safer.

bob · Post by **bob** » Sat Apr 25, 2009 3:00 am

krazyken wrote:here.

That's not exactly the correct subject. He asked if time control influenced the error bar, not the overall Elo rating...

mcostalba · Post by **mcostalba** » Sat Apr 25, 2009 3:49 am

bob wrote:
krazyken wrote:here.
That's not exactly the correct subject. He asked if time control influenced the error bar, not the overall Elo rating...

Yes, I was asking about convergence speed. To put in simple words:

If I need 2000 games at 1+0 to verify that engine A is stronger of engine B, how many games are needed if I play at 40+40 ?

This is somewhat different from ELO difference, because can be absolutely possible that at 1+0 engine A results 10 ELO stronger then B and at 40+40 could be 5 ELO points weaker. But this should not influence the fact that I can reliably verify these two different results using as example 2000 games in first case and 400 games in second case.

I was thinking about more draws with longer time control. I am not an expert of statistics, perhaps here someone more versed in mathematics could help, but my feeling is that if with longer time controls draws are more then also fluctations should be less because series of winning games and /or lost games should be statistically less frequent. In other words the fact that draws are more could translate in a less variance for longer games and in a faster convergence. But, again, some mathematician is needed here to confirm this idea.

MattieShoes · Post by **MattieShoes** » Sat Apr 25, 2009 4:29 am

Plugging those numbers in and assuming an overall 50% score, and holding testing time constant, you'd get 95% confidence error bars of:

Code: Select all

32,000 games at  28 seconds/game - 0.49%
 8,960 games at 100 seconds/game - 0.91% (+0.42%)
 4,480 games at 200 seconds/game - 1.27% (+0.77%)

If we double the test time&#58;

64,000 games at  28 seconds/game - 0.35%
17,920 games at 100 seconds/game - 0.64% (+0.29%)
 8,960 games at 200 seconds/game - 0.90% (+0.55%)

So it seems the increased draw % isn't enough to counteract the significantly fewer games, but the differences do decrease as testing time increases.

Edit: Marco, from what Dr. Muller said, that is indeed the case -- higher draw % reduces error bars more when all else is equal
Testing 51% vs 49% gauntlet score with 0% draws requires about 5000 games.
Testing 51% vs 49% gauntlet score with 35% draws requires about 3250 games.

But if those 3250 games take 2 hours apiece instead of 2 minutes apiece, you'd still hit the confidence interval faster testing 1 minute games...

Does time control influences ELO error bar?

Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?

Re: Does time control influences ELO error bar?