Long Time Control Testing

CRoberson · Post by **CRoberson** » Thu Jun 07, 2007 1:25 am

I use both long time control and short time control gauntlets for testing.

I've noticed reduced variability in the long time control testing which is
why I do both.

Anybody else noticed this?

Bob mentioned something a while back that made me think he has.

I never test version vs version. I only test my engine revs vs a suite
of other engines within +/- 100 pts of its current performance.

It seems to me that 20 to 40 matches at long TC's are reliable.

hgm · Post by **hgm** » Thu Jun 07, 2007 11:17 am

Do you see identical games in the gauntlets of your different versions. Or parttly identical games, and if so, after how many moves the corresponding games start to deviate?

CRoberson · Post by **CRoberson** » Thu Jun 07, 2007 4:07 pm

It depends on how drastic the changes were. If the changes were small
then fully repeated games do exist. But, I haven't counted it.

Seems to me that is good. If longer time controls really produce
more consistant results, one might think that would lead to more
repeats. I'll pay attention to that.

hgm · Post by **hgm** » Thu Jun 07, 2007 4:41 pm

CRoberson wrote:It depends on how drastic the changes were. If the changes were small
then fully repeated games do exist.

Well, don't let Bob hear it!

But seriously, if you do have identical games between the tests, that certainly shows that the tests are not independent, and thus that the spread in their difference can be less than their individual standard errors.

The effect that you observe can then be caused by the games in different tests becoming more independent at shorter time controls. (So that they tend to differ more after making chances, i.e. giving more variability.)

I think the explanation Ed gave for this in another thread is quite believable: games that start identical can at some point diverge because the decision to start one more iteration is taken in one case, but not in the other, due to tiny slow-downs of the engine caused by other processes running on the same machine. This slowdown is probably limited to just 100ms or so (if you are not doing heavy calculation jobs at the same time), and so you only can get divergence if an iteration finishes closer than 100 ms to the nominal time that you use in this decision. The probability for this gets lower if the last iteration lasts longer.

CRoberson · Post by **CRoberson** » Thu Jun 07, 2007 5:20 pm

HGM,

All that makes sense. As far as machine "hiccups" are concerned, I
think my testing methods handle that. I use a dual proc machine
but no program is allowed to be SMP or allowed to ponder. Thus, at
no point in time are two procs used by the testing which allows the
second processor to handle the "hiccups".

Also, my timer algorithm may be the reason for the stability. It would
not be so affected by small "hiccups" in a long time control.

Here is the algorithm.

If a ply completes and the time is up then stop searching.
During the search itself; don't interrupt the search unless
twice the alotted time has been used.

Charles

yoshiharu · Post by **yoshiharu** » Thu Jun 07, 2007 7:30 pm

CRoberson wrote:It depends on how drastic the changes were. If the changes were small
then fully repeated games do exist. But, I haven't counted it.

Seems to me that is good. If longer time controls really produce
more consistant results, one might think that would lead to more
repeats. I'll pay attention to that.

Why not trying to test from a "iteratively increasing" set of opening positions. You could start with a small set of "very different" positions (maybe taking the letter and first digit of the corresponding ECO code), then refine the graining of your starting set more and more. Any iteration will be guaranteed to give different games, and as the "resolution" you allow on the set of starting positions gets more refined, you should figure out if the modification gave a good outcome "up to some graining". Should save you some time, and give you some reliable statistical measure, while keeping the idea that "a good line remains a good line".

Cheers, Mauro

bob · Post by **bob** » Fri Jun 08, 2007 5:22 am

CRoberson wrote:I use both long time control and short time control gauntlets for testing.

I've noticed reduced variability in the long time control testing which is
why I do both.

Anybody else noticed this?

Bob mentioned something a while back that made me think he has.

I never test version vs version. I only test my engine revs vs a suite
of other engines within +/- 100 pts of its current performance.

It seems to me that 20 to 40 matches at long TC's are reliable.

long = less variance = true for me. But "less" is not very precise. I found the variance significant no matter what the time control. I have played 80 game matches (40 positions 2 games per position) and the variance is horrible even at 40/2hr... I'm currently using 2560 game matches against a single opponent and am getting reproducible results with almost no variance.

Long Time Control Testing

Long Time Control Testing

Re: Long Time Control Testing

Re: Long Time Control Testing

Re: Long Time Control Testing

Re: Long Time Control Testing

Re: Long Time Control Testing

Re: Long Time Control Testing