Debate: testing at fast time controls

Graham Banks · Post by **Graham Banks** » Fri Dec 19, 2008 7:12 pm

Dann Corbit wrote:
Kirill Kryukov wrote: {snip}
Do you seriously look for useful novelties in engine-engine games, even in long time controls?
Yes.
Found some, too.

Not surprising to me.

Steelman · Post by **Steelman** » Sat Dec 20, 2008 3:11 am

bob wrote:
Kempelen wrote:And when you repeat tests in fast time controls, do you see a repetition in the result?. I have noted that repeting tests in fast time controls, the result change more than repeting in slows. Have you noted something similar?
That only means you are not playing enough games. To get to the +/- 4 Elo level, you need to play 40,000 games or so. and +/- 4 gives a significant margin for error even with that many games...

How many games would be required for about +-20elo margin?
And test at both fast and then at a slower speed? Slower being no less than 20 to 30 min games. I wish these games could be played at more like 60 or 90 min games but that seems it would take (even for Bob) some time.

I ask this because at these fast speeds the playing strength is effected a great deal. The positional and even tactical abilities have been reduced. Would that not even effect the evaluation of positions. I have "tuned" my evaluations to playing at much slower speeds, not fast. I would think that some values would need to be adjusted for fast speeds? Or is this not true?

So my vote is no. I don't think testing at fast speeds is giving you the "real" data you are looking for? The data at slower speeds would have to be more accurate and a true test of playing strength. Unless of course the program is intended to play speed chess all the time.

bob · Post by **bob** » Sat Dec 20, 2008 6:31 am

Steelman wrote:
bob wrote:
Kempelen wrote:And when you repeat tests in fast time controls, do you see a repetition in the result?. I have noted that repeting tests in fast time controls, the result change more than repeting in slows. Have you noted something similar?
That only means you are not playing enough games. To get to the +/- 4 Elo level, you need to play 40,000 games or so. and +/- 4 gives a significant margin for error even with that many games...
How many games would be required for about +-20elo margin?
And test at both fast and then at a slower speed? Slower being no less than 20 to 30 min games. I wish these games could be played at more like 60 or 90 min games but that seems it would take (even for Bob) some time.

I ask this because at these fast speeds the playing strength is effected a great deal. The positional and even tactical abilities have been reduced. Would that not even effect the evaluation of positions. I have "tuned" my evaluations to playing at much slower speeds, not fast. I would think that some values would need to be adjusted for fast speeds? Or is this not true?

So my vote is no. I don't think testing at fast speeds is giving you the "real" data you are looking for? The data at slower speeds would have to be more accurate and a true test of playing strength. Unless of course the program is intended to play speed chess all the time.

the basic idea is that if you double the number of games, you reduce the error by a factor of four, as a pretty close estimate. So to get to +/- 1, 64K games would be needed. to reduce the error rate by a factor of 2 (to get to +/-2) you would need approximately 46,000 games, considering that 32,000 games is +/- 4. (sqrt(2) * 32000)

To estimate the time, a 1+1 time control at 32,000 games takes around 12 hours. Which translates to 32,000 / 12 = about 2,500 games per hour. Using 256 processors is therefore about 10 games per hour per cpu. Or 6 minutes a game on average. There is some lost time in running things on the cluster, but that is close. If you want to multiple the time control by 10 to get to roughly 30 minutes per side per game or 60 minutes total time per game, the test suddenly takes 10x longer or about 5 full days. That's too long to measure and tweak...

bob · Post by **bob** » Sat Dec 20, 2008 6:30 pm

It is apparently worse than I thought at first look. Here's a sample from current testing...

Code: Select all

   1 Toga2               2665    2    4 428010   59%  2601   22% 
   2 Glaurung 2.1        2663    3    2 428010   58%  2601   21% 
   3 Crafty-22.9R17-12   2608    3    3 93384   51%  2597   21%

There were lots of other versions of Crafty playing, which I omitted to not make this too long. But a _bunch_ of differently tuned versions of Crafty played against 4 opponents, a total of almost 2 million games (done over the past 2-3 days).

Notice the first two lines, with 428010 games each, and a +2-4 or +3-2 error margin.

So it takes a _ton_ of games to get down to +/-2 or lower. way more than I originally thought.

Hart · Post by **Hart** » Sat Dec 20, 2008 7:25 pm

If the error is a function of the square of the sample size, then don't you need 4x as many games to get half the error?

hgm · Post by **hgm** » Sat Dec 20, 2008 8:13 pm

bob wrote:Notice the first two lines, with 428010 games each, and a +2-4 or +3-2 error margin.

This would make me suspicious against BayesElo. At the very least the quoted error cannot mean what we think it means. With 428000 games the 2-sigma error in the win percentage should be 80%/sqrt(428000) = 0.12%, which should result in an Elo 2-sigma confidence of 0.85 Elo (in the 30-70% score range).

The quoted error might reflect the uncertainty in the ratings of the opponents, from which the current rating is derived. The error in the rating difference between two players with 428000 games each should be 1.2%, and the covariance given by BayesEo should reflect that.

If not, it is simply wrong.

bob · Post by **bob** » Sat Dec 20, 2008 9:49 pm

hgm wrote:
bob wrote:Notice the first two lines, with 428010 games each, and a +2-4 or +3-2 error margin.
This would make me suspicious against BayesElo. At the very least the quoted error cannot mean what we think it means. With 428000 games the 2-sigma error in the win percentage should be 80%/sqrt(428000) = 0.12%, which should result in an Elo 2-sigma confidence of 0.85 Elo (in the 30-70% score range).

The quoted error might reflect the uncertainty in the ratings of the opponents, from which the current rating is derived. The error in the rating difference between two players with 428000 games each should be 1.2%, and the covariance given by BayesEo should reflect that.

If not, it is simply wrong.

All I can say is there were N versions of Crafty, each playing 8K games against glaurung 2, toga2, fruit2 and glaurung 1. Knowing there are exactly 3891 positions in my starting position test set, double that many games per match, you could compute how many different versions of Crafty there are. I assume that in this case, since each version of Crafty is distinct, and each version only plays G1, G2, F2 and T2, that is where the extra uncertainty comes from. But that is just an assumption, where those 4 never play each other, and no two versions of crafty play each other. Given that, it is not so surprising. I believe that the last time I tried, two opponents seemed to follow the expected +/- error as the number of games increase. Here the number of games is high, but the number of games between the programs is a bit warped.

bob · Post by **bob** » Sat Dec 20, 2008 9:51 pm

Hart wrote:If the error is a function of the square of the sample size, then don't you need 4x as many games to get half the error?

It's the other way around. to divide the error by 2, you need ngames * sqrt(2). To divide by 4, you need 2x as many games.

hgm · Post by **hgm** » Sat Dec 20, 2008 10:56 pm

Hart is right, Bob is wrong.

bob · Post by **bob** » Sun Dec 21, 2008 3:28 am

hgm wrote:Hart is right, Bob is wrong.

You are right. I was thinking about it from a "backward" direction... Although it doesn't really explain the other issue we were discussing...

And I knew that of course. I watch the matches as they progress, and almost can tell you the error range for any number of games. And it takes longer and longer to drive it down to small values, as expected...

For example, +/- 7 takes around 8K games. +/-4 or 5 takes 32,000 games.

Debate: testing at fast time controls

Re: Debate: testing at fast time controls

Re: Debate: testing at fast time controls

Re: Debate: testing at fast time controls

Re: Debate: testing at fast time controls - update

Re: Debate: testing at fast time controls

Re: Debate: testing at fast time controls - update

Re: Debate: testing at fast time controls - update

Re: Debate: testing at fast time controls

Re: Debate: testing at fast time controls

Re: Debate: testing at fast time controls