Debate: testing at fast time controls

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Graham Banks
Posts: 44636
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: Debate: testing at fast time controls

Post by Graham Banks »

Dann Corbit wrote:
Kirill Kryukov wrote: {snip}
Do you seriously look for useful novelties in engine-engine games, even in long time controls?
Yes.
Found some, too.
Not surprising to me. :wink:
gbanksnz at gmail.com
Steelman

Re: Debate: testing at fast time controls

Post by Steelman »

bob wrote:
Kempelen wrote:And when you repeat tests in fast time controls, do you see a repetition in the result?. I have noted that repeting tests in fast time controls, the result change more than repeting in slows. Have you noted something similar?
That only means you are not playing enough games. To get to the +/- 4 Elo level, you need to play 40,000 games or so. and +/- 4 gives a significant margin for error even with that many games...
How many games would be required for about +-20elo margin?
And test at both fast and then at a slower speed? Slower being no less than 20 to 30 min games. I wish these games could be played at more like 60 or 90 min games but that seems it would take (even for Bob) some time.

I ask this because at these fast speeds the playing strength is effected a great deal. The positional and even tactical abilities have been reduced. Would that not even effect the evaluation of positions. I have "tuned" my evaluations to playing at much slower speeds, not fast. I would think that some values would need to be adjusted for fast speeds? Or is this not true?

So my vote is no. I don't think testing at fast speeds is giving you the "real" data you are looking for? The data at slower speeds would have to be more accurate and a true test of playing strength. Unless of course the program is intended to play speed chess all the time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Debate: testing at fast time controls

Post by bob »

Steelman wrote:
bob wrote:
Kempelen wrote:And when you repeat tests in fast time controls, do you see a repetition in the result?. I have noted that repeting tests in fast time controls, the result change more than repeting in slows. Have you noted something similar?
That only means you are not playing enough games. To get to the +/- 4 Elo level, you need to play 40,000 games or so. and +/- 4 gives a significant margin for error even with that many games...
How many games would be required for about +-20elo margin?
And test at both fast and then at a slower speed? Slower being no less than 20 to 30 min games. I wish these games could be played at more like 60 or 90 min games but that seems it would take (even for Bob) some time.

I ask this because at these fast speeds the playing strength is effected a great deal. The positional and even tactical abilities have been reduced. Would that not even effect the evaluation of positions. I have "tuned" my evaluations to playing at much slower speeds, not fast. I would think that some values would need to be adjusted for fast speeds? Or is this not true?

So my vote is no. I don't think testing at fast speeds is giving you the "real" data you are looking for? The data at slower speeds would have to be more accurate and a true test of playing strength. Unless of course the program is intended to play speed chess all the time.
the basic idea is that if you double the number of games, you reduce the error by a factor of four, as a pretty close estimate. So to get to +/- 1, 64K games would be needed. to reduce the error rate by a factor of 2 (to get to +/-2) you would need approximately 46,000 games, considering that 32,000 games is +/- 4. (sqrt(2) * 32000)

To estimate the time, a 1+1 time control at 32,000 games takes around 12 hours. Which translates to 32,000 / 12 = about 2,500 games per hour. Using 256 processors is therefore about 10 games per hour per cpu. Or 6 minutes a game on average. There is some lost time in running things on the cluster, but that is close. If you want to multiple the time control by 10 to get to roughly 30 minutes per side per game or 60 minutes total time per game, the test suddenly takes 10x longer or about 5 full days. That's too long to measure and tweak...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Debate: testing at fast time controls - update

Post by bob »

It is apparently worse than I thought at first look. Here's a sample from current testing...

Code: Select all

   1 Toga2               2665    2    4 428010   59%  2601   22% 
   2 Glaurung 2.1        2663    3    2 428010   58%  2601   21% 
   3 Crafty-22.9R17-12   2608    3    3 93384   51%  2597   21% 
There were lots of other versions of Crafty playing, which I omitted to not make this too long. But a _bunch_ of differently tuned versions of Crafty played against 4 opponents, a total of almost 2 million games (done over the past 2-3 days).

Notice the first two lines, with 428010 games each, and a +2-4 or +3-2 error margin.

So it takes a _ton_ of games to get down to +/-2 or lower. way more than I originally thought.
Hart

Re: Debate: testing at fast time controls

Post by Hart »

If the error is a function of the square of the sample size, then don't you need 4x as many games to get half the error?
User avatar
hgm
Posts: 28391
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Debate: testing at fast time controls - update

Post by hgm »

bob wrote:Notice the first two lines, with 428010 games each, and a +2-4 or +3-2 error margin.
This would make me suspicious against BayesElo. At the very least the quoted error cannot mean what we think it means. With 428000 games the 2-sigma error in the win percentage should be 80%/sqrt(428000) = 0.12%, which should result in an Elo 2-sigma confidence of 0.85 Elo (in the 30-70% score range).

The quoted error might reflect the uncertainty in the ratings of the opponents, from which the current rating is derived. The error in the rating difference between two players with 428000 games each should be 1.2%, and the covariance given by BayesEo should reflect that.

If not, it is simply wrong.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Debate: testing at fast time controls - update

Post by bob »

hgm wrote:
bob wrote:Notice the first two lines, with 428010 games each, and a +2-4 or +3-2 error margin.
This would make me suspicious against BayesElo. At the very least the quoted error cannot mean what we think it means. With 428000 games the 2-sigma error in the win percentage should be 80%/sqrt(428000) = 0.12%, which should result in an Elo 2-sigma confidence of 0.85 Elo (in the 30-70% score range).

The quoted error might reflect the uncertainty in the ratings of the opponents, from which the current rating is derived. The error in the rating difference between two players with 428000 games each should be 1.2%, and the covariance given by BayesEo should reflect that.

If not, it is simply wrong.
All I can say is there were N versions of Crafty, each playing 8K games against glaurung 2, toga2, fruit2 and glaurung 1. Knowing there are exactly 3891 positions in my starting position test set, double that many games per match, you could compute how many different versions of Crafty there are. I assume that in this case, since each version of Crafty is distinct, and each version only plays G1, G2, F2 and T2, that is where the extra uncertainty comes from. But that is just an assumption, where those 4 never play each other, and no two versions of crafty play each other. Given that, it is not so surprising. I believe that the last time I tried, two opponents seemed to follow the expected +/- error as the number of games increase. Here the number of games is high, but the number of games between the programs is a bit warped.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Debate: testing at fast time controls

Post by bob »

Hart wrote:If the error is a function of the square of the sample size, then don't you need 4x as many games to get half the error?
It's the other way around. to divide the error by 2, you need ngames * sqrt(2). To divide by 4, you need 2x as many games.
User avatar
hgm
Posts: 28391
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Debate: testing at fast time controls

Post by hgm »

Hart is right, Bob is wrong.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Debate: testing at fast time controls

Post by bob »

hgm wrote:Hart is right, Bob is wrong.
You are right. I was thinking about it from a "backward" direction... Although it doesn't really explain the other issue we were discussing...

And I knew that of course. I watch the matches as they progress, and almost can tell you the error range for any number of games. And it takes longer and longer to drive it down to small values, as expected...

For example, +/- 7 takes around 8K games. +/-4 or 5 takes 32,000 games.