Toga The Killer 1Y MP 4CPU is the strongest Toga....

pedrox · Post by **pedrox** » Sun Jun 28, 2009 7:33 pm

If you repeat the 31128 games I'm not sure that you get a result into 2607-2616

krazyken · Post by **krazyken** » Sun Jun 28, 2009 7:52 pm

As for different manner, people who are playing matches of games, vs people who are trying out test positions like you are. I'm talking about completely different experiment designs, You choose a fixed number of opponents, a fixed number of positions, and repeat games for each position. Most people posting match data are not using all the same criteria. different experiment designs can exhibit different behaviors. Taking what you see in your experiments, and trying to apply it to an experiment of a different design may not always be a good idea, it may be valid some times, but it is always important to check the assumptions.

as for the rest There is a difference between what is useful to you and what is useful to others. ±50 is worthless to your purpose, but for telling me the strength of an engine a ±50 is just fine. The difference between 2557 and 2611 is completely irrelevant and worthless to most chess players. So we appear to be discussing this because we have different definitions of worthless. All valid results are useful to me.

bob · Post by **bob** » Sun Jun 28, 2009 9:00 pm

pedrox wrote:If you repeat the 31128 games I'm not sure that you get a result into 2607-2616

Don't wager.

Code: Select all

   3 Crafty-23.1R01-3  2605    4    4 31128   51%  2598   25%
   4 Crafty-23.1-3     2605    4    4 31128   51%  2598   25%
   5 Crafty-23.1R01-2  2604    4    4 31128   51%  2598   24%
   6 Crafty-23.1-2     2604    4    4 31128   51%  2598   24%
   7 Crafty-23.1-1     2603    4    4 31128   51%  2598   25%
   8 Crafty-23.1R01-1  2602    5    5 31128   51%  2598   25%
   9 Crafty-23.0-1     2598    4    4 31128   50%  2598   24%
  10 Crafty-23.1R02-1  2598    4    4 31128   50%  2598   24%
  11 Crafty-23.1R02-2  2598    4    4 31128   50%  2598   24%
  12 Crafty-23.0-2     2597    5    5 31128   50%  2598   24%
  13 Crafty-23.0-3     2594    4    4 31128   49%  2598   24%

the above was a different run, but it was the only one I had handy where I ran each of three different versions three times. The ratings match what is statistically expected.

23.0-1,2,3 are the same veresion, 3 different runs.

23.0 -> 2594, 2597, 2598.
23.1 -> 2603, 2604, 2605
23.1R01 -> 2602, 2604, 2605 (note that R01 is a cleaned up 23.1, no changes to any algorithm, a couple of tiny changes to make the code more readable, I ran these just to see if I had broken something by accident.

R02 was a test I started but stopped after 2 runs as having no significant improvement, perhaps a slight bit worse.

The runs are _very_ consistent, within the error bar almost every time, as the above show.

bob · Post by **bob** » Sun Jun 28, 2009 9:07 pm

krazyken wrote:As for different manner, people who are playing matches of games, vs people who are trying out test positions like you are. I'm talking about completely different experiment designs, You choose a fixed number of opponents, a fixed number of positions, and repeat games for each position. Most people posting match data are not using all the same criteria. different experiment designs can exhibit different behaviors. Taking what you see in your experiments, and trying to apply it to an experiment of a different design may not always be a good idea, it may be valid some times, but it is always important to check the assumptions.

First, let's make sure we are talking apples to applies. I am not using "test positions". I am playing complete games, from standard starting positions. The "standard positions" come from a couple of million games, after move 10-12, white to move, sorted most frequent to least frequent, with all dups removed. They just make sure that every time I run a test, I play starting from the same openings, rather than dealing with the statistical uncertainty a book introduces.

The only assumption I can think of that I made, is that 10 games is not enough to draw _any_ conclusion, whether the result is 10-0, 0-10 or anything in between. Even a hundred games is not enough.

as for the rest There is a difference between what is useful to you and what is useful to others. ±50 is worthless to your purpose, but for telling me the strength of an engine a ±50 is just fine. The difference between 2557 and 2611 is completely irrelevant and worthless to most chess players. So we appear to be discussing this because we have different definitions of worthless. All valid results are useful to me.

OK, two cases:

(1) programmers/authors. This is the case I am discussing. Here we care whether A' is better than A. This takes a lot of games since the difference between A and A' is small. This requires a _large_ number of games.

(2) users. This case is far less important, although it still ought to be done with some degree of scientific validity. And here, 10 games is _not_ enough to conclude anything either. And the closer two programs are, the worse this comparison becomes.

But in any case, 10 games is useless for anything but the entertainment value of watching them. 100 games is on the very borderline of being useful, if the two programs are not very close together... Yet I see results with 20 games reported where A is supposedly better than B by 15 Elo. Ignoring the +/- 100 error bar this has.

Jaimes Conda · Post by **Jaimes Conda** » Mon Jun 29, 2009 12:53 am

bob wrote:
Jaimes Conda wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?

That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
Bob,
Two questions.

1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?
Depends on what I am testing. For evaluation ideas, I start with very fast games (10 secs on clock, 10ms increment) so that I can complete the 32,000 games in an hour or so. For changes that look good, I then will play slower games. 1+1 (one min on clock, 1 sec increment) takes about 12 hours to complete). If I am testing search ideas, I usually run 1+1 games, and for changes that look reasonable, or which intuition suggests might be better at longer time controls, I play longer games. I rarely go past 5+5 except for final verificatoin. 5+5 takes about 2 days to finish. 60+60 is almost 2 weeks.

2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?

Jaimes
last question points out a real problem. There are changes that look good at fast time controls and bad at long ones, and vice-versa. But those are fairly rare. I see more of the case where a change either helps at fast games and does little or nothing for longer games, or else it doesn't make any difference in fast games, but helps in longer games.

Thanks for the reply Jaimes.

krazyken · Post by **krazyken** » Mon Jun 29, 2009 4:25 am

bob wrote: OK, two cases:

(1) programmers/authors. This is the case I am discussing. Here we care whether A' is better than A. This takes a lot of games since the difference between A and A' is small. This requires a _large_ number of games.

(2) users. This case is far less important, although it still ought to be done with some degree of scientific validity. And here, 10 games is _not_ enough to conclude anything either. And the closer two programs are, the worse this comparison becomes.

But in any case, 10 games is useless for anything but the entertainment value of watching them. 100 games is on the very borderline of being useful, if the two programs are not very close together... Yet I see results with 20 games reported where A is supposedly better than B by 15 Elo. Ignoring the +/- 100 error bar this has.

point 1 I don't care about. Point 2, I've already shown with mathematical evidence is false. 10 games is information that can be used. I thank you for your time in helping this discussion, and I hope a few people have learned something here, I know I have.

bob · Post by **bob** » Mon Jun 29, 2009 5:37 am

krazyken wrote:
bob wrote: OK, two cases:

(1) programmers/authors. This is the case I am discussing. Here we care whether A' is better than A. This takes a lot of games since the difference between A and A' is small. This requires a _large_ number of games.

(2) users. This case is far less important, although it still ought to be done with some degree of scientific validity. And here, 10 games is _not_ enough to conclude anything either. And the closer two programs are, the worse this comparison becomes.

But in any case, 10 games is useless for anything but the entertainment value of watching them. 100 games is on the very borderline of being useful, if the two programs are not very close together... Yet I see results with 20 games reported where A is supposedly better than B by 15 Elo. Ignoring the +/- 100 error bar this has.
point 1 I don't care about. Point 2, I've already shown with mathematical evidence is false. 10 games is information that can be used. I thank you for your time in helping this discussion, and I hope a few people have learned something here, I know I have.

Where have you shown "with mathematical evidence it is false?"

bnemias · Post by **bnemias** » Mon Jun 29, 2009 5:41 am

Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.

bob · Post by **bob** » Mon Jun 29, 2009 7:48 pm

bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.

I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.

krazyken · Post by **krazyken** » Mon Jun 29, 2009 7:53 pm

bob wrote:
bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.

Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.

Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....