Let me see what the space constraints are.jwes wrote:Any chance you could put up an archive of these games on your ftp site? I would be interested in doing some statistics on the results.
A question on testing methodology
Moderator: Ras
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22%
3 Glaurung 2.2 2625 5 5 16000 52% 2607 22%
4 Toga2 2616 5 5 16000 51% 2607 23%
5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21%
6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22%
3 Glaurung 2.2 2625 3 3 48000 54% 2596 23%
4 Toga2 2615 4 4 48000 52% 2597 23%
5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21%
6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Re: A question on testing methodology [here is the data]
From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
Remember the "square" stuff. 4x the games to reduce the error by 2x.Hart wrote:From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology
bob wrote:What else could you factor in? Elo is used to predict the future outcome of a game between two opponents, based on past performance by each of them. No need to factor in weather, health, mental state, etc. Enough games and that is included by natural selection.Edsel Apostol wrote:I understand and I know what you are trying to say, but that is not what my question is all about.bob wrote:And you are falling into a classical trap. ratings are _not_ absolute numbers. If you play your program (two versions, A and A') against a set of opponents, you should be able to set your opponent ratings to anything you want, and your two versions should still end up in the right order, with the right separation between them. Elo is not an absolute value, the only use for Elo numbers is to subtract them to get the difference. That is an absolute number, and is useful.Edsel Apostol wrote:As for your question, I want to know if versions/settings of my engine performs better compared to each other. I just thought that if the engine opponents have a more stable rating that it will also reflect on the stability of the rating of the tested engine. This is not the case though as pointed out by the results posted by Adam Hair and as what you've pointed out in your posts.bob wrote:Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.
My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?
For example:
Format is (Engine, Elo, Number of Games)
Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000
Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000
Which rating list is more accurate?
You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
The question now is should opponent engine's rating stability be included in the computation of the ratings? I mean for example:
Format is (Engine, Number of Games, Winning Percentage)
Data 1:
EngineA 1000 60%
OpponentA 1000 40%
Data 2:
EngineA 1000 60%
OpponentA 20000 40%
In the above data, I have more confidence in the rating produced from Data 2 compared to the one from Data 1 as Data 2 have more games played by the OpponentA against other engines, but the current computation of ratings by EloStat and Bayeselo seems to produce the same rating for Data 1and Data 2 for EngineA.
It seems clear to me now how elo is being calculated, it is based mostly on winning percentages and it doesn't take into account other factors. Maybe someday, someone could discover a much accurate way to calculate ratings.

The current elo calculation is not that complex especially that of Elostat. That's why it has problems on some extreme scenarios. I think Remi takes into account more factors than just enough games in Bayeselo with the covariance feature.
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology [here is the data]
I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22% 3 Glaurung 2.2 2625 5 5 16000 52% 2607 22% 4 Toga2 2616 5 5 16000 51% 2607 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22% 3 Glaurung 2.2 2625 3 3 48000 54% 2596 23% 4 Toga2 2615 4 4 48000 52% 2597 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: A question on testing methodology [here is the data]
That "square" argument is flawed since, as you already mentioned by yourself and I repeated, the RR games are played only once and can be kept forever. For each new engine version to test, you only add its new gauntlet games, so no additional effort once the RR games are completed.bob wrote:Remember the "square" stuff. 4x the games to reduce the error by 2x.Hart wrote:From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.
More points:
1. Could you show us these two tables with only the first 50% and then only 75% of the gauntlet games included? I would like to see whether a reduced number of games is already sufficient to satisfy your error bar criteria with the RR method (the first method will not do, of course, but the corresponding table is needed for comparison).
2. Did you use "covariance" in both cases (with + without RR)?
3. Could you show us the LOS results for both cases, please?
Sven
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
I wasn't talking about effort. He commented that adding the extra games did not reduce the error bar as much as he expected. I pointed out that to reduce it by a factor of 2, you need 4x the games. The RR did not quadruple the games played.Sven Schüle wrote:That "square" argument is flawed since, as you already mentioned by yourself and I repeated, the RR games are played only once and can be kept forever. For each new engine version to test, you only add its new gauntlet games, so no additional effort once the RR games are completed.bob wrote:Remember the "square" stuff. 4x the games to reduce the error by 2x.Hart wrote:From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.
I have been using "exactdist" as Remi recommended that when this was first discussed.More points:
1. Could you show us these two tables with only the first 50% and then only 75% of the gauntlet games included? I would like to see whether a reduced number of games is already sufficient to satisfy your error bar criteria with the RR method (the first method will not do, of course, but the corresponding table is needed for comparison).
2. Did you use "covariance" in both cases (with + without RR)?
3. Could you show us the LOS results for both cases, please?
Sven
The input to BayesElo is as follows:
readpgn master.pgn
readpgn pgn
elo
offset 2600
mm
exactdist
ratings
x
master.pgn has any PGN I want included from older tests, such as 23.0
pgn has the PGN games from the current test. I normalized it to 2600 to eliminate negative numbers, as at one point I was parsing the BayesElo output in another program.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.Edsel Apostol wrote:I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22% 3 Glaurung 2.2 2625 5 5 16000 52% 2607 22% 4 Toga2 2616 5 5 16000 51% 2607 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22% 3 Glaurung 2.2 2625 3 3 48000 54% 2596 23% 4 Toga2 2615 4 4 48000 52% 2597 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Should be able to get this done by 12:00 or so.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.bob wrote:The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.Edsel Apostol wrote:I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22% 3 Glaurung 2.2 2625 5 5 16000 52% 2607 22% 4 Toga2 2616 5 5 16000 51% 2607 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22% 3 Glaurung 2.2 2625 3 3 48000 54% 2596 23% 4 Toga2 2615 4 4 48000 52% 2597 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Should be able to get this done by 12:00 or so.
Test results alone:
Code: Select all
Rank Name Elo + - games score oppo. draws
2 Crafty-23.1-1 2654 17 16 1040 58% 2595 23%
3 Toga2 2639 25 25 416 54% 2613 24%
4 Glaurung 2.2 2617 25 25 416 51% 2613 23%
5 Crafty-23.0-1 2571 16 16 1040 47% 2595 22%
6 Fruit 2.1 2532 25 26 416 39% 2613 22%
Code: Select all
2 Crafty-23.1-1 2654 17 17 1040 58% 2595 23%
3 Glaurung 2.2 2623 5 4 32416 54% 2589 24%
4 Toga2 2613 4 4 32416 53% 2591 23%
5 Crafty-23.0-1 2570 16 17 1040 47% 2595 22%
6 Fruit 2.1 2519 4 4 32416 37% 2614 23%