A question on testing methodology

bob · Post by **bob** » Fri Nov 20, 2009 1:55 am

jwes wrote:Any chance you could put up an archive of these games on your ftp site? I would be interested in doing some statistics on the results.

Let me see what the space constraints are.

bob · Post by **bob** » Fri Nov 20, 2009 2:02 am

First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.

Code: Select all

   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22%

And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.

Code: Select all

   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23%

So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob

Hart · Post by **Hart** » Fri Nov 20, 2009 2:38 am

From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.

bob · Post by **bob** » Fri Nov 20, 2009 6:50 am

Hart wrote:From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.

Remember the "square" stuff. 4x the games to reduce the error by 2x.

Edsel Apostol · Post by **Edsel Apostol** » Fri Nov 20, 2009 9:06 am

bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?
Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.

You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
As for your question, I want to know if versions/settings of my engine performs better compared to each other. I just thought that if the engine opponents have a more stable rating that it will also reflect on the stability of the rating of the tested engine. This is not the case though as pointed out by the results posted by Adam Hair and as what you've pointed out in your posts.

The question now is should opponent engine's rating stability be included in the computation of the ratings? I mean for example:

Format is (Engine, Number of Games, Winning Percentage)

Data 1:
EngineA 1000 60%
OpponentA 1000 40%

Data 2:
EngineA 1000 60%
OpponentA 20000 40%

In the above data, I have more confidence in the rating produced from Data 2 compared to the one from Data 1 as Data 2 have more games played by the OpponentA against other engines, but the current computation of ratings by EloStat and Bayeselo seems to produce the same rating for Data 1and Data 2 for EngineA.
And you are falling into a classical trap. ratings are _not_ absolute numbers. If you play your program (two versions, A and A') against a set of opponents, you should be able to set your opponent ratings to anything you want, and your two versions should still end up in the right order, with the right separation between them. Elo is not an absolute value, the only use for Elo numbers is to subtract them to get the difference. That is an absolute number, and is useful.
I understand and I know what you are trying to say, but that is not what my question is all about.

It seems clear to me now how elo is being calculated, it is based mostly on winning percentages and it doesn't take into account other factors. Maybe someday, someone could discover a much accurate way to calculate ratings.
What else could you factor in? Elo is used to predict the future outcome of a game between two opponents, based on past performance by each of them. No need to factor in weather, health, mental state, etc. Enough games and that is included by natural selection.

The current elo calculation is not that complex especially that of Elostat. That's why it has problems on some extreme scenarios. I think Remi takes into account more factors than just enough games in Bayeselo with the covariance feature.

Edsel Apostol · Post by **Edsel Apostol** » Fri Nov 20, 2009 9:19 am

bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
Code: Select all
   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22% 
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.
Code: Select all
   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23% 
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob

I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.

Sven · Post by **Sven** » Fri Nov 20, 2009 10:29 am

bob wrote:
Hart wrote:From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.
Remember the "square" stuff. 4x the games to reduce the error by 2x.

That "square" argument is flawed since, as you already mentioned by yourself and I repeated, the RR games are played only once and can be kept forever. For each new engine version to test, you only add its new gauntlet games, so no additional effort once the RR games are completed.

More points:

1. Could you show us these two tables with only the first 50% and then only 75% of the gauntlet games included? I would like to see whether a reduced number of games is already sufficient to satisfy your error bar criteria with the RR method (the first method will not do, of course, but the corresponding table is needed for comparison).

2. Did you use "covariance" in both cases (with + without RR)?

3. Could you show us the LOS results for both cases, please?

Sven

bob · Post by **bob** » Fri Nov 20, 2009 5:34 pm

Sven Schüle wrote:
bob wrote:
Hart wrote:From what I see then the results look remarkably consistent. It looks like the RR reduced the error somewhat but not as much as I'd have guessed.
Remember the "square" stuff. 4x the games to reduce the error by 2x.
That "square" argument is flawed since, as you already mentioned by yourself and I repeated, the RR games are played only once and can be kept forever. For each new engine version to test, you only add its new gauntlet games, so no additional effort once the RR games are completed.

I wasn't talking about effort. He commented that adding the extra games did not reduce the error bar as much as he expected. I pointed out that to reduce it by a factor of 2, you need 4x the games. The RR did not quadruple the games played.

More points:

1. Could you show us these two tables with only the first 50% and then only 75% of the gauntlet games included? I would like to see whether a reduced number of games is already sufficient to satisfy your error bar criteria with the RR method (the first method will not do, of course, but the corresponding table is needed for comparison).

2. Did you use "covariance" in both cases (with + without RR)?

3. Could you show us the LOS results for both cases, please?

Sven

I have been using "exactdist" as Remi recommended that when this was first discussed.

The input to BayesElo is as follows:

readpgn master.pgn
readpgn pgn
elo
offset 2600
mm
exactdist
ratings
x

master.pgn has any PGN I want included from older tests, such as 23.0
pgn has the PGN games from the current test. I normalized it to 2600 to eliminate negative numbers, as at one point I was parsing the BayesElo output in another program.

bob · Post by **bob** » Fri Nov 20, 2009 5:36 pm

Edsel Apostol wrote:
bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
Code: Select all
   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22% 
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.
Code: Select all
   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23% 
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob
I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.

The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.

Should be able to get this done by 12:00 or so.

bob · Post by **bob** » Fri Nov 20, 2009 5:42 pm

bob wrote:
Edsel Apostol wrote:
bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
Code: Select all
   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22% 
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.
Code: Select all
   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23% 
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob
I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.
The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.

Should be able to get this done by 12:00 or so.

Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.

Test results alone:

Code: Select all

Rank Name                  Elo    +    - games score oppo. draws
   2 Crafty-23.1-1        2654   17   16  1040   58%  2595   23% 
   3 Toga2                2639   25   25   416   54%  2613   24% 
   4 Glaurung 2.2         2617   25   25   416   51%  2613   23% 
   5 Crafty-23.0-1        2571   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2532   25   26   416   39%  2613   22%

test results + RR games:

Code: Select all

   2 Crafty-23.1-1        2654   17   17  1040   58%  2595   23% 
   3 Glaurung 2.2         2623    5    4 32416   54%  2589   24% 
   4 Toga2                2613    4    4 32416   53%  2591   23% 
   5 Crafty-23.0-1        2570   16   17  1040   47%  2595   22% 
   6 Fruit 2.1            2519    4    4 32416   37%  2614   23%

As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.

A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology [here is the data]

Re: A question on testing methodology [here is the data]