A question on testing methodology

Discussion of chess software programming and technical issues.

Moderator: Ras

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: A question on testing methodology

Post by Adam Hair »

It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

Here is a subset of the games:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)          86   55   55   110   65%   -17   27% 
   2 Fruit23-EM64T               18   53   53   110   55%   -17   34% 
   3 TwistedLogic20090922_x64     3   38   38   216   49%     8   34% 
   4 Delfi 5.4 (2CPU)           -22   54   54   108   50%   -17   31% 
   5 TwistedLogic20080620       -36   38   38   221   44%     9   33% 
   6 Spike1.2 Turin             -48   52   52   109   45%   -17   42% 
Here are the same games plus games between the other engines:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)         104   29   29   275   67%   -21   28% 
   2 Fruit23-EM64T               11   28   28   275   52%    -2   35% 
   3 TwistedLogic20090922_x64     3   32   32   216   49%     9   34% 
   4 Delfi 5.4 (2CPU)           -34   28   28   270   44%     7   33% 
   5 TwistedLogic20080620       -37   32   32   221   44%     9   33% 
   6 Spike1.2 Turin             -47   28   28   273   41%    10   40% 
The same subset with Rybka2.3.2a and Stockfish 1.5.1 added:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Stockfish_151_x64(2CPU)    205   60   60   108   86%   -77   22% 
   2 Rybkav2.3.2a.w32           156   56   56   108   81%   -77   25% 
   3 Bright-0.4a3(2CPU)          27   51   51   110   65%   -78   27% 
   4 Fruit23-EM64T              -43   49   49   110   55%   -78   34% 
   5 TwistedLogic20090922_x64   -44   28   28   324   40%    26   34% 
   6 Delfi 5.4 (2CPU)           -82   50   50   108   50%   -77   31% 
   7 Spike1.2 Turin            -108   48   48   109   45%   -77   42% 
   8 TwistedLogic20080620      -110   30   30   329   33%    25   27% 
Those same games plus Rybka Vs Stockfish:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Stockfish_151_x64(2CPU)    194   29   29   377   79%   -28   27% 
   2 Rybkav2.3.2a.w32           137   27   27   377   71%   -20   31% 
   3 Bright-0.4a3(2CPU)          34   35   35   218   48%    43   31% 
   4 Fruit23-EM64T              -34   36   36   218   39%    43   33% 
   5 TwistedLogic20090922_x64   -43   28   28   324   40%    25   34% 
   6 Delfi 5.4 (2CPU)           -80   37   37   216   34%    44   28% 
   7 Spike1.2 Turin             -98   37   37   217   31%    44   33% 
   8 TwistedLogic20080620      -110   29   29   329   33%    25   27% 
Here are all of the games played by both versions of TL:

Code: Select all

Rank Name                        Elo    +    - games score oppo. draws 
   1 Stockfish_151_x64(2CPU)     232   65   65   108   86%   -49   22% 
   2 Rybkav2.3.2a.w32            183   60   60   108   81%   -49   25% 
   3 Rybka v2.2n2x64(2CPU)       179   63   63    92   81%   -44   29% 
   4 Thinker54d Inertx64(2CPU)   178   61   61   106   79%   -49   21% 
   5 Stockfish_14_x64_ja(2CPU)   169   58   58   108   81%   -49   30% 
   6 Glaurung22_x64_ja(2CPU)     167   59   59   108   80%   -49   29% 
   7 Toga141se-2cpu               92   55   55   107   70%   -49   30% 
   8 Bright-0.4a3(2CPU)           54   54   54   110   65%   -50   27% 
   9 Crafty_230_x64_ja(2CPU)      -8   52   52   106   56%   -50   39% 
  10 Fruit23-EM64T               -15   52   52   110   55%   -50   34% 
  11 TwistedLogic20090922_x64    -20   17   17  1125   47%     5   31% 
  12 Naum2.0_x64                 -31   53   53   106   53%   -49   36% 
  13 Delfi 5.4 (2CPU)            -51   53   53   109   50%   -49   30% 
  14 Scorpio_21_x64_ja(2cpu)     -71   54   54   108   47%   -49   25% 
  15 Frenzee_Feb08_x64           -72   54   54   107   47%   -49   28% 
  16 TwistedLogic20080620        -79   17   17  1119   40%     2   28% 
  17 Spike1.2 Turin              -80   51   51   109   45%   -49   42% 
  18 Et_Chess_130108            -100   55   55   105   43%   -50   24% 
  19 BugChess2_V1_6_3_x64       -118   55   55   108   41%   -49   24% 
  20 Booot415                   -132   54   54   108   38%   -49   31% 
  21 Colossus2008b              -154   53   53   108   34%   -49   35% 
  22 Movei00_8_438(10 10 10)    -157   54   54   105   34%   -49   35% 
  23 Alaric707                  -167   55   55   108   34%   -49   25% 
And now all games played by these participants. Please note that not all
opponents have played each other.

Code: Select all

Rank Name                        Elo    +    - games score oppo. draws 
   1 Stockfish_151_x64(2CPU)     213   21   21   915   76%    12   28% 
   2 Rybka v2.2n2x64(2CPU)       208   22   22   779   72%    45   33% 
   3 Rybkav2.3.2a.w32            176   20   20   917   72%     8   31% 
   4 Thinker54d Inertx64(2CPU)   172   25   25   504   57%   121   39% 
   5 Stockfish_14_x64_ja(2CPU)   161   20   20   862   69%    22   33% 
   6 Glaurung22_x64_ja(2CPU)     120   31   31   336   53%    97   35% 
   7 Toga141se-2cpu              111   25   25   507   47%   129   37% 
   8 Bright-0.4a3(2CPU)           81   27   27   475   54%    53   32% 
   9 Fruit23-EM64T                -6   27   27   475   41%    63   33% 
  10 TwistedLogic20090922_x64    -20   17   17  1125   47%     5   31% 
  11 Crafty_230_x64_ja(2CPU)     -25   33   33   306   32%   106   32% 
  12 Naum2.0_x64                 -29   33   33   306   31%   106   32% 
  13 Delfi 5.4 (2CPU)            -58   28   28   470   33%    70   28% 
  14 Spike1.2 Turin              -65   27   27   472   32%    70   33% 
  15 Frenzee_Feb08_x64           -78   36   36   307   27%   105   22% 
  16 TwistedLogic20080620        -79   17   17  1118   40%     2   28% 
  17 Scorpio_21_x64_ja(2cpu)     -98   37   37   307   26%   104   18% 
  18 Et_Chess_130108            -100   55   55   105   43%   -50   24% 
  19 Booot415                   -105   36   36   308   23%   105   25% 
  20 BugChess2_V1_6_3_x64       -118   54   54   108   41%   -50   24% 
  21 Alaric707                  -146   44   44   215   23%    72   22% 
  22 Colossus2008b              -155   53   53   108   34%   -50   35% 
  23 Movei00_8_438(10 10 10)    -158   54   54   105   34%   -50   35% 
There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology

Post by bob »

Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?
Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.

You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology

Post by bob »

Edsel Apostol wrote:
michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel
How about if the rating difference of Engine_A and Engine_A* is different in both lists, which rating list to trust, the gaunlet or the gaunlet with the round robin match of opponents?

If for example Engine_A and Engine_A* have the same winning percentage on the gauntlet, they would have the same elo rating even though they perform differently against the opponents. When you consider the result of the round robin of the opponents, is it possible that the rating of Engine_A and Engine_A* wouldn't be the same?
When I tested this idea, it never happened. A and A' will have one rating when played against the gauntlet. They will likely have different ratings when played against the gauntlet and then the gauntlet plays a RR as well. But the difference should not be significant, and it would be quite difficult to make everything work out exact. For example, A and A' will play each gauntlet program 2x games (combined) compared to any single gauntlet program playing another. That introduces a bias as well.

But for determining whether A or A' is better, when I tried this very test a year or two ago, there was no difference
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology

Post by bob »

Sven Schüle wrote:
michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel
This topic was indeed discussed already in the past (sorry for not providing the link here) but for me there was no satisfying conclusion. My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. To prove this would require the following comparison:

Method 1:
- play gauntlet of A against opponents
- play gauntlet of A* against opponents
- make rating list of all these games and look at ratings and error bars for A and A*

Method 2:
- same as method 1 but also play RR between opponents and include these games as well

The assumption that the ratings of A and A* are not affected by choice of method 1 or 2 may hold but it is possible that method 2 improves error bars and therefore *may* help to reduce the required number of games to reach the defined maximum error bars. My idea behind this is that playing against "more stable" opponents should also result in a "more stable" rating.

I don't recall whether this has really been tested by someone.

Sven
This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: A question on testing methodology

Post by jwes »

Another interesting experiment would be to compare the elo difference found with gauntlet testing to that of playing A against A` directly.
Hart

Re: A question on testing methodology

Post by Hart »

What are the implications if one of your opponents varies in strength by as much as 60 Elo, well over the normal 95% confidence intervals, when using two different versions of your program? Shouldn't your opponents ratings remain constant throughout if they are in fact the same program?
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology

Post by Edsel Apostol »

Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology

Post by Edsel Apostol »

bob wrote:
Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?
Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.

You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
As for your question, I want to know if versions/settings of my engine performs better compared to each other. I just thought that if the engine opponents have a more stable rating that it will also reflect on the stability of the rating of the tested engine. This is not the case though as pointed out by the results posted by Adam Hair and as what you've pointed out in your posts.

The question now is should opponent engine's rating stability be included in the computation of the ratings? I mean for example:

Format is (Engine, Number of Games, Winning Percentage)

Data 1:
EngineA 1000 60%
OpponentA 1000 40%

Data 2:
EngineA 1000 60%
OpponentA 20000 40%

In the above data, I have more confidence in the rating produced from Data 2 compared to the one from Data 1 as Data 2 have more games played by the OpponentA against other engines, but the current computation of ratings by EloStat and Bayeselo seems to produce the same rating for Data 1and Data 2 for EngineA.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology

Post by bob »

jwes wrote:Another interesting experiment would be to compare the elo difference found with gauntlet testing to that of playing A against A` directly.
I've done that and was unhappy with the results. Some changes and A vs A' showing A' was +100 Elo better, where the gauntlet suggested more like +8 to 10 better. There were quite a few contradictory results when I played with the idea, whereas the gauntlet seemed to always be consistent.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology

Post by bob »

Hart wrote:What are the implications if one of your opponents varies in strength by as much as 60 Elo, well over the normal 95% confidence intervals, when using two different versions of your program? Shouldn't your opponents ratings remain constant throughout if they are in fact the same program?
Why would they? The Elo system has both players being affected by the outcome of games. This sounds like the classic misunderstanding that a rating is an absolute measure. It isn't, it is a very relative measure. The difference between the ratings of two opponents should not change, regardless of the starting ratings of either, after you complete a long match. The absolute scores might be significantly different, since starting ratings affect the final ratings. I _never_ look at the actual Elo value, because it is really meaningless. I look at the difference in Elo between the two programs I am interested in, as that is what tells me how things are changing.