A question on testing methodology

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

Re: A question on testing methodology

Post by nthom »

michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:

Code: Select all

1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 

Code: Select all

1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel
The change I made in LittleThought between the two tests (the source of these results) was to improve (?) its knowledge of pinned pieces in the second run. However, it made the engine so much slower that overall it made it worse.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: A question on testing methodology

Post by michiguel »

Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.
No, you are getting it all wrong!!

Miguel
Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.
No, you observed the result, but you did not observe how it is calculated. You see that the results are compatible with your idea that the calculation is simple and conclude then that they must be. No, it is not that simple. The exact methodology may vary according to programs, but the concept is the same. I doubt that programs just take an average of the opponents and look up ELO from winning percentage. Mine doesn't :-)
You have to balance in each game the chances to win or lose based on gaussian curve (BTW, mine does not use a gauss curve) determined by the difference with your opponent. The problems is that you do not know either and you have to come up with a solution that satisfy those probabilities calculate with the probabilities observed (games won, & lost).

Miguel


For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.
It is supposed to be if you play infinite games. The way it is calculated for Humans is a gross approximation that worked really well for practical purposes. Computationally, humans approach their rating by a "steepest descent-like" algorithm.

Miguel
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology

Post by Edsel Apostol »

michiguel wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.
No, you are getting it all wrong!!

Miguel
Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.
No, you observed the result, but you did not observe how it is calculated. You see that the results are compatible with your idea that the calculation is simple and conclude then that they must be. No, it is not that simple. The exact methodology may vary according to programs, but the concept is the same. I doubt that programs just take an average of the opponents and look up ELO from winning percentage. Mine doesn't :-)
You have to balance in each game the chances to win or lose based on gaussian curve (BTW, mine does not use a gauss curve) determined by the difference with your opponent. The problems is that you do not know either and you have to come up with a solution that satisfy those probabilities calculate with the probabilities observed (games won, & lost).

Miguel
The reason I said that the calculation is simple is that the calculation is only based on the win, draw, loss information of the current set of games.

Here's the relevant algorithm I've found on the WB forum:

Code: Select all

1) Use number of wins, loss, and draws
   W = number of wins, L = number of lost, D = number of draws
   n = number of games (W + L + D)
   m = mean value

2) Apply the following formulas to compute s 
  ( SQRT: square root of. )

   x = W*(1-m)*(1-m) + D*(0.5-m)*(0.5-m) + L*(0-m)*(0-m)   
   s = SQRT( x/(n-1) )

3) Compute error margin A (use 1.96  for 95% confidence)
   
   A = 1.96 * s / SQRT(n)

4) State with 95% confidence:
   The 'real' result should be somewhere in between m-A to m+A

5) Lookup the ELO figures with the win% from m-A and m+A to get the lower and higher values in the error margin.
http://www.open-aurec.com/wbforum/viewt ... ?f=4&t=949

By the way, does your program's ratings output differ from Elostat and Bayeselo?

For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.
It is supposed to be if you play infinite games. The way it is calculated for Humans is a gross approximation that worked really well for practical purposes. Computationally, humans approach their rating by a "steepest descent-like" algorithm.

Miguel
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: A question on testing methodology

Post by michiguel »

Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.
No, you are getting it all wrong!!

Miguel
Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.
No, you observed the result, but you did not observe how it is calculated. You see that the results are compatible with your idea that the calculation is simple and conclude then that they must be. No, it is not that simple. The exact methodology may vary according to programs, but the concept is the same. I doubt that programs just take an average of the opponents and look up ELO from winning percentage. Mine doesn't :-)
You have to balance in each game the chances to win or lose based on gaussian curve (BTW, mine does not use a gauss curve) determined by the difference with your opponent. The problems is that you do not know either and you have to come up with a solution that satisfy those probabilities calculate with the probabilities observed (games won, & lost).

Miguel
The reason I said that the calculation is simple is that the calculation is only based on the win, draw, loss information of the current set of games.

Here's the relevant algorithm I've found on the WB forum:

Code: Select all

1) Use number of wins, loss, and draws
   W = number of wins, L = number of lost, D = number of draws
   n = number of games (W + L + D)
   m = mean value

2) Apply the following formulas to compute s 
  ( SQRT: square root of. )

   x = W*(1-m)*(1-m) + D*(0.5-m)*(0.5-m) + L*(0-m)*(0-m)   
   s = SQRT( x/(n-1) )

3) Compute error margin A (use 1.96  for 95% confidence)
   
   A = 1.96 * s / SQRT(n)

4) State with 95% confidence:
   The 'real' result should be somewhere in between m-A to m+A

5) Lookup the ELO figures with the win% from m-A and m+A to get the lower and higher values in the error margin.
http://www.open-aurec.com/wbforum/viewt ... ?f=4&t=949

By the way, does your program's ratings output differ from Elostat and Bayeselo?

For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.
It is supposed to be if you play infinite games. The way it is calculated for Humans is a gross approximation that worked really well for practical purposes. Computationally, humans approach their rating by a "steepest descent-like" algorithm.

Miguel
That is how SSDF did it according to the post. Yes, that is very simplistic. Again, it is simplistic but it will give you problems regardless that you include more games from the round robin rather than the gauntlet alone. The reason why is simplistic is because ratings should be calculated game by game, not over the average of your opponents. Do you prefer to play two opponents that are 400 points higher or one opponent that is 800 points higher and one that has your same rating? That is a fatal flaw in how human ratings are calculated too. That is the reason why strong players feel that if they play one weak player in a pool of even players they end up losing points. They are right and that is why they end up avoiding tournaments if they face weak players.

But I digressed, I doubt Bayeselo does anything like it. I don't with my program. It is possible the EloStat does something like that because I heard complains about it. But we are talking about it here.

Several years ago I tried with Leo my system with WBEC results and Leo told me that he found significant differences with EloStat. That is all I can say. He wanted to use it but my program was not very "friendly" for Comp. chess yet (I was running NCAA Volleyball ratings during the 90's) and I did not find the time to adjust it. I quit CC for a while soon after that. Now I started to use it for testing and I adjusted to read pgn files. Maybe I should release it if anybody is interested.

Miguel
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: A question on testing methodology

Post by Sven »

bob wrote:
Sven Schüle wrote:
michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel
This topic was indeed discussed already in the past (sorry for not providing the link here) but for me there was no satisfying conclusion. My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. To prove this would require the following comparison:

Method 1:
- play gauntlet of A against opponents
- play gauntlet of A* against opponents
- make rating list of all these games and look at ratings and error bars for A and A*

Method 2:
- same as method 1 but also play RR between opponents and include these games as well

The assumption that the ratings of A and A* are not affected by choice of method 1 or 2 may hold but it is possible that method 2 improves error bars and therefore *may* help to reduce the required number of games to reach the defined maximum error bars. My idea behind this is that playing against "more stable" opponents should also result in a "more stable" rating.

I don't recall whether this has really been tested by someone.

Sven
This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.
You, as most other people posting in this thread, are still missing my key point. You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:
Adam Hair wrote:Here is a subset of the games:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)          86   55   55   110   65%   -17   27% 
   2 Fruit23-EM64T               18   53   53   110   55%   -17   34% 
   3 TwistedLogic20090922_x64     3   38   38   216   49%     8   34% 
   4 Delfi 5.4 (2CPU)           -22   54   54   108   50%   -17   31% 
   5 TwistedLogic20080620       -36   38   38   221   44%     9   33% 
   6 Spike1.2 Turin             -48   52   52   109   45%   -17   42% 
Here are the same games plus games between the other engines:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)         104   29   29   275   67%   -21   28% 
   2 Fruit23-EM64T               11   28   28   275   52%    -2   35% 
   3 TwistedLogic20090922_x64     3   32   32   216   49%     9   34% 
   4 Delfi 5.4 (2CPU)           -34   28   28   270   44%     7   33% 
   5 TwistedLogic20080620       -37   32   32   221   44%     9   33% 
   6 Spike1.2 Turin             -47   28   28   273   41%    10   40% 
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. Do you notice that difference? And what does that mean for the number of gauntlet games you need to play until your error bars reach your predefined goal, e.g. +/- 4 or 5 like in your own Crafty testing? Will you need the same number of games, more games, or less games? (A note for all readers: please do not count the RR games here since they will be played only once and can be kept "forever", no need to replay these when adding gauntlets for versions A'', A''', A'''' ...)

Btw I found the old threads, they are from August 2008. Here are some example links (I searched for "robin" in my own postings):

http://www.talkchess.com/forum/viewtopic.php?p=205814#205814
http://www.talkchess.com/forum/viewtopic.php?p=205596#205596
http://www.talkchess.com/forum/viewtopic.php?p=206874#206874
http://www.talkchess.com/forum/viewtopic.php?p=207971#207971

One of the open issues from these days seems to be still open: will the use of the BayesElo command "covariance" further increase the advantage of adding RR games?

Sven
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: A question on testing methodology

Post by michiguel »

Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel
This topic was indeed discussed already in the past (sorry for not providing the link here) but for me there was no satisfying conclusion. My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. To prove this would require the following comparison:

Method 1:
- play gauntlet of A against opponents
- play gauntlet of A* against opponents
- make rating list of all these games and look at ratings and error bars for A and A*

Method 2:
- same as method 1 but also play RR between opponents and include these games as well

The assumption that the ratings of A and A* are not affected by choice of method 1 or 2 may hold but it is possible that method 2 improves error bars and therefore *may* help to reduce the required number of games to reach the defined maximum error bars. My idea behind this is that playing against "more stable" opponents should also result in a "more stable" rating.

I don't recall whether this has really been tested by someone.

Sven
This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.
You, as most other people posting in this thread, are still missing my key point. You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:
Adam Hair wrote:Here is a subset of the games:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)          86   55   55   110   65%   -17   27% 
   2 Fruit23-EM64T               18   53   53   110   55%   -17   34% 
   3 TwistedLogic20090922_x64     3   38   38   216   49%     8   34% 
   4 Delfi 5.4 (2CPU)           -22   54   54   108   50%   -17   31% 
   5 TwistedLogic20080620       -36   38   38   221   44%     9   33% 
   6 Spike1.2 Turin             -48   52   52   109   45%   -17   42% 
Here are the same games plus games between the other engines:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)         104   29   29   275   67%   -21   28% 
   2 Fruit23-EM64T               11   28   28   275   52%    -2   35% 
   3 TwistedLogic20090922_x64     3   32   32   216   49%     9   34% 
   4 Delfi 5.4 (2CPU)           -34   28   28   270   44%     7   33% 
   5 TwistedLogic20080620       -37   32   32   221   44%     9   33% 
   6 Spike1.2 Turin             -47   28   28   273   41%    10   40% 
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. Do you notice that difference? And what does that mean for the number of gauntlet games you need to play until your error bars reach your predefined goal, e.g. +/- 4 or 5 like in your own Crafty testing? Will you need the same number of games, more games, or less games? (A note for all readers: please do not count the RR games here since they will be played only once and can be kept "forever", no need to replay these when adding gauntlets for versions A'', A''', A'''' ...)

Btw I found the old threads, they are from August 2008. Here are some example links (I searched for "robin" in my own postings):

http://www.talkchess.com/forum/viewtopic.php?p=205814#205814
http://www.talkchess.com/forum/viewtopic.php?p=205596#205596
http://www.talkchess.com/forum/viewtopic.php?p=206874#206874
http://www.talkchess.com/forum/viewtopic.php?p=207971#207971

One of the open issues from these days seems to be still open: will the use of the BayesElo command "covariance" further increase the advantage of adding RR games?

Sven
I get your point, but there is one thing that is never discussed and it relates directly to your good observation:

The +/- in this context is meaningless. You do not want to know the error bar of (A) in this pool, or the error bar of (A*) in the pool. You want to know the error bar of the number (A - A*). This could be much smaller than the comparison between both separate error bars. This is something most overlook. DeltaAA* may vary much less than A or A*.

The error changes when you include the RR because what you are seeing is the error of each A in the pool of engines. If you include infinite number of RR games, the error may converge to reflect the real DeltaAA*. But the accuracy of the test never changed, you just never knew the real error! you just decreased the error of the error :-)

Miguel
PS: It may be better to run simulations to know what type of error you have than running the RR.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: A question on testing methodology

Post by Sven »

michiguel wrote:
Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:[...] My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. [...]
This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.
[...] You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:
Adam Hair wrote:Here is a subset of the games:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)          86   55   55   110   65%   -17   27% 
   2 Fruit23-EM64T               18   53   53   110   55%   -17   34% 
   3 TwistedLogic20090922_x64     3   38   38   216   49%     8   34% 
   4 Delfi 5.4 (2CPU)           -22   54   54   108   50%   -17   31% 
   5 TwistedLogic20080620       -36   38   38   221   44%     9   33% 
   6 Spike1.2 Turin             -48   52   52   109   45%   -17   42% 
Here are the same games plus games between the other engines:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)         104   29   29   275   67%   -21   28% 
   2 Fruit23-EM64T               11   28   28   275   52%    -2   35% 
   3 TwistedLogic20090922_x64     3   32   32   216   49%     9   34% 
   4 Delfi 5.4 (2CPU)           -34   28   28   270   44%     7   33% 
   5 TwistedLogic20080620       -37   32   32   221   44%     9   33% 
   6 Spike1.2 Turin             -47   28   28   273   41%    10   40% 
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. [...]
I get your point, but there is one thing that is never discussed and it relates directly to your good observation:

The +/- in this context is meaningless. You do not want to know the error bar of (A) in this pool, or the error bar of (A*) in the pool. You want to know the error bar of the number (A - A*). This could be much smaller than the comparison between both separate error bars. This is something most overlook. DeltaAA* may vary much less than A or A*.
The +/- is not meaningless. In conjunction with the rating delta (A - A*) you can already guess how likely the statement "A* is an improvement over A" is, depending on the degree of overlapping of both rating intervals. But as recently discussed in another thread, it is much better to calculate LOS instead for this purpose. So Adam should do this for both methods (without and with RR games), then we'll see whether RR games have an influence on the LOS value. I am still confident that with RR games included, you will need less games to reach the same quality (expressed by error bars or by LOS) of your measurement than without.
michiguel wrote:The error changes when you include the RR because what you are seeing is the error of each A in the pool of engines. If you include infinite number of RR games, the error may converge to reflect the real DeltaAA*. But the accuracy of the test never changed, you just never knew the real error! you just decreased the error of the error :-)
I can't follow that. If I decrease the error for A and A* then I also decrease the error of DeltaAA*, as can be seen from the following small example:

Code: Select all

Name Elo    +    -
A      0   20   20
A*    30   20   20

The rating of A  lies within [-20 .. +20].
The rating of A* lies within [+10 .. +50].
Both intervals overlap, so it is not clear whether A* is better than A.


Name Elo    +    -
A      0   10   10
A*    30   10   10

The rating of A  lies within [-10 .. +10].
The rating of A* lies within [+20 .. +40].
A* is clearly better than A.
Sven
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: A question on testing methodology

Post by michiguel »

Sven Schüle wrote:
michiguel wrote:
Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:[...] My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. [...]
This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.
[...] You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:
Adam Hair wrote:Here is a subset of the games:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)          86   55   55   110   65%   -17   27% 
   2 Fruit23-EM64T               18   53   53   110   55%   -17   34% 
   3 TwistedLogic20090922_x64     3   38   38   216   49%     8   34% 
   4 Delfi 5.4 (2CPU)           -22   54   54   108   50%   -17   31% 
   5 TwistedLogic20080620       -36   38   38   221   44%     9   33% 
   6 Spike1.2 Turin             -48   52   52   109   45%   -17   42% 
Here are the same games plus games between the other engines:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 Bright-0.4a3(2CPU)         104   29   29   275   67%   -21   28% 
   2 Fruit23-EM64T               11   28   28   275   52%    -2   35% 
   3 TwistedLogic20090922_x64     3   32   32   216   49%     9   34% 
   4 Delfi 5.4 (2CPU)           -34   28   28   270   44%     7   33% 
   5 TwistedLogic20080620       -37   32   32   221   44%     9   33% 
   6 Spike1.2 Turin             -47   28   28   273   41%    10   40% 
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. [...]
I get your point, but there is one thing that is never discussed and it relates directly to your good observation:

The +/- in this context is meaningless. You do not want to know the error bar of (A) in this pool, or the error bar of (A*) in the pool. You want to know the error bar of the number (A - A*). This could be much smaller than the comparison between both separate error bars. This is something most overlook. DeltaAA* may vary much less than A or A*.
The +/- is not meaningless. In conjunction with the rating delta (A - A*) you can already guess how likely the statement "A* is an improvement over A" is, depending on the degree of overlapping of both rating intervals.
That is exactly what I am saying you cannot do and everybody does. Well, you can, but it could be grossly overestimated. Comparing the overlaps is not strictly correct, particularly with a small set of opponents (and even more if they did not play each other). You can easily have a situation in which A = 2400 +/- 30, A*= 2450 +/- 30 and DeltaAA*= 50 +/- 20

Both A and A* could have a not well determined elo compared to the pool, but have a well determined difference among them.

Play A and B 10000 games each other. Make them play a gauntlet against 10 opponent, only 10 games. Put everything together and look at the error bars. I predict you will have a case like the one I mentioned above.


But as recently discussed in another thread, it is much better to calculate LOS instead for this purpose.
Exactly, if LOS is what I think it is.
So Adam should do this for both methods (without and with RR games), then we'll see whether RR games have an influence on the LOS value. I am still confident that with RR games included, you will need less games to reach the same quality (expressed by error bars or by LOS) of your measurement than without.
If LOS can be calculated correctly, in both cases should be very similar.

michiguel wrote:The error changes when you include the RR because what you are seeing is the error of each A in the pool of engines. If you include infinite number of RR games, the error may converge to reflect the real DeltaAA*. But the accuracy of the test never changed, you just never knew the real error! you just decreased the error of the error :-)
I can't follow that. If I decrease the error for A and A* then I also decrease the error of DeltaAA*, as can be seen from the following small example:

Code: Select all

Name Elo    +    -
A      0   20   20
A*    30   20   20

The rating of A  lies within [-20 .. +20].
The rating of A* lies within [+10 .. +50].
Both intervals overlap, so it is not clear whether A* is better than A.


Name Elo    +    -
A      0   10   10
A*    30   10   10

The rating of A  lies within [-10 .. +10].
The rating of A* lies within [+20 .. +40].
A* is clearly better than A.
That is not necessarily correct.

Let's assume that you have three points in in the same line A, B and C.
A is 1 meter from B, and both of them are about 1 km from C. You measure the three distances A-B, A-C and B-C. One with a ruler, and the other with some sort of GPS system. Then you calculate the distance of each of those points to the center of mass. You will have that all of them will be ~ 0.5 km +/- 1 meter. You cannot say that the error of A - B is +/- 2 meter. The error of that particular measure is still 1 mm.
This may not be a good analogy but I am trying to illustrate what I mean.

Miguel
Sven
Hart

Re: A question on testing methodology

Post by Hart »

michiguel wrote:
Hart wrote:
michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:

Code: Select all

1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 

Code: Select all

1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel
Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.
The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.

Miguel
My bold.

How could this be so? Wouldn't smaller errors in your opponents ratings necessarily be conducive to smaller errors in yours? If not, why? Shouldn't scoring 50% against an engine with error bars at +/- 30 tell you less about your performance than if your opponent has error bars of +/- 15?
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: A question on testing methodology

Post by michiguel »

Hart wrote:
michiguel wrote:
Hart wrote:
michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:

Code: Select all

1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 

Code: Select all

1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel
Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.
The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.

Miguel
My bold.

How could this be so? Wouldn't smaller errors in your opponents ratings necessarily be conducive to smaller errors in yours?
I do not think this is necessarily true
If not, why?
Because the errors are relative to the average of the pool, which is not important to our purpose.

Shouldn't scoring 50% against an engine with error bars at +/- 30 tell you less about your performance than if your opponent has error bars of +/- 15?
For instance, A beats C 60-40%. B vs C is 50-50. A is better than B and C is the point of reference. It really should not matter where that point of reference is, the relative strength between A and B would still be the same.

Miguel