The change I made in LittleThought between the two tests (the source of these results) was to improve (?) its knowledge of pinned pieces in the second run. However, it made the engine so much slower that overall it made it worse.michiguel wrote:Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676
This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922 2839 23 22 1432 90% 2437 3% 2 Hermann 2.5 2647 17 16 1428 74% 2437 6%
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?Code: Select all
1 Twisted Logic 20090922 2770 20 20 1432 85% 2428 3% 2 Hermann 2.5 2666 17 17 1428 77% 2428 6%
Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.
Miguel
A question on testing methodology
Moderator: Ras
-
- Posts: 112
- Joined: Thu Mar 09, 2006 6:15 am
- Location: Australia
Re: A question on testing methodology
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: A question on testing methodology
No, you observed the result, but you did not observe how it is calculated. You see that the results are compatible with your idea that the calculation is simple and conclude then that they must be. No, it is not that simple. The exact methodology may vary according to programs, but the concept is the same. I doubt that programs just take an average of the opponents and look up ELO from winning percentage. Mine doesn'tEdsel Apostol wrote:Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.michiguel wrote:No, you are getting it all wrong!!Edsel Apostol wrote:Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other
There is just a slight difference in the first two sets of examples, and none
in the last set.
It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Miguel

You have to balance in each game the chances to win or lose based on gaussian curve (BTW, mine does not use a gauss curve) determined by the difference with your opponent. The problems is that you do not know either and you have to come up with a solution that satisfy those probabilities calculate with the probabilities observed (games won, & lost).
Miguel
It is supposed to be if you play infinite games. The way it is calculated for Humans is a gross approximation that worked really well for practical purposes. Computationally, humans approach their rating by a "steepest descent-like" algorithm.
For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.
Miguel
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology
The reason I said that the calculation is simple is that the calculation is only based on the win, draw, loss information of the current set of games.michiguel wrote:No, you observed the result, but you did not observe how it is calculated. You see that the results are compatible with your idea that the calculation is simple and conclude then that they must be. No, it is not that simple. The exact methodology may vary according to programs, but the concept is the same. I doubt that programs just take an average of the opponents and look up ELO from winning percentage. Mine doesn'tEdsel Apostol wrote:Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.michiguel wrote:No, you are getting it all wrong!!Edsel Apostol wrote:Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other
There is just a slight difference in the first two sets of examples, and none
in the last set.
It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Miguel
You have to balance in each game the chances to win or lose based on gaussian curve (BTW, mine does not use a gauss curve) determined by the difference with your opponent. The problems is that you do not know either and you have to come up with a solution that satisfy those probabilities calculate with the probabilities observed (games won, & lost).
Miguel
Here's the relevant algorithm I've found on the WB forum:
Code: Select all
1) Use number of wins, loss, and draws
W = number of wins, L = number of lost, D = number of draws
n = number of games (W + L + D)
m = mean value
2) Apply the following formulas to compute s
( SQRT: square root of. )
x = W*(1-m)*(1-m) + D*(0.5-m)*(0.5-m) + L*(0-m)*(0-m)
s = SQRT( x/(n-1) )
3) Compute error margin A (use 1.96 for 95% confidence)
A = 1.96 * s / SQRT(n)
4) State with 95% confidence:
The 'real' result should be somewhere in between m-A to m+A
5) Lookup the ELO figures with the win% from m-A and m+A to get the lower and higher values in the error margin.
By the way, does your program's ratings output differ from Elostat and Bayeselo?
It is supposed to be if you play infinite games. The way it is calculated for Humans is a gross approximation that worked really well for practical purposes. Computationally, humans approach their rating by a "steepest descent-like" algorithm.
For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.
Miguel
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: A question on testing methodology
That is how SSDF did it according to the post. Yes, that is very simplistic. Again, it is simplistic but it will give you problems regardless that you include more games from the round robin rather than the gauntlet alone. The reason why is simplistic is because ratings should be calculated game by game, not over the average of your opponents. Do you prefer to play two opponents that are 400 points higher or one opponent that is 800 points higher and one that has your same rating? That is a fatal flaw in how human ratings are calculated too. That is the reason why strong players feel that if they play one weak player in a pool of even players they end up losing points. They are right and that is why they end up avoiding tournaments if they face weak players.Edsel Apostol wrote:The reason I said that the calculation is simple is that the calculation is only based on the win, draw, loss information of the current set of games.michiguel wrote:No, you observed the result, but you did not observe how it is calculated. You see that the results are compatible with your idea that the calculation is simple and conclude then that they must be. No, it is not that simple. The exact methodology may vary according to programs, but the concept is the same. I doubt that programs just take an average of the opponents and look up ELO from winning percentage. Mine doesn'tEdsel Apostol wrote:Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.michiguel wrote:No, you are getting it all wrong!!Edsel Apostol wrote:Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other
There is just a slight difference in the first two sets of examples, and none
in the last set.
It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Miguel
You have to balance in each game the chances to win or lose based on gaussian curve (BTW, mine does not use a gauss curve) determined by the difference with your opponent. The problems is that you do not know either and you have to come up with a solution that satisfy those probabilities calculate with the probabilities observed (games won, & lost).
Miguel
Here's the relevant algorithm I've found on the WB forum:
http://www.open-aurec.com/wbforum/viewt ... ?f=4&t=949Code: Select all
1) Use number of wins, loss, and draws W = number of wins, L = number of lost, D = number of draws n = number of games (W + L + D) m = mean value 2) Apply the following formulas to compute s ( SQRT: square root of. ) x = W*(1-m)*(1-m) + D*(0.5-m)*(0.5-m) + L*(0-m)*(0-m) s = SQRT( x/(n-1) ) 3) Compute error margin A (use 1.96 for 95% confidence) A = 1.96 * s / SQRT(n) 4) State with 95% confidence: The 'real' result should be somewhere in between m-A to m+A 5) Lookup the ELO figures with the win% from m-A and m+A to get the lower and higher values in the error margin.
By the way, does your program's ratings output differ from Elostat and Bayeselo?
It is supposed to be if you play infinite games. The way it is calculated for Humans is a gross approximation that worked really well for practical purposes. Computationally, humans approach their rating by a "steepest descent-like" algorithm.
For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.
Miguel
But I digressed, I doubt Bayeselo does anything like it. I don't with my program. It is possible the EloStat does something like that because I heard complains about it. But we are talking about it here.
Several years ago I tried with Leo my system with WBEC results and Leo told me that he found significant differences with EloStat. That is all I can say. He wanted to use it but my program was not very "friendly" for Comp. chess yet (I was running NCAA Volleyball ratings during the 90's) and I did not find the time to adjust it. I quit CC for a while soon after that. Now I started to use it for testing and I adjusted to read pgn files. Maybe I should release it if anybody is interested.
Miguel
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: A question on testing methodology
You, as most other people posting in this thread, are still missing my key point. You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:bob wrote:This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.Sven Schüle wrote:This topic was indeed discussed already in the past (sorry for not providing the link here) but for me there was no satisfying conclusion. My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. To prove this would require the following comparison:michiguel wrote:You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).
Miguel
Method 1:
- play gauntlet of A against opponents
- play gauntlet of A* against opponents
- make rating list of all these games and look at ratings and error bars for A and A*
Method 2:
- same as method 1 but also play RR between opponents and include these games as well
The assumption that the ratings of A and A* are not affected by choice of method 1 or 2 may hold but it is possible that method 2 improves error bars and therefore *may* help to reduce the required number of games to reach the defined maximum error bars. My idea behind this is that playing against "more stable" opponents should also result in a "more stable" rating.
I don't recall whether this has really been tested by someone.
Sven
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. Do you notice that difference? And what does that mean for the number of gauntlet games you need to play until your error bars reach your predefined goal, e.g. +/- 4 or 5 like in your own Crafty testing? Will you need the same number of games, more games, or less games? (A note for all readers: please do not count the RR games here since they will be played only once and can be kept "forever", no need to replay these when adding gauntlets for versions A'', A''', A'''' ...)Adam Hair wrote:Here is a subset of the games:Here are the same games plus games between the other engines:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 86 55 55 110 65% -17 27% 2 Fruit23-EM64T 18 53 53 110 55% -17 34% 3 TwistedLogic20090922_x64 3 38 38 216 49% 8 34% 4 Delfi 5.4 (2CPU) -22 54 54 108 50% -17 31% 5 TwistedLogic20080620 -36 38 38 221 44% 9 33% 6 Spike1.2 Turin -48 52 52 109 45% -17 42%
Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 104 29 29 275 67% -21 28% 2 Fruit23-EM64T 11 28 28 275 52% -2 35% 3 TwistedLogic20090922_x64 3 32 32 216 49% 9 34% 4 Delfi 5.4 (2CPU) -34 28 28 270 44% 7 33% 5 TwistedLogic20080620 -37 32 32 221 44% 9 33% 6 Spike1.2 Turin -47 28 28 273 41% 10 40%
Btw I found the old threads, they are from August 2008. Here are some example links (I searched for "robin" in my own postings):
http://www.talkchess.com/forum/viewtopic.php?p=205814#205814
http://www.talkchess.com/forum/viewtopic.php?p=205596#205596
http://www.talkchess.com/forum/viewtopic.php?p=206874#206874
http://www.talkchess.com/forum/viewtopic.php?p=207971#207971
One of the open issues from these days seems to be still open: will the use of the BayesElo command "covariance" further increase the advantage of adding RR games?
Sven
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: A question on testing methodology
I get your point, but there is one thing that is never discussed and it relates directly to your good observation:Sven Schüle wrote:You, as most other people posting in this thread, are still missing my key point. You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:bob wrote:This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.Sven Schüle wrote:This topic was indeed discussed already in the past (sorry for not providing the link here) but for me there was no satisfying conclusion. My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. To prove this would require the following comparison:michiguel wrote:You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).
Miguel
Method 1:
- play gauntlet of A against opponents
- play gauntlet of A* against opponents
- make rating list of all these games and look at ratings and error bars for A and A*
Method 2:
- same as method 1 but also play RR between opponents and include these games as well
The assumption that the ratings of A and A* are not affected by choice of method 1 or 2 may hold but it is possible that method 2 improves error bars and therefore *may* help to reduce the required number of games to reach the defined maximum error bars. My idea behind this is that playing against "more stable" opponents should also result in a "more stable" rating.
I don't recall whether this has really been tested by someone.
Sven
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. Do you notice that difference? And what does that mean for the number of gauntlet games you need to play until your error bars reach your predefined goal, e.g. +/- 4 or 5 like in your own Crafty testing? Will you need the same number of games, more games, or less games? (A note for all readers: please do not count the RR games here since they will be played only once and can be kept "forever", no need to replay these when adding gauntlets for versions A'', A''', A'''' ...)Adam Hair wrote:Here is a subset of the games:Here are the same games plus games between the other engines:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 86 55 55 110 65% -17 27% 2 Fruit23-EM64T 18 53 53 110 55% -17 34% 3 TwistedLogic20090922_x64 3 38 38 216 49% 8 34% 4 Delfi 5.4 (2CPU) -22 54 54 108 50% -17 31% 5 TwistedLogic20080620 -36 38 38 221 44% 9 33% 6 Spike1.2 Turin -48 52 52 109 45% -17 42%
Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 104 29 29 275 67% -21 28% 2 Fruit23-EM64T 11 28 28 275 52% -2 35% 3 TwistedLogic20090922_x64 3 32 32 216 49% 9 34% 4 Delfi 5.4 (2CPU) -34 28 28 270 44% 7 33% 5 TwistedLogic20080620 -37 32 32 221 44% 9 33% 6 Spike1.2 Turin -47 28 28 273 41% 10 40%
Btw I found the old threads, they are from August 2008. Here are some example links (I searched for "robin" in my own postings):
http://www.talkchess.com/forum/viewtopic.php?p=205814#205814
http://www.talkchess.com/forum/viewtopic.php?p=205596#205596
http://www.talkchess.com/forum/viewtopic.php?p=206874#206874
http://www.talkchess.com/forum/viewtopic.php?p=207971#207971
One of the open issues from these days seems to be still open: will the use of the BayesElo command "covariance" further increase the advantage of adding RR games?
Sven
The +/- in this context is meaningless. You do not want to know the error bar of (A) in this pool, or the error bar of (A*) in the pool. You want to know the error bar of the number (A - A*). This could be much smaller than the comparison between both separate error bars. This is something most overlook. DeltaAA* may vary much less than A or A*.
The error changes when you include the RR because what you are seeing is the error of each A in the pool of engines. If you include infinite number of RR games, the error may converge to reflect the real DeltaAA*. But the accuracy of the test never changed, you just never knew the real error! you just decreased the error of the error

Miguel
PS: It may be better to run simulations to know what type of error you have than running the RR.
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: A question on testing methodology
The +/- is not meaningless. In conjunction with the rating delta (A - A*) you can already guess how likely the statement "A* is an improvement over A" is, depending on the degree of overlapping of both rating intervals. But as recently discussed in another thread, it is much better to calculate LOS instead for this purpose. So Adam should do this for both methods (without and with RR games), then we'll see whether RR games have an influence on the LOS value. I am still confident that with RR games included, you will need less games to reach the same quality (expressed by error bars or by LOS) of your measurement than without.michiguel wrote:I get your point, but there is one thing that is never discussed and it relates directly to your good observation:Sven Schüle wrote:[...] You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:bob wrote:This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.Sven Schüle wrote:[...] My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. [...]
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. [...]Adam Hair wrote:Here is a subset of the games:Here are the same games plus games between the other engines:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 86 55 55 110 65% -17 27% 2 Fruit23-EM64T 18 53 53 110 55% -17 34% 3 TwistedLogic20090922_x64 3 38 38 216 49% 8 34% 4 Delfi 5.4 (2CPU) -22 54 54 108 50% -17 31% 5 TwistedLogic20080620 -36 38 38 221 44% 9 33% 6 Spike1.2 Turin -48 52 52 109 45% -17 42%
Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 104 29 29 275 67% -21 28% 2 Fruit23-EM64T 11 28 28 275 52% -2 35% 3 TwistedLogic20090922_x64 3 32 32 216 49% 9 34% 4 Delfi 5.4 (2CPU) -34 28 28 270 44% 7 33% 5 TwistedLogic20080620 -37 32 32 221 44% 9 33% 6 Spike1.2 Turin -47 28 28 273 41% 10 40%
The +/- in this context is meaningless. You do not want to know the error bar of (A) in this pool, or the error bar of (A*) in the pool. You want to know the error bar of the number (A - A*). This could be much smaller than the comparison between both separate error bars. This is something most overlook. DeltaAA* may vary much less than A or A*.
I can't follow that. If I decrease the error for A and A* then I also decrease the error of DeltaAA*, as can be seen from the following small example:michiguel wrote:The error changes when you include the RR because what you are seeing is the error of each A in the pool of engines. If you include infinite number of RR games, the error may converge to reflect the real DeltaAA*. But the accuracy of the test never changed, you just never knew the real error! you just decreased the error of the error
Code: Select all
Name Elo + -
A 0 20 20
A* 30 20 20
The rating of A lies within [-20 .. +20].
The rating of A* lies within [+10 .. +50].
Both intervals overlap, so it is not clear whether A* is better than A.
Name Elo + -
A 0 10 10
A* 30 10 10
The rating of A lies within [-10 .. +10].
The rating of A* lies within [+20 .. +40].
A* is clearly better than A.
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: A question on testing methodology
That is exactly what I am saying you cannot do and everybody does. Well, you can, but it could be grossly overestimated. Comparing the overlaps is not strictly correct, particularly with a small set of opponents (and even more if they did not play each other). You can easily have a situation in which A = 2400 +/- 30, A*= 2450 +/- 30 and DeltaAA*= 50 +/- 20Sven Schüle wrote:The +/- is not meaningless. In conjunction with the rating delta (A - A*) you can already guess how likely the statement "A* is an improvement over A" is, depending on the degree of overlapping of both rating intervals.michiguel wrote:I get your point, but there is one thing that is never discussed and it relates directly to your good observation:Sven Schüle wrote:[...] You are right stating that the rating difference between A and A' (sometimes also called A* in this thread) remains nearly unchanged when adding RR games between the gauntlet opponents. But please have a look at Adam's data posted above:bob wrote:This was done by me year before last and the results reported here. It does improve the accuracy of the ratings for each gauntlet member, since otherwise they are only calculated by playing games against your two versions, which are calculated by playing your two versions against everybody. But there was no significant difference in the ratings of A and A' doing that. The good thing is that if you insist on doing this, you can play the gauntlet round robin just once and save the PGN, since those programs are not changing. And you then play A and A' vs the gauntlet, add in the gauntlet RR pgn, and run it thru BayesElo.Sven Schüle wrote:[...] My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. [...]
Without RR games both TL versions have ratings with error bars of +/- 38. Adding RR games changes the error bars to +/- 32. [...]Adam Hair wrote:Here is a subset of the games:Here are the same games plus games between the other engines:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 86 55 55 110 65% -17 27% 2 Fruit23-EM64T 18 53 53 110 55% -17 34% 3 TwistedLogic20090922_x64 3 38 38 216 49% 8 34% 4 Delfi 5.4 (2CPU) -22 54 54 108 50% -17 31% 5 TwistedLogic20080620 -36 38 38 221 44% 9 33% 6 Spike1.2 Turin -48 52 52 109 45% -17 42%
Code: Select all
Rank Name Elo + - games score oppo. draws 1 Bright-0.4a3(2CPU) 104 29 29 275 67% -21 28% 2 Fruit23-EM64T 11 28 28 275 52% -2 35% 3 TwistedLogic20090922_x64 3 32 32 216 49% 9 34% 4 Delfi 5.4 (2CPU) -34 28 28 270 44% 7 33% 5 TwistedLogic20080620 -37 32 32 221 44% 9 33% 6 Spike1.2 Turin -47 28 28 273 41% 10 40%
The +/- in this context is meaningless. You do not want to know the error bar of (A) in this pool, or the error bar of (A*) in the pool. You want to know the error bar of the number (A - A*). This could be much smaller than the comparison between both separate error bars. This is something most overlook. DeltaAA* may vary much less than A or A*.
Both A and A* could have a not well determined elo compared to the pool, but have a well determined difference among them.
Play A and B 10000 games each other. Make them play a gauntlet against 10 opponent, only 10 games. Put everything together and look at the error bars. I predict you will have a case like the one I mentioned above.
Exactly, if LOS is what I think it is.
But as recently discussed in another thread, it is much better to calculate LOS instead for this purpose.
If LOS can be calculated correctly, in both cases should be very similar.So Adam should do this for both methods (without and with RR games), then we'll see whether RR games have an influence on the LOS value. I am still confident that with RR games included, you will need less games to reach the same quality (expressed by error bars or by LOS) of your measurement than without.
That is not necessarily correct.
I can't follow that. If I decrease the error for A and A* then I also decrease the error of DeltaAA*, as can be seen from the following small example:michiguel wrote:The error changes when you include the RR because what you are seeing is the error of each A in the pool of engines. If you include infinite number of RR games, the error may converge to reflect the real DeltaAA*. But the accuracy of the test never changed, you just never knew the real error! you just decreased the error of the errorCode: Select all
Name Elo + - A 0 20 20 A* 30 20 20 The rating of A lies within [-20 .. +20]. The rating of A* lies within [+10 .. +50]. Both intervals overlap, so it is not clear whether A* is better than A. Name Elo + - A 0 10 10 A* 30 10 10 The rating of A lies within [-10 .. +10]. The rating of A* lies within [+20 .. +40]. A* is clearly better than A.
Let's assume that you have three points in in the same line A, B and C.
A is 1 meter from B, and both of them are about 1 km from C. You measure the three distances A-B, A-C and B-C. One with a ruler, and the other with some sort of GPS system. Then you calculate the distance of each of those points to the center of mass. You will have that all of them will be ~ 0.5 km +/- 1 meter. You cannot say that the error of A - B is +/- 2 meter. The error of that particular measure is still 1 mm.
This may not be a good analogy but I am trying to illustrate what I mean.
Miguel
Sven
Re: A question on testing methodology
My bold.michiguel wrote:The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.Hart wrote:Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.michiguel wrote:Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676
This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922 2839 23 22 1432 90% 2437 3% 2 Hermann 2.5 2647 17 16 1428 74% 2437 6%
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?Code: Select all
1 Twisted Logic 20090922 2770 20 20 1432 85% 2428 3% 2 Hermann 2.5 2666 17 17 1428 77% 2428 6%
Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.
Miguel
Miguel
How could this be so? Wouldn't smaller errors in your opponents ratings necessarily be conducive to smaller errors in yours? If not, why? Shouldn't scoring 50% against an engine with error bars at +/- 30 tell you less about your performance than if your opponent has error bars of +/- 15?
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: A question on testing methodology
I do not think this is necessarily trueHart wrote:My bold.michiguel wrote:The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.Hart wrote:Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.michiguel wrote:Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676
This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922 2839 23 22 1432 90% 2437 3% 2 Hermann 2.5 2647 17 16 1428 74% 2437 6%
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?Code: Select all
1 Twisted Logic 20090922 2770 20 20 1432 85% 2428 3% 2 Hermann 2.5 2666 17 17 1428 77% 2428 6%
Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.
Miguel
Miguel
How could this be so? Wouldn't smaller errors in your opponents ratings necessarily be conducive to smaller errors in yours?
Because the errors are relative to the average of the pool, which is not important to our purpose.If not, why?
For instance, A beats C 60-40%. B vs C is 50-50. A is better than B and C is the point of reference. It really should not matter where that point of reference is, the relative strength between A and B would still be the same.
Shouldn't scoring 50% against an engine with error bars at +/- 30 tell you less about your performance than if your opponent has error bars of +/- 15?
Miguel