A question on testing methodology

Adam Hair · Post by **Adam Hair** » Thu Nov 19, 2009 5:44 pm

Sven Schüle wrote: The +/- is not meaningless. In conjunction with the rating delta (A - A*) you can already guess how likely the statement "A* is an improvement over A" is, depending on the degree of overlapping of both rating intervals. But as recently discussed in another thread, it is much better to calculate LOS instead for this purpose. So Adam should do this for both methods (without and with RR games), then we'll see whether RR games have an influence on the LOS value. I am still confident that with RR games included, you will need less games to reach the same quality (expressed by error bars or by LOS) of your measurement than without.

In this case the LOS will virtually not change. The LOS computed when
I combined the 2 gauntlets was 99% for TL20090922 and that would not
change if I computed it for all of the games.

By the way, I used covariance instead of exactdist. You can tell by the
fact that the confidence intervals are symmetric for all the engines despite
the number of games played and relative strength of opponents played.

Adam Hair · Post by **Adam Hair** » Thu Nov 19, 2009 6:08 pm

Hart wrote:
michiguel wrote:
Hart wrote:
michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel
Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.
The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.

Miguel
My bold.

How could this be so? Wouldn't smaller errors in your opponents ratings necessarily be conducive to smaller errors in yours? If not, why? Shouldn't scoring 50% against an engine with error bars at +/- 30 tell you less about your performance than if your opponent has error bars of +/- 15?

The opponent's error bars have nothing to do with how your error bars are
computed. Your error bars are determined by the square root of the
number of games you played and the standard deviation in the distribution
of your game results relative to your unknown true strength.

bob · Post by **bob** » Thu Nov 19, 2009 6:47 pm

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?
Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.

You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
As for your question, I want to know if versions/settings of my engine performs better compared to each other. I just thought that if the engine opponents have a more stable rating that it will also reflect on the stability of the rating of the tested engine. This is not the case though as pointed out by the results posted by Adam Hair and as what you've pointed out in your posts.

The question now is should opponent engine's rating stability be included in the computation of the ratings? I mean for example:

Format is (Engine, Number of Games, Winning Percentage)

Data 1:
EngineA 1000 60%
OpponentA 1000 40%

Data 2:
EngineA 1000 60%
OpponentA 20000 40%

In the above data, I have more confidence in the rating produced from Data 2 compared to the one from Data 1 as Data 2 have more games played by the OpponentA against other engines, but the current computation of ratings by EloStat and Bayeselo seems to produce the same rating for Data 1and Data 2 for EngineA.
And you are falling into a classical trap. ratings are _not_ absolute numbers. If you play your program (two versions, A and A') against a set of opponents, you should be able to set your opponent ratings to anything you want, and your two versions should still end up in the right order, with the right separation between them. Elo is not an absolute value, the only use for Elo numbers is to subtract them to get the difference. That is an absolute number, and is useful.
I understand and I know what you are trying to say, but that is not what my question is all about.

It seems clear to me now how elo is being calculated, it is based mostly on winning percentages and it doesn't take into account other factors. Maybe someday, someone could discover a much accurate way to calculate ratings.

What else could you factor in? Elo is used to predict the future outcome of a game between two opponents, based on past performance by each of them. No need to factor in weather, health, mental state, etc. Enough games and that is included by natural selection.

bob · Post by **bob** » Thu Nov 19, 2009 7:35 pm

Hart wrote:
bob wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals.
You are talking about 92 difference. If you want more accuracy, the first step is to play more games to get the error bar down. Every program in the gauntlet has a high error bar. To be fair, I don't play 1400 games and look at results so I have no idea how my results would compare. I wait until the full 40,000 games are played, where the error bar drops to +/- 4 or 5, and make decisions there. There are two levels of questions and answers here. Which is better, A or B. If they are significantly different, it takes fewer games to determine the answer. If you want a very exact answer on "how much better" then up goes the number of games. If the programs are very close, as in A vs A', then a bunch of games is the only option.
Sure, common sense says that the more games you have the better, but what does this have to do with these two identical engine being rated 88 Elo further apart playing an engine that is probably not more than 9 Elo weaker? The results would seems preposterous even if BayesElo used a 4-sigma confidence interval, let alone 2, so I fail to see the relevance of the relatively small sample size.

Suppose you play a round-robin, 1 game per pair of opponents. What will be the error bar and the resulting Elo numbers? Scattered Elos and big errors, right? As you refine the ratings of the opponents, the overall error bar drops, as does the "scattering effect" of elo numbers that are way off. The more stable your opponent's Elo values are, the more correct your two test versions will be. I simply play enough games so that the accuracy is enough that I don't see this kind of fluctuation.

I'll run this experiment today and post the results. I'll play the usual 8K x 5 opponents, and post the BayesElo output, then I will play a full RR between the 5 opponents, add that to the PGN and post the BayesElo result for that as well. Then we can talk about data with a large enough sample size to really be trustworthy. And compare them without having each player show such huge error bars.

Hart · Post by **Hart** » Thu Nov 19, 2009 8:09 pm

bob wrote:
Hart wrote:
bob wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals.
You are talking about 92 difference. If you want more accuracy, the first step is to play more games to get the error bar down. Every program in the gauntlet has a high error bar. To be fair, I don't play 1400 games and look at results so I have no idea how my results would compare. I wait until the full 40,000 games are played, where the error bar drops to +/- 4 or 5, and make decisions there. There are two levels of questions and answers here. Which is better, A or B. If they are significantly different, it takes fewer games to determine the answer. If you want a very exact answer on "how much better" then up goes the number of games. If the programs are very close, as in A vs A', then a bunch of games is the only option.
Sure, common sense says that the more games you have the better, but what does this have to do with these two identical engine being rated 88 Elo further apart playing an engine that is probably not more than 9 Elo weaker? The results would seems preposterous even if BayesElo used a 4-sigma confidence interval, let alone 2, so I fail to see the relevance of the relatively small sample size.
Suppose you play a round-robin, 1 game per pair of opponents. What will be the error bar and the resulting Elo numbers? Scattered Elos and big errors, right? As you refine the ratings of the opponents, the overall error bar drops, as does the "scattering effect" of elo numbers that are way off. The more stable your opponent's Elo values are, the more correct your two test versions will be. I simply play enough games so that the accuracy is enough that I don't see this kind of fluctuation.

I'll run this experiment today and post the results. I'll play the usual 8K x 5 opponents, and post the BayesElo output, then I will play a full RR between the 5 opponents, add that to the PGN and post the BayesElo result for that as well. Then we can talk about data with a large enough sample size to really be trustworthy. And compare them without having each player show such huge error bars.

For the gauntlet match could you use two versions of Crafty that are not separated far in strength, that is, A vs your 5 opponents and then A* versus the same opponents under the same conditions. Then produce the BayesElo analysis for the two separate gauntlets and then the final analysis with the RR's thrown in?

bob · Post by **bob** » Thu Nov 19, 2009 8:58 pm

Hart wrote:
bob wrote:
Hart wrote:
bob wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals.
You are talking about 92 difference. If you want more accuracy, the first step is to play more games to get the error bar down. Every program in the gauntlet has a high error bar. To be fair, I don't play 1400 games and look at results so I have no idea how my results would compare. I wait until the full 40,000 games are played, where the error bar drops to +/- 4 or 5, and make decisions there. There are two levels of questions and answers here. Which is better, A or B. If they are significantly different, it takes fewer games to determine the answer. If you want a very exact answer on "how much better" then up goes the number of games. If the programs are very close, as in A vs A', then a bunch of games is the only option.
Sure, common sense says that the more games you have the better, but what does this have to do with these two identical engine being rated 88 Elo further apart playing an engine that is probably not more than 9 Elo weaker? The results would seems preposterous even if BayesElo used a 4-sigma confidence interval, let alone 2, so I fail to see the relevance of the relatively small sample size.
Suppose you play a round-robin, 1 game per pair of opponents. What will be the error bar and the resulting Elo numbers? Scattered Elos and big errors, right? As you refine the ratings of the opponents, the overall error bar drops, as does the "scattering effect" of elo numbers that are way off. The more stable your opponent's Elo values are, the more correct your two test versions will be. I simply play enough games so that the accuracy is enough that I don't see this kind of fluctuation.

I'll run this experiment today and post the results. I'll play the usual 8K x 5 opponents, and post the BayesElo output, then I will play a full RR between the 5 opponents, add that to the PGN and post the BayesElo result for that as well. Then we can talk about data with a large enough sample size to really be trustworthy. And compare them without having each player show such huge error bars.
For the gauntlet match could you use two versions of Crafty that are not separated far in strength, that is, A vs your 5 opponents and then A* versus the same opponents under the same conditions. Then produce the BayesElo analysis for the two separate gauntlets and then the final analysis with the RR's thrown in?

That is almost always the case (the two versions are not far apart). I have some results where the two are separated by 8 after 40,000 games. Close enough? I can add those PGNs to the RR pgn and give the original and combined results.

Hart · Post by **Hart** » Thu Nov 19, 2009 9:44 pm

Even better that you have a good idea of the difference beforehand, that is, 8 Elo.

Adam Hair · Post by **Adam Hair** » Thu Nov 19, 2009 10:00 pm

Adam Hair wrote:
Hart wrote:
michiguel wrote:
Hart wrote:
michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel
Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.
The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.

Miguel
My bold.

How could this be so? Wouldn't smaller errors in your opponents ratings necessarily be conducive to smaller errors in yours? If not, why? Shouldn't scoring 50% against an engine with error bars at +/- 30 tell you less about your performance than if your opponent has error bars of +/- 15?
The opponent's error bars have nothing to do with how your error bars are
computed. Your error bars are determined by the square root of the
number of games you played and the standard deviation in the distribution
of your game results relative to your unknown true strength.

Looks like I have to refute myself to some degree. The intervals are not
estimated independently.

To quote Remi:

The most accurate method computes the whole covariance matrix. It has a cost quadratic in the number of players, so I do not use it as the default, because it would take too much time and memory when rating thousands of players.

So, if you are rating a few players and would like better confidence intervals, you should run the "covariance" command in bayeselo, after "mm".

Also, it is important to understand that a confidence interval in general has little meaning, because elo ratings are relative. They are not estimated independently: there is some covariance in the uncertainty. That is to say, you cannot estimate the probability that one program is stronger than another by just looking at their confidence intervals. That is why I made the "los" command (for "likelihood of superiority"). "los" is a more significant indicator. In order to compare two programs, instead of running two separate experiments with two separate PGN files and look at the confidene intervals, it is a lot better to have only one PGN file with all the games, and compare the two programs with the los command.

Rémi

I was thinking that there was another cause in the change of the intervals
in the results I posted, yet it seems the change in uncertainty (given the small number of games )
played a role also.

bob · Post by **bob** » Thu Nov 19, 2009 10:50 pm

Hart wrote:Even better that you have a good idea of the difference beforehand, that is, 8 Elo.

Remember that I know this because I played the two versions against the same set of 5 opponents, same 4K positions, and found that +8 elo gain. I am going to report those results, and then add in the RR among the 5 opponents to the same PGN file and report that.

BTW, finding this +8 elo data is not so easy. I save the PGN by version number, so that we can compare any two versions whenever we want. I have 83 sets of PGN, which is 3.3 million games. I am hunting for those two versions (I organize by version number, not by elo gain unfortunately). Meanwhile the RR test is running, a total of 4! * 8000 games, or almost 200,000 games. Will take about 5 hours or so to run. More later tonight.

jwes · Post by **jwes** » Thu Nov 19, 2009 11:22 pm

Any chance you could put up an archive of these games on your ftp site? I would be interested in doing some statistics on the results.

A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology