A question on testing methodology

bob · Post by **bob** » Thu Nov 19, 2009 2:14 am

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?
Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.

You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
As for your question, I want to know if versions/settings of my engine performs better compared to each other. I just thought that if the engine opponents have a more stable rating that it will also reflect on the stability of the rating of the tested engine. This is not the case though as pointed out by the results posted by Adam Hair and as what you've pointed out in your posts.

The question now is should opponent engine's rating stability be included in the computation of the ratings? I mean for example:

Format is (Engine, Number of Games, Winning Percentage)

Data 1:
EngineA 1000 60%
OpponentA 1000 40%

Data 2:
EngineA 1000 60%
OpponentA 20000 40%

In the above data, I have more confidence in the rating produced from Data 2 compared to the one from Data 1 as Data 2 have more games played by the OpponentA against other engines, but the current computation of ratings by EloStat and Bayeselo seems to produce the same rating for Data 1and Data 2 for EngineA.

And you are falling into a classical trap. ratings are _not_ absolute numbers. If you play your program (two versions, A and A') against a set of opponents, you should be able to set your opponent ratings to anything you want, and your two versions should still end up in the right order, with the right separation between them. Elo is not an absolute value, the only use for Elo numbers is to subtract them to get the difference. That is an absolute number, and is useful.

Hart · Post by **Hart** » Thu Nov 19, 2009 2:43 am

http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:

Code: Select all

1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6%

Code: Select all

1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6%

That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?

bob · Post by **bob** » Thu Nov 19, 2009 2:54 am

Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals.

You are talking about 92 difference. If you want more accuracy, the first step is to play more games to get the error bar down. Every program in the gauntlet has a high error bar. To be fair, I don't play 1400 games and look at results so I have no idea how my results would compare. I wait until the full 40,000 games are played, where the error bar drops to +/- 4 or 5, and make decisions there. There are two levels of questions and answers here. Which is better, A or B. If they are significantly different, it takes fewer games to determine the answer. If you want a very exact answer on "how much better" then up goes the number of games. If the programs are very close, as in A vs A', then a bunch of games is the only option.

Hart · Post by **Hart** » Thu Nov 19, 2009 3:02 am

bob wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals.
You are talking about 92 difference. If you want more accuracy, the first step is to play more games to get the error bar down. Every program in the gauntlet has a high error bar. To be fair, I don't play 1400 games and look at results so I have no idea how my results would compare. I wait until the full 40,000 games are played, where the error bar drops to +/- 4 or 5, and make decisions there. There are two levels of questions and answers here. Which is better, A or B. If they are significantly different, it takes fewer games to determine the answer. If you want a very exact answer on "how much better" then up goes the number of games. If the programs are very close, as in A vs A', then a bunch of games is the only option.

Sure, common sense says that the more games you have the better, but what does this have to do with these two identical engine being rated 88 Elo further apart playing an engine that is probably not more than 9 Elo weaker? The results would seems preposterous even if BayesElo used a 4-sigma confidence interval, let alone 2, so I fail to see the relevance of the relatively small sample size.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 19, 2009 4:28 am

bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?
Here's the question. Do you want to know whether your engine is moving up or down, or do you want to know exact ratings for everyone? I have tested this exact case and found the round robin might give more accurate ratings for everyone, overall. But it has no influence on the ratings for your two test programs, because your rating and error bar is dependent on the number of games you play.

You might find the example buried in CCC a couple of years ago. When I was originally discussing cluster testing, Remi made one suggestion that helped a lot. That is, to test A against the gauntlet, and then test A' against the same gauntlet, and combine _all_ of the PGN into one file before passing to BayesElo. Those numbers have been rock-solid to date. I also, for fun, added an equal number of PGN games between each pair of opponents, So that A, A', and each opponent played the same number of games. The ratings changed a bit, but the difference between A and A' did not. And if you think about it, you can play the gauntlet round-robin first, and just save those games, since none of the opponents are changing at all, and just add in your A and A' vs the gauntlet PGN to the rest and run it thru bayeselo if you want to see how this changes (or doesn't) the results.
As for your question, I want to know if versions/settings of my engine performs better compared to each other. I just thought that if the engine opponents have a more stable rating that it will also reflect on the stability of the rating of the tested engine. This is not the case though as pointed out by the results posted by Adam Hair and as what you've pointed out in your posts.

The question now is should opponent engine's rating stability be included in the computation of the ratings? I mean for example:

Format is (Engine, Number of Games, Winning Percentage)

Data 1:
EngineA 1000 60%
OpponentA 1000 40%

Data 2:
EngineA 1000 60%
OpponentA 20000 40%

In the above data, I have more confidence in the rating produced from Data 2 compared to the one from Data 1 as Data 2 have more games played by the OpponentA against other engines, but the current computation of ratings by EloStat and Bayeselo seems to produce the same rating for Data 1and Data 2 for EngineA.
And you are falling into a classical trap. ratings are _not_ absolute numbers. If you play your program (two versions, A and A') against a set of opponents, you should be able to set your opponent ratings to anything you want, and your two versions should still end up in the right order, with the right separation between them. Elo is not an absolute value, the only use for Elo numbers is to subtract them to get the difference. That is an absolute number, and is useful.

I understand and I know what you are trying to say, but that is not what my question is all about.

It seems clear to me now how elo is being calculated, it is based mostly on winning percentages and it doesn't take into account other factors. Maybe someday, someone could discover a much accurate way to calculate ratings.

michiguel · Post by **michiguel** » Thu Nov 19, 2009 4:29 am

Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?

Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel

michiguel · Post by **michiguel** » Thu Nov 19, 2009 4:36 am

Edsel Apostol wrote:
Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.

No, you are getting it all wrong!!

Miguel

Hart · Post by **Hart** » Thu Nov 19, 2009 4:47 am

michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel

Obviously it can happen as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.

michiguel · Post by **michiguel** » Thu Nov 19, 2009 5:01 am

Hart wrote:
michiguel wrote:
Hart wrote:http://talkchess.com/forum/viewtopic.php?t=30676

This is what I am referring to:
Code: Select all
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6% 
Code: Select all
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6% 
That was a gauntlet run. The difference between these two engines in the first case it 192, and in the second 104, for a difference of 88 Elo between the two sets. Even if both these gauntlet matches were included in the same BayesElo analysis I can't believe that it would more than halve the difference in which case the difference is still well outside the 95% confidence intervals. Should a 5 Elo change in your program really cause two of your opponents to be rated 88 Elo further apart?
Yes it can! Many changes in one engine have a great impact on how they perform against one specific opponent. So, in your case, TL and Hermann based their rating on how they perform against the test engine only. That is why they could fluctuate like crazy. That is also why we have to test against a variety of opponents.

Look at Hermann and TL, the change made it positive for one and negative for the other one! And the calculation is reflecting that.

Miguel
Obviously, it can as the results show. The question is should this happen? What does it say about your performance being 9 Elo less against opponents whose ratings are separated by as much as 88 Elo between two sets? I do not how understand how your results would be anything but better if your opponents relative ratings were fixed beforehand, which is obviously not the case in gauntlets.

The problem is that there one assumption in ratings that is false. That is, I get "better" against A, then I get "better" against B. That is true most of the time, but not always. Sometime one change makes you better against A, but worse against B. If A and B are tested only against you, their ratings will fluctuate a lot. That is inaccurate for A and B ratings. I do not argue that. If you include games to make A and B ratings more accurate, you get a better picture of A relative to B, but does not affect yours.

Miguel

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 19, 2009 5:01 am

michiguel wrote:
Edsel Apostol wrote:
Adam Hair wrote:It appears that with Bayeselo it does not matter if you run a complete
round robin or if you just run a gauntlet. I took the games of TL20090922
and TL20080620 that I posted recently and tried different scenarios.
Note: TL20090922 and TL20080620 did not play each other

There is just a slight difference in the first two sets of examples, and none
in the last set.

It seems that, in general, gauntlets will give you the same information as
round robin tournaments. It does seem that if your engine performs poorly
against one opponent that is very weak against the other engines then
there would be some difference between gauntlet and round robin. But,
how likely is that?
Thanks for the data you've posted Adam. It answered most of my questions. It seems that the formula/algorithm for solving the elo is just simple and is only based on average winning percentages and it doesn't take into account the rating performance of the opponents.
No, you are getting it all wrong!!

Miguel

Please elaborate. Maybe you're confusing the elo computation for humans and for engines. I am talking about the computation for engines here. That is just my observation by the way.

For example, in computer chess, ratings are calculated based on a versions winning percentage on the total games it has played. In human chess, this is not the case, they are given a probationary rating, then the rating is updated after that based on further games and they take the opponents rating into account. I'm wondering if the updated rating is equivalent to when one just sums up the total games played by that player and computes the rating from that.

A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology