A question on testing methodology

Edsel Apostol · Post by **Edsel Apostol** » Wed Nov 18, 2009 4:57 am

I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

My question is if the result from this testing is accurate enough or do I need to run a round robin match of the opponents also to get a much accurate rating? Does the opponent's results against other engines affect the rating of an engine? Does the number of games played by the opponent makes the rating of an engine more stable?

For example:

Format is (Engine, Elo, Number of Games)

Rating List A: (Gauntlet)
Opponent1 2900 1000
Opponent2 2875 1000
Opponent3 2850 1000
Opponent4 2825 1000
Opponent5 2800 1000
EngineA 2775 5000

Rating List B: (Round Robin)
Opponent1 2900 5000
Opponent2 2875 5000
Opponent3 2850 5000
Opponent4 2825 5000
Opponent5 2800 5000
EngineA 2775 5000

Which rating list is more accurate?

Hart · Post by **Hart** » Wed Nov 18, 2009 5:03 am

I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.

michiguel · Post by **michiguel** » Wed Nov 18, 2009 9:25 am

Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.

You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel

Edsel Apostol · Post by **Edsel Apostol** » Wed Nov 18, 2009 9:59 am

michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel

How about if the rating difference of Engine_A and Engine_A* is different in both lists, which rating list to trust, the gaunlet or the gaunlet with the round robin match of opponents?

If for example Engine_A and Engine_A* have the same winning percentage on the gauntlet, they would have the same elo rating even though they perform differently against the opponents. When you consider the result of the round robin of the opponents, is it possible that the rating of Engine_A and Engine_A* wouldn't be the same?

michiguel · Post by **michiguel** » Wed Nov 18, 2009 11:47 am

Edsel Apostol wrote:
michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel
How about if the rating difference of Engine_A and Engine_A* is different in both lists, which rating list to trust, the gaunlet or the gaunlet with the round robin match of opponents?

If for example Engine_A and Engine_A* have the same winning percentage on the gauntlet, they would have the same elo rating even though they perform differently against the opponents. When you consider the result of the round robin of the opponents, is it possible that the rating of Engine_A and Engine_A* wouldn't be the same?

1) you should include A and A* together (i.e. both gauntlets as one thing), so you will see the difference directly.
2) If both engines score 50% they should have the same rating.

Miguel

Edsel Apostol · Post by **Edsel Apostol** » Wed Nov 18, 2009 12:36 pm

michiguel wrote:
Edsel Apostol wrote:
michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel
How about if the rating difference of Engine_A and Engine_A* is different in both lists, which rating list to trust, the gaunlet or the gaunlet with the round robin match of opponents?

If for example Engine_A and Engine_A* have the same winning percentage on the gauntlet, they would have the same elo rating even though they perform differently against the opponents. When you consider the result of the round robin of the opponents, is it possible that the rating of Engine_A and Engine_A* wouldn't be the same?
1) you should include A and A* together (i.e. both gauntlets as one thing), so you will see the difference directly.
2) If both engines score 50% they should have the same rating.

Miguel

@1 I'm already doing that. I'm just wondering if there is an effect on the rating difference between versions tested if the result of the round robin of the opponents will be considered also for calculating the rating list. Would the rating difference of the engines still remain the same or it will be influenced by the result of the round robin match of the opponents?

@2 You mean that they will only have the same rating if they both have a 50% winning percentage? What about the other cases, let's say for example both engines scored 60% on average?

mcostalba · Post by **mcostalba** » Wed Nov 18, 2009 2:09 pm

Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.

Hi Edsel,

can I ask you what is the time control you use ?

They seem a lot of games to validate a single patch. Or you have many PC or you probably use very fast time control to do not slow down development.

Thanks
Marco

Sven · Post by **Sven** » Wed Nov 18, 2009 3:40 pm

michiguel wrote:
Hart wrote:I would think a one time RR for opponents 1-5 should be enough to establish their ratings and give you better results. This just came up in another thread and while I am not sure what the expert opinion is it makes sense that you know what their relative ratings beforehand to more accurately gauge improvements in your program. In other words, the more players are connected, the better your results.
You are right if you are interested to know the rating for the engine, but IMO, not if you want to know how much an engine made progress compared to the previous version.
Unless the calculation of the rating program is wrongly affected by this, the influence of games between third parties should be minimal or close to zero. After all, what it is important in this case is the difference between Engine_A and the performance of Engine_A* (modified) against the same gauntlet (i.e. not the error of each ELO, but the error of the difference between the engines).

Miguel

This topic was indeed discussed already in the past (sorry for not providing the link here) but for me there was no satisfying conclusion. My point was, and still is, that an additional RR between the opponents should improve the error bars also for the ratings of Engine_A and Engine_A*. To prove this would require the following comparison:

Method 1:
- play gauntlet of A against opponents
- play gauntlet of A* against opponents
- make rating list of all these games and look at ratings and error bars for A and A*

Method 2:
- same as method 1 but also play RR between opponents and include these games as well

The assumption that the ratings of A and A* are not affected by choice of method 1 or 2 may hold but it is possible that method 2 improves error bars and therefore *may* help to reduce the required number of games to reach the defined maximum error bars. My idea behind this is that playing against "more stable" opponents should also result in a "more stable" rating.

I don't recall whether this has really been tested by someone.

Sven

Edsel Apostol · Post by **Edsel Apostol** » Wed Nov 18, 2009 3:58 pm

mcostalba wrote:
Edsel Apostol wrote:I am currently using 10 opponents for my engine. For every new version or settings that I'm testing I use these 10 opponents. I don't play my engine against itself unless the search and eval is so different.
Hi Edsel,

can I ask you what is the time control you use ?

They seem a lot of games to validate a single patch. Or you have many PC or you probably use very fast time control to do not slow down development.

Thanks
Marco

I'm using 2+0.05 seconds time control before then I switched to 10+0.1 now. I have testers that helped me in my testing before (Audy Arandela and Tobias Lagemann). Tobias literally played millions of games to tune our settings. TL wouldn't be that strong without the two of them helping me.

I don't play really lots of games for each version. I just use a small set of opening positions, just varied enough to cover important factors like passed pawn pushes, pawn storms, king attacks, etc. I have no choice but to trust the results of a thousand games. It's a compromise for the limited testing resources I have.

Hart · Post by **Hart** » Wed Nov 18, 2009 5:54 pm

This makes pretty good sense. In a gauntlet match I saw in another thread, one of the opponents varied in strength by more than 60 Elo, after more than 1000 games and well outside of the 95% confidence intervals. However, both versions were the same. My thinking is that the ratings of your opponents should remain a constant and the only variation that should be seen are in the two programs you mean to compare, and that this could be accomplished beforehand by play several thousand games in a RR.

A question on testing methodology

A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology

Re: A question on testing methodology