more on engine testing

Rémi Coulom · Post by **Rémi Coulom** » Thu Aug 07, 2008 2:17 pm

Sven Schüle wrote:
Rémi Coulom wrote:
Sven Schüle wrote: What I learn from your statement is that rating only the games from a gauntlet of an engine A against some opponents may give results that are less reliable than rating these games together with games between the opponents.
No, this is not what I meant. With the same amount of computing power, I believe that the opposite may be true, but I am not sure.
Now I'm lost again ... What do you mean by "same amount" here? When adding inter-opponent games, I would expect that each of these matches has the same number of games as each of the matches "A vs. opponent". So there is additional computing power necessary (e.g. for 5 opponents 10*80 games in addition to the other 5*80), but only once in the beginning, since these inter-opponent games can always be reused when another new version of A comes to testing.

Yes, inter-opponent games can be re-used, and improve the reliability of ratings. But games between the opponents and previous versions of your program can be re-used too. And I see no reason why they would improve the accuracy of opponent ratings less than inter-opponent games.

This means that the idea of merging the two sets of games into one single PGN file being better than two separates runs of bayeseo generalizes to more than two sets of games.

Rémi

hgm · Post by **hgm** » Thu Aug 07, 2008 7:31 pm

Rémi Coulom wrote: In the case you describe, I don't think there is any problem with the behavior of bayeselo. In case there is a result of 100-0 between the two groups and 200 virtual draws, this means that every player has played only two games.

I don't think that is true. How can you draw that conclusion? I explicitly wrote that each player had 100 games. It is just that only one of those games of each A player was against a player from group B. The other 99 were against the other A players (and similarly for each B player).

So the relative ratings of the players within group A, or within group B, are known with extremely high certainty. The only question is how group B should be positioned w.r.t. A on the rating scale. And there was a 100-0 result between them. That should be enough evidence to conclude that group B is trash: a 100-0 result is very significant. No one in his right mind would thig that the best player from group B would be about as strong as the average player of group A.

That BayesElo is incapable of drawing this conclusion, shows that its behavior is far from OK, and in fact extremely disappointing...

Your replay seems to suggest that the virtual draws are assigned only along the links of the pairing network that have non-zero games. But I don't think that is true (or correct, for that matter). The number of virtual draws assigned to each link is independent of the number of games played along that link, and zero is no exception to that. So each A player gets virtual draws against ALL the B players, even thoug he only played one of them.

If you would assign the virtual draws only along non-empty links, I can also easily construct examples where that fails dramatically. E.g. if A and B each play C1 - C1000 once (and no other games). How would you assign the virtual draws, such that the C players do not get too few, and A and B do not get too many?

Rémi Coulom · Post by **Rémi Coulom** » Thu Aug 07, 2008 7:35 pm

hgm wrote:I don't think that is true. How can you draw that conclusion? I explicitly wrote that each player had 100 games. It is just that only one of those games of each A player was against a player from group B. The other 99 were against the other A players (and similarly for each B player).

Then, in that case, only 1/100 of the prior goes into the link between the two groups. So the strength difference between the two groups would be very high.

Rémi

hgm · Post by **hgm** » Thu Aug 07, 2008 7:44 pm

I was adding to my message above, while your answer crossed it. Please take note of what I wrote. What you say is NOT how BayesElo behaves. Run it on the results of the ChessWar promo, and use the ratings it spews out to retro-predict the group-result of the first round (strongest half against weakest half) It predicts totally wrong.

mathmoi · Post by **mathmoi** » Thu Aug 07, 2008 9:15 pm

hgm wrote:So the relative ratings of the players within group A, or within group B, are known with extremely high certainty. The only question is how group B should be positioned w.r.t. A on the rating scale. And there was a 100-0 result between them. That should be enough evidence to conclude that group B is trash: a 100-0 result is very significant. No one in his right mind would thig that the best player from group B would be about as strong as the average player of group A.

How can you say that? The best player from group B might be the best of all from both groups. It is not impossible for the best player to loose it's only game against a player from group A.

You can't conclude that the best player from group b is weaker than the average player from group A just because group A is generally stronger than group B.

hgm · Post by **hgm** » Thu Aug 07, 2008 10:06 pm

Yes you can, if he scored only 62% in group B, paying all other players there, and group B as a whole was crushed 100-0 by group A. He is only ~85 Elo stronger than the average player of group B. And the 100-0 shows you that on the average, group B is at least about 1000 Elo weaker, on the average.

bob · Post by **bob** » Thu Aug 07, 2008 11:13 pm

I would draw a completely different conclusion. Ratings within each pool should be highly accurate. ratings between the pools can't possibly be accurate with so few games. So the old GIGO applies here. The original data is garbage from the perspective of being useful to predice overall ratings...

Sven · Post by **Sven** » Thu Aug 07, 2008 11:56 pm

hgm wrote:Yes you can, if he scored only 62% in group B, paying all other players there, and group B as a whole was crushed 100-0 by group A. He is only ~85 Elo stronger than the average player of group B. And the 100-0 shows you that on the average, group B is at least about 1000 Elo weaker, on the average.

There is no rating for a whole group, only for single players. The competition is not a competition of two groups but of individual players. A 100-0 result of A players against B players, where each A player met one B opponent, does not say much about the playing strength of any participant. It is also possible that 95 A players are stronger than their opponents but 5 A players are weaker, or even that only 30 A players are stronger and 70 A players are weaker, you can't tell that from 1 game, nor who is stronger by how much.

Sven

Rémi Coulom · Post by **Rémi Coulom** » Fri Aug 08, 2008 7:46 am

hgm wrote:I was adding to my message above, while your answer crossed it. Please take note of what I wrote. What you say is NOT how BayesElo behaves. Run it on the results of the ChessWar promo, and use the ratings it spews out to retro-predict the group-result of the first round (strongest half against weakest half) It predicts totally wrong.

I generated artificial data with 2 * 100 players, and round-robin in each group, and 100 games between the two groups. In the round-robin, results were with 50% probability a win, and with 50% probability a loss. The rating list looks like this:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws 
   1 171    495   47   47   199   59%   415    0% 
   2 168    491   47   47   199   58%   416    0% 
[...]
  99 134    359   47   47   199   43%   417    0% 
 100 117    355   47   47   199   43%   417    0% 
 101 60    -337   47   47   199   59%  -417    0% 
 102 3     -368   47   47   199   56%  -417    0% 
[...]

I think it is completely OK.

If you give me more precise pointers to which PGN file gives which strange results for ChessWar, I will investigate it.

Rémi

Uri Blass · Post by **Uri Blass** » Fri Aug 08, 2008 8:05 am

bob wrote:I would draw a completely different conclusion. Ratings within each pool should be highly accurate. ratings between the pools can't possibly be accurate with so few games. So the old GIGO applies here. The original data is garbage from the perspective of being useful to predice overall ratings...

I agree that you cannot know exact rating if one group wins 100-0 against the second group because the difference can be 1000 elo and can be 2000 elo but you can be practically sure that the difference is very big so if elo program gives you small difference of 100 or 200 elo then it is something that is better to correct.

Uri

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing