more on engine testing

Discussion of chess software programming and technical issues.

Moderator: Ras

Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: more on engine testing

Post by Rémi Coulom »

Note, by the way, that on the same data, if you run the "covariance" commande before listing the ratings, you'll get that instead:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws 
   1 171    495  154  154   199   59%   415    0% 
Covariance takes into consideration the uncertainty of the link between the two groups.

Rémi
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

bob wrote:I would draw a completely different conclusion. Ratings within each pool should be highly accurate. ratings between the pools can't possibly be accurate with so few games. So the old GIGO applies here. The original data is garbage from the perspective of being useful to predice overall ratings...
Well, of course you woud consider a 100-0 result garbage, because 100 is too few games. With you even a 25,000 game result is garbage...

But when I test engines, you can be pretty sure that the engine scoring 0 in a 100-0 match was not the stronger one...
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

Sven Schüle wrote:There is no rating for a whole group, only for single players. The competition is not a competition of two groups but of individual players. A 100-0 result of A players against B players, where each A player met one B opponent, does not say much about the playing strength of any participant. It is also possible that 95 A players are stronger than their opponents but 5 A players are weaker, or even that only 30 A players are stronger and 70 A players are weaker, you can't tell that from 1 game, nor who is stronger by how much.
If the relative rating of the players within a group is extremely certain from the large number of games within the group, the group indeed starts to behave as if it was a single player (probably receiving a known handicap in some of the gmes).

Look, there is little point in arguing about this. If you don't believe it, just generate synthetic data for the case I described, where you assign ratings such that within the groups no player scores outsite the range 38-62% (in a round-robin), and between te groups you will hae a 100-0 result. If you make the ratings of the groups overlap, there is no way you will be able to do that, not even if you continue tryng for a thousand years.
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

[quote="Rémi Coulom"]I generated artificial data with 2 * 100 players, and round-robin in each group, and 100 games between the two groups. In the round-robin, results were with 50% probability a win, and with 50% probability a loss. The rating list looks like this:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws 
   1 171    495   47   47   199   59%   415    0% 
   2 168    491   47   47   199   58%   416    0% 
[...]
  99 134    359   47   47   199   43%   417    0% 
 100 117    355   47   47   199   43%   417    0% 
 101 60    -337   47   47   199   59%  -417    0% 
 102 3     -368   47   47   199   56%  -417    0% 
[...]
Yes, you are right, this looks OK. I cannot give you the link to the PGN on which I did the observation rght now, as the Chess-War website seems to be down at the moment.

But if the poor result I got on that pgn file is not due to virtual-draw assgnment (as I supposed, since the anomaly seemed to disappear when I reduced the prior ro almost nothing), it raises the question how you do assign the virtual draws.

I assumed that, if there were 200 players, like in the example above, and prior = 2, you would assign 2/99 draws along any link, so that each player in total has 2 draws. But this would mean that there are 2/99 * 10,000 is about 202 virtual draws cnnecting group A and B, making the total result efectively 201-101. And that is not compatible with the result you print here so apparently it is not what you do.

So what is your algorithm for assigning the virtual draws to the links of the pairing network???

And in particular, how would you assign them in the other example (of the two gauntlets, of A and B against C1-C1000)?
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: more on engine testing

Post by Rémi Coulom »

hgm wrote:I assumed that, if there were 200 players, like in the example above, and prior = 2, you would assign 2/99 draws along any link, so that each player in total has 2 draws. But this would mean that there are 2/99 * 10,000 is about 202 virtual draws cnnecting group A and B, making the total result efectively 201-101. And that is not compatible with the result you print here so apparently it is not what you do.

So what is your algorithm for assigning the virtual draws to the links of the pairing network???

And in particular, how would you assign them in the other example (of the two gauntlets, of A and B against C1-C1000)?
The virtual draws are distributed evenly over all the games. If the prior is 2, and only two players play against each other, 2 virtual draws are added to the match. To make this work in the more general case, the number of virtual draws added between two players is, for every game, 1/N1 + 1/N2, where N1 is the number of games played by player 1 and N2 is the number of games played by player 2. The number of virtual draws only depends on the number of players, not on the number of games.

So, if A plays the same number of games against C1-C1000, then 1,001 virtual draws will be added for every pair.

As I said in my previous message, this may cause a problem by adding too few virtual draws between two densely connected parts of the graph. It may be better to also have an independ prior, ie a virtual player with some draws against everybody. But when I tried to optimize the prior by cross-validation on WBEC data (and other), I found that adding that virtual player did not improve predictions. I suppose in every normal pairing situation, the current scheme of bayeselo should work.

It is always possible to invent artificial situations where the prior is wrong, anyways, whatever the prior.

If we are going to use the output of the program do find out whether a version is stronger than another, then it is good to have a strong prior. Maybe in the case you have in mind, when trying to make prediction from very few games, a weaker prior is better.

Rémi
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: more on engine testing

Post by Rémi Coulom »

hgm wrote: I assumed that, if there were 200 players, like in the example above, and prior = 2, you would assign 2/99 draws along any link, so that each player in total has 2 draws.
This is correct
But this would mean that there are 2/99 * 10,000 is about 202 virtual draws cnnecting group A and B, making the total result efectively 201-101. And that is not compatible with the result you print here so apparently it is not what you do.
I don't understand this. 99 = 98 in the round-robin + 1 against the other group, right ? that is only 100 games between the two groups, so 2/99*100 = 2.02 virtual draws.

Rémi
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

Rémi Coulom wrote:The virtual draws are distributed evenly over all the games. If the prior is 2, and only two players play against each other, 2 virtual draws are added to the match. To make this work in the more general case, the number of virtual draws added between two players is, for every game, 1/N1 + 1/N2, where N1 is the number of games played by player 1 and N2 is the number of games played by player 2. The number of virtual draws only depends on the number of players, not on the number of games.
OK, this is different from what I thought. So this cannot be the explanation for the strange results I got on the ChessWar PromoDivision. What seems a bit inconsistent to me intuitively, though, is that the number of draws assigned to a link is independent of the number of games played along that link.... unless that number happens to be zero. This causes a suspect discontinuity, somehow.
So, if A plays the same number of games against C1-C1000, then 1,001 virtual draws will be added for every pair.
Well, with the virtual-draw assignment scheme as I understand it now, not the previous example becomes the Achilles heel of BayesElo, but this two-gauntlets one: by assigning 1,001 virtual draws to each link, A and B get connected to the C group each by 1001 virtual draws. If they played each C opponent only once, there were only 1000 real games. So a 990-10 result in favor of A, and B lost 990-10, th virtual draws make the effective result 1490.5-510.5. Which onlly corresponds to some 200 Elo between A or B and the C-group-average, so 400 between A and B.

But the score difference 99% vs 1% for two identical 1000-game gauntlets should allow you to conclude with a pretty large confidence that the Elo difference is over 1200.

Note that comparing gauntlet results is a very common habit, and that a gauntlet against 1000 different opponents with 1 game each would be considered a more reliable measure of Elo difference than one of 10 opponents, 10 games each.

It seems to me what is needed here is an assignment of virtual draws that somhow 'anchor' the C players (perhaps to each othr, or to a virtual player), without putting too much of a budon on the A and B player.
As I said in my previous message, this may cause a problem by adding too few virtual draws between two densely connected parts of the graph.
Well, I don't think the new example falls in that category
It may be better to also have an independ prior, ie a virtual player with some draws against everybody. But when I tried to optimize the prior by cross-validation on WBEC data (and other), I found that adding that virtual player did not improve predictions. I suppose in every normal pairing situation, the current scheme of bayeselo should work.

It is always possible to invent artificial situations where the prior is wrong, anyways, whatever the prior.

If we are going to use the output of the program do find out whether a version is stronger than another, then it is good to have a strong prior. Maybe in the case you have in mind, when trying to make prediction from very few games, a weaker prior is better.

Rémi
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: more on engine testing

Post by Sven »

Rémi Coulom wrote:Note, by the way, that on the same data, if you run the "covariance" commande before listing the ratings, you'll get that instead:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws 
   1 171    495  154  154   199   59%   415    0% 
Covariance takes into consideration the uncertainty of the link between the two groups.

Rémi
Hi Rémi,
could you please post the whole result after "covariance", or at least the most typical part of it, since I would like to see the error margins given for the other group's players, too.

Thanks,
Sven
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: more on engine testing

Post by Rémi Coulom »

hgm wrote:Well, with the virtual-draw assignment scheme as I understand it now, not the previous example becomes the Achilles heel of BayesElo, but this two-gauntlets one: by assigning 1,001 virtual draws to each link, A and B get connected to the C group each by 1001 virtual draws. If they played each C opponent only once, there were only 1000 real games. So a 990-10 result in favor of A, and B lost 990-10, th virtual draws make the effective result 1490.5-510.5. Which onlly corresponds to some 200 Elo between A or B and the C-group-average, so 400 between A and B.

But the score difference 99% vs 1% for two identical 1000-game gauntlets should allow you to conclude with a pretty large confidence that the Elo difference is over 1200.

Note that comparing gauntlet results is a very common habit, and that a gauntlet against 1000 different opponents with 1 game each would be considered a more reliable measure of Elo difference than one of 10 opponents, 10 games each.
I agree that the prior is wrong in this situation.
It seems to me what is needed here is an assignment of virtual draws that somhow 'anchor' the C players (perhaps to each othr, or to a virtual player), without putting too much of a budon on the A and B player.
Yes, this is the idea I suggested in my previous posts. Your example convinces me that I should add this as an option to bayeselo.

Rémi
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: more on engine testing

Post by Rémi Coulom »

Sven Schüle wrote: Hi Rémi,
could you please post the whole result after "covariance", or at least the most typical part of it, since I would like to see the error margins given for the other group's players, too.

Thanks,
Sven
They are the same for all the players.

Note that it is a symmetric (Gaussian) approxiation of the posterior distribution. With 100% wins or losses, the confidence interval should be strongly asymmetric.

Bayeselo offer 4 different algorithms for computing confidence intervals. This is the list of options, from the least accurate and fastest, to the most accurate and slowest:
  • Default: assume opponents ratings are their true ratings, and Gaussian distribution
  • "exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
  • "covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)
  • "jointdist": computes a numerical estimation of the whole distribution. It is the most accurate, but the cost is exponential in the number of players. May work for 3-4 players. You should reduce the resolution of the discretization for more players.
Rémi