more on engine testing

hgm · Post by **hgm** » Wed Aug 06, 2008 4:08 pm

Rémi Coulom wrote:When the number of games is high and the winning rate is not close to 0% or 100%, there is virtually no difference between expected rating and maximum a posteriori.

This is very tricky: sometimes there are 'hidden massacres' in the pairing network:

Suppose you have 200 players. A group of 100 of them play a complete round robin, (i.e. 99 games each), and turn out to be all about equally strong (so the scores are normally distributed with sigma ~4, and all players had results between 38 and 61 (+/-3 sigma should be enough to catch all the 100 players, as the probability a player exceeds that by chance is <1%). The same is true for the other grou of 100 players.

But now suppose that each of the players from group A has played one game against a player of group B (and vice versa), and that all the A players won that game. (This is pretty much what happens in the first round of a Swiss tournament, seeded by rating.)

The 1 game hardly affects the result of the individual players, they now each have 100 games, and for the A players the score was between 39% and 62%, for the B players between 38% and 61%. All very far from 100%. But B as a group was slaughtered by the A players with a 100% score!

This case is totally mishandled by BayesElo. It will not succeed in getting the Elo difference between the A and B groups average anywhere near correctly. While the 100 games played between the two should be enough to give a good clue (or in this case, a good lower limit for the difference).

bob · Post by **bob** » Wed Aug 06, 2008 6:52 pm

Rémi Coulom wrote:Hi,

I have just noticed this thread, and don't have time to read it all. I'll do it later.

Maybe someone already said this but you have to understand something very important about the confidence intervals of bayeselo: bayeselo has 3 different methods for computing confidence intervals. The default algorithm is the fastest, but not the most accurate: it does as if the estimated rating of opponents are their true ratings. This will underestimate the uncertainty in general, and particularly of a program who played a lot of games against opponents who played only a few.

The most accurate method computes the whole covariance matrix. It has a cost quadratic in the number of players, so I do not use it as the default, because it would take too much time and memory when rating thousands of players.

So, if you are rating a few players and would like better confidence intervals, you should run the "covariance" command in bayeselo, after "mm".

Also, it is important to understand that a confidence interval in general has little meaning, because elo ratings are relative. They are not estimated independently: there is some covariance in the uncertainty. That is to say, you cannot estimate the probability that one program is stronger than another by just looking at their confidence intervals. That is why I made the "los" command (for "likelihood of superiority"). "los" is a more significant indicator. In order to compare two programs, instead of running two separate experiments with two separate PGN files and look at the confidene intervals, it is a lot better to have only one PGN file with all the games, and compare the two programs with the los command.

Rémi

I was not contesting your computations. I was trying to show that matches between computers exhibit some unusual variability, and that even 25,000 game matches can produce completely different Elo estimates...

My issue is "how many games will it take to get a _very_ accurate estimation of a rating so that I can determine whether a modest change is good or bad? And more recently this has become not only can I play enough games, but is it even possible to measure small changes since there is so much randomness in the games played between computers?

bob · Post by **bob** » Wed Aug 06, 2008 7:00 pm

Let me make sure I understand your "graph" comment.

I am playing matches between Crafty and 5 different opponents, a total of roughly 25,000 games. I do use one large PGN file, but it would be _very_ difficult (if not impossible) to properly order the games into the sequence they were physically played. What I end up doing is to create small PGN files of the form matchX.Y.Z where X is the number of the opponent, Y is the position number, and Z is the number of the individual match.. To explain Z, I run 40 positions, 4 games per each opponent, and then 32 of those tests. Z is a number between 1 and 32.

I may have Y and Z backward but it is irrelevant here most likely. When I combine these into a single PGN file, the original files are lexically ordered by filenames, which is probably not the order the games were intended to be played. But when I run a second run with Crafty' (rather than Crafty) the ordering is at least consistent between the two different tests.

Should I combine the two PGN collections into one to run the test? Or is the two separate runs and comparing Elo the solution? Something tells me that combining will produce different results than two separate tests.

edit:

another question. do you believe (and I can/will test this for my own benefit) that the results would be better analyzed if I did the following:

Use Crafty and Crafty'. Rather than a separate test for each, combine Crafty and Crafty' plus the 5 opponents, and play a full round-robin between all 7 opponents and compare the final ratings for crafty/crafty'? Do you believe that fewer games will be needed? For example, in the two ugly results I posted at the top of this thread (the only two big runs in the post where the ratings were significantly different) that is 40 positions, times 4 games per position, or 160 games total for a single "mini-match' between two opponents. With a mix of 7 opponents, this requires 720 mini-matches to play everyone against everyone. Or a total of 115,200 games. I can clearly save by only playing 2 games per position to alternate colors, but that is almost 60,000 games still.

It seems to be important to use this many opponents if not more, so the only way to make it tractable for the non-cluster user would be to use fewer starting positoins, which also would seem to leave gaps in the testing results.

bob · Post by **bob** » Wed Aug 06, 2008 7:16 pm

Hmmm... that is what I asked before I saw your post. The good point is that the "round robin" among the opponents of crafty/crafty' would be a one-off and the results can be saved. Then it becomes a matter of dealing with 3 pgn files:

1. PGN of all opponents vs all opponents (would not change until a new opponent was added to the mix).

2. PGN for crafty vs each opponent

3. PGN for crafty' vs each opponent.

Then combine all 3 and run BayesElo on them. If I make another change to Crafty 1, redo 3, combine original 1+2 and add new 3 and run BayesElo again.

This has to be run on my cluster. Too many games otherwise. I am working on my referee program now to add the ability to adjudicate a game as won/lost/drawn as some of the opponents use the "Result" command to supply bogus results that screw things up. While waiting on Remi's response, I am going to prepare the referee and then try producing this data. It will take a a couple of days to play that many games to "seed" the Elo PGN file...

And perhaps this will answer the question originally posed with something other than "stampee foot, bug, stampee foot, incompetent researcher, stampee foot, impossible, stampee foot, cluster broken, stampee foot, bad software, stampee foot, bad assumptions, etc."

Amazing what a _rational_ discussion can uncover, if indeed this solves the problem, which is yet to be seen of course.

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 7:31 pm

Sven Schüle wrote: What I learn from your statement is that rating only the games from a gauntlet of an engine A against some opponents may give results that are less reliable than rating these games together with games between the opponents.

No, this is not what I meant. With the same amount of computing power, I believe that the opposite may be true, but I am not sure.

Then, at some point there is a version A'. Let it play games against the same opponents, not necessarily (but optionally) including also A.

It is probably not a good idea to mix self-play with play against opponents.

What I mean is that if you wish to compare A and A', you should

Play plenty of games between A and B1, B2, B3, B4
Play plenty of games between A' and the same B1, B2, B3, B4
Estimate the ratings of A and A' with Bayeselo by combining all the games in one single PGN file. Don't do two runs of bayeselo on the two sets of file.

I hope it is clear now.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 7:42 pm

hgm wrote:
Rémi Coulom wrote:When the number of games is high and the winning rate is not close to 0% or 100%, there is virtually no difference between expected rating and maximum a posteriori.
This is very tricky: sometimes there are 'hidden massacres' in the pairing network:

Suppose you have 200 players. A group of 100 of them play a complete round robin, (i.e. 99 games each), and turn out to be all about equally strong (so the scores are normally distributed with sigma ~4, and all players had results between 38 and 61 (+/-3 sigma should be enough to catch all the 100 players, as the probability a player exceeds that by chance is <1%). The same is true for the other grou of 100 players.

But now suppose that each of the players from group A has played one game against a player of group B (and vice versa), and that all the A players won that game. (This is pretty much what happens in the first round of a Swiss tournament, seeded by rating.)

The 1 game hardly affects the result of the individual players, they now each have 100 games, and for the A players the score was between 39% and 62%, for the B players between 38% and 61%. All very far from 100%. But B as a group was slaughtered by the A players with a 100% score!

Yes, OK, this is also a case where it does not work. Let's say when two parts of the graph are not connected by close to 0% or 100% wins.

This case is totally mishandled by BayesElo. It will not succeed in getting the Elo difference between the A and B groups average anywhere near correctly. While the 100 games played between the two should be enough to give a good clue (or in this case, a good lower limit for the difference).

Well, I don't see how the behaviour of BayesElo is incorrect. What would be a "correct" way to handle a 100% winning rate? In BayesElo, this is handled by the prior. It looks OK to me. If you feed this situation to Bayeselo, it will give a big difference of playing strength between the losing group and the winning group. I cannot see what else it could do.

If what you mean is that the confidence intervals are completely meaningless in this situation, then I completely agree. But the los matrix should give good results.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 8:11 pm

Bob:

The order of games inside each PGN file does not matter. The output of bayeselo should be the same, whatever the order. Any difference is either a bug or a small numerical error.

If you run two gauntlets, one with Crafty vs plenty of opponents, the other with Crafty' vs the same opponents, then you'll get much more meaningful results by combining the two sets of games into one single PGN file, rather than using two separate runs of bayeselo over each game set.

The reason for the additional accuracy is twofold

Opponent will have played twice more games, so their ratings will be more accurate
The offset for normalizing the ratings is meaningless, and its value may vary as much as the uncertainty about the rating of the most uncertain players.

It would be interesting to do this on the data you posted in your first message. Just give a different name to crafty in each pgn file, and feed them all to bayeselo. Then run the "los" command to get the matrix of likelihood of superiority. Of course you may sometimes get more than 95%, or even 99%, but I would be suprised if you get 99% much more than 1% of the time.

I am not sure that playing games between the opponents would be an efficient use of CPU time.

Rémi

hgm · Post by **hgm** » Wed Aug 06, 2008 8:57 pm

Rémi Coulom wrote:Well, I don't see how the behaviour of BayesElo is incorrect. What would be a "correct" way to handle a 100% winning rate? In BayesElo, this is handled by the prior. It looks OK to me. If you feed this situation to Bayeselo, it will give a big difference of playing strength between the losing group and the winning group. I cannot see what else it could do.

But that is exactly the point. When you feed this it does not give you a big difference between the two. Perhaps 300 Elo or so. While it is obvious that the difference should be at least 900 Elo.

At least with the default prior setting. If you take a near-zero prior things improve. But without prior BayesElo basically ceases to be BayesElo.

The problem is that BayesElo distributes the virtual draws equally over all pairs of players. And with 200 players, and 2 virtual draws per player, there are 400 virtual draws. As approximately half of the 200*199/2 pairs of players are from diffent groups (one in A, the other in B), BayesElo counts about 200 virtual draws between the groups. So a result of 100-0 is counted like it was 200-100. And that doesn't look so bad for group B...

Sven · Post by **Sven** » Thu Aug 07, 2008 12:12 am

Rémi Coulom wrote:
Sven Schüle wrote: What I learn from your statement is that rating only the games from a gauntlet of an engine A against some opponents may give results that are less reliable than rating these games together with games between the opponents.
No, this is not what I meant. With the same amount of computing power, I believe that the opposite may be true, but I am not sure.

Now I'm lost again ... What do you mean by "same amount" here? When adding inter-opponent games, I would expect that each of these matches has the same number of games as each of the matches "A vs. opponent". So there is additional computing power necessary (e.g. for 5 opponents 10*80 games in addition to the other 5*80), but only once in the beginning, since these inter-opponent games can always be reused when another new version of A comes to testing.

Regarding reliabilty of results, I thought that at least the confidence intervals of the opponents become better after adding all inter-opponent games, since the opponents now have much more games against different programs, instead of having played against only one (A). For the confidence interval of A, my hope was that it will become better, too.

Perhaps I will have to do some experiments on my own to understand these things better under practical conditions.

Sven

Rémi Coulom · Post by **Rémi Coulom** » Thu Aug 07, 2008 1:50 pm

hgm wrote:
Rémi Coulom wrote:Well, I don't see how the behaviour of BayesElo is incorrect. What would be a "correct" way to handle a 100% winning rate? In BayesElo, this is handled by the prior. It looks OK to me. If you feed this situation to Bayeselo, it will give a big difference of playing strength between the losing group and the winning group. I cannot see what else it could do.
But that is exactly the point. When you feed this it does not give you a big difference between the two. Perhaps 300 Elo or so. While it is obvious that the difference should be at least 900 Elo.

At least with the default prior setting. If you take a near-zero prior things improve. But without prior BayesElo basically ceases to be BayesElo.

The problem is that BayesElo distributes the virtual draws equally over all pairs of players. And with 200 players, and 2 virtual draws per player, there are 400 virtual draws. As approximately half of the 200*199/2 pairs of players are from diffent groups (one in A, the other in B), BayesElo counts about 200 virtual draws between the groups. So a result of 100-0 is counted like it was 200-100. And that doesn't look so bad for group B...

In the case you describe, I don't think there is any problem with the behavior of bayeselo. In case there is a result of 100-0 between the two groups and 200 virtual draws, this means that every player has played only two games. It is not reasonable to give a huge rating difference based on only two games. With such a small number of games, the two groups do not exist as groups. In order to make them groups, they need to play several games inside each group, so that the ratings of all the members get strongly correlated.

Imagine the extreme case where every player has played only one game, against the other group. In this case we have a 100-0 score between the two groups. But it cannot be considered like a 100-0 score between two individual: the ratings of the players inside each group are completely uncorrelated. So, the two virtual draws will produce a reasonable strength difference.

In order to have this correlation, we need to have more games inside each group. Just one additional game, like in your situation, will produce extremely weak correlation. Especially for those players who won or lost. The more they play games inside each group, the smaller the strength of the prior on the 100 games that link the two groups. So the two groups will move farther apart as they play games inside each group.

So, I believe that this example is a good example of a situation where bayeselo behaves particularly well. Especially if you compare it to Elostat. As soon as each group has played enough games to be strongly correlated, they will reach that 900 Elo-point difference.

In fact, I would say that if there is a problem with bayeselo in this situation, it is rather the reverse of what you said: as the number of games inside each group goes to infinity, the difference in playing strength between the two groups will go to infinity, although its evaluation is based on only 100 games.

In order to prevent this from happening, it might be a good idea to add an "absolute prior" to bayeselo. For instance, add a virtual player with a virtual draw against everybody. That may be a good additional prior option to add to the program.

Rémi

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing