more on engine testing

Uri Blass · Post by **Uri Blass** » Wed Aug 06, 2008 3:02 am

Note that there is something that is called performance and I thought in the beginning that rating for chess programs is based on performance in all the games because it is not logical to give rating based on something else(otherwise the last games are going to be more important and you will get a lot of noise).

The data did not suggest something different because I looked at the %
in your table and did not find something that seems to contradict the rating and bigger % means bigger difference in rating based on your data.

Looking at the data from the first post
I see that in part of the cases you played 5120 games against single opponent and in part of the cases you played 5115 games or 5119 games against the same opponet(opponent-21.7 got only 5115 games in one of the runs)

It is clear by that data that you did not repeat the same experiment twice
and I do not know if the mistake is small(by not having few games) or if the mistake is big.

Without the pgn we can say nothing.

Uri

bob · Post by **bob** » Wed Aug 06, 2008 7:39 am

Uri Blass wrote:Note that there is something that is called performance and I thought in the beginning that rating for chess programs is based on performance in all the games because it is not logical to give rating based on something else(otherwise the last games are going to be more important and you will get a lot of noise).

The data did not suggest something different because I looked at the %
in your table and did not find something that seems to contradict the rating and bigger % means bigger difference in rating based on your data.

Looking at the data from the first post
I see that in part of the cases you played 5120 games against single opponent and in part of the cases you played 5115 games or 5119 games against the same opponet(opponent-21.7 got only 5115 games in one of the runs)

It is clear by that data that you did not repeat the same experiment twice
and I do not know if the mistake is small(by not having few games) or if the mistake is big.

Without the pgn we can say nothing.

Uri

performance rating is based on opponent's ratings. But in these tests, we have _zero_ ratings. So where do you start? Again, whether or not this is the best way to rate programs, I don't know. And I suspect what we are doing is basically flawed in that computers and humans are completely different in their chess playing and results, yet we are using a system based on the human model. You will _rarely find three humans such that A beats B 75% of the time, B beats C 75% of the time, and C beats A 75% of the time (or pick the numbers so that the same idea holds true. But for computers it is not that uncommon at all. Because they are simply different from humans.

For your other issue, the slightly different number of games, is something else. What is happening is something I am looking at. When I play a game, after it ends, I open up a specific PGN file that goes with that pair of opponents, etc, and add the new PGN to the end. And on _very_ rare occasions, two games from the same "set" finish at exactly the same time. And both append their pgn to the original file, and only the last one to close the new file actually adds his game to it. I have not tried to fix this as it is a bit sticky. And with so many games, I've not been concerned. All that is happening is that a game or two or three gets lost on occasion. All the games are played, but a couple out of 5000+ just don't get counted.

Uri Blass · Post by **Uri Blass** » Wed Aug 06, 2008 10:03 am

bob wrote:
performance rating is based on opponent's ratings. But in these tests, we have _zero_ ratings. So where do you start? Again, whether or not this is the best way to rate programs, I don't know. And I suspect what we are doing is basically flawed in that computers and humans are completely different in their chess playing and results, yet we are using a system based on the human model. You will _rarely find three humans such that A beats B 75% of the time, B beats C 75% of the time, and C beats A 75% of the time (or pick the numbers so that the same idea holds true. But for computers it is not that uncommon at all. Because they are simply different from humans.

1)For the question where do you start you can give Crafty fixed rating of 2500 and calculate the performance of the opponents.

After doing it you can adjust the rating so the average of all rating is 0.

2)For the case that A beats B 75% of the time, B beats C 75% of the time, and C beats A 75% of the time then even if I reduce it to 60%
I do not know names of 3 chess programs when it happens in match of 100 games.

I think that it practically does not happen between programs when we test with no opening books from fixed positions.

I see no reason that FRC is going to be different than normal chess
from fixed positions.

Here is the complete rating list of FRC programs when programs do not use opening book.

http://www.computerchess.org.uk/ccrl/40 ... t_all.html

based on looking at the results in all cases when A scored at least 57 points out of 100 against B we get that A has a bigger rating than B.

Uri

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 10:27 am

Hi,

I have just noticed this thread, and don't have time to read it all. I'll do it later.

Maybe someone already said this but you have to understand something very important about the confidence intervals of bayeselo: bayeselo has 3 different methods for computing confidence intervals. The default algorithm is the fastest, but not the most accurate: it does as if the estimated rating of opponents are their true ratings. This will underestimate the uncertainty in general, and particularly of a program who played a lot of games against opponents who played only a few.

The most accurate method computes the whole covariance matrix. It has a cost quadratic in the number of players, so I do not use it as the default, because it would take too much time and memory when rating thousands of players.

So, if you are rating a few players and would like better confidence intervals, you should run the "covariance" command in bayeselo, after "mm".

Also, it is important to understand that a confidence interval in general has little meaning, because elo ratings are relative. They are not estimated independently: there is some covariance in the uncertainty. That is to say, you cannot estimate the probability that one program is stronger than another by just looking at their confidence intervals. That is why I made the "los" command (for "likelihood of superiority"). "los" is a more significant indicator. In order to compare two programs, instead of running two separate experiments with two separate PGN files and look at the confidene intervals, it is a lot better to have only one PGN file with all the games, and compare the two programs with the los command.

Rémi

Sven · Post by **Sven** » Wed Aug 06, 2008 11:56 am

Hi Rémi,

I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?

Take the following four scenarios, each resulting in a certain set of games being rated:

Sc.1) Player A meets K opponents B1 .. BK, N games each match, with equally distributed colors, so a total of K * N games gets rated. K > 1, N > 0 but "large enough to avoid trivial cases".

Sc.2) Player A plays a round-robin tournament together with K other players B1 .. BK, again with N games each match and equally distributed colors, so a total of (K * (K-1) / 2) * N games gets rated. K > 1, N > 0 as above. All games of A are exactly copied from scenario 1, so we only add games between B1 .. BK.

Sc.3) Exchange A by a different version A', and let A' play against B1 .. BK the same way as in Sc.1.

Sc.4) Modify the set of games rated in Sc.2 by exchanging all games of A by all games of A' as played in Sc.3.

Now here are my questions.

Q1: Does Sc.2 produce a different relative rating for player A than Sc.1 when using BayesElo, or is the rating of A independent from games between opponents?

Q2: If Q1 is answered with "yes, produces different rating" [edited], is the rating difference between A' in Sc.3 and A in Sc.1 the same as between A' in Sc.4 and A in Sc.2?

There were different opinions about it within this thread, so I am also interested in what the BayesElo expert thinks about this

Sven

hgm · Post by **hgm** » Wed Aug 06, 2008 12:27 pm

Uri Blass wrote:It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).

Exactly!

You hit upon the core of a slumbering discussion I have with Rémi. Of course the case you quote as an example (1500 or 1600, with no probability in between) does not occur in practice. The rating model uses a smooth monotonous function for score vs Elo-difference, and that means that the likelihoods will be smooth singly-peaked functions of the rating differences.

But in some very common cases (like swiss tournaments), the likelihoods can be very skewed distributions: their maximum can be quite far from their average, due to asymmetry. And if you use the maximum-likelihood estimate for the ratings in such a case, the expected average error, as well as the expected maximum error in the score you predict with the aid of those ratings will be much larger than when you would have taken the expectation value of the ratings under the likelihood function. So BayesElo performs like crap when you calculate ratings in such a case (e.g. the ChessWar Promo division).

It is not easy to formulate the problem in such a way that exact mathematical slution is possible. Bayesian analysis is always tricky: it must assume a prior likelyhood for the ratings. But how to assign that? In a sense the rating difference between two players is an arbitrary derived function, the fundamental quantity is the win probability. So would you take a prior likelihood that is flat on the score scale, or on the rating-difference scale. That is not the same, as expected scores are not linear functions f the rating difference. For more than two players it is not obvious what to do at all: For 4 players, there are 6 win probabilities, but only 4 ratings, and only 3 independent rating differences. So the space of ratings is embedded as a contorted manifold in the higher-dimensional space of win probabilities, and it becomes not obvious what 'flat probability distribution' means on such a non-flat space.

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 1:57 pm

Sven Schüle wrote:Hi Rémi,

I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?

No. The games between these opponents will have an effect on their ratings, which in turn has an effect on the ratings of A and A'

Take the following four scenarios, each resulting in a certain set of games being rated:

Sc.1) Player A meets K opponents B1 .. BK, N games each match, with equally distributed colors, so a total of K * N games gets rated. K > 1, N > 0 but "large enough to avoid trivial cases".

Sc.2) Player A plays a round-robin tournament together with K other players B1 .. BK, again with N games each match and equally distributed colors, so a total of (K * (K-1) / 2) * N games gets rated. K > 1, N > 0 as above. All games of A are exactly copied from scenario 1, so we only add games between B1 .. BK.

Sc.3) Exchange A by a different version A', and let A' play against B1 .. BK the same way as in Sc.1.

Sc.4) Modify the set of games rated in Sc.2 by exchanging all games of A by all games of A' as played in Sc.3.

Now here are my questions.

Q1: Does Sc.2 produce a different relative rating for player A than Sc.1 when using BayesElo, or is the rating of A independent from games between opponents?

Yes, different rating, in general

Q2: If Q1 is answered with "yes, produces different rating" [edited], is the rating difference between A' in Sc.3 and A in Sc.1 the same as between A' in Sc.4 and A in Sc.2?

No, for the same reason.

I repeat that the most important idea, is that if you wish to compare A and A', you should evaluate their ratings from one single PGN file that contains all the games both played. And the graph of players linked by games should be connected. It makes no sense to compare two ratings obtained in two separate subsets of games.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 1:59 pm

hgm wrote:
Uri Blass wrote:It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).
Exactly!

You hit upon the core of a slumbering discussion I have with Rémi. Of course the case you quote as an example (1500 or 1600, with no probability in between) does not occur in practice. The rating model uses a smooth monotonous function for score vs Elo-difference, and that means that the likelihoods will be smooth singly-peaked functions of the rating differences.

A also agree that computing the expected rating would be good. But I don't know how to do it. When the number of games is high and the winning rate is not close to 0% or 100%, there is virtually no difference between expected rating and maximum a posteriori.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Wed Aug 06, 2008 2:20 pm

Rémi Coulom wrote:It makes no sense to compare two ratings obtained in two separate subsets of games.

I think I will explain this a little more, because it is the root of many misunderstandings in this thread, so it may deserve some more details.

First Elo ratings are not absolute. They indicate the relative strength. They can be translated by any constant value, since the probabilities of winning are computed as a function of the difference of Elo ratings. In Bayeselo, ratings are normalized by translating the ratings so that the average is zero.

This means that if you play two set of games with the same pairing scheme, like Bob did, ratings should converge to the same value as the number of games goes to infinity. So, in a way, the comparison of the two ratings of those two experiment does make sense.

But the confidence interval given by Bayeselo is not at all a confidence interval for the limit rating that would be obtained when the number of games goes to infinity. If you wish to get an estimate of this limit, and thus make comparisons between two different sets of games played in similar conditions, you must also add the uncertainty of the evaluation of the constant that was used to translate the ratings.

This additional uncertainty should be of similar magnitude to the uncertainty of the players who played few games.

So I hope it is clear now, why you should put all the games in the same PGN, and use los to estimated the significance of the comparison.

Rémi

Sven · Post by **Sven** » Wed Aug 06, 2008 3:35 pm

Rémi Coulom wrote:
Sven Schüle wrote:Hi Rémi,

I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?
No. The games between these opponents will have an effect on their ratings, which in turn has an effect on the ratings of A and A'

[...]

I repeat that the most important idea, is that if you wish to compare A and A', you should evaluate their ratings from one single PGN file that contains all the games both played. And the graph of players linked by games should be connected. It makes no sense to compare two ratings obtained in two separate subsets of games.

Thanks, Rémi, for your clear statement.

An interesting question would of course be _why_ the inter-opponent games influence the ratings of A and A' (Uri, for instance, wrote that he thought the opposite were true), but most important for me is the conclusion from your statement, maybe like an advice for testing. I'll try to approach it now.

What I learn from your statement is that rating only the games from a gauntlet of an engine A against some opponents may give results that are less reliable than rating these games together with games between the opponents.

So one of the first steps would be to select a fixed set of opponents, and let them play a round-robin. The number of games per match, here and in all further steps, is chosen "high enough", I intentionally leave open here what this means exactly.

Then start testing versions of engine A. Play games of A against the selected opponents, and add these to the existing PGN file.

Then, at some point there is a version A'. Let it play games against the same opponents, not necessarily (but optionally) including also A.

Now calculate ratings. At this point, A and A' can be compared quite reliably.

The same can be repeated with further versions A'', A''' and so on, by always adding games of the newest version against the same opponents to the existing PGN file and then repeating the Elo calculation.

(Btw, this method appears much simpler and also better to me than my own proposal since you need only one PGN file for all, and since also the games of A' are used for the "new" rating of A.)

This should produce results that say more about changes in playing strength between versions of A than the method that omits the inter-opponent games.

At least this is what I understood from your post, do you agree?

If you do, I expect that it may have some impact (a positive one, I hope!) on the testing process of some engine authors who might have based upon the simple method up to now. I would also propose to Bob that he tries to apply this method, and that he comes back with data that show whether his current observations are still present with that "hopefully improved" evaluation method.

Personally I want to add that I'm still not sure whether the effect Bob describes will really disappear this way. I just propose that he tries this direction, as one possible way to find an explanation.

Sven

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing