lkaufman wrote:I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws. So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate. You may also be correct that the assumed two virtual draws significantly further compresses the ratings.
BayesElo has a command "mm" which is usually called without arguments. If you call "mm 1 1" instead, the program will calculate the real values "advantage" and "drawelo" from the given PGN database, and will use these for subsequent rating calculation. You can also use "mm 0 1" to only let the "drawelo" parameter be recalculated but keep the default for "advantage". I have tried both for the IPON games. There was a small rescaling of rating values, the top 20 engines got about 9-13 ELO points more than before and the bottom 20 engines correspondingly less points. The difference between recalculating "advantage" and keeping its default value was neglible.
And the rating difference between Houdini 2.0 and Komodo 4 remained at 40
Regarding "prior" and the two virtual draws assumed by BayesElo: I don't think this has any measurable influence for the high number of games we have at IPON, CCRL or other bigger engine rating systems. AFAIK these two draws are only added between opponents who actually have played each other, and since most games are played between engines which are not many hundreds of ELO points away from each other, the actual influence of two additional draws should be very small in practice.
lkaufman wrote:Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
Larry, I have two questions:
1) What is the idea you have in mind about "match performance ratings" in chess engine tournaments? How do you think they are, or should be, obtained, given one large PGN file of "all" games?
2) What is that difference between EloStat and BayesElo regarding applicability of "averaging" that you are thinking of?
Sven