Yes, I think that the issue with BayesElo is that it assumes some constant draw percentage between equally rated opponents (a parameter setting, but once set a constant for all games rated). But as we know, the draw percentage between 3600 rated engines is much higher than the draw percentage between 2600 rated engines. So it is fundamentally flawed for chess, where there are a lot of draws and the draw percentage varies with level. Ordo doesn't have this problem.Raphexon wrote: ↑Sun Oct 17, 2021 12:05 amBayesianelo is nice when a game has no or few draws.Modern Times wrote: ↑Sat Oct 16, 2021 7:09 pm CCRL did reduce its ratings by 100 Elo a few years back because we felt they were too high.
Bayeselo is my preference. Some say that it compresses ratings, I'd turn that around and say that Ordo expands them. I don't think one is better than the other, they both have sound statistical grounding, but they work differently.
Daniel Shawul had this to say about the two last year:
forum3/viewtopic.php?f=7&t=73761&p=8413 ... lo#p841358
by Daniel Shawul » Sat Apr 25, 2020 2:08 pm
Why are you using Ordo anyway, clearly it has inferior algorithms than bayeselo that is based on bayesian approach.
Here https://www.remi-coulom.fr/Bayesian-Elo/ Remi remid discusses some of the advantages over prior alogrithm, EloStat.
For the calculation of accurate standard deviations, there is an option to calculate the covariance matrix ( a bit slower) is not the default.
Ordo probably uses a monte carlo sampling of some sort for that, but in bayeselo you find better theory and algorithm.
Bayeselo does have the home field advantage (color) into consideration bud does not take into consideration draw ratio.
It was later extended it to take that into consideration using the Davidson model which turned out to be the best of three other
draw models.
Remi has a lot of history inside the computer Go community, and it shows...
Ordo is nicer for (modern computer) chess.
But all current rating systems have the problem that rating differences contract with longer time limits (human or engine, doesn't matter) due to more draws, and also that when opening positions are stipulated (with color reversal) the amount of one side's advantage also affects rating spread. Ideally a rating system should be immune (in terms of overall spread) to time control and opening choice in reversal play. I don't know of any that qualify, but I do have a proposal that might solve or at least dramatically reduce these two problems. I'm assuming reversal testing of paired games with specified start positions. My proposal is simply to discard from the database to be rated all pairs of games that result in a 1 to 1 tied score (whether due to two draws or two wins), then run the remaining games thru Ordo. This is completely fair, but will obviously result in much larger rating differences. Extremely drawish openings will be mostly tossed due to many tied matches, while easily won positions will also be tossed for the same reason. What's left (when fairly equal opponents are playing) will be openings where both a win and a draw are plausible results. If you want the ratings to resemble human elo, just scale the differences (from the reference engine) by whatever percentage is needed. With this method I am certain that we will see continued progress at a good rate for years to come; a slightly better engine should win most of the matches that are not tied, even if most of them are tied and discarded. The Stockfish crowd should love this idea, they already talk about their results in such two game matches rather than in individual games. It could even be used for human chess where pairs of games are played.