CCRL question re bayeselo

lkaufman · Post by **lkaufman** » Fri Apr 28, 2023 2:47 am

I know that CCRL uses BayesElo by default, my question is whether it uses the default values for eloAdvantage and eloDraw, which are 32.8 and 97.3 respectively. Most likely the answer is "yes". Also, is there any scaling factor used to make the rating differences closer to normal Elo ratings? The issue here is that those values were calculated a long time ago (around 2006 I believe), from data at a time when engines were MUCH weaker than now. If the current dataset is analyzed I would expect these values to be much higher now, and if those higher values were used I think we would see a much larger spread of the Elo ratings than we do now. This may be part of the reason that it seems no progress is being made based on the CCRL lists even though CEGT (which uses ORDO) does show progress. What really exaggerates the effect is that the eloDraw would be much higher still if we only looked at games by NNUE engines. This is the biggest problem when comparing BayesElo ratings to regular (i.e. ORDO-calculated) Elo ratings. The assumption of a constant value for eloAdvantage and eloDraw is incompatible with normal Elo ratings; it leads to a flattening of the curve at the top.
This doesn't mean that BayesElo is inferior, it may very well be a better system. It's just not going to be valid for predicting what FIDE ratings these engines might get (even with adjustment for level and scaling issues). It may well be that with BayesElo (with current default values for the params) we are very near the maximum possible elo, but with ORDO (or with BayesElo with the params continuously updated based on the top engine results) we might still have considerable room to grow.
One very interesting point about BayesElo with current defaults; if I understand it correctly, it means that ratings it generates will basically "assume" that a 1.00 advantage in Stockfish 15.1 (or Dragon 3.2), which should mean 50% actual wins, should always correspond to "EloDraw" which is set to 97.3. That means that an advantage right on the win/draw line, such as (roughly) White starting with two moves instead of one, should always be worth 97.3 elo regardless of the level of the engines. Of course with normal elo ratings that would steadily rise as the ratings rise; draw odds won't matter much at 1200 level but as we know, at 3600 level they are huge. This in turn means that with BayesElo, any handicap should have a clearly defined Elo value, which would be 97.3 multiplied by the eval of the handicap in question. Of course this is only as good as the values generated by the engines; in my opinion they are accurate enough up to knight odds (for Stockfish 15.1 or Dragon 3.2), but start to break down beyond that, presumably due to net training not caring much once a position passes 99% or so win prob.
Math guys, please let me know if I'm making any mistakes above.

Graham Banks · Post by **Graham Banks** » Fri Apr 28, 2023 8:07 am

Default values are used.

bayeselo_rating_options = mm\n

We did try alternate values. An issue was that the pure list database and the complete database would theoretically require different values. But we don't have those now.

bayeselo_rating_options = prior 0.1\nmm 1 1\nscale 1\n
bayeselo_rating_options = mm 1 1\nscale 1\n
bayeselo_rating_options = advantage 31.7407\ndrawelo 117.861\nmm\n

Worth noting also that there is no problem on the FRC list in measuring gains from Stockfish. It is blitz and 1CPU though.

lkaufman · Post by **lkaufman** » Fri Apr 28, 2023 6:22 pm

Graham Banks wrote: ↑Fri Apr 28, 2023 8:07 am Default values are used.

bayeselo_rating_options = mm\n

We did try alternate values. An issue was that the pure list database and the complete database would theoretically require different values. But we don't have those now.

bayeselo_rating_options = prior 0.1\nmm 1 1\nscale 1\n
bayeselo_rating_options = mm 1 1\nscale 1\n
bayeselo_rating_options = advantage 31.7407\ndrawelo 117.861\nmm\n

Worth noting also that there is no problem on the FRC list in measuring gains from Stockfish. It is blitz and 1CPU though.

Thanks, but you quote the drawelo value as being 117.861, whereas everything I've seen shows the default to be 97.3. Perhaps you have an updated version with drawelo calculated from more recent data? If not, can you account for the discrepancy between 117.861 and 97.3? That's a big difference! The 117.861 value seems much more reasonable to me, though still too low for current engines.

Modern Times · Post by **Modern Times** » Fri Apr 28, 2023 11:06 pm

lkaufman wrote: ↑Fri Apr 28, 2023 6:22 pm Thanks, but you quote the drawelo value as being 117.861, whereas everything I've seen shows the default to be 97.3. Perhaps you have an updated version with drawelo calculated from more recent data? If not, can you account for the discrepancy between 117.861 and 97.3? That's a big difference! The 117.861 value seems much more reasonable to me, though still too low for current engines.

Those numbers were from the blitz list more than 10 years ago. How they were derived is lost in time. The point was not the numbers themselves but that our processing scripts can pass these parameters to bayselo, and we have experimented in the past.

lkaufman · Post by **lkaufman** » Fri Apr 28, 2023 11:49 pm

Modern Times wrote: ↑Fri Apr 28, 2023 11:06 pm
lkaufman wrote: ↑Fri Apr 28, 2023 6:22 pm Thanks, but you quote the drawelo value as being 117.861, whereas everything I've seen shows the default to be 97.3. Perhaps you have an updated version with drawelo calculated from more recent data? If not, can you account for the discrepancy between 117.861 and 97.3? That's a big difference! The 117.861 value seems much more reasonable to me, though still too low for current engines.
Those numbers were from the blitz list more than 10 years ago. How they were derived is lost in time. The point was not the numbers themselves but that our processing scripts can pass these parameters to bayselo, and we have experimented in the past.

Ten years is an eternity in computer chess. Isn't it time to update those parameters based on current data? I would expect a large increase in the drawelo, which I think would result in a significant spreading of the ratings, perhaps making the spread comparable to Elo-based lists such as CEGT.

Modern Times · Post by **Modern Times** » Sat Apr 29, 2023 12:29 am

They aren't used, we use bayeselo defaults. They were just experiments conducted more than 10 years ago.

If there were values we should use that would give more accurate ratings, we'd use them. And they would be different for all 3 lists. But what are they and would the experts agree. Certainly above my pay grade. In the absence of that, defaults are the only option. I've nothing further to add,

lkaufman · Post by **lkaufman** » Sat Apr 29, 2023 1:34 am

Modern Times wrote: ↑Sat Apr 29, 2023 12:29 am They aren't used, we use bayeselo defaults. They were just experiments conducted more than 10 years ago.

If there were values we should use that would give more accurate ratings, we'd use them. And they would be different for all 3 lists. But what are they and would the experts agree. Certainly above my pay grade. In the absence of that, defaults are the only option. I've nothing further to add,

Thanks, now I understand, you actually use the 97.3 value for drawElo, not the 117+ value. Yes, they should be different for the 3 lists, but I think that if you derived the values (using the methods described in the BayesElo description) from the current blitz list, that would be good enough, far better than using values derived from data nearly 20 years old. I don't think there would be disagreement on how to get the values if you agreed to use one specific list for the data. The draw rate for that set is 38%; since some pairings are mismatches to some degree, the draw rate between near equal opponents is higher, perhaps 40% or a bit more. As I understand things, a drawElo of 120 would mean 1/3 draws between equals, while a drawElo of 192 would mean half draws between equals, so this dataset would probably give a drawElo in between these numbers, perhaps in the 150 to 160 range. Compared to 97.3, that is an enormous difference! The exact number you use isn't so important, but any number in or near the 150 to 160 range should produce much more accurate ratings than using 97.3, which was from a time when draws were relatively uncommon in engine play. Anyway just for an experiment you could run the data with 150 or 160 to see what the effect might be.

Gabor Szots · Post by **Gabor Szots** » Sat Apr 29, 2023 10:33 am

We would gladly use new values for the calculations, only we don't know how to obtain such values. In the original description of BayesElo there is a link to a paper (https://sites.stat.psu.edu/~dhunter/papers/bt.pdf) which describes a procedure but that is beyond our comprehension. Maybe someone could undertake the job of calculating values for each of the 3 lists.

mig2004 · Post by **mig2004** » Sat Apr 29, 2023 1:45 pm

"..Also, is there any scaling factor used to make the rating differences closer to normal Elo ratings? "

I did some tinkering many years ago about the mm index(??), which by default was something like 112. I manage to bring the Bayeselo rating results to be within 50 elo points of ordo, its best competitor. Since then I have been using it with 119.

I wanted to diminish the tendency to depress the ratings that is typical of bayeselo in standard default parameters. Also, I
wanted it to adjust it to a very heterogenous competition where each engine played fewer games within a large pool of other engines.

My best interpretation is that it changes the clustering factor in very heterogenous databases like the one i create with my engine tournaments. (more than 1 thousand engines). I can say that it has worked well so far. For me speed of execution was also a concern. Ordo took close to 50 minutes to calculate elo in a tournament with one thousand engines with very heterogenous matching. it came to a point that ordo´s alrogithm was not converging at all and calculations would just not get done. With the mod mentioned, bayeselo did the same calculations in a matter of seconds, and within 50 elo pts of what ordo did. Using decimal places might be even more accurate. And if you use a fast machine the speed issue would not be a concern.

lkaufman · Post by **lkaufman** » Sat Apr 29, 2023 6:42 pm

Gabor Szots wrote: ↑Sat Apr 29, 2023 10:33 am We would gladly use new values for the calculations, only we don't know how to obtain such values. In the original description of BayesElo there is a link to a paper (https://sites.stat.psu.edu/~dhunter/papers/bt.pdf) which describes a procedure but that is beyond our comprehension. Maybe someone could undertake the job of calculating values for each of the 3 lists.

The only parameter that you really need to get right is the eloDraw value; the eloAdvantage value doesn't vary so much from one data set to another (as the quoted values from some sample show), and don't have a big effect on elo values in general when colors are alternated. I don't know how to derive the exact value for eloDraw from data, but I believe I can tell you how to get an approximate value for it from data, which should be good enough, far better than using nearly 20 year old data. Just take your data set, filter it to only include "close" matches (I'd say within 25 elo, but the exact number isn't important, it will have only a tiny effect on the final answer), and calculate the percentage of draws. Once you have that, it is easy to derive the eloDraw value from the equations in the BayesElo write-up online; if you need help with that I can show you how to do so. It won't be exactly right, but it should be within a couple percent or so of the actual figure. This is a good example of the principle "Don't let the perfect be the enemy of the good"; just because you can't do something perfectly doesn't mean you shouldn't do it very well. If you just want to get an idea of what the consequences would be, try eloDraw = 150. I think the ratings will spread out a lot, but I'm not sure.

CCRL question re bayeselo

CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo