Why the errorbar is wrong ... simple example!

Ajedrecista · Post by **Ajedrecista** » Wed Feb 24, 2016 12:26 pm

Hello:

Ozymandias wrote:I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?

Wow! Specially if there are not fixed depth=1 games. With a confidence level of 95%, that's it if I am not wrong. Would you mind to share the rating list? If you feed the PGN into Bayeselo or Ordo, you will see really low error bars.

My next question is: how did you achieve it? I am puzzled.

Regards from Spain.

Ajedrecista.

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 3:21 pm

Ajedrecista wrote:Hello:

Ozymandias wrote:I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?
Wow! Specially if there are not fixed depth=1 games. With a confidence level of 95%, that's it if I am not wrong. Would you mind to share the rating list? If you feed the PGN into Bayeselo or Ordo, you will see really low error bars.

My next question is: how did you achieve it? I am puzzled.

Regards from Spain.

Ajedrecista.

I'm not willing to public the specifics, but I've never seen any engine report single digit depths, so no, they aren't "depth=1 games". I'm actually testing openings (engine testing is a byproduct), which means low quality games are of no interest to me; even if, at low depths, engines didn't misbehave, I wouldn't consider those settings. But the thing is that, even with at least ten times the number of games you mention, almost invariably with each update to the DB, Ordo reports an ELO oscillation for the newest engine. The ones which have been playing longer, don't exhibit any fluctuation at all, but I don't witness a 95% confidence in the results. I'm talking about small fluctuations, as much a 4 ELO points, but that's outside the margin of error, and happens more often than 5% of the time.

To achieve a high number of games, you just need many cores running over a long period of time 24/7, at a fast TC. Take a look at the guys at FishTest, they have even more games than me.

michiguel · Post by **michiguel** » Wed Feb 24, 2016 4:19 pm

Ozymandias wrote:
Ajedrecista wrote:Hello:

Ozymandias wrote:I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?
Wow! Specially if there are not fixed depth=1 games. With a confidence level of 95%, that's it if I am not wrong. Would you mind to share the rating list? If you feed the PGN into Bayeselo or Ordo, you will see really low error bars.

My next question is: how did you achieve it? I am puzzled.

Regards from Spain.

Ajedrecista.
I'm not willing to public the specifics, but I've never seen any engine report single digit depths, so no, they aren't "depth=1 games". I'm actually testing openings (engine testing is a byproduct), which means low quality games are of no interest to me; even if, at low depths, engines didn't misbehave, I wouldn't consider those settings. But the thing is that, even with at least ten times the number of games you mention, almost invariably with each update to the DB, Ordo reports an ELO oscillation for the newest engine. The ones which have been playing longer, don't exhibit any fluctuation at all, but I don't witness a 95% confidence in the results. I'm talking about small fluctuations, as much a 4 ELO points, but that's outside the margin of error, and happens more often than 5% of the time.

To achieve a high number of games, you just need many cores running over a long period of time 24/7, at a fast TC. Take a look at the guys at FishTest, they have even more games than me.

Default errors displayed are errors using the average of the pool as a "reference". When you introduce a new engine, the "meaning" of the average changes, so does the reference and the relative values. As a result, there could be variability in the numbers. It depends what the user may want to do, but if you fix the rating of one particular engine ("anchor"), now all the errors are referred to the "anchor" engine (unless you use the switch -V) and some of the variability may disappear. Errors are more meaningful and robust when they are referred to head to head match-ups. In head to head comparisons and their respective errors (-e <file> or fixing one engine with -A and -a) should be influenced way much less by the addition of new engines.

Of course, if you have thousands of engines, the average of the pool becomes really solid.

Miguel

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 4:32 pm

michiguel wrote:
Ozymandias wrote:
Ajedrecista wrote:Hello:

Ozymandias wrote:I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?
Wow! Specially if there are not fixed depth=1 games. With a confidence level of 95%, that's it if I am not wrong. Would you mind to share the rating list? If you feed the PGN into Bayeselo or Ordo, you will see really low error bars.

My next question is: how did you achieve it? I am puzzled.

Regards from Spain.

Ajedrecista.
I'm not willing to public the specifics, but I've never seen any engine report single digit depths, so no, they aren't "depth=1 games". I'm actually testing openings (engine testing is a byproduct), which means low quality games are of no interest to me; even if, at low depths, engines didn't misbehave, I wouldn't consider those settings. But the thing is that, even with at least ten times the number of games you mention, almost invariably with each update to the DB, Ordo reports an ELO oscillation for the newest engine. The ones which have been playing longer, don't exhibit any fluctuation at all, but I don't witness a 95% confidence in the results. I'm talking about small fluctuations, as much a 4 ELO points, but that's outside the margin of error, and happens more often than 5% of the time.

To achieve a high number of games, you just need many cores running over a long period of time 24/7, at a fast TC. Take a look at the guys at FishTest, they have even more games than me.
Default errors displayed are errors using the average of the pool as a "reference". When you introduce a new engine, the "meaning" of the average changes, so does the reference and the relative values. As a result, there could be variability in the numbers. It depends what the user may want to do, but if you fix the rating of one particular engine ("anchor"), now all the errors are referred to the "anchor" engine (unless you use the switch -V) and some of the variability may disappear. Errors are more meaningful and robust when they are referred to head to head match-ups. In head to head comparisons and their respective errors (-e <file> or fixing one engine with -A and -a) should be influenced way much less by the addition of new engines.

Of course, if you have thousands of engines, the average of the pool becomes really solid.

Miguel

The oldest engine is used as an anchor. New engines will have an initial rating with at least 2.5 million games. Subsequent updates will see this engine's rating fluctuate, until it stabilizes at +/- 4 ELO points of its initial rating.

bob · Post by **bob** » Wed Feb 24, 2016 6:48 pm

Ozymandias wrote:
bob wrote:The real elo can NOT lie outside the error bar 95% of the time.
If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?

Draw rate has an influence, but for my testing, 30K gets it to +/- 3, and usually 100K or so will get it to +/- 1. I have not tried to get it below 1, as this gets extremely expensive.

bob · Post by **bob** » Wed Feb 24, 2016 6:55 pm

Ozymandias wrote:
Ajedrecista wrote:Hi again:

Ozymandias wrote:So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?
Well, in this case the number of games would be 463725*(1 - 0.4) ~ 278.2k games or 463725*(1 - 0.5) = 231.9k games... each engine! Not the sum of games of all the engines. It is difficult, isn't it?
Not really, I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?

Depends. This is an "independent trials" statistical analysis. If you just stuff the same game into the mix 100K times, it is pretty obvious that Elo calculations will see the error bar drop to the 1-2 range. But it is also pretty obvious that the result will be wrong, because that would not be 100K independent trials.

This means that duplicate games are NFG. Or if you use starting positions, and some of them are pre-determined wins for black or white, or draws, then you have fewer independent trials, and while a program like BayesElo will give you a small error bar, it will be completely wrong.

Testing is not easy.

In my testing, for example, I am now using 6 opponents, 2500 different (almost hand-picked but using automation to screen them) positions to play a total of 5K games per each of 6 opponents. This is not exactly 30K independent trials, since at least each pair of games shares the same starting position. This test at a time control of 10s +0.1s increment is currently taking me about 6 hours to run until we get our new cluster up and going... (6 hours using 40 cores).

bob · Post by **bob** » Wed Feb 24, 2016 6:59 pm

Ozymandias wrote:
michiguel wrote:
Ozymandias wrote:
Ajedrecista wrote:Hello:

Ozymandias wrote:I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?
Wow! Specially if there are not fixed depth=1 games. With a confidence level of 95%, that's it if I am not wrong. Would you mind to share the rating list? If you feed the PGN into Bayeselo or Ordo, you will see really low error bars.

My next question is: how did you achieve it? I am puzzled.

Regards from Spain.

Ajedrecista.
I'm not willing to public the specifics, but I've never seen any engine report single digit depths, so no, they aren't "depth=1 games". I'm actually testing openings (engine testing is a byproduct), which means low quality games are of no interest to me; even if, at low depths, engines didn't misbehave, I wouldn't consider those settings. But the thing is that, even with at least ten times the number of games you mention, almost invariably with each update to the DB, Ordo reports an ELO oscillation for the newest engine. The ones which have been playing longer, don't exhibit any fluctuation at all, but I don't witness a 95% confidence in the results. I'm talking about small fluctuations, as much a 4 ELO points, but that's outside the margin of error, and happens more often than 5% of the time.

To achieve a high number of games, you just need many cores running over a long period of time 24/7, at a fast TC. Take a look at the guys at FishTest, they have even more games than me.
Default errors displayed are errors using the average of the pool as a "reference". When you introduce a new engine, the "meaning" of the average changes, so does the reference and the relative values. As a result, there could be variability in the numbers. It depends what the user may want to do, but if you fix the rating of one particular engine ("anchor"), now all the errors are referred to the "anchor" engine (unless you use the switch -V) and some of the variability may disappear. Errors are more meaningful and robust when they are referred to head to head match-ups. In head to head comparisons and their respective errors (-e <file> or fixing one engine with -A and -a) should be influenced way much less by the addition of new engines.

Of course, if you have thousands of engines, the average of the pool becomes really solid.

Miguel
The oldest engine is used as an anchor. New engines will have an initial rating with at least 2.5 million games. Subsequent updates will see this engine's rating fluctuate, until it stabilizes at +/- 4 ELO points of its initial rating.

You are doing something badly wrong somewhere. 2.5 M games should not have a +/- 4 Elo error bar by any known method of calculation I know of. The more common problem is that the REAL error bar is larger than the reported error bar because of duplicate games or openings...

IE if you pick a particular opening that white always wins, even though you don't get the same games repeated, you get the same result. Not independent trials. And that violates the basic underlying assumption of Elo calculations, making the numbers meaningless.

bob · Post by **bob** » Wed Feb 24, 2016 7:03 pm

Frank Quisinsky wrote:Hello Bob,

thanks for your time and explantions / hints again!

I have interest to give the following information:

With 26 opponents ... 1.400 games
With 24 opponents ... 1.600 games
...
for the same stable Elo!

But NOT the same error bar. Elo numbers are meaningless without the error bar. All you know is that you are (usually) 95% confident that the actual number (Elo) you have is going to be N +/- error bar 95% of the time.

Elo is not an absolute number.

The problem ist ...
Error is looking on quantity of games only! With the results that I have the same output not important how many opponents I am using for it.

But with more opponents lesser games are necessary (again for the same stable rating).

This information the elo calculation programs are not able to give us. Maybe this information should have an other name as error. Opp. Stability factor can be the name.

You wrote:
Code: Select all
That being said, when you want to compare two programs in terms of Elo, the more common opponents they have, the better the accuracy of the final Elo numbers, since they are "coupled" by common opponents. 
No doubt about it ...

The 5% you wrote is an other topic.
Randomly produced? There is a chance for more or less as 5%.

I am very forceful to this topic because I have reasons for it.

"Producing a strong result with the order to save electricity and time"

I think I need for the same stable result with 60 opponents 20 games per paring only.

= 1700 parings (60 opponents) x 20 games = 34.000 games (each one vs. each one).

And I produced a stronger and more accuracy rating as with a lot of more games and lesser opponents.

That's it what I have in my brain.
I have interest to know that very exactly!

Best
Frank

It is all about the number of _independent_ games you have. More is better. But not if you use the same starting positions multiple times with the same pair of players.

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 7:38 pm

bob wrote:This is an "independent trials" statistical analysis. If you just stuff the same game into the mix 100K times, it is pretty obvious that Elo calculations will see the error bar drop to the 1-2 range. But it is also pretty obvious that the result will be wrong, because that would not be 100K independent trials.

This means that duplicate games are NFG. Or if you use starting positions, and some of them are pre-determined wins for black or white, or draws, then you have fewer independent trials, and while a program like BayesElo will give you a small error bar, it will be completely wrong.

The first part is covered with more than 8 million unique starting positions. More than 5 million already tested, about 3 million left.

The second part could obviously be a problem for engine rating, because about 5% of the games finish before 10 ply of the starting position. But that's exactly what I'm trying to filter out, bad opening lines. The fact that I'm not getting as accurate a rating as I could, for engines, isn't an awful problem, because I only need to know if a new one is clearly (10 ELO) better or worse. It'd be nice to have finer grain, but that's it.

bob wrote:You are doing something badly wrong somewhere. 2.5 M games should not have a +/- 4 Elo error bar by any known method of calculation I know of. The more common problem is that the REAL error bar is larger than the reported error bar because of duplicate games or openings...

As I said, I don't even run simulations under Ordo, to find out the error bar, because I'm going to find out what the real one is anyway (about +/- 4 for the minimum 2.5 mill).

As an example, I'm looking at the last two updates, where the addition to the roster is SugaR 2.0. After the initial 830k games, exceptionally low, it got a rating of XX53. The subsequent burt of the usual 2.5 mill, where it performed at XX49, brought the current rating to XX50, after 3.3 million games. That translates to a 3 ELO point drop after the initial run.

michiguel · Post by **michiguel** » Wed Feb 24, 2016 8:38 pm

Ozymandias wrote:
michiguel wrote:
Ozymandias wrote:
Ajedrecista wrote:Hello:

Ozymandias wrote:I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?
Wow! Specially if there are not fixed depth=1 games. With a confidence level of 95%, that's it if I am not wrong. Would you mind to share the rating list? If you feed the PGN into Bayeselo or Ordo, you will see really low error bars.

My next question is: how did you achieve it? I am puzzled.

Regards from Spain.

Ajedrecista.
I'm not willing to public the specifics, but I've never seen any engine report single digit depths, so no, they aren't "depth=1 games". I'm actually testing openings (engine testing is a byproduct), which means low quality games are of no interest to me; even if, at low depths, engines didn't misbehave, I wouldn't consider those settings. But the thing is that, even with at least ten times the number of games you mention, almost invariably with each update to the DB, Ordo reports an ELO oscillation for the newest engine. The ones which have been playing longer, don't exhibit any fluctuation at all, but I don't witness a 95% confidence in the results. I'm talking about small fluctuations, as much a 4 ELO points, but that's outside the margin of error, and happens more often than 5% of the time.

To achieve a high number of games, you just need many cores running over a long period of time 24/7, at a fast TC. Take a look at the guys at FishTest, they have even more games than me.
Default errors displayed are errors using the average of the pool as a "reference". When you introduce a new engine, the "meaning" of the average changes, so does the reference and the relative values. As a result, there could be variability in the numbers. It depends what the user may want to do, but if you fix the rating of one particular engine ("anchor"), now all the errors are referred to the "anchor" engine (unless you use the switch -V) and some of the variability may disappear. Errors are more meaningful and robust when they are referred to head to head match-ups. In head to head comparisons and their respective errors (-e <file> or fixing one engine with -A and -a) should be influenced way much less by the addition of new engines.

Of course, if you have thousands of engines, the average of the pool becomes really solid.

Miguel
The oldest engine is used as an anchor. New engines will have an initial rating with at least 2.5 million games. Subsequent updates will see this engine's rating fluctuate, until it stabilizes at +/- 4 ELO points of its initial rating.

Without looking at the real numbers I cannot say much, but I can guess that you may have a point about possibly detecting a deviation from normal behavior. If my guess is correct, it would mean that certain engines may perform slightly better against some specific ones and slightly worse against some others. But, to actually detect this (if it even really exist) you need many many games to reduce the statistical error, which is what you have done.

Miguel

Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!