Some history from Bob please

lauriet · Post by **lauriet** » Sat Apr 26, 2014 2:43 pm

Hey Bob,

As a bit of a newbie, catching up on the history:
What, when, why, how was it realized that it takes so many games
to establish an ELO difference between programs.

Regards
Laurie

syzygy · Post by **syzygy** » Sat Apr 26, 2014 2:47 pm

That's just basic statistics.

Repeatedly flip a coin and count the number of heads and tails. At what moment can you tell with any confidence that the coin is biased?

This is the subject of statistics, a branch of mathematics.

bob · Post by **bob** » Sat Apr 26, 2014 3:21 pm

lauriet wrote:Hey Bob,

As a bit of a newbie, catching up on the history:
What, when, why, how was it realized that it takes so many games
to establish an ELO difference between programs.

Regards
Laurie

When Elo wrote his book?

Most of us did not pay any attention to the discussion about error bars, and 95% confidence intervals, and such. I started using a massive number of games back around 2001-2002. Over the next year or two I noticed some unexpected behavior here and there and started a discussion here several times. After discussions with Remi (BayesElo) I began to develop a better understanding about how many games were needed, how many starting positions were needed, why more than one opponent was useful, etc.

The only rule I use is that the error bar has to fit well inside the probable Elo range for the change being tested. If the expected gain/loss is small, a bunch of games is needed. To figure out whether null-move works or not, you can get by with far fewer games since the expected gain is quite large (+80 to +120 depending).

hgm · Post by **hgm** » Sat Apr 26, 2014 4:06 pm

Of course it also has to do with the fact that today there are hundreds of engines in thousands of versions, all practically identical, so that the Elo differences you want to measure are almost non-existent. In the days where there were only 5 programs, which differed hundreds of Elo in strength, not many games were needed at all to determine their Elo difference to the desired precision.

syzygy · Post by **syzygy** » Sat Apr 26, 2014 4:28 pm

hgm wrote:Of course it also has to do with the fact that today there are hundreds of engines in thousands of versions, all practically identical, so that the Elo differences you want to measure are almost non-existent. In the days where there were only 5 programs, which differed hundreds of Elo in strength, not many games were needed at all to determine their Elo difference to the desired precision.

And it was possible to observe with the naked eye that the engine stopped playing 2.Ke2

lauriet · Post by **lauriet** » Mon Apr 28, 2014 3:37 am

Does that mean the ratings on sites like SSDF are pretty unreliable ?
Sometimes they only have 50->100 games against a machine that is closely rated to the one being tested.

Regards
Laurie.

Vinvin · Post by **Vinvin** » Mon Apr 28, 2014 6:56 am

lauriet wrote:Does that mean the ratings on sites like SSDF are pretty unreliable ?
Sometimes they only have 50->100 games against a machine that is closely rated to the one being tested.

Regards
Laurie.

http://ssdf.bosjo.net/list.htm gives the error bar ("+ -") from long time ago.

zullil · Post by **zullil** » Mon Apr 28, 2014 6:34 pm

syzygy wrote: And it was possible to observe with the naked eye that the engine stopped playing 2.Ke2

Might be better than 2.Rh2.

2.Ke2 at least implies that move 1 was decent.

bob · Post by **bob** » Mon Apr 28, 2014 8:28 pm

lauriet wrote:Does that mean the ratings on sites like SSDF are pretty unreliable ?
Sometimes they only have 50->100 games against a machine that is closely rated to the one being tested.

Regards
Laurie.

Depends on the TOTAL number of games a program has played. You can play 30K games against one opponent, or 1 game against 30K opponents (the latter is actually preferred).

syzygy · Post by **syzygy** » Mon Apr 28, 2014 9:57 pm

zullil wrote:
syzygy wrote: And it was possible to observe with the naked eye that the engine stopped playing 2.Ke2
Might be better than 2.Rh2.

2.Ke2 at least implies that move 1 was decent.

Maybe the very first version of my engine wasn't so bad after all

Some history from Bob please

Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please

Re: Some history from Bob please