Some history from Bob please

Discussion of chess software programming and technical issues.

Moderator: Ras

lauriet
Posts: 199
Joined: Sun Nov 03, 2013 9:32 am

Some history from Bob please

Post by lauriet »

Hey Bob,

As a bit of a newbie, catching up on the history:
What, when, why, how was it realized that it takes so many games
to establish an ELO difference between programs.

Regards
Laurie
syzygy
Posts: 5869
Joined: Tue Feb 28, 2012 11:56 pm

Re: Some history from Bob please

Post by syzygy »

That's just basic statistics.

Repeatedly flip a coin and count the number of heads and tails. At what moment can you tell with any confidence that the coin is biased?

This is the subject of statistics, a branch of mathematics.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Some history from Bob please

Post by bob »

lauriet wrote:Hey Bob,

As a bit of a newbie, catching up on the history:
What, when, why, how was it realized that it takes so many games
to establish an ELO difference between programs.

Regards
Laurie
When Elo wrote his book? :) Most of us did not pay any attention to the discussion about error bars, and 95% confidence intervals, and such. I started using a massive number of games back around 2001-2002. Over the next year or two I noticed some unexpected behavior here and there and started a discussion here several times. After discussions with Remi (BayesElo) I began to develop a better understanding about how many games were needed, how many starting positions were needed, why more than one opponent was useful, etc.

The only rule I use is that the error bar has to fit well inside the probable Elo range for the change being tested. If the expected gain/loss is small, a bunch of games is needed. To figure out whether null-move works or not, you can get by with far fewer games since the expected gain is quite large (+80 to +120 depending).
User avatar
hgm
Posts: 28456
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Some history from Bob please

Post by hgm »

Of course it also has to do with the fact that today there are hundreds of engines in thousands of versions, all practically identical, so that the Elo differences you want to measure are almost non-existent. In the days where there were only 5 programs, which differed hundreds of Elo in strength, not many games were needed at all to determine their Elo difference to the desired precision.
syzygy
Posts: 5869
Joined: Tue Feb 28, 2012 11:56 pm

Re: Some history from Bob please

Post by syzygy »

hgm wrote:Of course it also has to do with the fact that today there are hundreds of engines in thousands of versions, all practically identical, so that the Elo differences you want to measure are almost non-existent. In the days where there were only 5 programs, which differed hundreds of Elo in strength, not many games were needed at all to determine their Elo difference to the desired precision.
And it was possible to observe with the naked eye that the engine stopped playing 2.Ke2 :-)
lauriet
Posts: 199
Joined: Sun Nov 03, 2013 9:32 am

Re: Some history from Bob please

Post by lauriet »

Does that mean the ratings on sites like SSDF are pretty unreliable ?
Sometimes they only have 50->100 games against a machine that is closely rated to the one being tested.

Regards
Laurie.
Vinvin
Posts: 5312
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: Some history from Bob please

Post by Vinvin »

lauriet wrote:Does that mean the ratings on sites like SSDF are pretty unreliable ?
Sometimes they only have 50->100 games against a machine that is closely rated to the one being tested.

Regards
Laurie.
http://ssdf.bosjo.net/list.htm gives the error bar ("+ -") from long time ago.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Some history from Bob please

Post by zullil »

syzygy wrote: And it was possible to observe with the naked eye that the engine stopped playing 2.Ke2 :-)
Might be better than 2.Rh2.

2.Ke2 at least implies that move 1 was decent. :wink:
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Some history from Bob please

Post by bob »

lauriet wrote:Does that mean the ratings on sites like SSDF are pretty unreliable ?
Sometimes they only have 50->100 games against a machine that is closely rated to the one being tested.

Regards
Laurie.
Depends on the TOTAL number of games a program has played. You can play 30K games against one opponent, or 1 game against 30K opponents (the latter is actually preferred).
syzygy
Posts: 5869
Joined: Tue Feb 28, 2012 11:56 pm

Re: Some history from Bob please

Post by syzygy »

zullil wrote:
syzygy wrote: And it was possible to observe with the naked eye that the engine stopped playing 2.Ke2 :-)
Might be better than 2.Rh2.

2.Ke2 at least implies that move 1 was decent. :wink:
Maybe the very first version of my engine wasn't so bad after all :-)