Stockfish Development Version

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

TShackel
Posts: 313
Joined: Sat Apr 05, 2014 12:09 am
Location: Neenah, WI, United States

Stockfish Development Version

Post by TShackel »

Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.
User avatar
reflectionofpower
Posts: 1668
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.
Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
TShackel
Posts: 313
Joined: Sat Apr 05, 2014 12:09 am
Location: Neenah, WI, United States

Re: Stockfish Development Version

Post by TShackel »

reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
Hi,

I just found my tests disagreed at least through 332 games with the results of stockfish team.

With regard to the rating list you gave, the reason some older versions are stronger than newer ones is because the new version is 1 cpu and the old one is 4 cpus.

Sincerley,

Tim.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish Development Version

Post by Laskos »

TShackel wrote:
reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
Hi,

I just found my tests disagreed at least through 332 games with the results of stockfish team.
You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.
User avatar
reflectionofpower
Posts: 1668
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

Laskos wrote:
TShackel wrote:
reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
Hi,

I just found my tests disagreed at least through 332 games with the results of stockfish team.
You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.
This is an obvious argument. There is always a variable and 2 pts. doesn't suggest anything unusual. As you can see from the CCRL list there is a +/- deviation for example:

Image
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
TShackel
Posts: 313
Joined: Sat Apr 05, 2014 12:09 am
Location: Neenah, WI, United States

Re: Stockfish Development Version

Post by TShackel »

Laskos wrote:You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.
Umm, relax. I don't need to be quoted all the mantras about engine testing. I want my own tests to agree with what they think they're improving otherwise I don't believe it.

I don't enjoy having my intelligence insulted. 332 games is quite a few to start getting a real comparison. Maybe stockfish team isn't doing enough games. I can get a more reliable result with 100,000 games. See you can always say more games are important even to the stockfish team themselves.

THere should be only improvement and not any regression is my point. I didn't say 2 elo difference was large, but it is in the wrong direction!

Sincerely,

Tim.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Stockfish Development Version

Post by zullil »

TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
TShackel
Posts: 313
Joined: Sat Apr 05, 2014 12:09 am
Location: Neenah, WI, United States

Re: Stockfish Development Version

Post by TShackel »

zullil wrote:Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
332 games is enough to get posted on CEGT's long time control rating list! Are you saying their rating lists aren't valid? And second of all, stockfish team is far from perfect or they would've done 100,000 games of testing for each change. See, you can always make the excuse more games are required. That doesn't mean we don't count our result in the meantime..

Tim.
User avatar
reflectionofpower
Posts: 1668
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
It's getting bad. Maybe I should interact with humanity. I read your comment and I was looking for the "thumbs up" icon. :lol:
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish Development Version

Post by Laskos »

zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
It's not about being statistician. More about a square root.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).

The resolution power in ELO points for a match of 324 games is:

700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.

That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.

For N games the formula is 700 ELO points divided by square root of N

And it's called 3 standard deviations confidence, but no one has to remember its name.