Hi,
I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.
Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.
Sincerely,
Tim.
Stockfish Development Version
Moderator: Ras
-
TShackel
- Posts: 313
- Joined: Sat Apr 05, 2014 12:09 am
- Location: Neenah, WI, United States
-
reflectionofpower
- Posts: 1668
- Joined: Fri Mar 01, 2013 5:28 pm
- Location: USA
Re: Stockfish Development Version
Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.TShackel wrote:Hi,
I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.
Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.
Sincerely,
Tim.
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)
Lonnie
Lonnie
-
TShackel
- Posts: 313
- Joined: Sat Apr 05, 2014 12:09 am
- Location: Neenah, WI, United States
Re: Stockfish Development Version
Hi,reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
I just found my tests disagreed at least through 332 games with the results of stockfish team.
With regard to the rating list you gave, the reason some older versions are stronger than newer ones is because the new version is 1 cpu and the old one is 4 cpus.
Sincerley,
Tim.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish Development Version
You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.TShackel wrote:Hi,reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
I just found my tests disagreed at least through 332 games with the results of stockfish team.
-
reflectionofpower
- Posts: 1668
- Joined: Fri Mar 01, 2013 5:28 pm
- Location: USA
Re: Stockfish Development Version
This is an obvious argument. There is always a variable and 2 pts. doesn't suggest anything unusual. As you can see from the CCRL list there is a +/- deviation for example:Laskos wrote:You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.TShackel wrote:Hi,reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
I just found my tests disagreed at least through 332 games with the results of stockfish team.

"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)
Lonnie
Lonnie
-
TShackel
- Posts: 313
- Joined: Sat Apr 05, 2014 12:09 am
- Location: Neenah, WI, United States
Re: Stockfish Development Version
Umm, relax. I don't need to be quoted all the mantras about engine testing. I want my own tests to agree with what they think they're improving otherwise I don't believe it.Laskos wrote:You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.
I don't enjoy having my intelligence insulted. 332 games is quite a few to start getting a real comparison. Maybe stockfish team isn't doing enough games. I can get a more reliable result with 100,000 games. See you can always say more games are important even to the stockfish team themselves.
THere should be only improvement and not any regression is my point. I didn't say 2 elo difference was large, but it is in the wrong direction!
Sincerely,
Tim.
-
zullil
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Stockfish Development Version
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.TShackel wrote: 332 games is quite a few to start getting a real comparison.
-
TShackel
- Posts: 313
- Joined: Sat Apr 05, 2014 12:09 am
- Location: Neenah, WI, United States
Re: Stockfish Development Version
332 games is enough to get posted on CEGT's long time control rating list! Are you saying their rating lists aren't valid? And second of all, stockfish team is far from perfect or they would've done 100,000 games of testing for each change. See, you can always make the excuse more games are required. That doesn't mean we don't count our result in the meantime..zullil wrote:Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.
Tim.
-
reflectionofpower
- Posts: 1668
- Joined: Fri Mar 01, 2013 5:28 pm
- Location: USA
Re: Stockfish Development Version
It's getting bad. Maybe I should interact with humanity. I read your comment and I was looking for the "thumbs up" icon.zullil wrote:Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.TShackel wrote: 332 games is quite a few to start getting a real comparison.
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)
Lonnie
Lonnie
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish Development Version
It's not about being statistician. More about a square root.zullil wrote:Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.TShackel wrote: 332 games is quite a few to start getting a real comparison.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).
The resolution power in ELO points for a match of 324 games is:
700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.
That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.
For N games the formula is 700 ELO points divided by square root of N
And it's called 3 standard deviations confidence, but no one has to remember its name.