Stockfish Development Version

TShackel · Post by **TShackel** » Sun Jun 14, 2015 5:17 pm

Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.

reflectionofpower · Post by **reflectionofpower** » Sun Jun 14, 2015 6:19 pm

TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.

Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.

TShackel · Post by **TShackel** » Sun Jun 14, 2015 6:27 pm

reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.

Hi,

I just found my tests disagreed at least through 332 games with the results of stockfish team.

With regard to the rating list you gave, the reason some older versions are stronger than newer ones is because the new version is 1 cpu and the old one is 4 cpus.

Sincerley,

Tim.

Laskos · Post by **Laskos** » Sun Jun 14, 2015 7:52 pm

TShackel wrote:
reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
Hi,

I just found my tests disagreed at least through 332 games with the results of stockfish team.

You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.

reflectionofpower · Post by **reflectionofpower** » Sun Jun 14, 2015 8:05 pm

Laskos wrote:
TShackel wrote:
reflectionofpower wrote:Sometimes the previous version is slighter stronger, it happens every now & then. if you check here: http://computerchess.org.uk/ccrl/4040/r ... t_all.html you'll see some previous versions are stronger than the more recent ones.
Hi,

I just found my tests disagreed at least through 332 games with the results of stockfish team.
You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.

This is an obvious argument. There is always a variable and 2 pts. doesn't suggest anything unusual. As you can see from the CCRL list there is a +/- deviation for example:

TShackel · Post by **TShackel** » Sun Jun 14, 2015 8:06 pm

Laskos wrote:You disagree on 2 ELO points issue in 332 games. You must consider Stockfish developers really brainless to play tens of thousand of games in order to check for a 3 ELO points improvement.

Umm, relax. I don't need to be quoted all the mantras about engine testing. I want my own tests to agree with what they think they're improving otherwise I don't believe it.

I don't enjoy having my intelligence insulted. 332 games is quite a few to start getting a real comparison. Maybe stockfish team isn't doing enough games. I can get a more reliable result with 100,000 games. See you can always say more games are important even to the stockfish team themselves.

THere should be only improvement and not any regression is my point. I didn't say 2 elo difference was large, but it is in the wrong direction!

Sincerely,

Tim.

zullil · Post by **zullil** » Sun Jun 14, 2015 8:17 pm

TShackel wrote: 332 games is quite a few to start getting a real comparison.

Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.

TShackel · Post by **TShackel** » Sun Jun 14, 2015 8:23 pm

zullil wrote:Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.

332 games is enough to get posted on CEGT's long time control rating list! Are you saying their rating lists aren't valid? And second of all, stockfish team is far from perfect or they would've done 100,000 games of testing for each change. See, you can always make the excuse more games are required. That doesn't mean we don't count our result in the meantime..

Tim.

reflectionofpower · Post by **reflectionofpower** » Sun Jun 14, 2015 8:32 pm

zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.

It's getting bad. Maybe I should interact with humanity. I read your comment and I was looking for the "thumbs up" icon.

Laskos · Post by **Laskos** » Sun Jun 14, 2015 8:56 pm

zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician.

It's not about being statistician. More about a square root.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).

The resolution power in ELO points for a match of 324 games is:

700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.

That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.

For N games the formula is 700 ELO points divided by square root of N

And it's called 3 standard deviations confidence, but no one has to remember its name.

Stockfish Development Version

Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version