Stockfish Development Version

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Stockfish Development Version

Post by lucasart »

TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.
1/ you cannot conclude anything from seeing -2 elo after 332 games. detecting a 2 elo regression with 95% confidence requires an SPRT(-2,0) which is *much* costlier than that

2/ you are using abrok.eu compiler, and anything you can conclude from those could be the result of abrok.eu doing lame compiler (eg. not using the right version of mingw-gcc or not doing profile guided optimizations). unless you can compile both versions yourself, exactly the same way, you cannot make any meaningful observation.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
thekingman
Posts: 35
Joined: Mon Mar 16, 2015 6:17 am

Re: Stockfish Development Version

Post by thekingman »

TShackel wrote:
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.
The online rating lists are our "best estimates", but there is of course 'noise' in these estimates. If one engine is rated higher than another, it's probably stronger, but our certainty in that conclusion is dependent on number of games. Check out how the CCRL rating list handles this, with a 'Likelihood of superiority' column, representing confidence that a given engine is actually stronger than the one rated just below it: http://www.computerchess.org.uk/ccrl/4040/

Note in particular that while there is extremely high confidence in Stockfish being stronger than Houdini, we can only be 87% sure (from this data) that Komodo is stronger than Stockfish, despite being rated 15 points higher. For an even more dramatic example, look at Gull vs Equinox - despite over 700 games played by each engine, the mere 3 point gap means that you may as well flip a coin to pick the stronger of the two.
TShackel wrote: But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.
Using the formula Laskos provided, to be able to say with confidence that version A is stronger than version B given that you have measured a 2 Elo difference, you need (700/2)^2 = 122,500 games.

Of course, the number of games required to show that A > B drops drastically as the rating gap increases - 5000 is enough for a 10 point gap, and 200 is enough for a 50 point gap (ie, Stockfish vs Houdini).

However, this is not to suggest that you can't call your results a real measure of Elo. They are - just an extremely noisy measure, with the noise so much stronger than the true signal that you can't say anything with confidence. But all Elo measurements are estimates, and have their own associated noise - what matters is understanding the noise to know what you can and cannot conclude from your estimate.
User avatar
reflectionofpower
Posts: 1669
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

thekingman wrote:
TShackel wrote:
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.
The online rating lists are our "best estimates", but there is of course 'noise' in these estimates. If one engine is rated higher than another, it's probably stronger, but our certainty in that conclusion is dependent on number of games. Check out how the CCRL rating list handles this, with a 'Likelihood of superiority' column, representing confidence that a given engine is actually stronger than the one rated just below it: http://www.computerchess.org.uk/ccrl/4040/

Note in particular that while there is extremely high confidence in Stockfish being stronger than Houdini, we can only be 87% sure (from this data) that Komodo is stronger than Stockfish, despite being rated 15 points higher. For an even more dramatic example, look at Gull vs Equinox - despite over 700 games played by each engine, the mere 3 point gap means that you may as well flip a coin to pick the stronger of the two.
TShackel wrote: But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.
Using the formula Laskos provided, to be able to say with confidence that version A is stronger than version B given that you have measured a 2 Elo difference, you need (700/2)^2 = 122,500 games.

Of course, the number of games required to show that A > B drops drastically as the rating gap increases - 5000 is enough for a 10 point gap, and 200 is enough for a 50 point gap (ie, Stockfish vs Houdini).

However, this is not to suggest that you can't call your results a real measure of Elo. They are - just an extremely noisy measure, with the noise so much stronger than the true signal that you can't say anything with confidence. But all Elo measurements are estimates, and have their own associated noise - what matters is understanding the noise to know what you can and cannot conclude from your estimate.
so what ur saying is "heads or tails" and don't waste my $$ for collage?
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Stockfish Development Version

Post by Lyudmil Tsvetkov »

TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.
Nothing is taken for granted in this life.

Maybe there will not even be SF 7. Maybe the framework will fall apart, no matter how slim such a possibility might look currently. Maybe some of the main programmers will be able to contribute significantly less, so progress will be significantly slowed down.

Maybe Komodo will also not be able to make big progress in the future, no matter how unrealistic such a possibility might look currently.

Maybe another strong engine will take their place at the very top, or maybe we have just witnessed the peak of computer chess for a few years to come. No guarantee progress will be made so rapidly as it has been over the last few years.

In any case, what I am almost fully certain is that it will take SF longer to reach +50-60 elo and be released as SF 7 than the 8 months it took the release of SF 6.

I also do not think that we will ever witness again the 3-4 months Komodo jumps from version to version.

As engines get stronger, it is only reasonable to suppose it will be increasingly more difficult to increase strength.

So maybe just enjoy what we have for a while and hope the future will bring something better.
ernest
Posts: 2059
Joined: Wed Mar 08, 2006 8:30 pm

Re: Stockfish Development Version

Post by ernest »

Laskos wrote:For N games the formula is 700 ELO points divided by square root of N
Actually, as you well know, the correct value for sigma (standard-deviation) is also dependent on the draw rate.

And has anybody noticed that the error bars given by the Chessbase/Fritz GUI during engine matches are completely idiotic ?... :shock:
User avatar
reflectionofpower
Posts: 1669
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

Lyudmil Tsvetkov wrote:
TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.
Nothing is taken for granted in this life.

Maybe there will not even be SF 7. Maybe the framework will fall apart, no matter how slim such a possibility might look currently. Maybe some of the main programmers will be able to contribute significantly less, so progress will be significantly slowed down.

Maybe Komodo will also not be able to make big progress in the future, no matter how unrealistic such a possibility might look currently.

Maybe another strong engine will take their place at the very top, or maybe we have just witnessed the peak of computer chess for a few years to come. No guarantee progress will be made so rapidly as it has been over the last few years.

In any case, what I am almost fully certain is that it will take SF longer to reach +50-60 elo and be released as SF 7 than the 8 months it took the release of SF 6.

I also do not think that we will ever witness again the 3-4 months Komodo jumps from version to version.

As engines get stronger, it is only reasonable to suppose it will be increasingly more difficult to increase strength.

So maybe just enjoy what we have for a while and hope the future will bring something better.
Unless we start seeing gigantic leaps in processing power then I have to agree with you.
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie