Stockfish Development Version

lucasart · Post by **lucasart** » Mon Jun 15, 2015 1:05 am

TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.

1/ you cannot conclude anything from seeing -2 elo after 332 games. detecting a 2 elo regression with 95% confidence requires an SPRT(-2,0) which is *much* costlier than that

2/ you are using abrok.eu compiler, and anything you can conclude from those could be the result of abrok.eu doing lame compiler (eg. not using the right version of mingw-gcc or not doing profile guided optimizations). unless you can compile both versions yourself, exactly the same way, you cannot make any meaningful observation.

thekingman · Post by **thekingman** » Mon Jun 15, 2015 1:41 am

TShackel wrote:
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.

The online rating lists are our "best estimates", but there is of course 'noise' in these estimates. If one engine is rated higher than another, it's probably stronger, but our certainty in that conclusion is dependent on number of games. Check out how the CCRL rating list handles this, with a 'Likelihood of superiority' column, representing confidence that a given engine is actually stronger than the one rated just below it: http://www.computerchess.org.uk/ccrl/4040/

Note in particular that while there is extremely high confidence in Stockfish being stronger than Houdini, we can only be 87% sure (from this data) that Komodo is stronger than Stockfish, despite being rated 15 points higher. For an even more dramatic example, look at Gull vs Equinox - despite over 700 games played by each engine, the mere 3 point gap means that you may as well flip a coin to pick the stronger of the two.

TShackel wrote: But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.

Using the formula Laskos provided, to be able to say with confidence that version A is stronger than version B given that you have measured a 2 Elo difference, you need (700/2)^2 = 122,500 games.

Of course, the number of games required to show that A > B drops drastically as the rating gap increases - 5000 is enough for a 10 point gap, and 200 is enough for a 50 point gap (ie, Stockfish vs Houdini).

However, this is not to suggest that you can't call your results a real measure of Elo. They are - just an extremely noisy measure, with the noise so much stronger than the true signal that you can't say anything with confidence. But all Elo measurements are estimates, and have their own associated noise - what matters is understanding the noise to know what you can and cannot conclude from your estimate.

reflectionofpower · Post by **reflectionofpower** » Mon Jun 15, 2015 5:36 am

thekingman wrote:
TShackel wrote:
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.
The online rating lists are our "best estimates", but there is of course 'noise' in these estimates. If one engine is rated higher than another, it's probably stronger, but our certainty in that conclusion is dependent on number of games. Check out how the CCRL rating list handles this, with a 'Likelihood of superiority' column, representing confidence that a given engine is actually stronger than the one rated just below it: http://www.computerchess.org.uk/ccrl/4040/

Note in particular that while there is extremely high confidence in Stockfish being stronger than Houdini, we can only be 87% sure (from this data) that Komodo is stronger than Stockfish, despite being rated 15 points higher. For an even more dramatic example, look at Gull vs Equinox - despite over 700 games played by each engine, the mere 3 point gap means that you may as well flip a coin to pick the stronger of the two.

TShackel wrote: But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.
Using the formula Laskos provided, to be able to say with confidence that version A is stronger than version B given that you have measured a 2 Elo difference, you need (700/2)^2 = 122,500 games.

Of course, the number of games required to show that A > B drops drastically as the rating gap increases - 5000 is enough for a 10 point gap, and 200 is enough for a 50 point gap (ie, Stockfish vs Houdini).

However, this is not to suggest that you can't call your results a real measure of Elo. They are - just an extremely noisy measure, with the noise so much stronger than the true signal that you can't say anything with confidence. But all Elo measurements are estimates, and have their own associated noise - what matters is understanding the noise to know what you can and cannot conclude from your estimate.

so what ur saying is "heads or tails" and don't waste my $$ for collage?

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Mon Jun 15, 2015 11:20 am

TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.

Nothing is taken for granted in this life.

Maybe there will not even be SF 7. Maybe the framework will fall apart, no matter how slim such a possibility might look currently. Maybe some of the main programmers will be able to contribute significantly less, so progress will be significantly slowed down.

Maybe Komodo will also not be able to make big progress in the future, no matter how unrealistic such a possibility might look currently.

Maybe another strong engine will take their place at the very top, or maybe we have just witnessed the peak of computer chess for a few years to come. No guarantee progress will be made so rapidly as it has been over the last few years.

In any case, what I am almost fully certain is that it will take SF longer to reach +50-60 elo and be released as SF 7 than the 8 months it took the release of SF 6.

I also do not think that we will ever witness again the 3-4 months Komodo jumps from version to version.

As engines get stronger, it is only reasonable to suppose it will be increasingly more difficult to increase strength.

So maybe just enjoy what we have for a while and hope the future will bring something better.

ernest · Post by **ernest** » Fri Jun 19, 2015 1:38 am

Laskos wrote:For N games the formula is 700 ELO points divided by square root of N

Actually, as you well know, the correct value for sigma (standard-deviation) is also dependent on the draw rate.

And has anybody noticed that the error bars given by the Chessbase/Fritz GUI during engine matches are completely idiotic ?...

reflectionofpower · Post by **reflectionofpower** » Fri Jun 19, 2015 1:45 am

Lyudmil Tsvetkov wrote:
TShackel wrote:Hi,

I've been using April 12th development version in most of my testing since it proved to be several elo stronger than stockfish 6.0 in my own testing. However, I tested the April 12th development version against the most recent development version, and after 332 games the recent version is down by 2 elo from the april 12th development version. I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.

Does anyone have an idea why this is? Normally the most recent development version is stronger than the previous versions.

Sincerely,

Tim.
Nothing is taken for granted in this life.

Maybe there will not even be SF 7. Maybe the framework will fall apart, no matter how slim such a possibility might look currently. Maybe some of the main programmers will be able to contribute significantly less, so progress will be significantly slowed down.

Maybe Komodo will also not be able to make big progress in the future, no matter how unrealistic such a possibility might look currently.

Maybe another strong engine will take their place at the very top, or maybe we have just witnessed the peak of computer chess for a few years to come. No guarantee progress will be made so rapidly as it has been over the last few years.

In any case, what I am almost fully certain is that it will take SF longer to reach +50-60 elo and be released as SF 7 than the 8 months it took the release of SF 6.

I also do not think that we will ever witness again the 3-4 months Komodo jumps from version to version.

As engines get stronger, it is only reasonable to suppose it will be increasingly more difficult to increase strength.

So maybe just enjoy what we have for a while and hope the future will bring something better.

Unless we start seeing gigantic leaps in processing power then I have to agree with you.

Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version

Re: Stockfish Development Version