Stockfish Development Version

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
reflectionofpower
Posts: 1668
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

Laskos wrote:
zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
It's not about being statistician. More about a square root.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).

The resolution power in ELO points for a match of 324 games is:

700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.

That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.

For N games the formula is 700 ELO points divided by square root of N

And it's called 3 standard deviations confidence, but no one has to remember its name.
Meet me at the casino. I would like to discuss something with you. :wink:

The MIT book I have read. There also was some people who developed a 'workaround" with a roulette table. https://www.youtube.com/watch?v=CiWHcpU6snM

interesting doc, I highly recommend it.
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
thekingman
Posts: 35
Joined: Mon Mar 16, 2015 6:17 am

Re: Stockfish Development Version

Post by thekingman »

TShackel wrote:
zullil wrote:Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
332 games is enough to get posted on CEGT's long time control rating list! Are you saying their rating lists aren't valid? And second of all, stockfish team is far from perfect or they would've done 100,000 games of testing for each change. See, you can always make the excuse more games are required. That doesn't mean we don't count our result in the meantime..

Tim.
Error bars, my friend. Every major rating list includes error bars, because they are extremely important for understanding what the results actually mean. You'll notice that on CEGT's long time control rating list, Stockfish 6 (with 300 games) is listed as 3154 +/- 39. Your sample size is similar, so your error bars will be about the same too. Therefore, while your best estimate is that the latest version is 2 Elo weaker, it could actually be up to 35 Elo stronger and your results would not be too surprising.

If you do the math, it should work out to ~45% chance of the latest version being stronger than the older one, despite the -2 Elo from your tests. This is without incorporating any outside information, just your results and basic statistics. I strongly suggest you read some of the available information on likelihood of superiority, which is what you are posting about, to understand why a -2 Elo estimation based on 332 games
is not at all informative: https://chessprogramming.wikispaces.com ... Likelihood of superiority
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish Development Version

Post by bob »

reflectionofpower wrote:
Laskos wrote:
zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
It's not about being statistician. More about a square root.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).

The resolution power in ELO points for a match of 324 games is:

700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.

That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.

For N games the formula is 700 ELO points divided by square root of N

And it's called 3 standard deviations confidence, but no one has to remember its name.
Meet me at the casino. I would like to discuss something with you. :wink:

The MIT book I have read. There also was some people who developed a 'workaround" with a roulette table. https://www.youtube.com/watch?v=CiWHcpU6snM

interesting doc, I highly recommend it.
There have been MANY that have beaten the roulette wheel, but you take a chance on visiting the state prison for most methods. Some used to "clock" wheels looking for zones or numbers that hit with greater than expected frequency, caused by wheels out of balance, slight differences in the bumpers, very consistent spins, etc. The better solutions involve electronic clocking, but that is illegal. There have been cases of people gutting a cell phone, inserting a laser and computer system, and actually measuring the speed of the wheel, the speed of the ball, and after a few spins it can predict the decay very accurately. Not to hit a specific number very often, but it has been shown to nail the quadrant very accurately, and that's enough to win, the tighter the bound, the better the win rate.

You'd probably enjoy some of the exploits of old-timers in vegas. From the original card-counting of Thorpe's system, through the Taft shuffle-tracking computer for blackjack, to the roulette clockers and such. Being a long-time card counter (blackjack) I considered all of that "basic education reading." :) There are lots of other things from tricking slot machines to a group that actually learned to beat a continuous shuffle machine that was supposedly unbeatable in a blackjack game. Until they got their hands on one and discovered a flaw. :)

BTW Taft's shuffle-tracking computer (wearable as in the video you linked to) had similar issues. Almost caught his pants on fire once and did burn his leg. Funny things happen when you wear batteries powerful enough to run 1970's era minicomputers. :)
syzygy
Posts: 6023
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish Development Version

Post by syzygy »

TShackel wrote:I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.
If it's about 2 Elo you cannot draw any conclusion from 332 games. That simple.
User avatar
reflectionofpower
Posts: 1668
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

bob wrote:
reflectionofpower wrote:
Laskos wrote:
zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
It's not about being statistician. More about a square root.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).

The resolution power in ELO points for a match of 324 games is:

700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.

That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.

For N games the formula is 700 ELO points divided by square root of N

And it's called 3 standard deviations confidence, but no one has to remember its name.
Meet me at the casino. I would like to discuss something with you. :wink:

The MIT book I have read. There also was some people who developed a 'workaround" with a roulette table. https://www.youtube.com/watch?v=CiWHcpU6snM

interesting doc, I highly recommend it.
There have been MANY that have beaten the roulette wheel, but you take a chance on visiting the state prison for most methods. Some used to "clock" wheels looking for zones or numbers that hit with greater than expected frequency, caused by wheels out of balance, slight differences in the bumpers, very consistent spins, etc. The better solutions involve electronic clocking, but that is illegal. There have been cases of people gutting a cell phone, inserting a laser and computer system, and actually measuring the speed of the wheel, the speed of the ball, and after a few spins it can predict the decay very accurately. Not to hit a specific number very often, but it has been shown to nail the quadrant very accurately, and that's enough to win, the tighter the bound, the better the win rate.

You'd probably enjoy some of the exploits of old-timers in vegas. From the original card-counting of Thorpe's system, through the Taft shuffle-tracking computer for blackjack, to the roulette clockers and such. Being a long-time card counter (blackjack) I considered all of that "basic education reading." :) There are lots of other things from tricking slot machines to a group that actually learned to beat a continuous shuffle machine that was supposedly unbeatable in a blackjack game. Until they got their hands on one and discovered a flaw. :)

BTW Taft's shuffle-tracking computer (wearable as in the video you linked to) had similar issues. Almost caught his pants on fire once and did burn his leg. Funny things happen when you wear batteries powerful enough to run 1970's era minicomputers. :)
Yeah, in that doc that was the main problem, they sweat naturally and then they start squirming and acting funny because now they are sitting in "Old sparky". U have facial recognition & database sharing across the board. It's hard to do, if ur a lone wolf u could probably pull it off even now with different disguises and careful planning.
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
TShackel
Posts: 313
Joined: Sat Apr 05, 2014 12:09 am
Location: Neenah, WI, United States

Re: Stockfish Development Version

Post by TShackel »

thekingman wrote:Error bars, my friend. Every major rating list includes error bars, because they are extremely important for understanding what the results actually mean. You'll notice that on CEGT's long time control rating list, Stockfish 6 (with 300 games) is listed as 3154 +/- 39. Your sample size is similar, so your error bars will be about the same too. Therefore, while your best estimate is that the latest version is 2 Elo weaker, it could actually be up to 35 Elo stronger and your results would not be too surprising.

If you do the math, it should work out to ~45% chance of the latest version being stronger than the older one, despite the -2 Elo from your tests. This is without incorporating any outside information, just your results and basic statistics. I strongly suggest you read some of the available information on likelihood of superiority, which is what you are posting about, to understand why a -2 Elo estimation based on 332 games
is not at all informative: https://chessprogramming.wikispaces.com ... Likelihood of superiority
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.

But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.

Sincerely,

Tim.
User avatar
reflectionofpower
Posts: 1668
Joined: Fri Mar 01, 2013 5:28 pm
Location: USA

Re: Stockfish Development Version

Post by reflectionofpower »

TShackel wrote:
thekingman wrote:Error bars, my friend. Every major rating list includes error bars, because they are extremely important for understanding what the results actually mean. You'll notice that on CEGT's long time control rating list, Stockfish 6 (with 300 games) is listed as 3154 +/- 39. Your sample size is similar, so your error bars will be about the same too. Therefore, while your best estimate is that the latest version is 2 Elo weaker, it could actually be up to 35 Elo stronger and your results would not be too surprising.

If you do the math, it should work out to ~45% chance of the latest version being stronger than the older one, despite the -2 Elo from your tests. This is without incorporating any outside information, just your results and basic statistics. I strongly suggest you read some of the available information on likelihood of superiority, which is what you are posting about, to understand why a -2 Elo estimation based on 332 games
is not at all informative: https://chessprogramming.wikispaces.com ... Likelihood of superiority
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.

But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.

Sincerely,

Tim.
I look at the top 3 and their pretty much = in my eyes. Houdini 4 is 1.5 years old, amazing that it is still 62 point behind!! IF Houdart escaped out of the straight jacket and made it to version 5 he would kick hiney!
"Without change, something sleeps inside us, and seldom awakens. The sleeper must awaken." (Dune - 1984)

Lonnie
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish Development Version

Post by bob »

TShackel wrote:
thekingman wrote:Error bars, my friend. Every major rating list includes error bars, because they are extremely important for understanding what the results actually mean. You'll notice that on CEGT's long time control rating list, Stockfish 6 (with 300 games) is listed as 3154 +/- 39. Your sample size is similar, so your error bars will be about the same too. Therefore, while your best estimate is that the latest version is 2 Elo weaker, it could actually be up to 35 Elo stronger and your results would not be too surprising.

If you do the math, it should work out to ~45% chance of the latest version being stronger than the older one, despite the -2 Elo from your tests. This is without incorporating any outside information, just your results and basic statistics. I strongly suggest you read some of the available information on likelihood of superiority, which is what you are posting about, to understand why a -2 Elo estimation based on 332 games
is not at all informative: https://chessprogramming.wikispaces.com ... Likelihood of superiority
In spite of error bars everyone seems to conclude which engines are stronger based on the trustworthy CEGT lists.

But anyhow, what would be the amount of games I would need to provide to have a reliable measure of elo? Maybe we could improve stockfish teams methods and increase their 30,000 game tests to 100,000 while we're at it. But nevertheless, I would like to know the number I would need to reach to call it a real measure of elo.

Sincerely,

Tim.
Stockfish team is not running 30K game tests. They are using SPRT with self-play, completely different animal.

I run 30K game tests all the time as that drops the error bar to about +/- 3 or 4, which is usually good enough.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish Development Version

Post by bob »

reflectionofpower wrote:
bob wrote:
reflectionofpower wrote:
Laskos wrote:
zullil wrote:
TShackel wrote: 332 games is quite a few to start getting a real comparison.
Gee, my gut reaction is that 332 games is way too few to conclude much of anything with any confidence. But I'm not a statistician. :wink:
It's not about being statistician. More about a square root.
Square root of 4 is 2 because 2*2=4
Square root of 25 is 5 because 5*5=25
Square root of 324 is 18 because 18*18=324, very close to those 332 (games played).

The resolution power in ELO points for a match of 324 games is:

700 ELO points divided by square root of 324
that is
700 / 18 ~39 ELO points.

That is, one cannot detect in this match of 324 games anything for sure smaller than 39 ELO points difference. So, far away from 2 ELO points "detected" in OP.

For N games the formula is 700 ELO points divided by square root of N

And it's called 3 standard deviations confidence, but no one has to remember its name.
Meet me at the casino. I would like to discuss something with you. :wink:

The MIT book I have read. There also was some people who developed a 'workaround" with a roulette table. https://www.youtube.com/watch?v=CiWHcpU6snM

interesting doc, I highly recommend it.
There have been MANY that have beaten the roulette wheel, but you take a chance on visiting the state prison for most methods. Some used to "clock" wheels looking for zones or numbers that hit with greater than expected frequency, caused by wheels out of balance, slight differences in the bumpers, very consistent spins, etc. The better solutions involve electronic clocking, but that is illegal. There have been cases of people gutting a cell phone, inserting a laser and computer system, and actually measuring the speed of the wheel, the speed of the ball, and after a few spins it can predict the decay very accurately. Not to hit a specific number very often, but it has been shown to nail the quadrant very accurately, and that's enough to win, the tighter the bound, the better the win rate.

You'd probably enjoy some of the exploits of old-timers in vegas. From the original card-counting of Thorpe's system, through the Taft shuffle-tracking computer for blackjack, to the roulette clockers and such. Being a long-time card counter (blackjack) I considered all of that "basic education reading." :) There are lots of other things from tricking slot machines to a group that actually learned to beat a continuous shuffle machine that was supposedly unbeatable in a blackjack game. Until they got their hands on one and discovered a flaw. :)

BTW Taft's shuffle-tracking computer (wearable as in the video you linked to) had similar issues. Almost caught his pants on fire once and did burn his leg. Funny things happen when you wear batteries powerful enough to run 1970's era minicomputers. :)
Yeah, in that doc that was the main problem, they sweat naturally and then they start squirming and acting funny because now they are sitting in "Old sparky". U have facial recognition & database sharing across the board. It's hard to do, if ur a lone wolf u could probably pull it off even now with different disguises and careful planning.
It is difficult. The Vegas casinos all exchange data on suspected card counters, suspected cheaters, suspected ace trackers, suspected anything at all. If I could walk into a Vegas casino and just count and play, I'd retire immediately. But it is a game of cat and mouse. You have to count, but look like a "gambler" by doing occasional things against the count (aka "cover" betting in BJ). You have to become natural enough at counting that (a) you don't move your lips. :) (b) can carry on one or more conversations and STILL count. And in some cases, you might want to stand between two tables and count both and jump in when the odds are in your favor (AKA "Wonging" after the pseudonym of a well-known Vegas counter.) You learn to use various tricks (a non-existent cell phone call to get away from a table where the count has tanked, or suddenly recognizing a long-lost friend across the room for the same reason. To be successful at counting cards is MUCH harder than just learning Hi-Lo and the basic strategy departure counts. Some get "drunk" without drinking anything more than "tea". Some have an accomplice that stands behind them and urges them to make stupid plays, both when the count is good and when it is bad. Of course you try to only make those "plays" when the count is right, while looking like your accomplice is almost forcing you to split those 10's or 6's.

And you will STILL get "backed off" where they will say something like "I'm sorry, but your blackjack game is too strong for us, you can play any other game in the casino, but not blackjack." (and of course, there is only ONE game in the casino that can be beaten, and you just lost the ability to play.) So disguises to avoid facial recognition, a career in "acting" to avoid attracting attention from the eye (in the sky, aka the security or game-protection folks) and such. If you don't get picked up on, you can play as much as you want. Once recognized, it's over there at least until next year. I always use a 60 minute rule myself. No more than 60 minutes in one casino to avoid playing long enough for game-protection (or the pit boss too, of course) to take an interest. Fortunately, in Vegas, you have lots of choices. I like Fremont street since casinos are lined up side-by-side. On the Strip you end up walking a _lot_ since the mega-resorts are more spaced out (with a few exceptions like MGM/Tropicana/NYNY. I've played in every one. I won't be going back to Caesars unfortunately (a $100 min single-deck game was my undoing there, playing dead accurate counting gets more attention at single-deck games than anywhere, something I learned the hard way.)

I enjoy doing this because one can have a good time and actually make some money along the way (which offset's my slots-happy wife easily). :)

As I tell anyone I talk to, I do NOT gamble. But playing a game where you make the odds in your favor rather than in the house's is NOT gambling, it is known as "a sure thing". All the other casino games are losers with VERY rare exceptions. There used to be some decent video poker games that paid out more than 100% if you knew what you were doing. But video poker is not poker so you are forced to use a completely different "basic strategy" where you always chase the big payouts like a royal flush. If you have 5 spades, with A, K and 10, who would give up a flush in the hand for a possible royal flush? :)

Answer: someone that understands video poker. :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish Development Version

Post by bob »

syzygy wrote:
TShackel wrote:I know it's not to a thousand games yet, but 332 is quite a few games to start drawing a conclusion from.
If it's about 2 Elo you cannot draw any conclusion from 332 games. That simple.
In fact, if it is only 2 elo, you can't draw any conclusions from 33,200 games either.