Toga The Killer 1Y MP 4CPU is the strongest Toga....

bob · Post by **bob** » Tue Jun 30, 2009 5:39 am

krazyken wrote:
bob wrote:
krazyken wrote:
Ryan Benitez wrote:
bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
Strange thing is, I have a degree in Math, but I guess you haven't been following along, the question isn't whether more games is better, the question is whether or not a set of games has no value whatsoever. Just because a small sample isn't always right does not mean that it is always wrong, If you have set things up correctly, it will be right far more often than it will be wrong.
That is simply _WRONG_.

I've given several examples. There is no "right way" to set up a test so that 10 games will tell you with any sort of usable certainty that A is better or worse than B. Simply no way.
proof by example? my professors have never let me get away with that.
I suppose it could be possible that BayesELO is using an algorithm that that requires a minimum sample size. It is more likely that there are some assumptions inherent in the algorithms that are not being satisfied by some of your samples. Regardless, BayesELO is not the only possible statistical tool, just because it has trouble with a particular sample, doesn't mean we need to throw away all the rest of the statistical tools we have at are disposal and declare the sample worthless.

At present BayesElo is the _best_ tool we have to take game results and produce Elo ratings for comparison. Everyone has been using Elo ratings for comparison since the book was published. Feel free to use any other scheme you want. But to draw conclusions from 10 games is to draw conclusions from essentially random numbers. The sample size is simply too small for any sort of comparison. Entertainment, perhaps. But not useful. And I am not interested in the follow-up that "but if we take those results and add them to lots of others..." as that is a silly argument. Of course 10 games is useful if you play 100 such matches to get 1000 games. But taken by themselves, they are not useful. Yet that is how they are being reported and used.

krazyken · Post by **krazyken** » Tue Jun 30, 2009 7:51 am

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
Ryan Benitez wrote:
bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
Strange thing is, I have a degree in Math, but I guess you haven't been following along, the question isn't whether more games is better, the question is whether or not a set of games has no value whatsoever. Just because a small sample isn't always right does not mean that it is always wrong, If you have set things up correctly, it will be right far more often than it will be wrong.
That is simply _WRONG_.

I've given several examples. There is no "right way" to set up a test so that 10 games will tell you with any sort of usable certainty that A is better or worse than B. Simply no way.
proof by example? my professors have never let me get away with that.
I suppose it could be possible that BayesELO is using an algorithm that that requires a minimum sample size. It is more likely that there are some assumptions inherent in the algorithms that are not being satisfied by some of your samples. Regardless, BayesELO is not the only possible statistical tool, just because it has trouble with a particular sample, doesn't mean we need to throw away all the rest of the statistical tools we have at are disposal and declare the sample worthless.
At present BayesElo is the _best_ tool we have to take game results and produce Elo ratings for comparison. Everyone has been using Elo ratings for comparison since the book was published. Feel free to use any other scheme you want. But to draw conclusions from 10 games is to draw conclusions from essentially random numbers. The sample size is simply too small for any sort of comparison. Entertainment, perhaps. But not useful. And I am not interested in the follow-up that "but if we take those results and add them to lots of others..." as that is a silly argument. Of course 10 games is useful if you play 100 such matches to get 1000 games. But taken by themselves, they are not useful. Yet that is how they are being reported and used.

the strange thing is when "essentially random" numbers will show a pattern. If you take a number of random selections of 10 games out of your 32000 games you will find that the scores from those random selections will be normally distributed. Obviously you can't know where on the distribution one particular sample will lie, and the safest bet is to discard it. Because these samples are normally distributed there are probabilities that most samples will lie close to the truth. Obviously increasing the size of the sample provides less variation. Yes people who are using the results of 10 games as their sole basis of conclusion are not really concerned with reality. but discouraging people from posting their results is far worse.

nevatre · Post by **nevatre** » Tue Jun 30, 2009 1:48 pm

I am not sure I agree with that formula

This is what I reckon...

The probability of a run of 10 games in a match of 1000 is about 38.5%, if the probability of a win is 0.5. That is larger than I guessed.

With equally matched programs and 15% draws, so prob(win)=0.4, the probability of a run of 10 wins in 1000 games is about 10.4%.

The probability of a run of 10 games in a match of 100 games is about 4.4%, if the probability of a win is 0.5.

With equally matched programs and 15% draws, so prob(win)=0.4, the probability of a run of 10 wins in 100 games is about 1.0%.

krazyken · Post by **krazyken** » Tue Jun 30, 2009 6:03 pm

What formula are you suggesting?

nevatre · Post by **nevatre** » Tue Jun 30, 2009 6:21 pm

Kenny,

I calculated the probabilities in two ways.

The first is a simulation, just generating a lot of 1000 game sequences and counting the number of runs of 10 wins.

The second is using a recurrence relation

p[N] = p[N-1] + (1-P[N-R-1])(1-p)p^R

where p is the prob. of a win, R is the length of a run (R=10), and p[N] is the probability of a run of length R in N games. I set p[N]=0 for N=0,...,9, and p[10]=p^R, then use the recurrence to calculate P[N] for N>10.

The recurrence comes from considering that a run of R in N can occur in two mutually exclusive ways: a run of R in N-1 (hence p[N-1] on the RHS of the recurrence), or no run of R in N-R-1 and a non-win and R wins (hence the other term on the RHS).

I think it is not easy to write a useful formula for P[N] directly.

krazyken · Post by **krazyken** » Tue Jun 30, 2009 7:34 pm

nevatre wrote:Kenny,

I calculated the probabilities in two ways.

The first is a simulation, just generating a lot of 1000 game sequences and counting the number of runs of 10 wins.

The second is using a recurrence relation

p[N] = p[N-1] + (1-P[N-R-1])(1-p)p^R

where p is the prob. of a win, R is the length of a run (R=10), and p[N] is the probability of a run of length R in N games. I set p[N]=0 for N=0,...,9, and p[10]=p^R, then use the recurrence to calculate P[N] for N>10.

The recurrence comes from considering that a run of R in N can occur in two mutually exclusive ways: a run of R in N-1 (hence p[N-1] on the RHS of the recurrence), or no run of R in N-R-1 and a non-win and R wins (hence the other term on the RHS).

I think it is not easy to write a useful formula for P[N] directly.

I used a simple setup of counting:
say for a run of 10 in 100
there are 91 possible places in the sequence for the run to occur:
(0)R(90)
(1)R(89)
(2)R(88)
...
(88)R(2)
(89)R(1)
(90)R(0)

since we don't care what the other 90 spots contain, the probability of filling any of them acceptably is 1. R is a fixed sequence of 10 wins so that probability is p^10 So it is simply a case of multiplying the number of possibilities times the probability of that possibility. This agrees with your calculations where p = .4 but your p =.5 doesn't match up.

bob · Post by **bob** » Tue Jun 30, 2009 7:36 pm

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
Ryan Benitez wrote:
bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
Strange thing is, I have a degree in Math, but I guess you haven't been following along, the question isn't whether more games is better, the question is whether or not a set of games has no value whatsoever. Just because a small sample isn't always right does not mean that it is always wrong, If you have set things up correctly, it will be right far more often than it will be wrong.
That is simply _WRONG_.

I've given several examples. There is no "right way" to set up a test so that 10 games will tell you with any sort of usable certainty that A is better or worse than B. Simply no way.
proof by example? my professors have never let me get away with that.
I suppose it could be possible that BayesELO is using an algorithm that that requires a minimum sample size. It is more likely that there are some assumptions inherent in the algorithms that are not being satisfied by some of your samples. Regardless, BayesELO is not the only possible statistical tool, just because it has trouble with a particular sample, doesn't mean we need to throw away all the rest of the statistical tools we have at are disposal and declare the sample worthless.
At present BayesElo is the _best_ tool we have to take game results and produce Elo ratings for comparison. Everyone has been using Elo ratings for comparison since the book was published. Feel free to use any other scheme you want. But to draw conclusions from 10 games is to draw conclusions from essentially random numbers. The sample size is simply too small for any sort of comparison. Entertainment, perhaps. But not useful. And I am not interested in the follow-up that "but if we take those results and add them to lots of others..." as that is a silly argument. Of course 10 games is useful if you play 100 such matches to get 1000 games. But taken by themselves, they are not useful. Yet that is how they are being reported and used.
the strange thing is when "essentially random" numbers will show a pattern. If you take a number of random selections of 10 games out of your 32000 games you will find that the scores from those random selections will be normally distributed. Obviously you can't know where on the distribution one particular sample will lie, and the safest bet is to discard it. Because these samples are normally distributed there are probabilities that most samples will lie close to the truth. Obviously increasing the size of the sample provides less variation. Yes people who are using the results of 10 games as their sole basis of conclusion are not really concerned with reality. but discouraging people from posting their results is far worse.

I didn't discourage him from posting his results. I discouraged drawing any conclusions from the results that were posted.

krazyken · Post by **krazyken** » Tue Jun 30, 2009 7:55 pm

bob wrote:
I didn't discourage him from posting his results. I discouraged drawing any conclusions from the results that were posted.

You are right it wasn't you, but this discussion was started because someone else posted discouraging remarks. and it may all just be a semantics issue. I know the discussion helped clear up a few things for me along the way. I've even had time to read the Hunter paper on MM Algorithms. BayesELO is a very nice tool, but it is not always the best way to determine if A is better than A'. You could probably get more accurate results with smaller numbers with a simple student's T-test.

nevatre · Post by **nevatre** » Tue Jun 30, 2009 9:19 pm

I think the probability must be larger than you have calculated. There are other longer sub-sequences containing a run of 10, there is a chance of two runs of 10 separated by a draw, for example, or a single run of 11, and so on. All those other arrangements should be counted because the whole sequence of 1000 games would still contain a run of 10.

krazyken · Post by **krazyken** » Tue Jun 30, 2009 9:54 pm

nevatre wrote:I think the probability must be larger than you have calculated. There are other longer sub-sequences containing a run of 10, there is a chance of two runs of 10 separated by a draw, for example, or a single run of 11, and so on. All those other arrangements should be counted because the whole sequence of 1000 games would still contain a run of 10.

Well because the positions outside the run could contain anything, they certainly could contain more wins, or be all wins. but they don't change the truth of at least 1 run existing in that series.

if the probability of a win is .5 then there is a 96.7% chance of having a a run of 10 wins in a 1000 games. Bob tells me his probability of a draw is 22%, so I use .39 as a probability for a win and get about 8%. The first probability I calculated I overestimated the draw rate at 33% which came close to 1%.

Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....