Engine Testing - Statistics

UncombedCoconut · Post by **UncombedCoconut** » Sun Jan 17, 2010 12:53 pm

Edmund wrote:Maybe I am doing something wrong here, but I am still not getting quite the same values. eg n=100; alpha=5% --> should be 57.5%

I guess it is a bit complicated. To match the table's 5% column I translate as follows: First, since that page discusses 1-sided tests and my program wants to do 2-sided tests, I tell my program alpha=0.1. (5% probability of mistakenly saying program A is stronger, 5% probability of mistakenly saying program B is stronger.)
Second, I set cutoff = 0 each round and look at the largest cutoff allowed each round. (The table assumes you don't terminate early, so that's the same as terminating if score<0.)
Finally, I translate the reported max-cutoff values: if the program says "h" half-points, then the winning percentage must be >= h/(2g) to continue, hence > (2g-h)/(2g) to conclude superiority, hence >=(2g-h+0.5)/(2g) to conclude superiority. This last number should match those in Mr. Ciarrochi's table.

Edmund wrote:Furthermore, would you elaborate on the "Optimize for very different strength." feature, I don't quite understand that.

It just means, insert a chance of early termination whenever it is possible (i.e., doesn't push the entire test's probability of false positive over the requested alpha). These insertions are likely to pay off if one program is far superior, and to hurt otherwise. It's not a practical way to design the superiority test, but it and the opposite approach (constant # of games with no early termination rules) are the simplest to code.

Edmund wrote:Great idea! I didn't think of that. The problem I see though is that you may worsen rounding errors by that. Do you have any ideas on how to travers the permutations most efficiently given that win is changed at last (to generate the output) and to minimize the rounding errors (so maybe go from biggest to smallest)

Sure, rounding errors could happen. I wouldn't worry much because none of the operations cancel significant digits. Furthermore you do the same number of ops regardless of the order of recursion (since you increment w, d, and l the same # of times). So I'd make an arbitrary choice (and also memoize results if my goal was to output a table), then compare answers after changing doubles to floats in the program to gauge the roundoff errors.

Edmund wrote:Back to the original problem; any ideas on how we could use these new findings to answer the question on when to stop without the need of a simulation.

Well, now we have (assuming my code doesn't have more nasty bugs) a huge family of statistically sound tests, some more aggressive than others about stopping early. To choose between them you'd need to guess how probable each ELO difference is, since that determines the average # of games these tests require. I haven't done this yet, but I'd start with a very crude guess and see whether you can even get a significant improvement in average testing time vs the standard method of running a fixed # of games.

bob · Post by **bob** » Sun Jan 17, 2010 5:14 pm

mcostalba wrote:
bob wrote: Wasn't arguing with him. Was arguing _with_ him and against the idea of stopping when you like or don't like the result.
I think you cannot state something like this without doing the math.

An easier way, for people that is not able to produce the needed statistical formula (almost everybody in this forum I would guess , perhaps only Rémi Coulom has the theoretical background to be able to write some formulas) is to use a game simulator and plot the corresponding table. Also this is not trivial because you have to ask the correct questions to your simulator.

Admittedly I didn't think a lot on this and I would need a bit of time to come up with a proper simulation procedure, but I see this task doable by me, while instead deduce the statistical formulas is beyond my skills.

P.S: Of course just stating "can be done" or "cannot be done" is pure handwaving, I would think we all agree on this point

The math is not complex and can be found in lots of books. My best example is the game of blackjack or 21. Typical game has a player advantage of -0.5% (house wins about 1/2 of 1% of every bet made). If you count cards, you can swing this to a +1.0-1.5% advantage for the player. But with an edge that thin, and with cards that are still random, your "bankroll" will slowly trend upward since you do have a real advantage. But it goes up and down, sometimes slowly, sometimes wildly, but averaging about +1% of every bet played. If you take the non-card-counter, he _still_ sees the up and down swings. If you use the old voodoo approach to playing these kinds of games, which only rely on money management, such as progressive betting schemes (the old double your bet if you lose and you are guaranteed to win one betting "unit" is an example. Yes such a scheme will work. If your bankroll is unlimited. But as a card counter, I have lost 20 consecutive hands. If you start off with a $10 bet, after 20 losses you need to bet $10,000,000.00. Do you have that much? Probably doesn't matter since the table maximum is not going to be much beyond $5,000 to $10,000 in Vegas. This is a so-called Martingale progression and it does _not_ work in the real world for the reasons given, although it passes the "logic test" when real constraints such as total bankroll and table limits are ignored. )

You are doing the same thing with the Elo information. It goes up and down, as the data I posted clearly shows. If you stop when it looks good, the test is meaningless. yes, if you totally bust the code it will plummet like a rock and you could stop quickly when the results look horrible. But to make the test statistically valid, the "stop point" has to be something set _before" the test starts, not something that is based on the test results themselves while the match is "in progress".

That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.

BubbaTough · Post by **BubbaTough** » Sun Jan 17, 2010 5:20 pm

bob wrote:
mcostalba wrote:
bob wrote: Wasn't arguing with him. Was arguing _with_ him and against the idea of stopping when you like or don't like the result.
I think you cannot state something like this without doing the math.

An easier way, for people that is not able to produce the needed statistical formula (almost everybody in this forum I would guess , perhaps only Rémi Coulom has the theoretical background to be able to write some formulas) is to use a game simulator and plot the corresponding table. Also this is not trivial because you have to ask the correct questions to your simulator.

Admittedly I didn't think a lot on this and I would need a bit of time to come up with a proper simulation procedure, but I see this task doable by me, while instead deduce the statistical formulas is beyond my skills.

P.S: Of course just stating "can be done" or "cannot be done" is pure handwaving, I would think we all agree on this point
The math is not complex and can be found in lots of books. My best example is the game of blackjack or 21. Typical game has a player advantage of -0.5% (house wins about 1/2 of 1% of every bet made). If you count cards, you can swing this to a +1.0-1.5% advantage for the player. But with an edge that thin, and with cards that are still random, your "bankroll" will slowly trend upward since you do have a real advantage. But it goes up and down, sometimes slowly, sometimes wildly, but averaging about +1% of every bet played. If you take the non-card-counter, he _still_ sees the up and down swings. If you use the old voodoo approach to playing these kinds of games, which only rely on money management, such as progressive betting schemes (the old double your bet if you lose and you are guaranteed to win one betting "unit" is an example. Yes such a scheme will work. If your bankroll is unlimited. But as a card counter, I have lost 20 consecutive hands. If you start off with a $10 bet, after 20 losses you need to bet $10,000,000.00. Do you have that much? Probably doesn't matter since the table maximum is not going to be much beyond $5,000 to $10,000 in Vegas. This is a so-called Martingale progression and it does _not_ work in the real world for the reasons given, although it passes the "logic test" when real constraints such as total bankroll and table limits are ignored. )

You are doing the same thing with the Elo information. It goes up and down, as the data I posted clearly shows. If you stop when it looks good, the test is meaningless. yes, if you totally bust the code it will plummet like a rock and you could stop quickly when the results look horrible. But to make the test statistically valid, the "stop point" has to be something set _before" the test starts, not something that is based on the test results themselves while the match is "in progress".

That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.

I agree the math is quite doable by many here given an introduction to statistics book. The tricky part is not the equations but the assumptions. I am not positive I follow your analogy though, what are we trying to predict? Whether our bankroll will be positive after Y hands?

-Sam

zamar · Post by **zamar** » Sun Jan 17, 2010 5:33 pm

bob wrote: That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.

I think that most of us in here agree what you are saying, but I want to express things from a different point of view:

* As an inexperienced programmer I _often_ write patches which are extremely bad. In Stockfish development we have a rule that each patch which gets accepted must be tested in 1000 games match versus original.

* Now if after 150 games result is Mod - Orig: 50 - 100, there is no point of continuing anymore, patch is completely garbage. But I'd like to get scientifically valid condition when to stop.

* Another example is that if after 1000 games result is +30 elo (which is rear, but sometimes do happen!), there is no need for further games. But if it is only say +6elo, there might be need for another 1000 games match. But I'd really like to know what happens to error bars in this case. Because now it's question about conditional probability I cannot use the usual formula for the error bars.

In short this comes down to question: We want to commit patches which are improvements and discard the rest. We want the success rate to be X% (for example 90%). What is the optimal testing strategy when you want to minimize the number of games?

mcostalba · Post by **mcostalba** » Sun Jan 17, 2010 6:33 pm

bob wrote: But you have to be _sure_ you are willing to accept the resulting inaccuracy.

It seems not natural for you to accept that _any_ match result has an intrinsic accuracy that can be big or small, but there is always. Perhaps is due to the fact that you are used with the cluster

In you reasoning you use words like " test is meaningless" or something like this but I think you missed the point that in statistic you have to accept a _finite_ accuracy.

What we are asking is for a method to detect with a given accuracy of say 95%, when early stopping if engine A seems much lower/higher then engine B.

Please, focus on the fact that we don't want an oracle answer, but we accept answer could be false (in 5% of cases).

BTW sequential probability ( http://en.wikipedia.org/wiki/Sequential ... ratio_test ) is not easy and you don't get it reading some introductionary statistical book.

Sven · Post by **Sven** » Sun Jan 17, 2010 7:29 pm

zamar wrote:
bob wrote:That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.
I think that most of us in here agree what you are saying, but I want to express things from a different point of view:

* As an inexperienced programmer I _often_ write patches which are extremely bad. In Stockfish development we have a rule that each patch which gets accepted must be tested in 1000 games match versus original.

* Now if after 150 games result is Mod - Orig: 50 - 100, there is no point of continuing anymore, patch is completely garbage. But I'd like to get scientifically valid condition when to stop.

* Another example is that if after 1000 games result is +30 elo (which is rear, but sometimes do happen!), there is no need for further games. But if it is only say +6elo, there might be need for another 1000 games match. But I'd really like to know what happens to error bars in this case. Because now it's question about conditional probability I cannot use the usual formula for the error bars.

In short this comes down to question: We want to commit patches which are improvements and discard the rest. We want the success rate to be X% (for example 90%). What is the optimal testing strategy when you want to minimize the number of games?

You could calculate the LOS matrix using BayesElo every 50 games or so (EDIT: for *all* games up to that point, of course, so the games included in LOS matrix grow by time), and see how the LOS develops for the modified version. Then next time you define a goal like "I stop as soon as the LOS is >= 95% (98%, 99% ?) for one of the versions" and run your test.

However, the bad news is that nobody can tell you whether reaching an LOS of >= 95% after, say, 150 games means that adding more games will never let LOS fall below these 95% again. You only know a *likelihood*, not the truth.

Make the experiment by yourself. Play a 1000 games match between two versions, calculate ELO and error bars (or even LOS), then split the 1000 games that have been played into 10 groups of 100 subsequent games and repeat the same calculations for all of these groups (EDIT: or for the first 100, the first 200, 300, ...). Then look at the results. Bob did something like that, you may be surprised about how the scores can be different between these groups.

Sven

bob · Post by **bob** » Sun Jan 17, 2010 7:40 pm

BubbaTough wrote:
bob wrote:
mcostalba wrote:
bob wrote: Wasn't arguing with him. Was arguing _with_ him and against the idea of stopping when you like or don't like the result.
I think you cannot state something like this without doing the math.

An easier way, for people that is not able to produce the needed statistical formula (almost everybody in this forum I would guess , perhaps only Rémi Coulom has the theoretical background to be able to write some formulas) is to use a game simulator and plot the corresponding table. Also this is not trivial because you have to ask the correct questions to your simulator.

Admittedly I didn't think a lot on this and I would need a bit of time to come up with a proper simulation procedure, but I see this task doable by me, while instead deduce the statistical formulas is beyond my skills.

P.S: Of course just stating "can be done" or "cannot be done" is pure handwaving, I would think we all agree on this point
The math is not complex and can be found in lots of books. My best example is the game of blackjack or 21. Typical game has a player advantage of -0.5% (house wins about 1/2 of 1% of every bet made). If you count cards, you can swing this to a +1.0-1.5% advantage for the player. But with an edge that thin, and with cards that are still random, your "bankroll" will slowly trend upward since you do have a real advantage. But it goes up and down, sometimes slowly, sometimes wildly, but averaging about +1% of every bet played. If you take the non-card-counter, he _still_ sees the up and down swings. If you use the old voodoo approach to playing these kinds of games, which only rely on money management, such as progressive betting schemes (the old double your bet if you lose and you are guaranteed to win one betting "unit" is an example. Yes such a scheme will work. If your bankroll is unlimited. But as a card counter, I have lost 20 consecutive hands. If you start off with a $10 bet, after 20 losses you need to bet $10,000,000.00. Do you have that much? Probably doesn't matter since the table maximum is not going to be much beyond $5,000 to $10,000 in Vegas. This is a so-called Martingale progression and it does _not_ work in the real world for the reasons given, although it passes the "logic test" when real constraints such as total bankroll and table limits are ignored. )

You are doing the same thing with the Elo information. It goes up and down, as the data I posted clearly shows. If you stop when it looks good, the test is meaningless. yes, if you totally bust the code it will plummet like a rock and you could stop quickly when the results look horrible. But to make the test statistically valid, the "stop point" has to be something set _before" the test starts, not something that is based on the test results themselves while the match is "in progress".

That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.
I agree the math is quite doable by many here given an introduction to statistics book. The tricky part is not the equations but the assumptions. I am not positive I follow your analogy though, what are we trying to predict? Whether our bankroll will be positive after Y hands?

-Sam

The idea is this. If you start a session, and play until you are one unit ahead, and then quit, you are one unit up. And with the significant variance in the game of 21, the odds of your going up at one particular instant are quite high. And each time you do this, you win one bet. Suppose you play on a $5 min bet table. Every session will end up at +5 bucks. Except for the occasional session where you start off with a loss and continue to play and lose.

The point is that lifetime winnings occurs over a lifetime, counting everything. If you quit because you are ahead, you start off at the same probability of winning the next hand whether you quit and don't play until tomorrow, or you stay at the table and play the next hand. But I know of many gamblers that swear they are winning overall, just by playing the game without any sort of "advantage play" to really give them an actual edge over the house. It is always the "I win more sessions than I lose" idea. If you play 20 sessions, of 500 hands each (3-4-5 hours of play depending on crowd) you play 10,000 hands. If you flat-bet $10 per hand, you then had $100,000 of "action". and the house will keep pretty close to $500 bucks of your money. It doesn't matter whether you change that and play 1 session of 10,000 hands, or 10,000 sessions of one hand each. You end up the same $500 in the hole.

Now if you do as some con-artists do, and play multiple sessions and report the number of times you walked away a winner vs a lower, you can win more sessions, while still losing that same 500 bucks, because you can lose more in one session than you win in all the rest. But if you claim "more winning sessions than losing sessoins" a "noob" will buy that book, which is the way the author gets rich, not by playing the game with his losing system. And the "noob" will verify that he wins more sessions than he loses, all the time asking himself "where in the devil is my money going?"

If you test and when you like the result, you stop, all you did was find a place where the god of variance was kind to you, and reach the conclusion you wanted to reach. If you play toward some set goal (say N thousand games, then you ignore that variance, and the more games you play, the smaller that variance becomes.

If you look at the data I posted the other day, after a thousand games, I could have stopped, and been "sure" that this version was better, because the variance had taken the Elo up quite a bit. It did come down to reality, given enough games as you could see.

bob · Post by **bob** » Sun Jan 17, 2010 7:46 pm

zamar wrote:
bob wrote: That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.
I think that most of us in here agree what you are saying, but I want to express things from a different point of view:

* As an inexperienced programmer I _often_ write patches which are extremely bad. In Stockfish development we have a rule that each patch which gets accepted must be tested in 1000 games match versus original.

* Now if after 150 games result is Mod - Orig: 50 - 100, there is no point of continuing anymore, patch is completely garbage. But I'd like to get scientifically valid condition when to stop.

* Another example is that if after 1000 games result is +30 elo (which is rear, but sometimes do happen!), there is no need for further games. But if it is only say +6elo, there might be need for another 1000 games match. But I'd really like to know what happens to error bars in this case. Because now it's question about conditional probability I cannot use the usual formula for the error bars.

In short this comes down to question: We want to commit patches which are improvements and discard the rest. We want the success rate to be X% (for example 90%). What is the optimal testing strategy when you want to minimize the number of games?

Enough games so that you can separate the two programs reliably, by driving the SD down to a small enough value that you can be confident (to some extent) that the one with the larger Elo is actually better. In a 1000 game match, 30 elo swings are certainly possible. I believe I had that in the results I posted in fact, where at one point it was +20-something better than the best, and after a couple of thousand games it was barely better, and after 40K games it was 10 or so worse than the best so far.

You have only two choices. Vary the number of games which will determine the confidence in your result. Or establish a specific confidence you want and let that dictate the number of games you want. Note that 90% is pretty "iffy". one of ten changes will be mis-evaluated, which is pretty high. Even 95% causes one in twenty to produce wrong results.

BubbaTough · Post by **BubbaTough** » Sun Jan 17, 2010 7:51 pm

bob wrote: The idea is this. If you start a session, and play until you are one unit ahead, and then quit, you are one unit up. And with the significant variance in the game of 21, the odds of your going up at one particular instant are quite high. And each time you do this, you win one bet. Suppose you play on a $5 min bet table. Every session will end up at +5 bucks. Except for the occasional session where you start off with a loss and continue to play and lose.

The point is that lifetime winnings occurs over a lifetime, counting everything. If you quit because you are ahead, you start off at the same probability of winning the next hand whether you quit and don't play until tomorrow, or you stay at the table and play the next hand. But I know of many gamblers that swear they are winning overall, just by playing the game without any sort of "advantage play" to really give them an actual edge over the house. It is always the "I win more sessions than I lose" idea. If you play 20 sessions, of 500 hands each (3-4-5 hours of play depending on crowd) you play 10,000 hands. If you flat-bet $10 per hand, you then had $100,000 of "action". and the house will keep pretty close to $500 bucks of your money. It doesn't matter whether you change that and play 1 session of 10,000 hands, or 10,000 sessions of one hand each. You end up the same $500 in the hole.

Now if you do as some con-artists do, and play multiple sessions and report the number of times you walked away a winner vs a lower, you can win more sessions, while still losing that same 500 bucks, because you can lose more in one session than you win in all the rest. But if you claim "more winning sessions than losing sessoins" a "noob" will buy that book, which is the way the author gets rich, not by playing the game with his losing system. And the "noob" will verify that he wins more sessions than he loses, all the time asking himself "where in the devil is my money going?"

If you test and when you like the result, you stop, all you did was find a place where the god of variance was kind to you, and reach the conclusion you wanted to reach. If you play toward some set goal (say N thousand games, then you ignore that variance, and the more games you play, the smaller that variance becomes.

If you look at the data I posted the other day, after a thousand games, I could have stopped, and been "sure" that this version was better, because the variance had taken the Elo up quite a bit. It did come down to reality, given enough games as you could see.

I understand the fallacy you are referring to, and that many gamblers fall into. I am just not sure the analogy is fitting just right.

Is the "bankroll" equivalent to ELO?
Is the "stopping after you are down $500" equivalent to stopping your test when you are down a certain number of games?
Is the "lifetime" equivalent to playing an infinity number of games in your test?

I guess its a little silly for me to nitpick the analogy. I think what is throwing me is that, generally speaking in gambling, you figure out your odds with math, and deduce how control your bankroll based on that. If I am understanding the analogy correctly, in computer chess you are kind of doing the opposite. You are watching your results, and trying to deduce the math.

-Sam

bob · Post by **bob** » Sun Jan 17, 2010 7:54 pm

mcostalba wrote:
bob wrote: But you have to be _sure_ you are willing to accept the resulting inaccuracy.
It seems not natural for you to accept that _any_ match result has an intrinsic accuracy that can be big or small, but there is always. Perhaps is due to the fact that you are used with the cluster

In you reasoning you use words like " test is meaningless" or something like this but I think you missed the point that in statistic you have to accept a _finite_ accuracy.

No argument there. But you are apparently not seeing how _big_ that "inaccuracy" is when you talk of a hundred or a thousand games. That's the problem I was explaining.

What we are asking is for a method to detect with a given accuracy of say 95%, when early stopping if engine A seems much lower/higher then engine B.

There are tables to do this and you can compute them. The wider the "gap" between two programs, the fewer games you need. The closer they are, the more games you need. So there isn't any one number. It would be a 2d matrix with accuracy on one axis, difference in Elo between A and B on the other. And the matrix value would be the number of games needed to trust that result with whatever accuracy you want. But 100 game matches are not useful at all unless you absolutely wreck the program with a horrible change. In thinking back over the past 2 years of improving Crafty, I can't remember any +20 changes we made. Many were 1-2-3 Elo changes which take a large number of games. Every now and then we'd get an 8-10-12 Elo boost, which still needs a large number of games. Many changes show up as zero improvement, and would require millions of games to accurately measure. For those, we use intuition. If the change seems to be good in theory, and it hurts nothing with respect to Elo, then we will keep it, assuming that the gain is too small to measure in reasonably finite time.

Please, focus on the fact that we don't want an oracle answer, but we accept answer could be false (in 5% of cases).

The data I produce here is, according to Remi, based on 95% confidence. That takes a ton of games, not a hundred or two. BayesElo gives you the rating confidence interval directly.

BTW sequential probability ( http://en.wikipedia.org/wiki/Sequential ... ratio_test ) is not easy and you don't get it reading some introductionary statistical book.

No idea. But you do get it by visiting any of the online blackjack forums and start to discuss "progressive betting strategies". There are world-class statisticians which will be quite happy to explain why the idea is broken. It happens every day.

Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics