Two people have independently told me that, although they read this forum, the various threads about testing have grown too long and heated for them to keep up with. For such weary readers, let me summarize what has happened so far. My secondary motive is to clarify what I think my contribution to the discussion has been so far, because recently that has become a bone of contention as much as how testing suites should be run. The quick recap:
bob said, "These test runs show that nobody's engine testing has the significance we pretend it does."
hgm said, "The problem is you. Your games can't be independent."
bob said, "It is impossible that the results are dependent. Name one way they could be."
hgm said, "How should I know why your tests are dependent? Could be not randomizing CPU, leaving ponder on, not clearing hash table, etc. Give me the PGN's and other data so I can tell you how your test is broken."
bob said, "None of you explanations are true."
hgm said, "Your games can't be independent."
bob said, "Stampee foot."
various said, "Could we contribute here?"
hgm(bob), "Sure, but first note that bob(hgm) is a total idiot."
sven said, "Could BayesElo be the problem?"
remi said, "No."
karl said, "The games can be mathematically dependent without being causally dependent. Why are we poking around in corners for the cause of dependence when there is an obvious source? If we play the same position sixty-four times with the same opponents at the same time control we can expect correlated results."
hgm said, "That's probably not the source of dependence."
martin said, "Let's test it."
karl said, "Martin's test results prove I was right all along"
bob said, "What karl said proves I was right all along."
hgm said, "What karl said proves I was right all along."
And that where we stand today. We are one big happy family united in pursuit of the truth!
Testing thread summary for the weary
Moderator: Ras
-
- Posts: 28356
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing thread summary for the weary
Good summary!
I want to comment on the point where it ascribed to me the statement:
"That's probably not the source of dependence."
I would agree to that, if you said "the obeserved dependence in Bob's first two runs"
This was a very particular kind of dependence, between the games within a run, while apparently lacking (or being far smaller) between between games in the different runs. Dependence (correleation) between games within a run drives up the standard deviation of the run results, while correlation between games of different runs reduces that variation. If all games, within and between runs, are equally correlated, the effects exactly cancel.
The source of dependence that you pointed out (and that I pointed out 8 month ago, and again at the very beginning of the first thread, where Bob again dismissed it with

I want to comment on the point where it ascribed to me the statement:
"That's probably not the source of dependence."
I would agree to that, if you said "the obeserved dependence in Bob's first two runs"
This was a very particular kind of dependence, between the games within a run, while apparently lacking (or being far smaller) between between games in the different runs. Dependence (correleation) between games within a run drives up the standard deviation of the run results, while correlation between games of different runs reduces that variation. If all games, within and between runs, are equally correlated, the effects exactly cancel.
The source of dependence that you pointed out (and that I pointed out 8 month ago, and again at the very beginning of the first thread, where Bob again dismissed it with
equally correlates games of the different runs with each other as it does for games within the same run, and can thus not be the source of any observe hyper-variability of run results. Which whas what the discussion was all about. That the results would be crap anyway, evenwhen they would have perfectly converged, was already known for 8 month, and was, as Bob said, of no interest.bob wrote:BTW the stuff about the 40 positions, or the 5 opponents is way beside the point. No matter what 40 positions I choose, and it would seem to me that the smaller number the more stable the results, I ought to be able to produce a stable answer about those 40 positions, whether or not that carries over to other positions depends on how "general" those positions are (and the silver positions are pretty generic/representative of opening positions).
-
- Posts: 819
- Joined: Sat Mar 11, 2006 3:15 am
- Location: Guadeloupe (french caribbean island)
Re: Testing thread summary for the weary
Fritzlein wrote:Two people have independently told me that, although they read this forum, the various threads about testing have grown too long and heated for them to keep up with. For such weary readers, let me summarize what has happened so far. My secondary motive is to clarify what I think my contribution to the discussion has been so far, because recently that has become a bone of contention as much as how testing suites should be run. The quick recap:
bob said, "These test runs show that nobody's engine testing has the significance we pretend it does."
hgm said, "The problem is you. Your games can't be independent."
bob said, "It is impossible that the results are dependent. Name one way they could be."
hgm said, "How should I know why your tests are dependent? Could be not randomizing CPU, leaving ponder on, not clearing hash table, etc. Give me the PGN's and other data so I can tell you how your test is broken."
bob said, "None of you explanations are true."
hgm said, "Your games can't be independent."
bob said, "Stampee foot."
various said, "Could we contribute here?"
hgm(bob), "Sure, but first note that bob(hgm) is a total idiot."
sven said, "Could BayesElo be the problem?"
remi said, "No."
karl said, "The games can be mathematically dependent without being causally dependent. Why are we poking around in corners for the cause of dependence when there is an obvious source? If we play the same position sixty-four times with the same opponents at the same time control we can expect correlated results."
hgm said, "That's probably not the source of dependence."
martin said, "Let's test it."
karl said, "Martin's test results prove I was right all along"
bob said, "What karl said proves I was right all along."
hgm said, "What karl said proves I was right all along."
And that where we stand today. We are one big happy family united in pursuit of the truth!
Karl I would like to take this opportunity to thank you for your contribution. Your comments were clear and showing a scientific mind at work.
The topic has been hot for me for a long time and guess what... I screwed up by testing repeatedly from the same limited number of positions, repeating matches in the hope that time jitter would magically randomize and let the truth emerge.
I was definitely wrong and wasted months of computing power. But it's always good to know what you did wrong (oh and it's not the only thing I have done wrong!

I would also like to thank Bob for all the time and computing power he has contributed, as well as HGM for his tireless refusals when something looked obviously wrong.
And thank you to all the people who have contributed to this topic in a way or another.
// Christophe
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing thread summary for the weary
Only thing that is left, is to figure out how many games are necessary. It would be nice to make sure that the large number of positions (and their variety/balance) are a reasonable set, which I will be happy to provide. Then to refine this to determine how many are needed for a +/- 1 elo comparison, a +/- 10 Elo, and maybe something even bigger, so that for a major change, it can be accepted/rejected more quickly.tiger wrote:Fritzlein wrote:Two people have independently told me that, although they read this forum, the various threads about testing have grown too long and heated for them to keep up with. For such weary readers, let me summarize what has happened so far. My secondary motive is to clarify what I think my contribution to the discussion has been so far, because recently that has become a bone of contention as much as how testing suites should be run. The quick recap:
bob said, "These test runs show that nobody's engine testing has the significance we pretend it does."
hgm said, "The problem is you. Your games can't be independent."
bob said, "It is impossible that the results are dependent. Name one way they could be."
hgm said, "How should I know why your tests are dependent? Could be not randomizing CPU, leaving ponder on, not clearing hash table, etc. Give me the PGN's and other data so I can tell you how your test is broken."
bob said, "None of you explanations are true."
hgm said, "Your games can't be independent."
bob said, "Stampee foot."
various said, "Could we contribute here?"
hgm(bob), "Sure, but first note that bob(hgm) is a total idiot."
sven said, "Could BayesElo be the problem?"
remi said, "No."
karl said, "The games can be mathematically dependent without being causally dependent. Why are we poking around in corners for the cause of dependence when there is an obvious source? If we play the same position sixty-four times with the same opponents at the same time control we can expect correlated results."
hgm said, "That's probably not the source of dependence."
martin said, "Let's test it."
karl said, "Martin's test results prove I was right all along"
bob said, "What karl said proves I was right all along."
hgm said, "What karl said proves I was right all along."
And that where we stand today. We are one big happy family united in pursuit of the truth!
Karl I would like to take this opportunity to thank you for your contribution. Your comments were clear and showing a scientific mind at work.
The topic has been hot for me for a long time and guess what... I screwed up by testing repeatedly from the same limited number of positions, repeating matches in the hope that time jitter would magically randomize and let the truth emerge.
I was definitely wrong and wasted months of computing power. But it's always good to know what you did wrong (oh and it's not the only thing I have done wrong!).
I would also like to thank Bob for all the time and computing power he has contributed, as well as HGM for his tireless refusals when something looked obviously wrong.
And thank you to all the people who have contributed to this topic in a way or another.
// Christophe
Still a lot of questions, once Karl gave a good explanation of the problem, and then made a reasonable suggestion that seems to be panning out...
-
- Posts: 819
- Joined: Sat Mar 11, 2006 3:15 am
- Location: Guadeloupe (french caribbean island)
Re: Testing thread summary for the weary
bob wrote:Only thing that is left, is to figure out how many games are necessary. It would be nice to make sure that the large number of positions (and their variety/balance) are a reasonable set, which I will be happy to provide. Then to refine this to determine how many are needed for a +/- 1 elo comparison, a +/- 10 Elo, and maybe something even bigger, so that for a major change, it can be accepted/rejected more quickly.tiger wrote:Fritzlein wrote:Two people have independently told me that, although they read this forum, the various threads about testing have grown too long and heated for them to keep up with. For such weary readers, let me summarize what has happened so far. My secondary motive is to clarify what I think my contribution to the discussion has been so far, because recently that has become a bone of contention as much as how testing suites should be run. The quick recap:
bob said, "These test runs show that nobody's engine testing has the significance we pretend it does."
hgm said, "The problem is you. Your games can't be independent."
bob said, "It is impossible that the results are dependent. Name one way they could be."
hgm said, "How should I know why your tests are dependent? Could be not randomizing CPU, leaving ponder on, not clearing hash table, etc. Give me the PGN's and other data so I can tell you how your test is broken."
bob said, "None of you explanations are true."
hgm said, "Your games can't be independent."
bob said, "Stampee foot."
various said, "Could we contribute here?"
hgm(bob), "Sure, but first note that bob(hgm) is a total idiot."
sven said, "Could BayesElo be the problem?"
remi said, "No."
karl said, "The games can be mathematically dependent without being causally dependent. Why are we poking around in corners for the cause of dependence when there is an obvious source? If we play the same position sixty-four times with the same opponents at the same time control we can expect correlated results."
hgm said, "That's probably not the source of dependence."
martin said, "Let's test it."
karl said, "Martin's test results prove I was right all along"
bob said, "What karl said proves I was right all along."
hgm said, "What karl said proves I was right all along."
And that where we stand today. We are one big happy family united in pursuit of the truth!
Karl I would like to take this opportunity to thank you for your contribution. Your comments were clear and showing a scientific mind at work.
The topic has been hot for me for a long time and guess what... I screwed up by testing repeatedly from the same limited number of positions, repeating matches in the hope that time jitter would magically randomize and let the truth emerge.
I was definitely wrong and wasted months of computing power. But it's always good to know what you did wrong (oh and it's not the only thing I have done wrong!).
I would also like to thank Bob for all the time and computing power he has contributed, as well as HGM for his tireless refusals when something looked obviously wrong.
And thank you to all the people who have contributed to this topic in a way or another.
// Christophe
Still a lot of questions, once Karl gave a good explanation of the problem, and then made a reasonable suggestion that seems to be panning out...
My hope was that once the basis of a sane experiment was stated it would be easy to compute the margin of error with some (say 95%) confidence from the number of games played.
Then it could be checked experimentaly if you have enough cluster time to contribute.
One suggestion is to use slight time handicaps and to try to measure these handicaps in term of elo. From the experimental data of the SSDF and the CCRL it seems that doubling the thinking time gives 60 to 70 elo point.
Do we have any better way to control the strength of an engine than using controlled time handicaps? If not, then using time handicaps could be the only way to experimentally check the validity of the testing procedure.
Now that I think about it, maybe material handicap can be used as well, but it's harder to make it vary in small quantities. And it can't be used with an existing collection of test positions.
// Christophe
-
- Posts: 28356
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing thread summary for the weary
Left for you to figure out, that is. The rest of the world of course already knows this, as they do read my posts. But take your time, and eventually you will get there. (Well, any bets on this? We were looking for dead-cert betting opportunities in the other thread, not?bob wrote:Only thing that is left, is to figure out how many games are necessary. It would be nice to make sure that the large number of positions (and their variety/balance) are a reasonable set, which I will be happy to provide. Then to refine this to determine how many are needed for a +/- 1 elo comparison, a +/- 10 Elo, and maybe something even bigger, so that for a major change, it can be accepted/rejected more quickly.



Re: Testing thread summary for the weary
I've not read the other threads, it was too long.
The question is : how to test two versions of an engine ?
What about methods that eliminate all randomness ? Did anybody tried that ?
HJ.
The question is : how to test two versions of an engine ?
What about methods that eliminate all randomness ? Did anybody tried that ?
HJ.
Re: Testing thread summary for the weary
You have to do exactly the opposite.Harald Johnsen wrote:I've not read the other threads, it was too long.
The question is : how to test two versions of an engine ?
What about methods that eliminate all randomness ? Did anybody tried that ?
HJ.
Make sure to have enough (randomly selected) positions, so the various sides of each engine get used, and only then, the generally used math is valid (wel, so it seems) because it needs "random"
Tony
-
- Posts: 28356
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing thread summary for the weary
If the data is sufficiently random, the results will eventually become as accurate as you want. The bad thing is that it takes an extremely large number of games before it is accurate enough to be useful for highly developed engines that will not experience large Elo jumps from one version to the other. (Something lik 10,000 to 1,000,000).
Standard method for reducing the number of needed samples in a high-variance population, when you are looking for small differences, is paired sampling. In this method you eliminate most of the variance of the population by correlating the samples before taking the difference. E.g. if you want to know how much boys on the average grow in their 14th year, you could measure N randomly selected boys on their 13th birthday, (sample A) and M different, independently randomly selected boys on their 14th birthday (sample B). And then calculate the average of sample A, of sample B, and the difference of these two averages.
And then the error would be totally dominated by the intrinsic, large variability of how tall boys are, and you would need exceedingly large samples to reduce that sampling noise to a level where it becomes comparable to ho much they had grown. On the other hand, if you measure the same set of boys one year later, you could know what you want from a small sample. The average of the samples at age 13 and 14 would likely deviate very much from the true average length of the population, but as this quantity cancels out exactly, rather than in a statistical-average sense, when you take the difference, you would get very accurate numbers for the growth rate. No matter how much smaller it was than theintrinsic variation.
Problem is that playing Chess games with slightly modified engines, even if you could eliminate all sources of randomness by playing on nps from idetical positions with the same number of nodes, are independent samples: they play completely different games. And the outcome of games in Chess has a high intrinsic variance (it is all or nothing).
My proposal for doing the equivalent of a paired sampling, which I did some years ago, was to make sure that each version of the engine thinks on the same positions, for all positions occurring in the game, rather than just the initial position. This would lead to the so-called tree-comparison test, which is feasible if the difference between the engines is truly small (so that they deviate only once every so many moves, rather than every move).
The idea is to play both A and A' (the engines you want to compare) agains B, from the same starting position. After a few moves, A and A' will finally play a different move. From that point they would play different games, breaking the pairing and creating variability that couples through into the result difference. So that is not what you do. In stead, you undo the move of A' (but remember it), and force the move A played into A'. Then they are again in the same position, and you continue both games (A-B and A'-B), following the same procedure, as if that new position was the initial position of the game (i.e. recursively). After the games end, you now have a list of all the rmemebered positions and moves where A' wanted to deviate. You will then go back to these positions, and force the move of A' into A, and repeat the procedure from there (again, recursively). You continue to do this until you have explored both branches from any point where the two engines wanted to deviate.
If the engines only deviate every 10 moves or so, the tree will not be excessively large. (A game lasts perhaps 60 moves on average, so you would have 64 different game ends.) You would of course adjudicate games that develop a decisive advantage for one player, to prevent exploring many alternative game paths with a dead-certain outcome. For every node of this tree, you can define the win fraction as the average of the win fraction for its two daughter nodes, while a node where the game ended (rather than branched) would of course get the actual game result as win fraction. This way you would know for all nodes how large the win fraction chosen by A' would differ from the win fraction chosen by A. An dthus how much better in terms of win fraction A' did compared to A.
Easy as pie. But I haven't tried it yet. Perhaps I should build this as a standard option in WinBoard, next to normal match mode ('-treeMatch'). It would require the loading of a third Chess program, though. Or (probably better) a way to switch engine while WinBoard is running. And it would only work for comparing engines that are very close, otherwise the tree would get too big. But that is exactly what you need it for, of course.
Standard method for reducing the number of needed samples in a high-variance population, when you are looking for small differences, is paired sampling. In this method you eliminate most of the variance of the population by correlating the samples before taking the difference. E.g. if you want to know how much boys on the average grow in their 14th year, you could measure N randomly selected boys on their 13th birthday, (sample A) and M different, independently randomly selected boys on their 14th birthday (sample B). And then calculate the average of sample A, of sample B, and the difference of these two averages.
And then the error would be totally dominated by the intrinsic, large variability of how tall boys are, and you would need exceedingly large samples to reduce that sampling noise to a level where it becomes comparable to ho much they had grown. On the other hand, if you measure the same set of boys one year later, you could know what you want from a small sample. The average of the samples at age 13 and 14 would likely deviate very much from the true average length of the population, but as this quantity cancels out exactly, rather than in a statistical-average sense, when you take the difference, you would get very accurate numbers for the growth rate. No matter how much smaller it was than theintrinsic variation.
Problem is that playing Chess games with slightly modified engines, even if you could eliminate all sources of randomness by playing on nps from idetical positions with the same number of nodes, are independent samples: they play completely different games. And the outcome of games in Chess has a high intrinsic variance (it is all or nothing).
My proposal for doing the equivalent of a paired sampling, which I did some years ago, was to make sure that each version of the engine thinks on the same positions, for all positions occurring in the game, rather than just the initial position. This would lead to the so-called tree-comparison test, which is feasible if the difference between the engines is truly small (so that they deviate only once every so many moves, rather than every move).
The idea is to play both A and A' (the engines you want to compare) agains B, from the same starting position. After a few moves, A and A' will finally play a different move. From that point they would play different games, breaking the pairing and creating variability that couples through into the result difference. So that is not what you do. In stead, you undo the move of A' (but remember it), and force the move A played into A'. Then they are again in the same position, and you continue both games (A-B and A'-B), following the same procedure, as if that new position was the initial position of the game (i.e. recursively). After the games end, you now have a list of all the rmemebered positions and moves where A' wanted to deviate. You will then go back to these positions, and force the move of A' into A, and repeat the procedure from there (again, recursively). You continue to do this until you have explored both branches from any point where the two engines wanted to deviate.
If the engines only deviate every 10 moves or so, the tree will not be excessively large. (A game lasts perhaps 60 moves on average, so you would have 64 different game ends.) You would of course adjudicate games that develop a decisive advantage for one player, to prevent exploring many alternative game paths with a dead-certain outcome. For every node of this tree, you can define the win fraction as the average of the win fraction for its two daughter nodes, while a node where the game ended (rather than branched) would of course get the actual game result as win fraction. This way you would know for all nodes how large the win fraction chosen by A' would differ from the win fraction chosen by A. An dthus how much better in terms of win fraction A' did compared to A.
Easy as pie. But I haven't tried it yet. Perhaps I should build this as a standard option in WinBoard, next to normal match mode ('-treeMatch'). It would require the loading of a third Chess program, though. Or (probably better) a way to switch engine while WinBoard is running. And it would only work for comparing engines that are very close, otherwise the tree would get too big. But that is exactly what you need it for, of course.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing thread summary for the weary
I think that while a reasonable confidence interval can be established, it is going to have to be tested (and most likely adjusted) experimentally, because there is going to be some unexpected correlation in any test set. For example, I didn't do anything to measure the "chess-hamming-distance" (new term, but basically the number of squares that are different between two positions). If you have two positions where the only difference is something minor like a pawn on a2 or a3, then those results will likely show correlation. Since I am not quite sure what the proper method for position selection would be to minimize correlation, sets of starting positions will probably always exhibit some of the extra randomness I have always seen. Note that even these 40,000 game runs produce Elos that have a range of 5-6 or so, which is a bit wide to detect small improvements in a program.