New testing thread

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Correlated data discussion

Post by hgm »

bob wrote:Another one: do not draw conclusions from _one_ partial result. The test has not even finished yet and here you go, drawing a conclusion. The way the test is run, the partial results can be highly misleading due to white/black bias. One does best by waiting until _all_ the data is in.
You think that was a conclusion from your partial result? Actually this is fresh-man's teaching material, the basics of every experimental science...
Where do I deny that? I am onto _accurate_ measurement.
Oh, man, why don't you wait until you are awake, before you start posting. It says you deny that the game results will practically be the same when you change the the node count by 100,000. Or are you denying now that you deny that?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 4 sets of data

Post by bob »

hgm wrote:
bob wrote:No, but if you'd just follow the discussion once in a while,
you would see that reducing the number of games is good enough to test the hypothesis that a round-robin will stabilize the ratings more.
Well, as I pointed out, even two games would be enough to 'test' the hypothesis that 1+1=2. Test whatever trivialities you want, but don't expect me to follow your muddling with interest.
However, would it be possible for either (a) you _do_ follow a specific discussion and post comments related to it or (b) if you choose to not follow context, then also choose to not make comments that have nothing to do with the discussion?

Someone asked me to run the test. It _did_ change the results. I ran 4 runs to see if the variability was lower, higher, or the same.

One does have to have an attention span measured in something longer than milliseconds to be able to carry on discussions here of course. I've never proposed using 800 games for the good/bad testing.
A blatant ly. Even in the previous post you were complaining that the 800-game runs were not good enough for the tiny differences you wanted to measure. "tons of variation", remember? (Yes, hard, I admit. You did say that more than a msec ago...)
I have maintained from the beginning that 800 games is not enough to satisfy my _ultimate_ goal of measuring small differences. yes. And someone proposed an experiment that I ran with reduced data to get an idea of how the results looked with C vs world and then with a RR format. And I provided the data. And off you go into the wild blue yonder without even understanding the original request that I simply satisfied. And no, it had nothing to do with 1+1 = 2...

But I did suggest using it to test the hypothesis that playing a round robin would help, otherwise the test would take over a week for one run, not a day.

Get that now? Nobody is saying _reduce_ the number of games for real testing. At least not me...
Are you sure you don't mean "FORget it now?" :lol:
Nope. That is what you are continually doing. I am saying _remember_ what the current discussion is about. And one _can_ have multiple simultaneous discussions on slightly different ideas, _and_ keep them all separate and distinct. At least I can.
easy enough. Because the engines run on "virgin" machines each time. The machines do not cyclically ramp up their clock, then ramp it back down.
Indeed not, but I was not asking about the machines, but about the engines. The engines do have access to the date and time of day, do they not? It would be trivial for me to write an engine that would wreck your testing: Just add the line:
And if someone has done that then most all past testing is flawed, since I am using the same unmodified source that was used in the CCRL, SSDF, etc testing. And btw, that would _not_ corrupt the test and make your "dependency" condition show up. A program would simply play better about 1/2 the time. So even that statistical suggestion would be wrong. A binary decision based on even/odd months is not going to produce any sort of dependency in the data. Would produce variation if your sample interval matched up, but not dependency.


if( timeOfDay.month & 1 ) thinkingTime /= 2;

and that engine would play 70 Elo weaker in odd months than in even months. So if you would make a 25,000-game run in it in June, and repeat that run in July, you would get totally different results. The score would be off by as much as 10%.

So are your machines virgin enough that you reset the system time to zero before each run? Safe bet you don't...
No, because the TOD is always synced to NTP sources. You have suggested a way to break the test. If anyone thought that was _really_ the issue, the only flaw with that reasoning is that _all_ the programs would have to be doing that, and in a way that would get in sync with my testing. Month is too coarse when you run a test in a day, except right at the boundary. However, it is certainly easy enough to verify that this doesn't happen, if someone wants to. The source for all the engines _is_ available. But I suspect no one would consider stomping through source to see if _that_ is actually happening. Because if it was, it would show up in other testing as well.
Which would not introduce a dependency anyway. So somehow the act of winning or losing a game on machine X would have to alter the clock on machine X so that the next time a game is played there, the altered clock has to somehow affect (in a consistent way) the outcome of that game.

It is all absolutely poppycock and you (as well as everyone else) knows that is simply not possible. I also don't want to have to prove that sunspots, cosmic rays, close-proximity black holes, gravity waves, etc are introducing dependencies also. So back to the real world. If my cluster produces dependencies, so does everyone else's testing, particularly those just using a paltry single machine or 2 or 4.
Even if that were true (and it does not necssessaryly have to be true, as the problem might be in one of the engines, and other people might not use that engine), the main point is that others do not care. They do not make 25,000-game runs, so their errors are always dominated by sampling statistics. Systematic errors that are 6-sigma in a 25,000-game run are only 0.6 sigma in a 250-game run, and of no consequence. You could not see them, and they would not affect your testing accuracy.
What on earth does that mean? So you are _now_ going to tell me that you would get more accurate results with 250 games than with 25,000? :) Simply because the SD is now much larger? :) Hint: that is _not_ a good definition of accuracy...
If you want to be able to do precision measurements, you have to work harder to eliminate noise sources. Every physiscist knows this.
And if you had been reading, you would have noted that I have attempted to do this. No books, no learning, no pondering, no SMP search, A clean directory with just engine executables that is zapped after each game to avoid hidden files that could contain anything from persistent hash to evaluation weight changes. All we are left with is a pool of engines, using the same time control repeatedly, playing the same positions repeatedly, playing on the same pool of identical processors repeatedly. No the timing is not exact. But since that is how everyone is testing, it would seem that is something that is going to have to be dealt with as is. yes a program can on occasion fail low just because it gets to search a few more nodes due to timing jitter. Yes that might burn many seconds of CPU time to resolve. Yes that will change the time per move for the rest of the game. Yes that may change the outcome.

However, if I am designing a telescope that is going to be mounted on the surface of planet earth, I am not going to waste much time planning on using a perfect vacuum where there is no air diffraction going on, because that is not going to be possible since only a very few have enough resources to shoot something like Hubble up into orbit. And it is immaterial because we have already decided this will be surface-mounted.
Your signal/noise ratio is so bad, I am not aware of your offering any useful signal anyway. So I believe I will just "muddle on" and before long you will actually know how to test engines as a result, since you obviously do not know how at present... And neither do I, at present. But at least _I_ am working on changing that situation.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

hgm wrote:
bob wrote:Another one: do not draw conclusions from _one_ partial result. The test has not even finished yet and here you go, drawing a conclusion. The way the test is run, the partial results can be highly misleading due to white/black bias. One does best by waiting until _all_ the data is in.
You think that was a conclusion from your partial result? Actually this is fresh-man's teaching material, the basics of every experimental science...
Going to try to sell me a bridge next? re-read what you wrote about this suggesting that the first run was bad, and the second was good. All based on a partial result that might show the first result was typical and the second was atypical, after the entire thing is played out. So _yes_ you did draw a conclusion from partial data. Dance around it all you want, of course. But your post is there for anyone to read...

Where do I deny that? I am onto _accurate_ measurement.
Oh, man, why don't you wait until you are awake, before you start posting. It says you deny that the game results will practically be the same when you change the the node count by 100,000. Or are you denying now that you deny that?
Did you read my experimental explanation and results. _sometimes_ changing the node count by 1000 changes the game outcomes significantly. Sometimes it makes _no_ change whatsoever. Why is that so hard??? sometimes changing the node count by only 1 changes a game or two, although in the very few such tests I ran, the result was not changed, just the moves actually played varied at some point.

So I would not bet on _anything_ dealing with this, except that in general, the best program will win a match more times than it loses. But individual games are _far_ too variable to bet on.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

BTW

Post by bob »

You made a duplicate post. Are you having issues with very slow posting times? I've been trying to determine if it is just CCC or something in general on my end as I am seeing the same.
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: 4 sets of data

Post by hgm »

bob wrote:However, would it be possible for either (a) you _do_ follow a specific discussion and post comments related to it or (b) if you choose to not follow context, then also choose to not make comments that have nothing to do with the discussion?
You would have to be more specific. That you perhaps cannot see the relevance of my remarks, does not necessarily mean that they have nothing to do with the discussion. In so far the discussion has to do with anything in the first place, that is...
Someone asked me to run the test. It _did_ change the results. I ran 4 runs to see if the variability was lower, higher, or the same.
Did it? What change are you talking about? 2 Elo on Crafty's rating, that had an error bar of +/- 18 in the first place? You call that a change? Your statement is actually very questionable "It _did_ change the results." What is 'it' here? Did that (insignificant) change in the results occur because you included games between the other engines, or simply because you redid the test (or some games of the test). The Crafty rating would very likely change much more than 2 points (namely 9 point = 1 SD) if you would have redone the Crafty games without playing the opponents against each other. And it would likely have changed in another way if you only redid the games of the opponents against each other. The conclusion that including games between opponents is not justified.

And if someone has done that then most all past testing is flawed, since I am using the same unmodified source that was used in the CCRL, SSDF, etc testing.
A good tester would recognize such engines as variable, and delete its results from their tests. This is why you have to keep track of variance and correlations within the results of each engine.
And btw, that would _not_ corrupt the test and make your "dependency" condition show up. A program would simply play better about 1/2 the time. So even that statistical suggestion would be wrong. A binary decision based on even/odd months is not going to produce any sort of dependency in the data.
It most certainly will. The data will be highly correlated timewise, and the stochastic processes producing them will be highly dependent in the mathematical sense. That the causal relationship in fact is that both are dependent on the month of the year, is something that the math does not care about.
Would produce variation if your sample interval matched up, but not dependency.

No, because the TOD is always synced to NTP sources. You have suggested a way to break the test. If anyone thought that was _really_ the issue, the only flaw with that reasoning is that _all_ the programs would have to be doing that, and in a way that would get in sync with my testing. Month is too coarse when you run a test in a day, except right at the boundary. However, it is certainly easy enough to verify that this doesn't happen, if someone wants to. The source for all the engines _is_ available. But I suspect no one would consider stomping through source to see if _that_ is actually happening. Because if it was, it would show up in other testing as well.
You might not be able to see it at all, by looking at the source, just as you will in general not be able to debug a source by only looking at it. The behavior I described might be the result of a well-hidden bug. The only reliable way to find bugs is by looking how they manifest themselve.

This is not far-fetched. It did actually happen to me. At some point I noted that Joker, for a search to a given depth from the opening position would sometimes produce a different score at the same search depth, even when 'random' was switched off. Turned out that the local array in the evaluation routine that held the backward-most Pawn of each file was only initialized up to the g-file. And the byte for the h-file coincided with a memory address that was used to hold the high-order byte of a clock variable, read to know if the search should be aborted. During weeks where this was large and positive, the backward-most pawn was found without error. When it was negative, the negative value stuck like an off-board ghost Pawn, and caused horrendous misjudgements in the Pawn evaluation, leading to disastrous Pawn moves, losing the game.
Even if that were true (and it does not necssessaryly have to be true, as the problem might be in one of the engines, and other people might not use that engine), the main point is that others do not care. They do not make 25,000-game runs, so their errors are always dominated by sampling statistics. Systematic errors that are 6-sigma in a 25,000-game run are only 0.6 sigma in a 250-game run, and of no consequence. You could not see them, and they would not affect your testing accuracy.
What on earth does that mean? So you are _now_ going to tell me that you would get more accurate results with 250 games than with 25,000? :) Simply because the SD is now much larger? :) Hint: that is _not_ a good definition of accuracy...
Are you incapable of drawing any correct inferences whatsoever? An elephant does not care about the weight of a dog on its back, while that same dog would crush a spider. You conclude from that that elephants are too small to be crushed????
And if you had been reading, you would have noted that I have attempted to do this. No books, no learning, no pondering, no SMP search, A clean directory with just engine executables that is zapped after each game to avoid hidden files that could contain anything from persistent hash to evaluation weight changes. All we are left with is a pool of engines, using the same time control repeatedly, playing the same positions repeatedly, playing on the same pool of identical processors repeatedly. No the timing is not exact. But since that is how everyone is testing, it would seem that is something that is going to have to be dealt with as is. yes a program can on occasion fail low just because it gets to search a few more nodes due to timing jitter. Yes that might burn many seconds of CPU time to resolve. Yes that will change the time per move for the rest of the game. Yes that may change the outcome.
Yes, all very nice. But it did not reduce the variance to the desired (and designed!) level. And it is results that count. An expensive looking spotless car is no good if it refuces to start. Even if the engine is brand new, still under factory guarantee, the gas tank is loaded up, the battery is charged. But you would still have to walk...
However, if I am designing a telescope that is going to be mounted on the surface of planet earth, I am not going to waste much time planning on using a perfect vacuum where there is no air diffraction going on, because that is not going to be possible since only a very few have enough resources to shoot something like Hubble up into orbit. And it is immaterial because we have already decided this will be surface-mounted.
A very educational example, as people are now building huge Earth-based telescopes, which correct for air turbulence by using adaptive optics, and exceed the limits of their mirror size by being grouped as clusters of interferometers, as one does for radio telecopes. If you cannot remove a noise source, you will have to learn to live with it, and outsmart it...
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: BTW

Post by hgm »

bob wrote:You made a duplicate post. Are you having issues with very slow posting times? I've been trying to determine if it is just CCC or something in general on my end as I am seeing the same.
CCC is very slow today, also for me.
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Correlated data discussion

Post by hgm »

bob wrote:
hgm wrote: You think that was a conclusion from your partial result? Actually this is fresh-man's teaching material, the basics of every experimental science...
Going to try to sell me a bridge next? re-read what you wrote about this suggesting that the first run was bad, and the second was good. All based on a partial result that might show the first result was typical and the second was atypical, after the entire thing is played out. So _yes_ you did draw a conclusion from partial data. Dance around it all you want, of course. But your post is there for anyone to read...
Oh, you were talking about my remark on which run was bad, and not on the lessons to be drawn? Im I supposed to smell that you put comments two paragraphs later than the one to which they refer?

The 'conclusion' you object about says 'it seems' (for everyone to read, fortunatlely). I would call that a prediction more than a conclusion. And that it is a prediction, does not imply that it is wrong. The prediction was indeed based on the idea that the partial result you were showing us was representative for the entire result, and not only games with Crafty white and the black games still to come. What was all this boasting of random shuffling games in time and over cores? That suddenly does not apply anymore?
Did you read my experimental explanation and results. _sometimes_ changing the node count by 1000 changes the game outcomes significantly. Sometimes it makes _no_ change whatsoever. Why is that so hard??? sometimes changing the node count by only 1 changes a game or two, although in the very few such tests I ran, the result was not changed, just the moves actually played varied at some point.

So I would not bet on _anything_ dealing with this, except that in general, the best program will win a match more times than it loses. But individual games are _far_ too variable to bet on.
You are contradicting yourself here. Either changing the node count only affects the outcome of the game rarely, and in that case it would be very safe to bet on the result of the replay of that individual game, or it changes it nearly always, in which case betting would be a random gamble. You can't have both, so what is it? In particular, to stick to the given example, how much is the probability that the game result changes there? 1%, 10% or 50%?
Fritzlein

Re: Correlated data discussion

Post by Fritzlein »

bob wrote:And that is the effect I am talking about. A tiny evaluation change, changes the shape of the tree, and that is all that is needed to change the result, and the really interesting part is, the change in shape is _not_ correlated with the result. One time fewer nodes makes a better result, next time it makes a worse one...
hgm wrote:Well, this is something Bob vehemently denies, and in this respect I have no reason at all to doubt his statement. I think you would lose tons of money, as in general Fruit is stronger than Crafty, and the (not so extremely rare) exception that Crafty beats Fruit for this precise combination of conditions will not carry over to cases with even a tiny difference in number of nodes. This is the 'butterfly effect'.
MartinBryant wrote:I'd be careful how you bet your money!

Bob has already demo'd that tiny changes (1,000 nodes) can make move choices and hence the game result fluctuate wildly and my own experience/experiments agree with this. It may be that the result is so sensitive to the node count that the addition of just a _single_ node would do the same. (Now that would be an interesting experiment.) The results through the 200,001 posiible games may be totally chaotic.

In fact, because Fruit is stronger than Crafty, your Crafty win at 3,000,000 nodes could well be sat in the middle of a sea of losses.

If you took your +/-100,000 node interval and ran 201 games at the 1,000 node steps I'd happily bet MY money that Fruit would win the majority.
Wow, it's very interesting that everyone disagrees with me here. It really is a money-making (or money-losing) opportunity. :) I quite believe all of you that a tiny change in node count can change the entire course of the game. However, I'm not at all convinced by these anecdotal reports that my bet is a money-loser.

The conclusion that you are arriving at is that because a small change in node count can change the results (and often does) that there is no correlation between repeated plays with similar node counts. Has anyone measured the correlation? Are there any statistics on this? If there are no statistics, let me ask how good you think you are at distinguishing a coin that lands heads 55% of the time from one that lands heads 45% of the time on the basis of looking at a few flips.

Martin, let me make sure that you are offering me the same bet I am offering you. Crafty is playing against Fruit from the same starting position 201 times at different fixed node counts. From Bob's posted results we know that Fruit scores about 60% against Crafty. However, in my bet I stipulate that I get to look at one of the results first, let's say the middle one of the 201. If Crafty didn't win that one, then no bet. If Crafty did win that one, then I take Crafty to score over 100 points in the other 200 games, and you take Fruit to score over 100 points in the other 200 games. You think that one win is likely to be a fluke in the middle of a sea of losses, whereas I think that one win is likely to be correlated with all the other games. So you think you would win money from me in this way?

Well, I could be wrong, but I am still offering, until someone actually measures how much correlation is there. Maybe Bob would even take a sporting interest in our disagreement and play out our bet in this way: He takes random positions from his set of 3000+ until he finds one that Crafty beats Fruit at a median fixed node count (corresponding to the time control at which Fruit was generally winning 60%). He plays it at the 200 neighboring node counts, and marks down whether Crafty or Fruit won more. Then pick more random positions until there is another one Crafty wins, etc., repeating 100 times. Well, that's 20,000+ games to play, but if I'm right and Crafty won more than 50 of those 100 bets, it would give us some insight into a very fundamental point.
hgm wrote:
Fritzlein wrote:What we apparently disagree is about the potential that we will be able to re-use positions in a way that is independent. I highly doubt it can be done.
Is this just a gut feeling, or is this doubt somehow based on mathematical calculation?
It's just a gut feeling with no mathematical calculation. However, there is no mathematical calculation on the other side either, just anecdotal evidence that changing node counts totally changes playouts, and references to the butterfly effect. Let's get some numbers and see whose intuition is accurate!
Uri Blass
Posts: 10822
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Correlated data discussion

Post by Uri Blass »

Fritzlein wrote:
bob wrote:And that is the effect I am talking about. A tiny evaluation change, changes the shape of the tree, and that is all that is needed to change the result, and the really interesting part is, the change in shape is _not_ correlated with the result. One time fewer nodes makes a better result, next time it makes a worse one...
hgm wrote:Well, this is something Bob vehemently denies, and in this respect I have no reason at all to doubt his statement. I think you would lose tons of money, as in general Fruit is stronger than Crafty, and the (not so extremely rare) exception that Crafty beats Fruit for this precise combination of conditions will not carry over to cases with even a tiny difference in number of nodes. This is the 'butterfly effect'.
MartinBryant wrote:I'd be careful how you bet your money!

Bob has already demo'd that tiny changes (1,000 nodes) can make move choices and hence the game result fluctuate wildly and my own experience/experiments agree with this. It may be that the result is so sensitive to the node count that the addition of just a _single_ node would do the same. (Now that would be an interesting experiment.) The results through the 200,001 posiible games may be totally chaotic.

In fact, because Fruit is stronger than Crafty, your Crafty win at 3,000,000 nodes could well be sat in the middle of a sea of losses.

If you took your +/-100,000 node interval and ran 201 games at the 1,000 node steps I'd happily bet MY money that Fruit would win the majority.
Wow, it's very interesting that everyone disagrees with me here. It really is a money-making (or money-losing) opportunity. :) I quite believe all of you that a tiny change in node count can change the entire course of the game. However, I'm not at all convinced by these anecdotal reports that my bet is a money-loser.

The conclusion that you are arriving at is that because a small change in node count can change the results (and often does) that there is no correlation between repeated plays with similar node counts. Has anyone measured the correlation? Are there any statistics on this? If there are no statistics, let me ask how good you think you are at distinguishing a coin that lands heads 55% of the time from one that lands heads 45% of the time on the basis of looking at a few flips.

Martin, let me make sure that you are offering me the same bet I am offering you. Crafty is playing against Fruit from the same starting position 201 times at different fixed node counts. From Bob's posted results we know that Fruit scores about 60% against Crafty. However, in my bet I stipulate that I get to look at one of the results first, let's say the middle one of the 201. If Crafty didn't win that one, then no bet. If Crafty did win that one, then I take Crafty to score over 100 points in the other 200 games, and you take Fruit to score over 100 points in the other 200 games. You think that one win is likely to be a fluke in the middle of a sea of losses, whereas I think that one win is likely to be correlated with all the other games. So you think you would win money from me in this way?

Well, I could be wrong, but I am still offering, until someone actually measures how much correlation is there. Maybe Bob would even take a sporting interest in our disagreement and play out our bet in this way: He takes random positions from his set of 3000+ until he finds one that Crafty beats Fruit at a median fixed node count (corresponding to the time control at which Fruit was generally winning 60%). He plays it at the 200 neighboring node counts, and marks down whether Crafty or Fruit won more. Then pick more random positions until there is another one Crafty wins, etc., repeating 100 times. Well, that's 20,000+ games to play, but if I'm right and Crafty won more than 50 of those 100 bets, it would give us some insight into a very fundamental point.
hgm wrote:
Fritzlein wrote:What we apparently disagree is about the potential that we will be able to re-use positions in a way that is independent. I highly doubt it can be done.
Is this just a gut feeling, or is this doubt somehow based on mathematical calculation?
It's just a gut feeling with no mathematical calculation. However, there is no mathematical calculation on the other side either, just anecdotal evidence that changing node counts totally changes playouts, and references to the butterfly effect. Let's get some numbers and see whose intuition is accurate!
My opinion is that nothing is clear to me.

I believe that there is correlation but I am not sure if the correlation is high enough for you to win the bet.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

hgm wrote:
bob wrote:
hgm wrote: You think that was a conclusion from your partial result? Actually this is fresh-man's teaching material, the basics of every experimental science...
Going to try to sell me a bridge next? re-read what you wrote about this suggesting that the first run was bad, and the second was good. All based on a partial result that might show the first result was typical and the second was atypical, after the entire thing is played out. So _yes_ you did draw a conclusion from partial data. Dance around it all you want, of course. But your post is there for anyone to read...
Oh, you were talking about my remark on which run was bad, and not on the lessons to be drawn? Im I supposed to smell that you put comments two paragraphs later than the one to which they refer?

The 'conclusion' you object about says 'it seems' (for everyone to read, fortunatlely). I would call that a prediction more than a conclusion. And that it is a prediction, does not imply that it is wrong. The prediction was indeed based on the idea that the partial result you were showing us was representative for the entire result, and not only games with Crafty white and the black games still to come. What was all this boasting of random shuffling games in time and over cores? That suddenly does not apply anymore?
Did you read my experimental explanation and results. _sometimes_ changing the node count by 1000 changes the game outcomes significantly. Sometimes it makes _no_ change whatsoever. Why is that so hard??? sometimes changing the node count by only 1 changes a game or two, although in the very few such tests I ran, the result was not changed, just the moves actually played varied at some point.

So I would not bet on _anything_ dealing with this, except that in general, the best program will win a match more times than it loses. But individual games are _far_ too variable to bet on.
You are contradicting yourself here. Either changing the node count only affects the outcome of the game rarely, and in that case it would be very safe to bet on the result of the replay of that individual game, or it changes it nearly always, in which case betting would be a random gamble. You can't have both, so what is it? In particular, to stick to the given example, how much is the probability that the game result changes there? 1%, 10% or 50%?
Ok, sorry. I thought that if I change something in an unpredictable way, so that the results are changed in an unpredictable way, and that even if the overall results are not changed with respect to who wins and who loses, that I could assume that picking any one of the runs I posted produces different and contradictory Elo numbers, but the programs are finishing up in the same order each time.

I guess I don't pay enough attention to what is going on to realize that no matter which one of those random matches I pick, the same program won them all. By different amounts, but that was all


sarcasm off

BTW, for the intellectually dense readers, here is another way to express this, in _very_ simple terms.

On a single game of chess, between any two programs on the planet, using any two hardware platforms available, I would not bet any more money than I could afford to lose without causing any pain. On a longer match, I would be willing to risk much more because the variability goes way down, particularly if one program tends to win all such matches, even though it will probably never win every single game.