YATT.... (Yet Another Testing Thread)

Dirt · Post by **Dirt** » Fri Aug 15, 2008 1:37 am

bob wrote:So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play?

That's the most likely situation, although I wouldn't personally call it a bug, as I don't think there is anything wrong with the hardware or software.

bob wrote:And now that we have changed the number of starting positions, suddenly that "bug" has been fixed, even though _nothing_ has been changed except for the file containing the PGN records.

If by PGN records you mean the enlarged set of starting positions, then yes, and not only that but I think we all expected the changes you made to fix the problem.

bob wrote:Even though others have found the same high variability in results from the same starting position, etc?

That's almost irrelevant. If there was a lot more variability in the results from a positions you probably wouldn't have seen the inconsistencies you did, but your results still wouldn't have been a reliable measure of Crafty's strength changes.

bob wrote:Until I see more of this random behavior, which has not yet been seen after four even bigger tests (doesn't mean it won't, but it hasn't yet) I feel somewhat comfortable in concluding that since the only thing that changed between the two experiments is the starting positions and the number of them, that most likely _that_ is what is making a difference.

Almost anything you could do to randomize the playing conditions should have helped. Random time controls, random evaluation offsets, random cache sizes, even files or processes left over from the previous use of the node. Everything you were doing to get more consistent testing conditions was only making your problem worse. But only adding more start positions was going to fix your accuracy problems, so once you started down that road, with a commitment to adding enough positions that you wouldn't need any more randomness, it seemed pointless to worry about the consistency problem.

bob · Post by **bob** » Fri Aug 15, 2008 1:53 am

Dirt wrote:
bob wrote:So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play?
That's the most likely situation, although I wouldn't personally call it a bug, as I don't think there is anything wrong with the hardware or software.
bob wrote:And now that we have changed the number of starting positions, suddenly that "bug" has been fixed, even though _nothing_ has been changed except for the file containing the PGN records.
If by PGN records you mean the enlarged set of starting positions, then yes, and not only that but I think we all expected the changes you made to fix the problem.
bob wrote:Even though others have found the same high variability in results from the same starting position, etc?
That's almost irrelevant. If there was a lot more variability in the results from a positions you probably wouldn't have seen the inconsistencies you did, but your results still wouldn't have been a reliable measure of Crafty's strength changes.
bob wrote:Until I see more of this random behavior, which has not yet been seen after four even bigger tests (doesn't mean it won't, but it hasn't yet) I feel somewhat comfortable in concluding that since the only thing that changed between the two experiments is the starting positions and the number of them, that most likely _that_ is what is making a difference.
Almost anything you could do to randomize the playing conditions should have helped. Random time controls, random evaluation offsets, random cache sizes, even files or processes left over from the previous use of the node. Everything you were doing to get more consistent testing conditions was only making your problem worse. But only adding more start positions was going to fix your accuracy problems, so once you started down that road, with a commitment to adding enough positions that you wouldn't need any more randomness, it seemed pointless to worry about the consistency problem.

No disagreement. but if you look at past posts, the reasoning was always "there is some sort of correlation between the games that is caused by some unknown problem in your cluster". Karl came along and suggested, fairly convincingly, that the natural correlation produced when the same program plays the same position over and over was more than enough to produce those kinds of results. He suggested a possible solution, which I tried, and so far, it does look promising... I probably should also run some smaller runs as well to see if they look more consistent than the original runs did when I first started. But so far, Karl seems to be right on with his comments. so far...

bob · Post by **bob** » Fri Aug 15, 2008 1:58 am

Uri Blass wrote:
Dirt wrote:
Uri Blass wrote:Note that I do not agree with karl about eliminating the black/white pairs and I think that the correlation that happens by black/white of the same position is correlation that reduce the error.

If there are unbalanced positions it is obvious and in the extreme case that the result is only based on the position you get exactly 50% if you play the position white/black and you can get some different noise that is not dependent on the strength of the engines if you do not do it.

I believe that the position are not extremely unbalanced but I do not believe that they are extremely balanced and white may score 55% in some positions when you choose equal engines so having white/black for the same position has clear advantages of reducing noise.

Uri
If we were trying to measure the strength of Crafty against the other engines then I could see that playing a position only with Crafty white would add noise. I think that is always in practice going to be of some concern, and that is a reason for Crafty to play it both as black and white. But theoretically we are only interested in measuring the difference between two versions of Crafty, who will both play the position the same way, so I don't see any additional noise being added to this measurement at all.
You are right.
It is important to have the same position white and black only
if you test version X against version X+1 of the same program.

Bob does not test in this way but we have no proof that this type of test cannot be productive.

IM Larry kaufman from the rybka team told us that rybka test mainly RybkaX+1 against RybkaX.

Uri

Don't follow. My tests are run the same time every time. If I were to alternate colors and play 2N positions once rather than N positions twice, the same opponents would play the same positions with the same colors every single time. The order the games are played gets pretty scrambled up, but before anything is run I have a shell script that prepares the individual match scripts sequentially to avoid this problem.

If you mean I don't play N vs N', you are correct. I do not believe that testing is as accurate as testing against a gauntlet of opponents so that you can see if the change helps against some and hurts against others. N vs N' doesn't give you that kind of information.

bob · Post by **bob** » Fri Aug 15, 2008 3:31 am

hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.

OK. Here you go. First a direct quote from karl:

============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.

And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.

Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.

And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.
Well, I already told you nearly a year ago that the error caused by the low number of positions was starting to dominate the error in your result for the number of games you play. Put that is only the error between your results and the true strength difference, which does not show up in the difference between runs with identical programs, as all these runs used the same small set of positions, and thus suffer equally when these positions are not representative.

The data of your first runs was corrupted. If you would have saved the PGN, you could have known that from the internal variation. (And you would have known how many positions you need to use.) Now you can only now that by repeating the run. But you are not going to do that. So in the end you will know noting, except how to determine a square root by playing Chess games.

Yeah, yeah. got it. "stampee foot. Data corrupted. Stampee foot. bad testing. etc..."

Hopefully we will get this fixed, in spite of your foot music... then you can move on to some other topic where you are absolutely correct, except when you aren't...

bob · Post by **bob** » Fri Aug 15, 2008 7:45 am

OK, one run of new set has finished, and the PGN was saved successfully. First, here is the original 4 runs, plus the new (5th) run.

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20% 
   2 Fruit 2.1               62    7    6  7782   61%   -21   23% 
   3 opponent-21.7           25    6    6  7780   57%   -21   33% 
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20% 
   5 Crafty-22.2            -21    4    4 38908   46%     4   23% 
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19% 
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21% 
   2 Fruit 2.1               63    6    7  7782   61%   -19   23% 
   3 opponent-21.7           26    6    6  7782   57%   -19   33% 
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19% 
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20% 
   2 Fruit 2.1               63    6    7  7782   61%   -16   24% 
   3 opponent-21.7           23    6    6  7781   56%   -16   32% 
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21% 
   5 Crafty-22.2            -16    4    3 38909   47%     3   23% 
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19% 
Wed Aug 13 14:19:47 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -20   21% 
   2 Fruit 2.1               71    6    7  7782   62%   -20   23% 
   3 opponent-21.7           17    6    6  7780   56%   -20   34% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -20   20% 
   5 Crafty-22.2            -20    3    4 38908   47%     4   23% 
   6 Arasan 10.0           -191    7    7  7782   28%   -20   18% 
Fri Aug 15 00:22:40 CDT 2008
time control = 1+1
crafty-22.2R4a
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   105    7    7  7782   67%   -19   21% 
   2 Fruit 2.1               62    7    6  7782   61%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -181    7    7  7782   29%   -19   19% 
olympus% 
[/quote]

Another -19, to go with -16, -19, -20 and -21.  So still consistent.  I have three more of these scheduled.  Once they are done, I will then do what I originally intended, and combine them two at a time  (changing the name of one crafty-22.2 to something else) and running a double-set thru BayesElo to see how stable those numbers are since there will be some commonality between the two runs then.  I will do that six times to combine all possible pairs of PGN results.  More when that is done.  at 12 hours a run, and with the second run starting about an hour ago, this will be done Saturday morning early.

Looks more and more like the cluster testing is _not_ "broken" or "corrupted".  Too few positions repeated too many times seems to be the unintuitive issue (unintuitive to me, that is).

hgm · Post by **hgm** » Fri Aug 15, 2008 7:53 am

bob wrote:I suppose you guys are going to keep up with the "stampee feet, error, stampee feet, bug, stampee feet, etc" no matter what happens?

Again, completely wrong. As soon as there would be new data relevant to this matter, we would of course refine / reconsider our position. But as you are not providing relevant data, what was true at t=0 remains as true as ever...

How many runs have I posted over the past 2 years with that wild variability?

Apart from this time, I have seen only one. And later it turned out this was likely selected data, not recently taken, that happened to be "lying around" for some unknown reason. When you showed more typical data taken through the same procedure, the anomaly was totally absent. In all other runs I have seen the 'wild' aspect existed only in your imagination, and the variability was absolutely normal.

And now, suddenly, at least for 4 full runs plus one partial, everything looks perfectly normal.

As is should and as it does.

And yet that variability is the result of a bug, when _everything_ about all the runs I have done has been absolutely identical.

Apparantly, you still don't understand. There isn't anything like "that variability". Variability comes in kinds, distinguished by quantitative analyses. This remark is at the level "I have show you 10 flies now, and still you maintain that the Elephant I have shown you the other day is big. How many more flies will I have to show you before you will admit that all animals are equal?"...

Same exact programs, same exact starting positions, same time control, same hash sizes, same processors, same operating systems, same compilers, same executables, same referee program, in short, same _everything except for the new set of more starting positions.

"Same jungle, same month of the year, same safari truck. Only this time I was wearing a green helmet in stead of a red one. And I haven't caught any big flies with a trunk and tusks in the week that I did that"

A rational person would eventually assume that if one thing was changed, and the result changed significantly, then that one change is most likely the thing that caused the result to change.

Wrong. This is in general what superstitious persons assume. Something unlikely but bad happens to them when they were wearing a red hat, and then they will never wear that color hat again in their life.

Others just continue to shout bug, over and over, without _ever_ being able to give me a scenario, knowing _exactly_ what both of these clusters look like, that I could use to _make_ those odd results happen, without modifying anything that is known to be constant (executables, nodes, etc.)

As discussed, not everything was the same. The time was different, and you did not make any effort to prevent the engines from knowing that time. There could even be a bug in your OS that is coupled to a high byte of the time.

Uri Blass · Post by **Uri Blass** » Fri Aug 15, 2008 7:57 am

bob wrote:
Dirt wrote:
bob wrote:So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play?
That's the most likely situation, although I wouldn't personally call it a bug, as I don't think there is anything wrong with the hardware or software.
bob wrote:And now that we have changed the number of starting positions, suddenly that "bug" has been fixed, even though _nothing_ has been changed except for the file containing the PGN records.
If by PGN records you mean the enlarged set of starting positions, then yes, and not only that but I think we all expected the changes you made to fix the problem.
bob wrote:Even though others have found the same high variability in results from the same starting position, etc?
That's almost irrelevant. If there was a lot more variability in the results from a positions you probably wouldn't have seen the inconsistencies you did, but your results still wouldn't have been a reliable measure of Crafty's strength changes.
bob wrote:Until I see more of this random behavior, which has not yet been seen after four even bigger tests (doesn't mean it won't, but it hasn't yet) I feel somewhat comfortable in concluding that since the only thing that changed between the two experiments is the starting positions and the number of them, that most likely _that_ is what is making a difference.
Almost anything you could do to randomize the playing conditions should have helped. Random time controls, random evaluation offsets, random cache sizes, even files or processes left over from the previous use of the node. Everything you were doing to get more consistent testing conditions was only making your problem worse. But only adding more start positions was going to fix your accuracy problems, so once you started down that road, with a commitment to adding enough positions that you wouldn't need any more randomness, it seemed pointless to worry about the consistency problem.
No disagreement. but if you look at past posts, the reasoning was always "there is some sort of correlation between the games that is caused by some unknown problem in your cluster". Karl came along and suggested, fairly convincingly, that the natural correlation produced when the same program plays the same position over and over was more than enough to produce those kinds of results. He suggested a possible solution, which I tried, and so far, it does look promising... I probably should also run some smaller runs as well to see if they look more consistent than the original runs did when I first started. But so far, Karl seems to be right on with his comments. so far...

Karl did not suggest what you say.
I remember that Karl agree that you should get the same wrong results that are close to each other if you repeat exactly the same experiment
that you did earlier.

It is possible that the change in the condition was very small(something equivalent to the case that the cluster becomes 0.1% slower) and in this case the change is not going to cause significant noise in the new experiment(additional noise that is probably less than 0.1 elo in the new experiment is not important).

Uri

bob · Post by **bob** » Fri Aug 15, 2008 8:23 am

Uri Blass wrote:
bob wrote:
Dirt wrote:
bob wrote:So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play?
That's the most likely situation, although I wouldn't personally call it a bug, as I don't think there is anything wrong with the hardware or software.
bob wrote:And now that we have changed the number of starting positions, suddenly that "bug" has been fixed, even though _nothing_ has been changed except for the file containing the PGN records.
If by PGN records you mean the enlarged set of starting positions, then yes, and not only that but I think we all expected the changes you made to fix the problem.
bob wrote:Even though others have found the same high variability in results from the same starting position, etc?
That's almost irrelevant. If there was a lot more variability in the results from a positions you probably wouldn't have seen the inconsistencies you did, but your results still wouldn't have been a reliable measure of Crafty's strength changes.
bob wrote:Until I see more of this random behavior, which has not yet been seen after four even bigger tests (doesn't mean it won't, but it hasn't yet) I feel somewhat comfortable in concluding that since the only thing that changed between the two experiments is the starting positions and the number of them, that most likely _that_ is what is making a difference.
Almost anything you could do to randomize the playing conditions should have helped. Random time controls, random evaluation offsets, random cache sizes, even files or processes left over from the previous use of the node. Everything you were doing to get more consistent testing conditions was only making your problem worse. But only adding more start positions was going to fix your accuracy problems, so once you started down that road, with a commitment to adding enough positions that you wouldn't need any more randomness, it seemed pointless to worry about the consistency problem.
No disagreement. but if you look at past posts, the reasoning was always "there is some sort of correlation between the games that is caused by some unknown problem in your cluster". Karl came along and suggested, fairly convincingly, that the natural correlation produced when the same program plays the same position over and over was more than enough to produce those kinds of results. He suggested a possible solution, which I tried, and so far, it does look promising... I probably should also run some smaller runs as well to see if they look more consistent than the original runs did when I first started. But so far, Karl seems to be right on with his comments. so far...
Karl did not suggest what you say.
I remember that Karl agree that you should get the same wrong results that are close to each other if you repeat exactly the same experiment
that you did earlier.

It is possible that the change in the condition was very small(something equivalent to the case that the cluster becomes 0.1% slower) and in this case the change is not going to cause significant noise in the new experiment(additional noise that is probably less than 0.1 elo in the new experiment is not important).

Uri

You can say he didn't say whatever you want. But just read the _direct_ quote from him I gave to HGM. He said _exactly_ what I said he said. There is _no_ way to slow the cluster .1%. so that's a dead horse and it is time to stop trying to ride it.. it isn't going anywhere.

bob · Post by **bob** » Fri Aug 15, 2008 8:41 am

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20% 
   2 Fruit 2.1               62    7    6  7782   61%   -21   23% 
   3 opponent-21.7           25    6    6  7780   57%   -21   33% 
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20% 
   5 Crafty-22.2            -21    4    4 38908   46%     4   23% 
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19% 
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21% 
   2 Fruit 2.1               63    6    7  7782   61%   -19   23% 
   3 opponent-21.7           26    6    6  7782   57%   -19   33% 
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19% 
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20% 
   2 Fruit 2.1               63    6    7  7782   61%   -16   24% 
   3 opponent-21.7           23    6    6  7781   56%   -16   32% 
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21% 
   5 Crafty-22.2            -16    4    3 38909   47%     3   23% 
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19% 
Wed Aug 13 14:19:47 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -20   21% 
   2 Fruit 2.1               71    6    7  7782   62%   -20   23% 
   3 opponent-21.7           17    6    6  7780   56%   -20   34% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -20   20% 
   5 Crafty-22.2            -20    3    4 38908   47%     4   23% 
   6 Arasan 10.0           -191    7    7  7782   28%   -20   18% 
Fri Aug 15 00:22:40 CDT 2008
time control = 1+1
crafty-22.2R4a
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   105    7    7  7782   67%   -19   21% 
   2 Fruit 2.1               62    7    6  7782   61%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -181    7    7  7782   29%   -19   19% 
olympus%

hgm · Post by **hgm** » Fri Aug 15, 2008 8:53 am

bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.

Karl wrote:============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.

If you think that quote says the same as you said, then indeed, they both count as bullshit. (You have ripped it so badly out of context, that I really cannot see what claim is actually being made in that quote.)

What Karl actually said, referring to his long story where you have taken this snippet from was:

I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision.

So if "it appears there was a correleation issue" refers to the 'hyper-variability' in your initial two runs with 40 positions (i.e. the 6-sigma difference between the first two 25,000-game runs), you cannot possibly have said exactly the same thing as Karl, as you have been talking about something he explicitly denies having said anything about. If not, you should explain us what correlation issue that appeared you were talking about.

So you see, for those who read it is not so difficult to recognize bullshit. And you are our most generous supplier of it, not Karl....

This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.

This does not seem to make much sense. Are you sure you do not mean "twice as many positions"? In any case, one can fix the number of positons or the number of games in any way one likes. About the cancelling out, you seem to fail to make the distinction in cancelling out within one run, and caccel out in the difference between runs (made with the engine versions we wanted to compare). As you fail to make it oon the subject of correlated games in general, where the effect of correlating games within one run has exactly the opposite effect on run variability as correleating them between runs has.

Yeah, yeah. got it. "stampee foot. Data corrupted. Stampee foot. bad testing. etc..."

As long a you allow your imagination to run amok like that, the only thing that is clear is that you still don't get it at all.

Hopefully we will get this fixed, in spite of your foot music... then you can move on to some other topic where you are absolutely correct, except when you aren't...

Of course. When you are now finally doing as I suggested 8 month ago, you will have fixed it in spite of my advice. My mistake. I should not have suggested it at that time, then you could have fixed it much earlier. The insight you display is brilliant, as usual...

YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

more data...

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: more data... with correct formatting

Re: YATT.... (Yet Another Testing Thread)