YATT.... (Yet Another Testing Thread)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Tony

Re: YATT.... (Yet Another Testing Thread)

Post by Tony »

bob wrote:
hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
OK. Here you go. First a direct quote from karl:

============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.


And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.
Is this true ?

With equal strength (50% winchance)

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced => 1.5 - 0.5

perfect world result 1 - 1

With unequal strength (100% winchance for 1):

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced 2 possibilities
stronger gets winning position => 2 - 0
weaker gets winning position => 1 - 1

perfect world result 2-0

Tony
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: YATT.... (Yet Another Testing Thread)

Post by Dirt »

Tony wrote:
bob wrote:A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.
Is this true ?

With equal strength (50% winchance)

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced => 1.5 - 0.5

perfect world result 1 - 1

With unequal strength (100% winchance for 1):

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced 2 possibilities
stronger gets winning position => 2 - 0
weaker gets winning position => 1 - 1

perfect world result 2-0

Tony
I don't know what you mean by "perfect world", but I'm pretty sure I wouldn't get your point even if I did.

Keeping in mind that we are going to compare CraftyA and CraftyB against, say, Fruit, the situations I see are:

Code: Select all

1 unbalanced position, played twice : =>CraftyA - Fruit  1 - 0
                                        Fruit - CraftyA  1 - 0
                                        CraftyB - Fruit  1 - 0
                                        Fruit - CraftyB  1 - 0
Which gives us exactly no information about whether CraftyA or CraftyB is stronger.

Code: Select all

1 unbalanced position, 1 balanced : =>CraftyA - Fruit  1 - 0
                                      Fruit - CraftyA  varies
                                      CraftyB - Fruit  1 - 0
                                      Fruit - CraftyB  varies
Now we have two useful games and two useless games, which is an improvement. The net effect is that with twice as many positions we are twice as likely to choose a bad one, but it only does half the damage to the testing. This is about a wash, but it reduces the chance of a worse case outcome so it may help a little. A larger benefit is that we explore more of the possible openings, which I think is partly offset by a lowish correlation between how an engine performs on the black and white sides of an opening. The main drawbacks I see are that it will take more effort to produce so many positions, and it will be harder to detect unbalanced positions if Crafty is always the same color.

I'm not sure what the effect of using a perfect engine (100% winchance, Rybka 20?) would be, but it wouldn't be good so why would it, or anything even close, be used at all? The opponent engines would be chosen to not be too strong or weak.
Fritzlein

Re: YATT.... (Yet Another Testing Thread)

Post by Fritzlein »

Dirt wrote:
bob wrote:So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play?
That's the most likely situation, although I wouldn't personally call it a bug, as I don't think there is anything wrong with the hardware or software.
I think we might agree, but I want to be careful with the words to make sure of that. My guess is that this "random bug" (actually not a bug but an unknown change in the conditions of the testing system) doesn't in fact "bias the results" in the sense of systematically helping or hurting any bot. Rather it provides a random jolt to the results that is equally likely to hurt or help each bot in each position. People had pointed out before I joined the discussion that such a random jolt would not produce statistically out of bounds results, but it could indeed produce out of bounds results in the following case: replays within test run 1 are internally correlated, replays within test run 2 are internally correlated, but test run 1 and test run 2 (because of the "random bug") are not correlated. Now that Martin has provided hard evidence of internal correlation based on replayed positions, this scenario becomes more likely.

Although I am hard pressed to think of what might make the two runs different at all, it seems to be an easier stretch for the imagination than thinking what might make the two runs biased in different ways.
Dirt wrote:Almost anything you could do to randomize the playing conditions should have helped. Random time controls, random evaluation offsets, random cache sizes, even files or processes left over from the previous use of the node.
Again, I think we might agree, but I want to be careful about the words. The purpose of randomizing playing conditions is to kill correlations between repeated measurements. Things like using different starting positions and different opponents will reduce correlation, so they are good changes to make.

But some kind of randomizing introduces correlations, and thus is bad. For example, randomizing the positions so much that we use unbalanced positions is bad, because that makes the result of Crafty vs. Glaurung more correlated the the result of Crafty vs. Fruit from the same position. (Unless we never re-use the position under any circumstances.)

Also some kind of randomizing just introduces noise, so it is also bad. For example, randomizing the time control so that Crafty gets 1 to 10 seconds to think and independently Fruit gets 1 to 10 seconds to think will certainly kill off correlations between Crafty vs. Fruit results on replays of the same position, but it will also conceal exactly what we are trying to measure, because it makes the winner less correlated to the true playing strength of each engine. Indeed, for statistically beautifully behaved results, we could randomize so much that each game was essentially a fair coin flip, which would take care of all statistical anomalies, but would also prevent us from measuring anything at all.

I'm sure you were not suggesting such outrageous testing procedures, but I just wanted to add caveats to the idea that introducing any kind of randomness is going to improve test design.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: YATT.... (Yet Another Testing Thread)

Post by bob »

hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
Karl wrote:============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.
If you think that quote says the same as you said, then indeed, they both count as bullshit. (You have ripped it so badly out of context, that I really cannot see what claim is actually being made in that quote.)

What Karl actually said, referring to his long story where you have taken this snippet from was:
I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision.
So if "it appears there was a correleation issue" refers to the 'hyper-variability' in your initial two runs with 40 positions (i.e. the 6-sigma difference between the first two 25,000-game runs), you cannot possibly have said exactly the same thing as Karl, as you have been talking about something he explicitly denies having said anything about. If not, you should explain us what correlation issue that appeared you were talking about.

So you see, for those who read it is not so difficult to recognize bullshit. And you are our most generous supplier of it, not Karl....
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.
This does not seem to make much sense. Are you sure you do not mean "twice as many positions"? In any case, one can fix the number of positons or the number of games in any way one likes. About the cancelling out, you seem to fail to make the distinction in cancelling out within one run, and caccel out in the difference between runs (made with the engine versions we wanted to compare). As you fail to make it oon the subject of correlated games in general, where the effect of correlating games within one run has exactly the opposite effect on run variability as correleating them between runs has.
Yeah, yeah. got it. "stampee foot. Data corrupted. Stampee foot. bad testing. etc..."
As long a you allow your imagination to run amok like that, the only thing that is clear is that you still don't get it at all.
Hopefully we will get this fixed, in spite of your foot music... then you can move on to some other topic where you are absolutely correct, except when you aren't...
Of course. When you are now finally doing as I suggested 8 month ago, you will have fixed it in spite of my advice. My mistake. I should not have suggested it at that time, then you could have fixed it much earlier. The insight you display is brilliant, as usual... :lol: :lol: :lol:
A. How about showing me a post you wrote, where the primary point was "Bob, the problem with your testing is that you are using only 40 positions (even though many others are using the same positions). If you would just increase the number of positions significantly, and no longer play each position more than twice, then your results will exhibit far less random behavior, and the results won't be nearly so far outside the normal statistical expectation."

B. I suppose you _might_ have said that, but most likely, if you did, it was buried deep inside so much bullshit that no normal human would have tried to delve it out without full SCUBA gear.

But please show me where you clearly said that the silver test, or the nunn test, or the Noomen test and such are no good. Because I certainly do not remember reading such. In fact, I believe you have _always_ been on the side of the audience that believes that such a large number of games is not necessary to evaluate changes...

ball is in your court...

Karl made a suggestion, Uri made a suggestion on where to get the positions, and I tried it. Had a _reasonable_ suggestion been made 8 months ago (where that comes from is unknown as these testing threads have been going on for almost 2 years now) I'd certainly have tried it. Just as I did this time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: YATT.... (Yet Another Testing Thread)

Post by bob »

Tony wrote:
bob wrote:
hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
OK. Here you go. First a direct quote from karl:

============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.


And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.
Is this true ?

With equal strength (50% winchance)

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced => 1.5 - 0.5

perfect world result 1 - 1

With unequal strength (100% winchance for 1):

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced 2 possibilities
stronger gets winning position => 2 - 0
weaker gets winning position => 1 - 1

perfect world result 2-0

Tony
The issue was "independent or non-correlated results." In a 2-game match on an unbalanced position, the two results are correlated. Think about the extreme point. 100 positions, all unbalanced. So you get 100 wins and 100 losses whatever you change. Now take 200 positions, 100 unbalanced, 100 pretty even. Changes you make are not going to affect the unbalanced results, but will affect the other 100 games. Which set will give the most useful information???
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: YATT.... (Yet Another Testing Thread)

Post by bob »

hgm wrote:
bob wrote:I suppose you guys are going to keep up with the "stampee feet, error, stampee feet, bug, stampee feet, etc" no matter what happens?
Again, completely wrong. As soon as there would be new data relevant to this matter, we would of course refine / reconsider our position. But as you are not providing relevant data, what was true at t=0 remains as true as ever...
How many runs have I posted over the past 2 years with that wild variability?
Apart from this time, I have seen only one. And later it turned out this was likely selected data, not recently taken, that happened to be "lying around" for some unknown reason. When you showed more typical data taken through the same procedure, the anomaly was totally absent. In all other runs I have seen the 'wild' aspect existed only in your imagination, and the variability was absolutely normal.
"was likely selected data..." Did I not hear this _every_ time I posted results? If you re-read your _first_ response to the most recent data, did you not say the same thing yet again? So that has been a constant in all your posts. "selected data" or, if not selected data, then "you have a bug or problem". But I posted _several_ sets of data over the past 1.5 years or so, all showing the same stuff. Some more, some less random. But all far more random than the recent results. So I am "selecting data" to show more randomness than is there, then I now must be selecting data to show less randomness. And exactly _what_ am I getting out of running all these tests, and according to you I must be running 100,000 times as much as I am showing to be able to select things that are 6 sigma apart?

The old war-cry is getting old, and you need to move on. I've not selected any data at all, in any way other than _exactly_ what I have stated in the past. And the present. I run 'em as fast as I can and report exactly what comes out, exactly as it comes out.
And now, suddenly, at least for 4 full runs plus one partial, everything looks perfectly normal.
As is should and as it does.
And yet that variability is the result of a bug, when _everything_ about all the runs I have done has been absolutely identical.
Apparantly, you still don't understand. There isn't anything like "that variability". Variability comes in kinds, distinguished by quantitative analyses. This remark is at the level "I have show you 10 flies now, and still you maintain that the Elephant I have shown you the other day is big. How many more flies will I have to show you before you will admit that all animals are equal?"...
Same exact programs, same exact starting positions, same time control, same hash sizes, same processors, same operating systems, same compilers, same executables, same referee program, in short, same _everything except for the new set of more starting positions.
"Same jungle, same month of the year, same safari truck. Only this time I was wearing a green helmet in stead of a red one. And I haven't caught any big flies with a trunk and tusks in the week that I did that"
So we resort to nonsense when we can't come up with a reasonable argument to explain the behavior? "bug". "correlation". "ignorant tester". "cherry-picked results". etc...

A rational person would eventually assume that if one thing was changed, and the result changed significantly, then that one change is most likely the thing that caused the result to change.
Wrong. This is in general what superstitious persons assume. Something unlikely but bad happens to them when they were wearing a red hat, and then they will never wear that color hat again in their life.
That is the type of comment I would expect from a 6th grader. This was _not_ a one-time event. That's where your argument fails. yes, if something unlikely happens (bird craps on head) 5 times in a row, but only when I am wearing a red hat, then I would logically conclude that the hat is playing a role. Certainly more likely than to assume it is some other influence that we can't measure, when it is much more likely that the bird is just attracted to the red hat. Actually does happen in nature, in fact. You can find it in the literature.

Others just continue to shout bug, over and over, without _ever_ being able to give me a scenario, knowing _exactly_ what both of these clusters look like, that I could use to _make_ those odd results happen, without modifying anything that is known to be constant (executables, nodes, etc.)
As discussed, not everything was the same. The time was different, and you did not make any effort to prevent the engines from knowing that time. There could even be a bug in your OS that is coupled to a high byte of the time.
Do you know how unix keeps time? Didn't think so. Hint: high-byte has _zero_ to do with anything here. as far as making any effort to prevent engines from knowing that time, do _you_ test like that? What about those that test commercial engines and have no source code? That is so far beyond stupid, it takes sunlight 6 months to get from stupid to there...

Next we go on to cosmic rays and such and their potential effect on electronic circuits? While the most likely issue is right out in plain sight...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: YATT.... (Yet Another Testing Thread)

Post by bob »

hgm wrote:
bob wrote:I suppose you guys are going to keep up with the "stampee feet, error, stampee feet, bug, stampee feet, etc" no matter what happens?
Again, completely wrong. As soon as there would be new data relevant to this matter, we would of course refine / reconsider our position. But as you are not providing relevant data, what was true at t=0 remains as true as ever...
How many runs have I posted over the past 2 years with that wild variability?
Apart from this time, I have seen only one. And later it turned out this was likely selected data, not recently taken, that happened to be "lying around" for some unknown reason. When you showed more typical data taken through the same procedure, the anomaly was totally absent. In all other runs I have seen the 'wild' aspect existed only in your imagination, and the variability was absolutely normal.
"was likely selected data..." Did I not hear this _every_ time I posted results? If you re-read your _first_ response to the most recent data, did you not say the same thing yet again? So that has been a constant in all your posts. "selected data" or, if not selected data, then "you have a bug or problem". But I posted _several_ sets of data over the past 1.5 years or so, all showing the same stuff. Some more, some less random. But all far more random than the recent results. So I am "selecting data" to show more randomness than is there, then I now must be selecting data to show less randomness. And exactly _what_ am I getting out of running all these tests, and according to you I must be running 100,000 times as much as I am showing to be able to select things that are 6 sigma apart?

The old war-cry is getting old, and you need to move on. I've not selected any data at all, in any way other than _exactly_ what I have stated in the past. And the present. I run 'em as fast as I can and report exactly what comes out, exactly as it comes out.
And now, suddenly, at least for 4 full runs plus one partial, everything looks perfectly normal.
As is should and as it does.
And yet that variability is the result of a bug, when _everything_ about all the runs I have done has been absolutely identical.
Apparantly, you still don't understand. There isn't anything like "that variability". Variability comes in kinds, distinguished by quantitative analyses. This remark is at the level "I have show you 10 flies now, and still you maintain that the Elephant I have shown you the other day is big. How many more flies will I have to show you before you will admit that all animals are equal?"...
Same exact programs, same exact starting positions, same time control, same hash sizes, same processors, same operating systems, same compilers, same executables, same referee program, in short, same _everything except for the new set of more starting positions.
"Same jungle, same month of the year, same safari truck. Only this time I was wearing a green helmet in stead of a red one. And I haven't caught any big flies with a trunk and tusks in the week that I did that"
So we resort to nonsense when we can't come up with a reasonable argument to explain the behavior? "bug". "correlation". "ignorant tester". "cherry-picked results". etc...

A rational person would eventually assume that if one thing was changed, and the result changed significantly, then that one change is most likely the thing that caused the result to change.
Wrong. This is in general what superstitious persons assume. Something unlikely but bad happens to them when they were wearing a red hat, and then they will never wear that color hat again in their life.
That is the type of comment I would expect from a 6th grader. This was _not_ a one-time event. That's where your argument fails. yes, if something unlikely happens (bird craps on head) 5 times in a row, but only when I am wearing a red hat, then I would logically conclude that the hat is playing a role. Certainly more likely than to assume it is some other influence that we can't measure, when it is much more likely that the bird is just attracted to the red hat. Actually does happen in nature, in fact. You can find it in the literature.

Others just continue to shout bug, over and over, without _ever_ being able to give me a scenario, knowing _exactly_ what both of these clusters look like, that I could use to _make_ those odd results happen, without modifying anything that is known to be constant (executables, nodes, etc.)
As discussed, not everything was the same. The time was different, and you did not make any effort to prevent the engines from knowing that time. There could even be a bug in your OS that is coupled to a high byte of the time.
Do you know how unix keeps time? Didn't think so. Hint: high-byte has _zero_ to do with anything here. as far as making any effort to prevent engines from knowing that time, do _you_ test like that? What about those that test commercial engines and have no source code? That is so far beyond stupid, it takes sunlight 6 months to get from stupid to there...

Next we go on to cosmic rays and such and their potential effect on electronic circuits? While the most likely issue is right out in plain sight...

Karl pretty clearly (at least to me) explained how replaying the same positions multiple times would tend to produce results that have a significant correlation, and that by doing so, the SD we are measuring becomes wrong. I had never considered that, until he offered a clear, concise, and logical argument explaining that, and then made a simple suggestion that could be tested to confirm his explanation. Those are the kinds of suggestions that lead to positive results. Not "bullshit". "stampee feet". "cluster has bug". "make sure programs can't determine time" and other such complete and utter bullshit. (And I do know how to recognize bullshit, of which we have plenty of examples above).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more data... with correct formatting

Post by bob »

bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20% 
   2 Fruit 2.1               62    7    6  7782   61%   -21   23% 
   3 opponent-21.7           25    6    6  7780   57%   -21   33% 
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20% 
   5 Crafty-22.2            -21    4    4 38908   46%     4   23% 
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19% 
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21% 
   2 Fruit 2.1               63    6    7  7782   61%   -19   23% 
   3 opponent-21.7           26    6    6  7782   57%   -19   33% 
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19% 
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20% 
   2 Fruit 2.1               63    6    7  7782   61%   -16   24% 
   3 opponent-21.7           23    6    6  7781   56%   -16   32% 
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21% 
   5 Crafty-22.2            -16    4    3 38909   47%     3   23% 
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19% 
Wed Aug 13 14:19:47 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -20   21% 
   2 Fruit 2.1               71    6    7  7782   62%   -20   23% 
   3 opponent-21.7           17    6    6  7780   56%   -20   34% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -20   20% 
   5 Crafty-22.2            -20    3    4 38908   47%     4   23% 
   6 Arasan 10.0           -191    7    7  7782   28%   -20   18% 
Fri Aug 15 00:22:40 CDT 2008
time control = 1+1
crafty-22.2R4a
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   105    7    7  7782   67%   -19   21% 
   2 Fruit 2.1               62    7    6  7782   61%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -181    7    7  7782   29%   -19   19% 
olympus% 
Here is the next partial run. Should be done in a bit. Cluster load is a bit variable at the moment so sometimes I can use all 260 CPUs, sometimes I get down to half of that (or even lower).

Code: Select all

35032 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    7    7  7000   68%   -20   22% 
   2 Fruit 2.1               62    7    7  6998   61%   -20   24% 
   3 opponent-21.7           25    7    7  7001   57%   -20   33% 
   4 Glaurung 1.1 SMP        10    6    7  7006   54%   -20   20% 
   5 Crafty-22.2            -20    3    4 35032   46%     4   24% 
   6 Arasan 10.0           -188    7    7  7027   28%   -20   20% 
User avatar
hgm
Posts: 28123
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: YATT.... (Yet Another Testing Thread)

Post by hgm »

bob wrote:A. How about showing me a post you wrote, where the primary point was "Bob, the problem with your testing is that you are using only 40 positions (even though many others are using the same positions).
How about this one? http://www.talkchess.com/forum/viewtopi ... 75&start=0 :roll:
If you would just increase the number of positions significantly, and no longer play each position more than twice, then your results will exhibit far less random behavior, and the results won't be nearly so far outside the normal statistical expectation."
Of course the latter is not in that post, because it is simply not true. And no one except you has claimed it would be true, here in this thread, least of all Karl! Whether you pair positions or not, or if you play them 1 time or 10,000 time, none of that can have the slightest effect on how far the the variance of the results will fall outside normal statistical expectations, as long as you do the same thing in every run. It only has effect on how far the results may differ from the quantity you aimed interested to measure. But as you do not know the value of that quantity beforehand, you would never notice anything strange, no matter how far off your results were.
B. I suppose you _might_ have said that, but most likely, if you did, it was buried deep inside so much bullshit that no normal human would have tried to delve it out without full SCUBA gear.
Seems to me even the title says it all. But if that makes it "burried to deep" for you to read... Well, that really says it all too, for the effort you are prepared to , in order to read what others write, wouldn't you agree?

And sorry, I do apologize... It was not 8 months ago, it was 11 months ago. :?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: YATT.... (Yet Another Testing Thread)

Post by bob »

OK, here's the "conclusions" you gave:


======================================================
When we use only a limited number of positions, the sampling of these positions from the larger set of possible positions also causes a statistical error. If we are unlucky, we select an above-average number of positions where our engine happens to be poor (e.g. it has the tendency to make a certain mistake when playing one of the colors from that position). That would mean we observe an apparently poorer performance of our engine against that opponent as it would have in real games. That error would not increase by increasing the umber of games from the given position. It could only decrease by increasing the number of different starting positions.

If we would just measure the performance of our engine (and its various versions) based on repeating the same 80-game gauntlet many times against the same opponent, we would in particular run the risk of adopting changes that happen to change the preference for a certain bad move in one of the initial positions, while averaged over all possible positions (including those we don't test from) it would actually make the engine weaker. E.g. if one of the positions would offer the possibility for a Knight move that is bad for deep tactical reasons, a positionally driven tendency to play that move (which does not even involve the opponent!) might cause that this game is mostly lost, in stead of giving an average 50-50 score that would be more representative against this opponent. So you would lose 0.5/80 = 0.65% on the gauntlet score. "Freezing" that knight by upping the piece-square value of the square it is on to the point where another move is better will earn you that 0.65% in an easy way. Even if it loses 0.3% in average play on the remaining position, and in the grand total of all possible positions, the change would still evaluate as good.

Conclusion: very precise measurement of the gauntlet result by repeating the gauntlet, to an accuracy better than or similar to the inverse of the number of positions in your gauntlet, will start to train your engine to score better in the gauntlet, even if this goes at the expense of playing good chess. One should be careful not to accept small "improvements" that are not sufficiently far above the statistical noise cause by sampling the positions.
==========================================================

So we are concerned about "training on the small set of positions, we are concerned about the possibility of randomly choosing bad positions (these were not randomly chosen of course, they were intentionally chosen as representative openings in fairly balanced positions).

So exactly where is any argument suggesting _anything_ related to what Karl pointed out? You are talking about errors with accepting/rejecting improvements because of the poor choice of positions. Which has exactly _zero_ to do with the current discussion.

try again...