YATT.... (Yet Another Testing Thread)

hgm · Post by **hgm** » Thu Aug 14, 2008 11:48 pm

bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.

This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.

And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.

This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.

And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.

Well, I already told you nearly a year ago that the error caused by the low number of positions was starting to dominate the error in your result for the number of games you play. Put that is only the error between your results and the true strength difference, which does not show up in the difference between runs with identical programs, as all these runs used the same small set of positions, and thus suffer equally when these positions are not representative.

The data of your first runs was corrupted. If you would have saved the PGN, you could have known that from the internal variation. (And you would have known how many positions you need to use.) Now you can only now that by repeating the run. But you are not going to do that. So in the end you will know noting, except how to determine a square root by playing Chess games.

oystein · Post by **oystein** » Thu Aug 14, 2008 11:59 pm

hgm wrote:
oystein wrote:So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.
The difference is that the data using the first set of starting positions is somehow corrupted, as it seems now, in particular the first run. So averaging the the runs corrupts everything. The proper way to process the data would be to discard the first run. (Or the part of the first run that was corrupted, but as only the total result of that run has survived, we can no longer make that distiction.)

Lacking details that allow us to judge runs for acceptability based on its internal details, the proper way to do deal with the situation would be to redo the run with the first set of positions, and use 2:1 voting to decide which to discard.

I agree that the result from the first 25000 run is suspicious. I did write that this run should be removed, but deleted it before I posted because I thought it has been "discussed" enough without getting anywhere.

If we remove the suspicious run we get:

Code: Select all

#games                 6397     25595    38908 38910 38909 38908
 
Fruit 2.1                37        38       58    59    60    67
opponent-21.7            29        28       21    22    20    13
Glaurung 1.1 SMP         25        16        6     3     0     7

This is pretty convincing to me.

Uri Blass · Post by **Uri Blass** » Fri Aug 15, 2008 12:08 am

hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.
Well, I already told you nearly a year ago that the error caused by the low number of positions was starting to dominate the error in your result for the number of games you play. Put that is only the error between your results and the true strength difference, which does not show up in the difference between runs with identical programs, as all these runs used the same small set of positions.

The data of your first runs was corrupted. If you would have saved the PGN, you could have known that from the internal variation. (And you would have known how many positions you need to use.) Now you can only now that by repeating the run. But you are not going to do that. So in the end you will know noting, except how to determine a square root by playing Chess games.

I do not suggest Bob to repeat the same run that is not good for measuring small changes.

Note that it is possible that the corruption that happened does not happen often and in this case there is a good chance that repeating the first run
is not going to help Bob to find the error because there is not going to be an error in the next few runs.

I suggest him simply to save the pgn(and he does it) so later it can help him to find errors.

Note that I do not agree with karl about eliminating the black/white pairs
and I think that the correlation that happens by black/white of the same position is correlation that reduce the error.

If there are unbalanced positions it is obvious and in the extreme case that the result is only based on the position you get exactly 50% if you play the position white/black and you can get some different noise that is not dependent on the strength of the engines if you do not do it.

I believe that the position are not extremely unbalanced but I do not believe that they are extremely balanced and white may score 55% in some positions when you choose equal engines so having white/black for the same position has clear advantages of reducing noise.

Uri

oystein · Post by **oystein** » Fri Aug 15, 2008 12:16 am

bob wrote:One thing to keep in mind. There are several ways to test, and each shows different things. A general set of starting positions, if it is a small set, can put an engine in a position that (a) it plays poorly, and (b) will never see in a real game bacause its opening book does not allow that opening to be played.

The larger and more varied set of positions probably does a better overall job of measuring strength between the engines, but that is not exactly what I am looking for. I am wanting to measure the difference between two versions of the same program, that use the same starting positions against the same opponents, so the only thing that varies is the changes to the new version.

But the measured difference between your versions may vary with test set, so it is important to get a representative test set. Thats way I think more test with differt sets is interesting. But perhaps you already know a lot about this (I am new here, as you may have noticed).

It appears to require a _large_ number of games, measured in the tens of thousands, to measure small changes, assuming it is even possible to do this... My intent with these tests is to attempt to quantify the _minimum_ testing necessary to say whether a change is good or bad...

How many elo points is "small changes"?

bob · Post by **bob** » Fri Aug 15, 2008 12:17 am

Uri Blass wrote:
bob wrote:
hgm wrote:
oystein wrote:So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.
The difference is that the data using the first set of starting positions is somehow corrupted, as it seems now, in particular the first run. So averaging the the runs corrupts everything. The proper way to process the data would be to discard the first run. (Or the part of the first run that was corrupted, but as only the total result of that run has survived, we can no longer make that distiction.)

Lacking details that allow us to judge runs for acceptability based on its internal details, the proper way to do deal with the situation would be to redo the run with the first set of positions, and use 2:1 voting to decide which to discard.
This "corrupted" is simply wrong. It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along. And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents. And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.
My opinion is that the results were corrupted but without having the data (pgn) I have no way to know what is corrupted(it may be possible that it is a bug that does not happen every day and you did not see it in the games that you checked and it is the reason that it is important to have pgn so later in the future you can analyze what happened).
When we do not have the pgn discussion about it is not going to be productive.

When I have time, and nothing useful to run on the cluster, I will re-run the original positions again, since even though I don't have the PGN, I have everything needed to reproduce PGN samples that ought to show some variability. The next two might be worse, or they might look normal, but those results should be repeatable.

The correlation that karl talked about is correlation that can give the same wrong result again and again and not correlation that can explain the results that you got.

Karl has no explanation for the fact that you did not get almost the same wrong result twice.

He had a pretty good explanation about correlation however, and explained how that _could_ produce those results...

I did not talk about the correlation that carl meant earlier because I tried to explain your results and I did not try to help you to design experiment to measure small changes.

I thought that it is important to find what you did wrong for future tests
because maybe the same type of mistake can cause errors also in correct tests to measure small changes.

Uri

So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play? And now that we have changed the number of starting positions, suddenly that "bug" has been fixed, even though _nothing_ has been changed except for the file containing the PGN records. Even though others have found the same high variability in results from the same starting position, etc?

Until I see more of this random behavior, which has not yet been seen after four even bigger tests (doesn't mean it won't, but it hasn't yet) I feel somewhat comfortable in concluding that since the only thing that changed between the two experiments is the starting positions and the number of them, that most likely _that_ is what is making a difference.

bob · Post by **bob** » Fri Aug 15, 2008 12:24 am

Uri Blass wrote:
hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.
Well, I already told you nearly a year ago that the error caused by the low number of positions was starting to dominate the error in your result for the number of games you play. Put that is only the error between your results and the true strength difference, which does not show up in the difference between runs with identical programs, as all these runs used the same small set of positions.

The data of your first runs was corrupted. If you would have saved the PGN, you could have known that from the internal variation. (And you would have known how many positions you need to use.) Now you can only now that by repeating the run. But you are not going to do that. So in the end you will know noting, except how to determine a square root by playing Chess games.
I do not suggest Bob to repeat the same run that is not good for measuring small changes.

Note that it is possible that the corruption that happened does not happen often and in this case there is a good chance that repeating the first run
is not going to help Bob to find the error because there is not going to be an error in the next few runs.

I suggest him simply to save the pgn(and he does it) so later it can help him to find errors.

Note that I do not agree with karl about eliminating the black/white pairs
and I think that the correlation that happens by black/white of the same position is correlation that reduce the error.

If there are unbalanced positions it is obvious and in the extreme case that the result is only based on the position you get exactly 50% if you play the position white/black and you can get some different noise that is not dependent on the strength of the engines if you do not do it.

I believe that the position are not extremely unbalanced but I do not believe that they are extremely balanced and white may score 55% in some positions when you choose equal engines so having white/black for the same position has clear advantages of reducing noise.

Uri

I suppose you guys are going to keep up with the "stampee feet, error, stampee feet, bug, stampee feet, etc" no matter what happens? How many runs have I posted over the past 2 years with that wild variability? And now, suddenly, at least for 4 full runs plus one partial, everything looks perfectly normal. And yet that variability is the result of a bug, when _everything_ about all the runs I have done has been absolutely identical. Same exact programs, same exact starting positions, same time control, same hash sizes, same processors, same operating systems, same compilers, same executables, same referee program, in short, same _everything except for the new set of more starting positions. A rational person would eventually assume that if one thing was changed, and the result changed significantly, then that one change is most likely the thing that caused the result to change. Others just continue to shout bug, over and over, without _ever_ being able to give me a scenario, knowing _exactly_ what both of these clusters look like, that I could use to _make_ those odd results happen, without modifying anything that is known to be constant (executables, nodes, etc.)

bob · Post by **bob** » Fri Aug 15, 2008 12:32 am

oystein wrote:
bob wrote:One thing to keep in mind. There are several ways to test, and each shows different things. A general set of starting positions, if it is a small set, can put an engine in a position that (a) it plays poorly, and (b) will never see in a real game bacause its opening book does not allow that opening to be played.

The larger and more varied set of positions probably does a better overall job of measuring strength between the engines, but that is not exactly what I am looking for. I am wanting to measure the difference between two versions of the same program, that use the same starting positions against the same opponents, so the only thing that varies is the changes to the new version.
But the measured difference between your versions may vary with test set, so it is important to get a representative test set. Thats way I think more test with differt sets is interesting. But perhaps you already know a lot about this (I am new here, as you may have noticed).

It appears to require a _large_ number of games, measured in the tens of thousands, to measure small changes, assuming it is even possible to do this... My intent with these tests is to attempt to quantify the _minimum_ testing necessary to say whether a change is good or bad...
How many elo points is "small changes"?

That's a _good_ question to ask, and I have no answer. I'd hope that a small change would be at least one Elo, if it is an improvement. But that seems to be on the far side of ridiculous. In this test, for example, 200 "small changes" and we would be at the top of the list? Somehow I don't think so. So, as I postulated previously, perhaps we can't really measure small improvements, as they might well be less than 1 Elo.

To rehash _way_ old discussion, this first came up when I added the so-called "history pruning" to crafty a couple of years back or so. (now called LMR and most do not use the history information for this any longer). In any case, the first thing I saw posted was Uri's comment that he had a better history value than the default for Fruit. So I decided to test Crafty using the old positions and run a few hundred games per test for "accuracy". And I got _nothing_ useful. You would think that if you start at zero and work your way up to the max possible value, somewhere in there the scores will start to improve, then as you pass the max, they start to decline again. So I ran that test and the resulting "plot" looked like noise rather than the expected smooth up and down curve. Thinking I had a bug, I ran the same test on fruit. same result. Which led me to believe that either (a) history counters were no good or (b) the testing was somehow missing the mark. I then started running the _same_ test over and over to see how much that varied, expecting identical results each time since these things are generally deterministic. But not so. The results varied wildly although nothing was changed between runs.

Leading us to where we are today...

Dirt · Post by **Dirt** » Fri Aug 15, 2008 12:58 am

Uri Blass wrote:Note that I do not agree with karl about eliminating the black/white pairs and I think that the correlation that happens by black/white of the same position is correlation that reduce the error.

If there are unbalanced positions it is obvious and in the extreme case that the result is only based on the position you get exactly 50% if you play the position white/black and you can get some different noise that is not dependent on the strength of the engines if you do not do it.

I believe that the position are not extremely unbalanced but I do not believe that they are extremely balanced and white may score 55% in some positions when you choose equal engines so having white/black for the same position has clear advantages of reducing noise.

Uri

If we were trying to measure the strength of Crafty against the other engines then I could see that playing a position only with Crafty white would add noise. I think that is always in practice going to be of some concern, and that is a reason for Crafty to play it both as black and white. But theoretically we are only interested in measuring the difference between two versions of Crafty, who will both play the position the same way, so I don't see any additional noise being added to this measurement at all.

Uri Blass · Post by **Uri Blass** » Fri Aug 15, 2008 12:59 am

bob wrote:
Uri Blass wrote:
hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.
Well, I already told you nearly a year ago that the error caused by the low number of positions was starting to dominate the error in your result for the number of games you play. Put that is only the error between your results and the true strength difference, which does not show up in the difference between runs with identical programs, as all these runs used the same small set of positions.

The data of your first runs was corrupted. If you would have saved the PGN, you could have known that from the internal variation. (And you would have known how many positions you need to use.) Now you can only now that by repeating the run. But you are not going to do that. So in the end you will know noting, except how to determine a square root by playing Chess games.
I do not suggest Bob to repeat the same run that is not good for measuring small changes.

Note that it is possible that the corruption that happened does not happen often and in this case there is a good chance that repeating the first run
is not going to help Bob to find the error because there is not going to be an error in the next few runs.

I suggest him simply to save the pgn(and he does it) so later it can help him to find errors.

Note that I do not agree with karl about eliminating the black/white pairs
and I think that the correlation that happens by black/white of the same position is correlation that reduce the error.

If there are unbalanced positions it is obvious and in the extreme case that the result is only based on the position you get exactly 50% if you play the position white/black and you can get some different noise that is not dependent on the strength of the engines if you do not do it.

I believe that the position are not extremely unbalanced but I do not believe that they are extremely balanced and white may score 55% in some positions when you choose equal engines so having white/black for the same position has clear advantages of reducing noise.

Uri
I suppose you guys are going to keep up with the "stampee feet, error, stampee feet, bug, stampee feet, etc" no matter what happens? How many runs have I posted over the past 2 years with that wild variability? And now, suddenly, at least for 4 full runs plus one partial, everything looks perfectly normal. And yet that variability is the result of a bug, when _everything_ about all the runs I have done has been absolutely identical. Same exact programs, same exact starting positions, same time control, same hash sizes, same processors, same operating systems, same compilers, same executables, same referee program, in short, same _everything except for the new set of more starting positions. A rational person would eventually assume that if one thing was changed, and the result changed significantly, then that one change is most likely the thing that caused the result to change. Others just continue to shout bug, over and over, without _ever_ being able to give me a scenario, knowing _exactly_ what both of these clusters look like, that I could use to _make_ those odd results happen, without modifying anything that is known to be constant (executables, nodes, etc.)

There are results that are simply almost impossible mathematically.
If you see something with probability of less than 1/100000 happens then it is logical to believe that it is probably a bug even if you have no idea what is the bug.

It is possible that the bug was in your previous experiment and is not in the new experiment but after reading that you insist that it is not a bug I agree with H.G.Muller that it may be better that you repeat the previous test not to test small changes but to find the bug or to discover that the result that you got was really not typical result.

Uri

Uri Blass · Post by **Uri Blass** » Fri Aug 15, 2008 1:10 am

Dirt wrote:
Uri Blass wrote:Note that I do not agree with karl about eliminating the black/white pairs and I think that the correlation that happens by black/white of the same position is correlation that reduce the error.

If there are unbalanced positions it is obvious and in the extreme case that the result is only based on the position you get exactly 50% if you play the position white/black and you can get some different noise that is not dependent on the strength of the engines if you do not do it.

I believe that the position are not extremely unbalanced but I do not believe that they are extremely balanced and white may score 55% in some positions when you choose equal engines so having white/black for the same position has clear advantages of reducing noise.

Uri
If we were trying to measure the strength of Crafty against the other engines then I could see that playing a position only with Crafty white would add noise. I think that is always in practice going to be of some concern, and that is a reason for Crafty to play it both as black and white. But theoretically we are only interested in measuring the difference between two versions of Crafty, who will both play the position the same way, so I don't see any additional noise being added to this measurement at all.

You are right.
It is important to have the same position white and black only
if you test version X against version X+1 of the same program.

Bob does not test in this way but we have no proof that this type of test cannot be productive.

IM Larry kaufman from the rybka team told us that rybka test mainly RybkaX+1 against RybkaX.

Uri

YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)