YATT.... (Yet Another Testing Thread)

bob · Post by **bob** » Thu Aug 14, 2008 1:07 am

First, the fourth and final run finished a couple of hours ago. Cluster was loaded, but the other jobs ran far faster than I expected so my testing slipped back in and finished. Here's the complete set of four runs:

Code: Select all

Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Wed Aug 13 14&#58;19&#58;47 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -20   21%
   2 Fruit 2.1               71    6    7  7782   62%   -20   23%
   3 opponent-21.7           17    6    6  7780   56%   -20   34%
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -20   20%
   5 Crafty-22.2            -20    3    4 38908   47%     4   23%
   6 Arasan 10.0           -191    7    7  7782   28%   -20   18%

So -16, -19, -20 and -21.

Now I am going to take the three huge pgn files I have, and run em thru BayesElo two at a time changing the name of crafty-22.2 in one to see how the two versions will compare, knowing that they are _identical_ in reality

Actually I am not going to do that just yet.

Just discovered a major issue with the new test. Since I run so many matches of 2 games each, I managed to create enough different filenames so that I ran out of i-nodes. I am going to try to clean this mess up a bit, and then I will simply have to re-run if I decide that the data might be interesting. I am also going to have to find a better way to name these files since 40,000 individual PGN files is somewhat problematic (I am not the only user making thousands of small files, but I am going to probably be the only one that is going to try to solve this so it won't haunt me again. This might have been the issue with an occasional missing PGN file or two in the past runs where the total number of games was always off by just a few.

ah well...

BTW this isn't a cluster problem. I run things over there, then copy the files back to my office machine. I had a large raid0 filesystem I was using, but had forgotten I had formatted it for large files (endgame tables) which limited the number of i-nodes (one i-node per filename) to gain more data blocks. I have re-formatted with default i-node count so this will not be an issue next time around.

sje · Post by **sje** » Thu Aug 14, 2008 1:48 am

Do you need to keep all of the game scores? Would not the player identities and the results be sufficient?

A while back I wrote a PGN library game duplicate remover routine. To make this work fast without having to store all the game text in memory, I wrote a hash generator that calculated a 128 bit signature based on the positions in the game and the order in which they occurred. Perhaps something like this would be of assistance if you're concerned about replicated games tarnishing the statistical analysis.

bob · Post by **bob** » Thu Aug 14, 2008 1:58 am

sje wrote:Do you need to keep all of the game scores? Would not the player identities and the results be sufficient?

A while back I wrote a PGN library game duplicate remover routine. To make this work fast without having to store all the game text in memory, I wrote a hash generator that calculated a 128 bit signature based on the positions in the game and the order in which they occurred. Perhaps something like this would be of assistance if you're concerned about replicated games tarnishing the statistical analysis.

The request was for the PGN which has things like node id, date/time games were played, etc, so that any sort of unexpected correlation might be tracked down to a particular time range, or particular node, etc.

I've fixed my match maker program and am running a test right now, and have already mkfs's the file system to hold this stuff...

hgm · Post by **hgm** » Thu Aug 14, 2008 10:45 am

bob wrote:So -16, -19, -20 and -21.

Now I am going to take the three huge pgn files I have, and run em thru BayesElo two at a time changing the name of crafty-22.2 in one to see how the two versions will compare, knowing that they are _identical_ in reality

Seems to me that a simple multiplication by 6/5 should be sufficient in this simple Crafty vs. World case. If Crafty would perform, say, 2% worse in one of the runs, its rating would drop 14 Elo compared to the avrage of its opponents. If you kept the opponent ratings fixed, it would mean Crafty's absolute rating would drop 14. But that would mean that the average of all ratings (Crafty+World) would drop by 14/6. BayeElo does not allow that, and would shift the rating scale up by 14/6 points to keep the average at zero. So in stead of dropping 14 Elo, BayesElo would report a drop of 14-14/6 = 14*(5/6). Which will not produce a very shocking difference.

To recover the hypothetical situaton of fixed (average) opponent ratings, you can thus simply multiply the reported result of the individual run by 6/5. The main problem with BayesElo for this purpose is that the precision with which the results are reported starts to be insufficient: we really would like to see digits behind the decimal point for Elo and score here.

16, 19, 20 and 21 average to 19, and a standar deviation of sqrt((9+0+1+4)/4) = sqrt(3.5) = 1.9. (Actually we would have to divide by 3 here, rather than 4, since we base the D on an average that is also derived from the same data. This would give SD = sqrt(14/3) = 2.16. This is in perfect agreement with the reported 95% confidence interval (2 sigma) of 4 Elo.

After scaling with the factor 6/5, the observed SD of individul runs would increase to 2.6 Elo. Theoretical value for 38,000 games would be 1.6. But it is really hard to make any conclusions based on that, since the observed SD is mainly based on a single large deviation (the 16), which could have suffered from rounding.

oystein · Post by **oystein** » Thu Aug 14, 2008 8:37 pm

I have made a summary of the runs of Crafty vs the world I have found in these threads. I have shifted the results so the mean of Crafty opponents elos are 0.

The 800 runs, all with the 40 silver positions.

Code: Select all

#games                   799   799   799   800      800   800   800   800
 
Glaurung 2-epsilon/5     117    77   111   124       86    98   140   116
Fruit 2.1                 45    31    19    57       67    35    28    14
opponent-21.7              9    57    71     6       13    36    13    25
Glaurung 1.1 SMP          57    42   -37    18        9    40    36    36
Crafty-22.2              -22   -21   -14   -40      -23   -36   -41   -30
Arasan 10.0             -230  -209  -163  -206     -177  -208  -218  -192

Combination of the 800 runs above, the 25000 runs with 40 silver positions and 38000 runs with 3800 different starting positions.

Code: Select all

#games                 6397     25597 25595    38908 38910 38909 38908
 
Glaurung 2-epsilon/5    109       123   114      104   106   106   107
Fruit 2.1                37        38    38       58    59    60    67
opponent-21.7            29        28    28       21    22    20    13
Glaurung 1.1 SMP         25         2    16        6     3     0     7
Crafty-22.2             -28         2   -23      -25   -23   -19   -24
Arasan 10.0            -200      -193  -197     -189  -191  -185  -195

Note that Fruit is rated about 20-25 elo stronger in the last 4 runs, opponent-21.7 and Glaurung 1.1 is rated about 10-20 elo weaker. So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.

bob · Post by **bob** » Thu Aug 14, 2008 9:54 pm

oystein wrote:I have made a summary of the runs of Crafty vs the world I have found in these threads. I have shifted the results so the mean of Crafty opponents elos are 0.

The 800 runs, all with the 40 silver positions.
Code: Select all
#games                   799   799   799   800      800   800   800   800
 
Glaurung 2-epsilon/5     117    77   111   124       86    98   140   116
Fruit 2.1                 45    31    19    57       67    35    28    14
opponent-21.7              9    57    71     6       13    36    13    25
Glaurung 1.1 SMP          57    42   -37    18        9    40    36    36
Crafty-22.2              -22   -21   -14   -40      -23   -36   -41   -30
Arasan 10.0             -230  -209  -163  -206     -177  -208  -218  -192
Combination of the 800 runs above, the 25000 runs with 40 silver positions and 38000 runs with 3800 different starting positions.
Code: Select all
#games                 6397     25597 25595    38908 38910 38909 38908
 
Glaurung 2-epsilon/5    109       123   114      104   106   106   107
Fruit 2.1                37        38    38       58    59    60    67
opponent-21.7            29        28    28       21    22    20    13
Glaurung 1.1 SMP         25         2    16        6     3     0     7
Crafty-22.2             -28         2   -23      -25   -23   -19   -24
Arasan 10.0            -200      -193  -197     -189  -191  -185  -195
Note that Fruit is rated about 20-25 elo stronger in the last 4 runs, opponent-21.7 and Glaurung 1.1 is rated about 10-20 elo weaker. So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.

One thing to keep in mind. There are several ways to test, and each shows different things. A general set of starting positions, if it is a small set, can put an engine in a position that (a) it plays poorly, and (b) will never see in a real game bacause its opening book does not allow that opening to be played.

The larger and more varied set of positions probably does a better overall job of measuring strength between the engines, but that is not exactly what I am looking for. I am wanting to measure the difference between two versions of the same program, that use the same starting positions against the same opponents, so the only thing that varies is the changes to the new version.

It appears to require a _large_ number of games, measured in the tens of thousands, to measure small changes, assuming it is even possible to do this... My intent with these tests is to attempt to quantify the _minimum_ testing necessary to say whether a change is good or bad...

hgm · Post by **hgm** » Thu Aug 14, 2008 10:23 pm

oystein wrote:So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.

The difference is that the data using the first set of starting positions is somehow corrupted, as it seems now, in particular the first run. So averaging the the runs corrupts everything. The proper way to process the data would be to discard the first run. (Or the part of the first run that was corrupted, but as only the total result of that run has survived, we can no longer make that distiction.)

Lacking details that allow us to judge runs for acceptability based on its internal details, the proper way to do deal with the situation would be to redo the run with the first set of positions, and use 2:1 voting to decide which to discard.

hgm · Post by **hgm** » Thu Aug 14, 2008 10:33 pm

bob wrote: It appears to require a _large_ number of games, measured in the tens of thousands, to measure small changes, assuming it is even possible to do this... My intent with these tests is to attempt to quantify the _minimum_ testing necessary to say whether a change is good or bad...

The error in score % for N games is 45%/sqrt(N), and as 1% is 7 Elo for approximately equivalent engines the SD in Elo equals 315/sqrt(N), and the 95% confidence interval +/- 630/sqrt(N).

The difference of rating of two engines with N games each (against the same set of opponents) has a sqrt(2) larger error bar, i.e. 981/sqrt(N). So if you want to measure a difference of D Elo, you have

D = 981/sqrt(N)

sqrt(N) = 981/D

N = (981/D)^2

So if you want to measure 9.81 Elo with 95% confidence, you need indeed 10,000 games. If you want to measure 1 Elo, you need to measure 1,000,000 games. (For each engine.) If you can avoid all other errors, that is... But no way you could ever do with less.

bob · Post by **bob** » Thu Aug 14, 2008 10:44 pm

hgm wrote:
oystein wrote:So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.
The difference is that the data using the first set of starting positions is somehow corrupted, as it seems now, in particular the first run. So averaging the the runs corrupts everything. The proper way to process the data would be to discard the first run. (Or the part of the first run that was corrupted, but as only the total result of that run has survived, we can no longer make that distiction.)

Lacking details that allow us to judge runs for acceptability based on its internal details, the proper way to do deal with the situation would be to redo the run with the first set of positions, and use 2:1 voting to decide which to discard.

This "corrupted" is simply wrong. It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along. And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents. And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.

Uri Blass · Post by **Uri Blass** » Thu Aug 14, 2008 11:33 pm

bob wrote:
hgm wrote:
oystein wrote:So there really is a difference between the 2 sets of starting positions. I think more test with different starting sets would be interesting.
The difference is that the data using the first set of starting positions is somehow corrupted, as it seems now, in particular the first run. So averaging the the runs corrupts everything. The proper way to process the data would be to discard the first run. (Or the part of the first run that was corrupted, but as only the total result of that run has survived, we can no longer make that distiction.)

Lacking details that allow us to judge runs for acceptability based on its internal details, the proper way to do deal with the situation would be to redo the run with the first set of positions, and use 2:1 voting to decide which to discard.
This "corrupted" is simply wrong. It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along. And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents. And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.

My opinion is that the results were corrupted but without having the data (pgn) I have no way to know what is corrupted(it may be possible that it is a bug that does not happen every day and you did not see it in the games that you checked and it is the reason that it is important to have pgn so later in the future you can analyze what happened).
When we do not have the pgn discussion about it is not going to be productive.

The correlation that karl talked about is correlation that can give the same wrong result again and again and not correlation that can explain the results that you got.

Karl has no explanation for the fact that you did not get almost the same wrong result twice.

I did not talk about the correlation that carl meant earlier because I tried to explain your results and I did not try to help you to design experiment to measure small changes.

I thought that it is important to find what you did wrong for future tests
because maybe the same type of mistake can cause errors also in correct tests to measure small changes.

Uri

YATT.... (Yet Another Testing Thread)

YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)