evaluating testing results, one more time

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

evaluating testing results, one more time

Post by bob »

Here is a bit of an oddity for a question. I have finally finished the fruit-interpolation changes to Crafty's evaluation. None of the values are tuned at all, and some parts of the evaluation dealing with recognizing drawish positions and such is still locked out. I added a small feature to the commands so that I can "scale" a particular evaluation term by some percentage. The idea was to then pick some evaluation term that is important (usually the "term" is an array of values) and then scale it up and down and run a cluster test to see whether the change is better or worse. And I noticed something that might not be unexpected, but it makes interpreting the results more interesting. First, three runs:

Code: Select all

Sat Sep 27 16:20:42 CDT 2008
time control = 1000+10
crafty-22.2R10
scale 110
Rank Name               Elo    +    - games score oppo. draws  ---------LOS----------
   1 Glaurung 2.1       159    4    3 46692   70%     4   16%     100100100100100100100100100100
   2 Fruit 2.1           65    4    4 46691   58%     4   20%    0   100100100100100100100100100
   3 opponent-21.7       32    4    4 46692   54%     4   27%    0  0    99100100100100100100100
   4 Crafty-22.2R6       19    4    5 38910   52%    -5   18%    0  0  0    99 99100100100100100
   5 Crafty-22.2R9       13    4    4 38910   51%    -5   19%    0  0  0  0    75 99 99100100100
   6 Crafty-22.2R10      11    4    5 38910   51%    -5   19%    0  0  0  0 24    99 99100100100
   7 Crafty-22.2R3        0    4    5 38910   50%    -5   18%    0  0  0  0  0  0    99 99 99100
   8 Crafty-22.2R2       -6    4    5 38910   49%    -5   18%    0  0  0  0  0  0  0    96 99100
   9 Glaurung 1.1 SMP   -10    4    4 46692   48%     4   17%    0  0  0  0  0  0  0  3    98100
  10 Crafty-22.2R1      -14    4    4 38909   48%    -5   18%    0  0  0  0  0  0  0  0  1   100
  11 Arasan 10.0       -268    4    4 46692   18%     4   12%    0  0  0  0  0  0  0  0  0  0
Sat Sep 27 17:24:45 CDT 2008
time control = 1000+10
crafty-22.2R10
scale 120
Rank Name               Elo    +    - games score oppo. draws  ---------LOS----------
   1 Glaurung 2.1       157    3    4 46692   70%     4   16%     100100100100100100100100100100
   2 Fruit 2.1           64    4    3 46690   58%     4   20%    0   100100100100100100100100100
   3 opponent-21.7       32    4    4 46692   54%     4   26%    0  0    99100100100100100100100
   4 Crafty-22.2R6       19    4    5 38910   52%    -5   18%    0  0  0    99 99100100100100100
   5 Crafty-22.2R9       12    4    4 38910   51%    -5   19%    0  0  0  0    74 99 99100100100
   6 Crafty-22.2R10      11    4    5 38909   51%    -5   19%    0  0  0  0 25    99 99100100100
   7 Crafty-22.2R3        0    4    5 38910   50%    -5   18%    0  0  0  0  0  0    99 99 99100
   8 Crafty-22.2R2       -6    4    5 38910   49%    -5   18%    0  0  0  0  0  0  0    96 99100
   9 Glaurung 1.1 SMP    -9    4    4 46692   48%     4   17%    0  0  0  0  0  0  0  3    99100
  10 Crafty-22.2R1      -14    4    4 38909   48%    -5   18%    0  0  0  0  0  0  0  0  0   100
  11 Arasan 10.0       -267    4    4 46692   18%     4   12%    0  0  0  0  0  0  0  0  0  0
Sat Sep 27 18:29:19 CDT 2008
time control = 1000+10
crafty-22.2R10
scale 130
Rank Name               Elo    +    - games score oppo. draws  ---------LOS----------
   1 Glaurung 2.1       159    4    4 46692   70%     4   16%     100100100100100100100100100100
   2 Fruit 2.1           65    4    4 46691   58%     4   20%    0   100100100100100100100100100
   3 opponent-21.7       31    4    3 46692   54%     4   26%    0  0    99100100100100100100100
   4 Crafty-22.2R6       19    4    4 38910   52%    -4   18%    0  0  0    99 99100100100100100
   5 Crafty-22.2R9       13    4    4 38910   51%    -4   19%    0  0  0  0    91 99 99100100100
   6 Crafty-22.2R10      10    4    4 38910   51%    -4   19%    0  0  0  0  8    99 99100100100
   7 Crafty-22.2R3        0    5    4 38910   50%    -4   18%    0  0  0  0  0  0    99 99 99100
   8 Crafty-22.2R2       -6    4    4 38910   49%    -4   18%    0  0  0  0  0  0  0    95 99100
   9 Glaurung 1.1 SMP    -9    3    4 46692   48%     4   17%    0  0  0  0  0  0  0  4    99100
  10 Crafty-22.2R1      -14    4    5 38909   48%    -4   18%    0  0  0  0  0  0  0  0  0   100
  11 Arasan 10.0       -268    4    4 46692   18%     4   12%    0  0  0  0  0  0  0  0  0  0
First, what is being tested? Crafty-22.2Rn is the question, with 22.2R10 being the one that is modified. Crafty-22.2Rn has played the usual 40K game match against each of 5 opponents, and as Remi suggested, I then combined all the PGN into one file. So here's what is what. R1 was the original version of 22.2. R2 added q-search checks. R3 added null R=3 everywhere, and R4, etc are just versions where I added the new interpolation scoring. R10 was the last version. If you notice at the top of the three elo outputs, you will see the "scale n" line. 120 means scale passed pawn scores up by a factor of 110% (a 10% boost). If you notice, the Elo for 22.2R10 doesn't change in the first two tests, which originally made me think there was little difference. But when you look more closely, the programs at the top dropped slightly in Elo in the second test.

I would rather see their Elo remain static, or at least fix the Elo of the top program so that the others change with respect to it, rather than having the elo of the programs above and below my test version changing a bit.

Any suggestions here???

One note. None of the 22.2Rn results can change except for 22.2R10, because they are not playing any additional games. The PGN from all their games are included, but their results are fixed. R10 is playing 40K games for each scale factor, and then the entire mess is run thru Bayeselo. I should probably try this without including the old Rn versions and see how they compare then, but I'd expect the same effect, but what I really want is to see Crafty's rating change if it is better, which is not quite what I am seeing.

I'm beginning to think what I should do here is to save all the PGN for all the different "scale" matches and combine them so that all the versions pop out in the one giant Elo calculation. That is "in progress". I will report that tomorrow. This will require almost a million games so it will take 24 hours or so, maybe a little less...
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: evaluating testing results, one more time

Post by Dirt »

bob wrote:First, what is being tested? Crafty-22.2Rn is the question, with 22.2R10 being the one that is modified. Crafty-22.2Rn has played the usual 40K game match against each of 5 opponents, and as Remi suggested, I then combined all the PGN into one file. So here's what is what. R1 was the original version of 22.2. R2 added q-search checks. R3 added null R=3 everywhere, and R4, etc are just versions where I added the new interpolation scoring. R10 was the last version. If you notice at the top of the three elo outputs, you will see the "scale n" line. 120 means scale passed pawn scores up by a factor of 110% (a 10% boost). If you notice, the Elo for 22.2R10 doesn't change in the first two tests, which originally made me think there was little difference. But when you look more closely, the programs at the top dropped slightly in Elo in the second test.

I would rather see their Elo remain static, or at least fix the Elo of the top program so that the others change with respect to it, rather than having the elo of the programs above and below my test version changing a bit.

Any suggestions here???

One note. None of the 22.2Rn results can change except for 22.2R10, because they are not playing any additional games. The PGN from all their games are included, but their results are fixed. R10 is playing 40K games for each scale factor, and then the entire mess is run thru Bayeselo. I should probably try this without including the old Rn versions and see how they compare then, but I'd expect the same effect, but what I really want is to see Crafty's rating change if it is better, which is not quite what I am seeing.

I'm beginning to think what I should do here is to save all the PGN for all the different "scale" matches and combine them so that all the versions pop out in the one giant Elo calculation. That is "in progress". I will report that tomorrow. This will require almost a million games so it will take 24 hours or so, maybe a little less...
Remi said combine your pgns to get one file, but I think you have three files here, one each for 110, 120 and 130. Those are what I think he meant to combine. Include R9 etc. if you're interested in comparing with them, too.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: evaluating testing results, one more time

Post by bob »

Dirt wrote:
bob wrote:First, what is being tested? Crafty-22.2Rn is the question, with 22.2R10 being the one that is modified. Crafty-22.2Rn has played the usual 40K game match against each of 5 opponents, and as Remi suggested, I then combined all the PGN into one file. So here's what is what. R1 was the original version of 22.2. R2 added q-search checks. R3 added null R=3 everywhere, and R4, etc are just versions where I added the new interpolation scoring. R10 was the last version. If you notice at the top of the three elo outputs, you will see the "scale n" line. 120 means scale passed pawn scores up by a factor of 110% (a 10% boost). If you notice, the Elo for 22.2R10 doesn't change in the first two tests, which originally made me think there was little difference. But when you look more closely, the programs at the top dropped slightly in Elo in the second test.

I would rather see their Elo remain static, or at least fix the Elo of the top program so that the others change with respect to it, rather than having the elo of the programs above and below my test version changing a bit.

Any suggestions here???

One note. None of the 22.2Rn results can change except for 22.2R10, because they are not playing any additional games. The PGN from all their games are included, but their results are fixed. R10 is playing 40K games for each scale factor, and then the entire mess is run thru Bayeselo. I should probably try this without including the old Rn versions and see how they compare then, but I'd expect the same effect, but what I really want is to see Crafty's rating change if it is better, which is not quite what I am seeing.

I'm beginning to think what I should do here is to save all the PGN for all the different "scale" matches and combine them so that all the versions pop out in the one giant Elo calculation. That is "in progress". I will report that tomorrow. This will require almost a million games so it will take 24 hours or so, maybe a little less...
Remi said combine your pgns to get one file, but I think you have three files here, one each for 110, 120 and 130. Those are what I think he meant to combine. Include R9 etc. if you're interested in comparing with them, too.
I realized that after I made the post. Too late to delete it. I am running the test again with automatic combining of all the PGNs enabled. Unfortunately, our A/C in the cluster room is once again failing and the cluster is down until at least this afternoon...