The last few weeks I have been devoted strictly to optimizing my evaluation function using the Texel method. At first, I was almost ready to chuck this idea because after getting my first set of weights after minimizing the errors I not only didn't get any improvement, but I lost Elo!!! I did eventually find a bug in my code that led to this result, but it still left me with a sense of doubt about whether this was a worth endeavor or not. So, I decided to conduct an experiment to prove to myself that this method works. Taking 500K positions from games played by computers having an Elo of 2800 or greater I started up my iterative process that minimizes the square of errors to see if it affects the Elo or not. This process is described in the Chess Programming wiki for the Texel tuning method. The only design change I made is to go through the weights randomly on each iteration to try and spread-out changes. While iteratively searching for better and better solutions I saved intermediate sets of weights along the way. Initially I saved weights after 5, 10, 15, 20, 25, 30, 50, 75, 100 and 150 iterations and my routine converged after 157 iterations. I then setup all those weights in a round-robin tournament along with the iteration 0 progenitor (paragon) to see how they all relate to one another in terms of Elo. I was trying to show myself that as the error was minimized, at least up to the point of overfitting, that the Elo would also go up. Here are the results of that first experiment:
Code: Select all
Rank Name                          Elo     +/-   Games   Score    Draw
   1 pass75                          9      11    1100   51.4%   72.2%
   2 pass158                         4      10    1100   50.6%   75.5%
   3 pass100                         3      10    1100   50.5%   75.6%
   4 pass50                          3      10    1100   50.4%   75.5%
   5 pass15                          3      10    1100   50.4%   75.5%
   6 pass30                          2      10    1100   50.3%   76.5%
   7 pass5                          -1      10    1100   49.9%   74.9%
   8 pass40                         -2      10    1100   49.8%   77.7%
   9 pass10                         -3      10    1100   49.5%   76.3%
  10 pass25                         -5      10    1100   49.3%   76.6%
  11 paragon                        -6      11    1100   49.2%   73.6%
  12 pass20                         -8      10    1100   48.8%   75.8%
These results weren't quite what I was hoping for, but they did show that in a very general sense the Elo goes up with more passes (i.e. less error). The upper half had all of the weights were created from an aggregate of 428 passes, while the lower half had 43 passes with paragon being pass0. However, as you can see there isn't an ordering that pops out at you that would allow someone to say a 40 pass solution is better than a 15 pass solution. However, it would appear I could pick up 15 Elo by choosing pass 75. I repeated this experiment with a different data set. This time I sampled 2 million records from human grandmaster games (i.e. > 2500 Elo) and preserved the weights for 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140 and 150 and my routine actually converged on the 157th pass. I then setup another round-robin tournament and here are the results:
Code: Select all
Rank Name                          Elo     +/-   Games    Wins  Losses   Draws   Points   Score    Draw
   1 pass110                        12       9    1024     116      82     826    529.0   51.7%   80.7%
   2 pass50                          7      10    1024     117      97     810    522.0   51.0%   79.1%
   3 pass40                          6      10    1024     112      94     818    521.0   50.9%   79.9%
   4 pass150                         5      10    1024     121     105     798    520.0   50.8%   77.9%
   5 pass60                          5       9    1024     105      89     830    520.0   50.8%   81.1%
   6 pass157                         5      10    1024     116     101     807    519.5   50.7%   78.8%
   7 pass140                         5      10    1024     123     108     793    519.5   50.7%   77.4%
   8 pass70                          3       9    1024     104      94     826    517.0   50.5%   80.7%
   9 pass100                         1      10    1024     109     105     810    514.0   50.2%   79.1%
  10 pass20                          0      10    1024     110     109     805    512.5   50.0%   78.6%
  11 pass130                        -0      10    1024     103     104     817    511.5   50.0%   79.8%
  12 pass90                         -5      10    1024     106     120     798    505.0   49.3%   77.9%
  13 pass80                         -5      10    1024      97     113     814    504.0   49.2%   79.5%
  14 pass30                         -6      10    1024     103     121     800    503.0   49.1%   78.1%
  15 pass120                        -7       9    1024      90     110     824    502.0   49.0%   80.5%
  16 pass10                        -13      11    1024     111     148     765    493.5   48.2%   74.7%
  17 paragon                       -15      10    1024     101     144     779    490.5   47.9%   76.1%
These are very similar results to my first experiment. So, while the Texel method does work generally, you may need to test all of the solutions it creates to determine which are better. Of course, my testing method was limited to 100 games per round on the first experiment and 64 games per round on the second, so it's very possible that my results might change significantly if the games per round were increased to a larger number, but I doubt they would ever line up in order such that higher passes (i.e. less error) were always better than lower passes. In the meantime it looks like I may be able to pick up 27 Elo!