Tapered Evaluation and MSE (Texel Tuning)

Desperado · Post by **Desperado** » Thu Jan 21, 2021 10:42 am

..."Modifying the parameter set" is what the algorithm does all the time...

Here is the misunderstanding imo. It is not about modifying the parameter set.
I was talking about modifying the parameters of the search setting. This affect the modification of the parameter set in the end.

Meanwhile i would agree, if you say that introducing iterations because of using different stepsizes instead of one,
does change the basic algorithm and is a variant of it. Ok, no problem.

But i startet too with one stepzsize and used only one unit (my unit was defined with 5).
At least that is not a modification of the algorithm, it simply increments or decrements another unit.

Talking about it further will not get either of us anywhere. I think we are both able to abstract what the other wants to express.
Everything else is nitpicking.

Ronald · Post by **Ronald** » Thu Jan 21, 2021 12:03 pm

hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.

The terms of an evaluation function are not independent, most obvious are the piece values, which are basically values relative to each other (we even express them in "centipawns"). This means that changing the value of one element influences the optimal value of other elements. Doesn't this imply that when optimizing with Texel tuning like algorithms, where you change one term at a time, you will nearly always end up with a local optimum, and that using different stepsizes/ different sequences of terms lead to different local optima?

hgm · Post by **hgm** » Thu Jan 21, 2021 12:26 pm

Well, as I demonstrated there will not be any local minima other than the global minimum, if the test set is not obviously flawed by failing to provide WDL info for some large range of scores.

Some examples: running a simplistic optimizer (just randomly increasing / decreasing the eval parameters by 1cP, and trying that up to 1000 times to get a better MSE) on a synthetic test set of random positions, for various sizes of the set, fitting the 5 piece values, gave the following results:

Code: Select all

100000, start at 0:     92 304 327 468 889 5396.180698
100000, start at exact: 93 305 328 470 892 5396.172637
 10000, start at 0:     91 303 328 486 897  528.024707
 10000, start at exact: 92 304 329 487 900  528.024649
  1000, start at 0:     76 312 339 499 895   51.970099
  1000, start at exact: 77 313 340 501 899   51.969961

It doesn't matter whether I start all piece values at 0, or whether I start with the values that were used to generate the results for the set (100/325/350/500/950). It always converges to the same minimum. That the best values are different for the different sets is just an artifact of the particular set, and the probability for accidentally implying something weird in a set diminishes as the set gets larger.

Interestingly, it thinks it can improve on the original values, even in asymptotically large sets. This is caused by the (simulated) positional term that helped determine the game results (but could not be fitted by the evaluation that was tuned). Since I used a symmetric distribution for that term, it has the effect of shifting the result towards a draw. Because if the material advantage is already large, getting some extra positional advantage on top of it improves the average result less than an equal positional disadvantage would reduce it. (The positional term I applied could ranges from -330 to +330 cP, but would be between -20 and +20 cP in 50% of the cases.)

Anyway, it seems that this talk about local minima is just a red herring, unless you would be fitting highly non-linear terms such as are sometimes used in calculating King safety. And it remains to be seen whether any of the local minima would be any good at all, both MSE and Elo-wise.

hgm · Post by **hgm** » Thu Jan 21, 2021 12:30 pm

Ronald wrote: ↑Thu Jan 21, 2021 12:03 pmThe terms of an evaluation function are not independent, most obvious are the piece values, which are basically values relative to each other (we even express them in "centipawns"). This means that changing the value of one element influences the optimal value of other elements. Doesn't this imply that when optimizing with Texel tuning like algorithms, where you change one term at a time, you will nearly always end up with a local optimum, and that using different stepsizes/ different sequences of terms lead to different local optima?

The piece values are not just relative, when you express them in terms of how they skew the result. The steepness of the sigmoid ('K') defines an absolute scale. The test positions will tell you what score advantage you get for being a Pawn ahead, and there is nothing arbitrary in that. If you would scale all piece values by the same factor, the materially imbalanced positions would not predict the correct score.

Ronald · Post by **Ronald** » Thu Jan 21, 2021 12:42 pm

Does your optimizer use a Texel like method, ie changing 1 term at a time? My question is: does using different stepsizes or a different sequence of changing the terms influence the outcome of the optimization?

hgm · Post by **hgm** » Thu Jan 21, 2021 12:46 pm

No, it can also change multiple terms at the same time. If you don't do that, you might get stuck in places that are not even local optima. I used the smallest possible step size (as the piece values are integers in centiPawn).

To be sure to find a local minimum you would always have to reduce your step size when you approach it. Or you will reach a situation where all new trials will oveshoot the minimum, and look worse, even though stepping a smaller amount in that direction would have given you an improvement.

This is the complete code:

Code: Select all

#include <stdio.h>
#include <math.h>

#define NR 10000

int mat[NR][11];
int val[] = { 100, 325, 350, 500, 950, 100, 325, 350, 500, 950 };
int cnt[] = { 8, 2, 2, 2, 1, 8, 2, 2, 2, 1 };
int w[] = { 0, 1, 1, 2, 4, 0, 1, 1, 2, 4 };
//int best[10] = {100, 325, 350, 500, 950}, trial[10];
int best[10], trial[10];

double r()
{
  return (unsigned int) rand() / (double) (1U << 31);
}

double sigm(int x)
{
  return 1./(1. + exp(-2.2/400 * x));
}

double mse(int fit[10])
{
  double sum = 0, e;
  int i, j, eval;
  for(i=0; i<NR; i++) {
    eval = 0;
    for(j = 0; j<10; j++) {
      eval += (j < 5 ? fit[j] : -fit[j-5])*mat[i][j];
    }
    e = sigm(eval) - 0.5*mat[i][10];
    sum += e*e;
  }
  return sum;
}

void main()
{
  int i, j, k;
  double res;
  // create synthetic test set
  for(i=0; i<NR; i++) {
    double f, d, x, frac = r();
    int score = 0, phase = 0;
    for(j=0; j<10; j++) {
      int n = cnt[j];
      for(k=0; k<cnt[j]; k++) if(r() > frac) n--; // randomly delete some pieces
      mat[i][j] = n;
      score += (j < 5 ? val[j] : -val[j-5])*n;    // calculate material eval
      phase += w[j]*n;
    }
    f = 4*r() - 2;
    score += (f*f*f*f*f + f)*10 + 0.5; // add random positional term
    f = sigm(score); // average result for this position
    d = f*(1-f);     // assume a draw fraction for this average result
    x = r();         // assign an actual WDL result conforming to the average
    mat[i][10] = (x > f + d ? 0 : x > f - d ? 1 : 2);
  }
  // start optimizing
  i = 1; res = mse(best);
  do {
    double new;
    for(j=0; j<5; j++) { // generate new trial vector randomly
      double f = r();
      trial[j] = best[j];
      if(f > 0.75) trial[j]++; else
      if(f < 0.25) trial[j]--;
    }
    new = mse(trial); i++; // try it out
    if(new < res) {        // it is better
      printf("%d %d %d %d %d %f\n", trial[0], trial[1], trial[2], trial[3], trial[4], new);
      res = new; i = 1;
      for(j=0; j<5; j++) best[j] = trial[j];
    }
  } while(i < 1000); // give up after 1000 unsuccesful trials
}

hgm · Post by **hgm** » Thu Jan 21, 2021 2:15 pm

I now also tried a tapered evaluation, which for result generation uses the values

int val[] = { 70, 325, 350, 450, 950, 100, 300, 350, 500, 1100 };

(first 5 opening, last 5 end-game). Fitting on a set of 10000 gives:

Code: Select all

starting at 0:               59 308 320 432 910 96 270 339 495 1035 529.194240
starting with exact values:  60 308 320 430 895 97 273 342 503 1060 529.140631

It converges to practically the same values, no matter where you start.

Sven · Post by **Sven** » Thu Jan 21, 2021 2:38 pm

hgm wrote: ↑Thu Jan 21, 2021 12:46 pm No, it can also change multiple terms at the same time. If you don't do that, you might get stuck in places that are not even local optima. I used the smallest possible step size (as the piece values are integers in centiPawn).

To be sure to find a local minimum you would always have to reduce your step size when you approach it. Or you will reach a situation where all new trials will oveshoot the minimum, and look worse, even though stepping a smaller amount in that direction would have given you an improvement.

This is the complete code:

Code: Select all

#include <stdio.h>
#include <math.h>

#define NR 10000

int mat[NR][11];
int val[] = { 100, 325, 350, 500, 950, 100, 325, 350, 500, 950 };
int cnt[] = { 8, 2, 2, 2, 1, 8, 2, 2, 2, 1 };
int w[] = { 0, 1, 1, 2, 4, 0, 1, 1, 2, 4 };
//int best[10] = {100, 325, 350, 500, 950}, trial[10];
int best[10], trial[10];

double r()
{
  return (unsigned int) rand() / (double) (1U << 31);
}

double sigm(int x)
{
  return 1./(1. + exp(-2.2/400 * x));
}

double mse(int fit[10])
{
  double sum = 0, e;
  int i, j, eval;
  for(i=0; i<NR; i++) {
    eval = 0;
    for(j = 0; j<10; j++) {
      eval += (j < 5 ? fit[j] : -fit[j-5])*mat[i][j];
    }
    e = sigm(eval) - 0.5*mat[i][10];
    sum += e*e;
  }
  return sum;
}

void main()
{
  int i, j, k;
  double res;
  // create synthetic test set
  for(i=0; i<NR; i++) {
    double f, d, x, frac = r();
    int score = 0, phase = 0;
    for(j=0; j<10; j++) {
      int n = cnt[j];
      for(k=0; k<cnt[j]; k++) if(r() > frac) n--; // randomly delete some pieces
      mat[i][j] = n;
      score += (j < 5 ? val[j] : -val[j-5])*n;    // calculate material eval
      phase += w[j]*n;
    }
    f = 4*r() - 2;
    score += (f*f*f*f*f + f)*10 + 0.5; // add random positional term
    f = sigm(score); // average result for this position
    d = f*(1-f);     // assume a draw fraction for this average result
    x = r();         // assign an actual WDL result conforming to the average
    mat[i][10] = (x > f + d ? 0 : x > f - d ? 1 : 2);
  }
  // start optimizing
  i = 1; res = mse(best);
  do {
    double new;
    for(j=0; j<5; j++) { // generate new trial vector randomly
      double f = r();
      trial[j] = best[j];
      if(f > 0.75) trial[j]++; else
      if(f < 0.25) trial[j]--;
    }
    new = mse(trial); i++; // try it out
    if(new < res) {        // it is better
      printf("%d %d %d %d %d %f\n", trial[0], trial[1], trial[2], trial[3], trial[4], new);
      res = new; i = 1;
      for(j=0; j<5; j++) best[j] = trial[j];
    }
  } while(i < 1000); // give up after 1000 unsuccesful trials
}

Your f() function returns values between 0.0 and (RAND_MAX / 2147483648.0), where the latter is about 1.526e-05 for a typical RAND_MAX=32767. That means, r() is always very small. Especially it is never greater than 0.25 (in fact much smaller). Is that intended? Also, what does it mean for the created test sets, do they have the intended variety, given that the "random positional term" is always either -339 or -340, the draw fraction is almost zero, and thus the WDL result is always 2?

hgm · Post by **hgm** » Thu Jan 21, 2021 2:43 pm

Sven wrote: ↑Thu Jan 21, 2021 2:38 pmYour f() function returns values between 0.0 and (RAND_MAX / 2147483648.0), where the latter is about 1.526e-05 for a typical RAND_MAX=32767.

Well, apparently the rand() function used by my compiler is not very typical. Of course the very first thing I did is make the program do 100 rand() calls and print their result. From which I saw that the result went up to about 2 billion. So none of the problems that could have occurred if I would have done it wrong actually happen here.

Sven · Post by **Sven** » Thu Jan 21, 2021 2:51 pm

hgm wrote: ↑Thu Jan 21, 2021 2:43 pm
Sven wrote: ↑Thu Jan 21, 2021 2:38 pmYour f() function returns values between 0.0 and (RAND_MAX / 2147483648.0), where the latter is about 1.526e-05 for a typical RAND_MAX=32767.
Well, apparently the rand() function used by my compiler is not very typical. Of course the very first thing I did is make the program do 100 rand() calls and print their result. From which I saw that the result went up to about 2 billion. So none of the problems that could have occurred if I would have done it wrong actually happen here.

Replacing (1U << 31) by RAND_MAX and #including <stdlib.h> gives better results.

Now what is your conclusion, given that your tuning algorithm does not resemble Texel tuning at all, and that your test setup uses random test data that are different each time you run the tuner? Should we use random positions for eval parameter tuning?

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)