tuning via maximizing likelihood

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

I have got two issues

a) The scale of centi-pawns vs elo is not one to one when we have significant draw percentages, so despite our previous assumption that no scaling is needed, this needs to be addressed especially for Davidson. The tuning results I got using Davidson are siginficantly ramped up compared to the other models. Assuming the centi-pawn vs winning-percentage relation is logistic, we will try to morf the three draw models to give evaluation scores obtained with no draw model. The criteria I am using to calculate the scaling factor is the slope with eval=0 should match that of the logistic curve. Let me know if there is a better scaling factor calculation method especially for Davidson that is slighlty off after scaling using this method. In code, it is:

Code: Select all

1282 //scale elos so that they look more like elostat's 
1283 //  Match slopes at 0 elo difference using df(x)/dx = K/4
1284 void calculate_scale() {
1285     const double K = log(10)/400.0;
1286     double df;
1287     if(eloModel == 0) {
1288         double dg = elo_to_gamma(eloDraw);
1289         double f = 1 / (1 + dg);
1290         df = f * (1 - f) * K;
1291     } else if(eloModel == 1) {
1292         double dg = elo_to_gamma(eloDraw) - 1;
1293         df = (dg / pow(2+dg,2.0)) * K;
1294     } else if(eloModel == 2) {
1295         const double pi = 3.14159265359;
1296         double x = -eloDraw/400.0;
1297         df = exp(-x*x) / (400.0 * sqrt(pi));
1298     }
1299     eloScale = (4.0 / K) * df;
1300     printf("EloScale %f\n",eloScale);
1301 }
For Davidson, calculating the factor (nu=dg) is made so that eloDraw=0 gives dg=0. This wasn't the case in the previous code i posted but i think it should be like that.

b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.

I will do the drawishness simulations to convergence after i fixed these issues.

Daniel
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: maximum a posteriori

Post by Evert »

Daniel Shawul wrote: b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
Yes, I found that ideally the positions should exhibit the features I'm trying to fit (other positions just add noise). If you use a fitting method that uses the Jacobian, the condition number gives you a hint on how well features are represented.
Values can also be unstable due to bugs in the implementation of the evaluation; I've fiund a few of those that way.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

About the scaling it turns out it is better to match the slope at the drawElo point instead of 0 point. Here are pictures that show before and after scaling with two criteria i.e. matching the slope at 0 or eloDraw point.

First using eloDraw=50 which doesn't show any difference with the two methods

Image
Image
Image


But using a very large value of eloDraw=250, matching the slope at 250 is better for Davidson
Image
Image
Image

The slopes are scaled by 400 times.

Daniel
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: maximum a posteriori

Post by Michel »

Daniel wrote:b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
You know of course as well as I that one usually tries to address this by adding a (somewhat adhoc) regularization term to the objective function. The theoretical justification for doing this seems somewhat unclear.

daniel wrote:About the scaling it turns out it is better to match the slope at the drawElo point instead of 0 point. Here are pictures that show before and after scaling with two criteria i.e. matching the slope at 0 or eloDraw point.
If the aim is to have the evaluation function reflect the expected score of a position via the standard logistic function then scaling is indeed necessary if one of the standard draw models is being used.

However for simple symmetry reasons it seems weird to do the matching for any other point than ev_score=0.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: maximum a posteriori

Post by AlvaroBegue »

Michel wrote:
Daniel wrote:b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
You know of course as well as I that one usually tries to address this by adding a (somewhat adhoc) regularization term to the objective function. The theoretical justification for doing this seems somewhat unclear.
The theoretical justification for regularization is not that unclear. I'll introduce the standard naming for these things from statistics, because I shouldn't assume everyone is familiar with the terminology. We are considering a large family of models that might explain our observations (all the possible settings of the parameters). The Bayesian approach to this situation requires that we start with a probability distribution over the set of models (known as a "prior"), which expresses our beliefs about what's reasonable (e.g., one of the bonus you talk about having a value of 500 isn't reasonable). The data allows us to refine this and obtain another probability distribution (known as the "posterior") by using Bayes's formula to compute the probability of a model given the data. One way to estimate our model parameters is to pick the model with the highest posterior probability. In case our prior probability distribution is flat, this estimator is called the "maximum likelihood estimator".

Now for the relevant part: In some cases regularization is exactly equivalent to starting with a prior probability distribution that is not flat. For instance, L2 regularization of a linear model (a.k.a. "Tikhonov regularization" or "ridge regression") corresponds to a Gaussian prior centered around zero, with diagonal covariance and equal variance in every parameter. I seem to remember that the regularization coefficient is inversely proportional to the variance of the prior, or something like that; I have done the computation before but I can't remember.

https://en.wikipedia.org/wiki/Tikhonov_ ... rpretation
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

If the aim is to have the evaluation function reflect the expected score of a position via the standard logistic function then scaling is indeed necessary if one of the standard draw models is being used.

However for simple symmetry reasons it seems weird to do the matching for any other point than ev_score=0.
Yes, we basically agree but the point of symmetry is not ev_score=0 when we have those two additional parameters. Our target should be to match slopes at ev_score = eloDraw - eloHome --the idea being matching the slope at the inflection point. With ridiclous values of eloDraw=250 and eloHome=500 the previous criteria of matching at ev_score=250 fails miserably while the new one at eloDraw=-250 works well. Here is a little python code to play with

Code: Select all

import matplotlib.pyplot as plt
import numpy as np
import math

eloD = 250
eloH = 500
off = eloD - eloH #set the inflection point as the point where we match slopes
K = math.log(10.0)/400
K1 = 1.0/400
nu = math.exp(K*eloD) - 1
rpi = math.sqrt(3.14159265359)

N=1000
x=np.arange(-N*1.0,N*2.0,1.0)
r=np.copy(x)
d=np.copy(x)
g=np.copy(x)
dr=np.copy(x)
dd=np.copy(x)
dg=np.copy(x)
for i in range(len(x)):
    r[i] = 1.0 / (1 + math.exp(-K*(x[i]+eloH-eloD)))
    dr[i] = (r[i] * (1 - r[i])) * K * 400	

    g[i] = (1 + math.erf(K1*(x[i]+eloH-eloD)))/2.0
    dg[i] = math.exp(-pow(K1*(x[i]+eloH-eloD),2.0)) / rpi

    f = 1.0 / (1 + math.exp(-K*(x[i]+eloH)))
    df = f * (1 - f)
    dx = nu * math.sqrt(df)
    b = 1 + dx
    c = (nu * f * (1 - 2 * f)) / (2 * math.sqrt(df))
    d[i] = f / b
    dd[i] = ((b - c) / (b * b)) * df * K * 400

plt.plot(x,r,'r',label='RK')
plt.plot(x,g,'b',label='GD')
plt.plot(x,d,'g',label='DV')
plt.plot(x,dr,'r--',label='RK-slope')
plt.plot(x,dg,'b--',label='GD-slope')
plt.plot(x,dd,'g--',label='DV-slope')
plt.legend(loc='upper left')
plt.xlabel('evaluation score')
plt.ylabel('winning probablility')
k = 'Before scaling: eloDraw=' + str(eloD) + ', eloHome='+str(eloH)
plt.title(k)
k = 'before'+str(eloD)+'.png'
plt.savefig(k)
plt.show()



der = (1.0 / 4) * K * 400
scale_r = dr[N+off] / der
scale_g = dg[N+off] / der
scale_d = dd[N+off] / der
print scale_r, scale_g, scale_d

for i in range(len(x)):
    r[i] = 1.0 / (1 + math.exp(-K*(1.0/scale_r)*(x[i]+eloH-eloD)))
    dr[i] = (r[i] * (1 - r[i])) * K * 400	

    g[i] = (1 + math.erf(K1*(1.0/scale_g)*(x[i]+eloH-eloD)))/2.0
    dg[i] = math.exp(-pow(K1*(1.0/scale_g)*(x[i]+eloH-eloD),2.0)) / rpi

    f = 1.0 / (1 + math.exp(-K*(1.0/scale_d)*(x[i]+eloH)))
    df = f * (1 - f)
    dx = nu * math.sqrt(df)
    b = 1 + dx
    c = nu * f * (1 - 2 * f) / (2 * math.sqrt(df))
    d[i] = f / b
    dd[i] = ((b - c) / (b * b)) * df*(1.0/scale_d) * K * 400

plt.plot(x,r,'r',label='RK')
plt.plot(x,g,'b',label='GD')
plt.plot(x,d,'g',label='DV')
plt.plot(x,dr,'r--',label='RK-slope')
plt.plot(x,dg,'b--',label='GD-slope')
plt.plot(x,dd,'g--',label='DV-slope')
plt.legend(loc='upper left')
plt.xlabel('evaluation score')
plt.ylabel('winning probablility')
k = 'After scaling by matching slope at ' + str(off) + ': eloDraw=' + str(eloD) + ', eloHome='+str(eloH)
plt.title(k)
k = 'after'+str(eloD)+'_' + str(off)+'.png'
plt.savefig(k)
plt.show()
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

A third point :

Since most evaluation functions are linear, we can compute and store the Jacobian for all the fens in the database once at start up. The jacobian is a constant matrix of (number_of_positions)X(number_of_parameters). Then we will never have to call the engines eval() again for computing gradients. I haven't tried it but this should be very efficient compute wise and also avoid the need for automatic differentiation while getting the benefit of less eval calls.

With the stochastic sampling i am using, it can be a little inefficient though since only a fraction of the positions are used in one iteration. With more iterations, it should benefit more.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: maximum a posteriori

Post by AlvaroBegue »

Daniel Shawul wrote:A third point :

Since most evaluation functions are linear[...]
Mine certainly isn't. I am not sure how you express in a linear function that endgames with opposite-color bishops are drawish, or that you need to coordinate several pieces to mount an effective attack on the king. Those things are naturally non linear.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

Matching the slopes at score=eloDraw-eloHome is so good that it is the first time Davidson is matching the other two models with automatically calculated scales. I am gonna make this a default. Here is the computed elos for cegt 40/120 with the three models -- eloHome and eloDraw are calculated to be 191 and 60, resp.

Code: Select all

daniel@daniel-Satellite-C855:~/bopo$ ./bopo
$: read cegt.pgn
Games 6200	
$: model 1
model 1
$: mm
100. 2.097133e-04
158. 8.725856e-06
Time 0.01 sec
$: scale
scale 0.557966
$: ratings
 1.                          Komodo 4.0 x64    246     31     31    350    66.3%   107    46.9%
 2.                         Houdini 1.5 x64    242     25     25    550    64.4%   121    40.7%
 3.                           Rybka 4.0 x64    231     25     25    550    64.0%   111    48.4%
 4.                         Komodo 2.03 x64    196     26     26    500    61.7%    97    45.4%
 5.                       Stockfish 2.0 x64    185     26     26    500    61.2%    89    48.8%
 6.                        Critter 0.90 x64    142     24     24    550    57.2%    81    46.0%
 7.                            Naum 4.2 x64    103     20     20    800    53.0%    78    49.0%
 8.                         Chiron 1.1a x64     60     33     33    300    42.3%   125    44.0%
 9.                               Spike 1.4     59     26     26    500    46.8%    84    42.0%
10.                    Deep Shredder 12 x64     58     21     21    700    50.9%    50    45.9%
11.                      Deep Sjeng ct 2010     53     31     31    350    38.7%   148    48.9%
12.                          Rybka 1.2f x64     20     21     21    750    60.5%   -70    45.9%
13.                            Gull 1.2 x64      4     27     27    450    40.0%    88    40.9%
14.                Deep Junior 12.5.0.3 x64      0     29     29    400    54.1%   -37    47.8%
15.                           Spark 1.0 x64     -7     33     33    300    54.7%   -47    40.7%
16.                                Fritz 12    -15     25     25    550    47.2%     9    46.0%
17.                             Hiarcs 13.1    -42     25     25    550    37.6%    64    41.8%
18.                                Fritz 11    -50     33     33    300    57.8%  -116    42.3%
19.                   Deep Sjeng WC2008 x64    -83     23     23    650    42.5%   -18    40.3%
20.                           Loop 2007 x64   -145     30     30    350    47.7%  -125    45.7%
21.                         Spike 1.2 Turin   -167     26     26    450    47.0%  -141    44.7%
22.                           Toga II 1.3.1   -191     31     31    350    36.6%   -77    36.0%
23.                             Fruit 2.2.1   -195     24     24    600    37.1%   -84    37.8%
24.                               Hiarcs 10   -208     28     28    400    41.6%  -138    40.8%
25.                         Bobcat 2.75 x64   -227     31     31    350    35.4%  -102    39.4%
26.                                 Ktulu 8   -257     33     33    300    35.3%  -132    34.0%
$: model 0
model 0
$: mm
85. 7.873132e-06
Time 0.00 sec
$: scale
scale 0.858535
$: ratings
 1.                         Houdini 1.5 x64    244     24     24    550    64.4%   120    40.7%
 2.                          Komodo 4.0 x64    242     30     30    350    66.3%   107    46.9%
 3.                           Rybka 4.0 x64    227     24     24    550    64.0%   111    48.4%
 4.                         Komodo 2.03 x64    195     25     25    500    61.7%    97    45.4%
 5.                       Stockfish 2.0 x64    182     25     25    500    61.2%    89    48.8%
 6.                        Critter 0.90 x64    141     24     24    550    57.2%    81    46.0%
 7.                            Naum 4.2 x64    103     20     20    800    53.0%    78    49.0%
 8.                         Chiron 1.1a x64     64     32     32    300    42.3%   124    44.0%
 9.                               Spike 1.4     60     25     25    500    46.8%    83    42.0%
10.                    Deep Shredder 12 x64     58     21     21    700    50.9%    50    45.9%
11.                      Deep Sjeng ct 2010     50     30     30    350    38.7%   147    48.9%
12.                          Rybka 1.2f x64     21     21     21    750    60.5%   -70    45.9%
13.                            Gull 1.2 x64      2     27     27    450    40.0%    88    40.9%
14.                Deep Junior 12.5.0.3 x64      0     28     28    400    54.1%   -36    47.8%
15.                           Spark 1.0 x64     -4     33     33    300    54.7%   -47    40.7%
16.                                Fritz 12    -12     24     24    550    47.2%     9    46.0%
17.                             Hiarcs 13.1    -39     24     24    550    37.6%    64    41.8%
18.                                Fritz 11    -46     32     32    300    57.8%  -117    42.3%
19.                   Deep Sjeng WC2008 x64    -82     22     22    650    42.5%   -17    40.3%
20.                           Loop 2007 x64   -144     30     30    350    47.7%  -123    45.7%
21.                         Spike 1.2 Turin   -165     27     27    450    47.0%  -140    44.7%
22.                           Toga II 1.3.1   -194     31     31    350    36.6%   -77    36.0%
23.                             Fruit 2.2.1   -195     24     24    600    37.1%   -83    37.8%
24.                               Hiarcs 10   -208     29     29    400    41.6%  -137    40.8%
25.                         Bobcat 2.75 x64   -225     31     31    350    35.4%  -102    39.4%
26.                                 Ktulu 8   -262     34     34    300    35.3%  -131    34.0%
$: model 2
model 2
$: cg
Iteration     0: -13287.523355 6.200000e+03 0.150000 {60.556408 1.000000 197.977849 1.000000}
Iteration    10: -12383.840314 3.714725e+01 0.150000 {55.010422 1.000000 176.446668 1.000000}
Iteration    20: -12052.078816 9.584121e+00 0.150000 {55.920514 1.000000 185.379153 1.000000}
Iteration    30: -11912.548375 5.538627e+00 0.150000 {57.988576 1.000000 189.688186 1.000000}
Iteration    40: -11851.402021 1.598067e+00 0.150000 {59.096819 1.000000 190.830564 1.000000}
Iteration    50: -11836.311248 3.785031e-01 0.150000 {59.078966 1.000000 191.199800 1.000000}
Iteration    60: -11833.076483 6.807339e-02 0.150000 {58.957378 1.000000 191.418866 1.000000}
Iteration    70: -11832.426771 1.925283e-02 0.150000 {59.013454 1.000000 191.601267 1.000000}
Iteration    80: -11832.212272 6.422767e-03 0.150000 {59.108343 1.000000 191.767091 1.000000}
Iteration    90: -11832.146908 1.697782e-03 0.150000 {59.144294 1.000000 191.852996 1.000000}
Iteration   100: -11832.132521 3.012616e-04 0.150000 {59.143888 1.000000 191.880569 1.000000}
Iteration   110: -11832.129912 6.479464e-05 0.150000 {59.145690 1.000000 191.894245 1.000000}
Iteration   120: -11832.129185 2.451105e-05 0.150000 {59.150025 1.000000 191.904046 1.000000}
Iteration   130: -11832.128931 6.218225e-06 0.150000 {59.152546 1.000000 191.908939 1.000000}
Time 0.22 sec
$: scale
scale 0.877872
$: ratings
 1.                          Komodo 4.0 x64    241     30     30    350    66.3%   105    46.9%
 2.                         Houdini 1.5 x64    239     24     24    550    64.4%   119    40.7%
 3.                           Rybka 4.0 x64    226     24     24    550    64.0%   110    48.4%
 4.                         Komodo 2.03 x64    192     25     25    500    61.7%    95    45.4%
 5.                       Stockfish 2.0 x64    181     25     25    500    61.2%    88    48.8%
 6.                        Critter 0.90 x64    140     24     24    550    57.2%    80    46.0%
 7.                            Naum 4.2 x64    101     19     19    800    53.0%    77    49.0%
 8.                         Chiron 1.1a x64     59     32     32    300    42.3%   122    44.0%
 9.                               Spike 1.4     58     25     25    500    46.8%    83    42.0%
10.                    Deep Shredder 12 x64     57     21     21    700    50.9%    49    45.9%
11.                      Deep Sjeng ct 2010     51     30     30    350    38.7%   145    48.9%
12.                          Rybka 1.2f x64     19     20     20    750    60.5%   -69    45.9%
13.                            Gull 1.2 x64      2     27     27    450    40.0%    86    40.9%
14.                Deep Junior 12.5.0.3 x64      0     28     28    400    54.1%   -36    47.8%
15.                           Spark 1.0 x64     -7     32     32    300    54.7%   -46    40.7%
16.                                Fritz 12    -13     24     24    550    47.2%     9    46.0%
17.                             Hiarcs 13.1    -40     24     24    550    37.6%    62    41.8%
18.                                Fritz 11    -48     32     32    300    57.8%  -115    42.3%
19.                   Deep Sjeng WC2008 x64    -81     22     22    650    42.5%   -17    40.3%
20.                           Loop 2007 x64   -142     29     29    350    47.7%  -122    45.7%
21.                         Spike 1.2 Turin   -163     26     26    450    47.0%  -138    44.7%
22.                           Toga II 1.3.1   -188     30     30    350    36.6%   -76    36.0%
23.                             Fruit 2.2.1   -191     23     23    600    37.1%   -82    37.8%
24.                               Hiarcs 10   -205     28     28    400    41.6%  -135    40.8%
25.                         Bobcat 2.75 x64   -222     30     30    350    35.4%  -101    39.4%
26.                                 Ktulu 8   -254     33     33    300    35.3%  -129    34.0%
$: quit
For my eval tuning though, Davidson still seems to give distincly different values than the other models (especially with the king_attack weight). I guess it is what it is and that there is nothing i can do about that.

Daniel
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

AlvaroBegue wrote:
Daniel Shawul wrote:A third point :

Since most evaluation functions are linear[...]
Mine certainly isn't. I am not sure how you express in a linear function that endgames with opposite-color bishops are drawish, or that you need to coordinate several pieces to mount an effective attack on the king. Those things are naturally non linear.
Note that we are talking about linearity with regard to the parameters. In other words, no two parameters should mulitply or divide (they can sum & subtract) or be used as a test condition to apply another evaluation term. With this definition, i can see my evaluation function to be totally linear. For example, for king safety, I use the number of attacks from pieces, calculate the phase etc and multiply by KING_SAFETY_WEIGHT. Now since the phase is determined by number of pieces (not sum value of pieces), number of attacks by presence of a piece (not its value), it is still linear.

The opposite color bishops case should also be a property of the position and be linear in the paramters. I am calculating the gradient of the eval with respect to the parameters for each position, so if you don't change the position the gradient should remain the same. With the stochastic sampling, you are sampling different positions so the mean of the gradients will be different from iteration to iteration. But if you consider all positions, you only need to compute it once. Computing and storing the Hessian is probably an overkill especially with large number of parameters.

Can you give an eval term that is non-linear by this definition ?