tuning via maximizing likelihood

Rein Halbersma · Post by **Rein Halbersma** » Sat Oct 07, 2017 2:35 pm

Michel wrote:Another important issue is "model mismatch", that is, when the true evaluation function (predicting the true w/l probabilities via the logistic function) and the heuristic evaluation function have different forms (e.g. E(x)=a+bx, H(c)=cx, where x is a scalar property of the position).

Both ML and LS will come up with a value for c, which will be different in both cases. Unfortunately there is no objective criterion (yet?) to decide which value is better. So this issue cannot be handled theoretically.

Model miss-specification aside, there is a theoretical sense in which MLE is superior to non-linear least squares (NLS).

For any finite-sample of N observations generated from a model with the "true" parameter vector, you can compute the vector of parameter estimates for both MLE and NLS. You can also do this for multiple of such finite samples, generating a sampling distribution of parameter estimates.

Under some mild conditions, both the MLE and NLS estimates converge in probability to the "true" parameter vector as the sample size N goes to infinity. In econometrics jargon, this means that both estimators are "consistent". In other words, the parameter estimates for MLE and NLS are in practice the same.

However, the variance of the sampling distribution of the parameter estimates scales at least as fast as 1/sqrt(N). Only for MLE is this bound strictly satisfied, and in econometrics jargon it is called the "most efficient" estimator of all consistent estimators. NLS always has a larger variance around the "true" parameter for finite samples.

Source: Micro-econometrics textbook by Cameron & Trivedi, chapter 5.

Rein Halbersma · Post by **Rein Halbersma** » Sat Oct 07, 2017 2:41 pm

Michel wrote:
LogL converges 3x faster (fewer iterations) than MSE if the positions have a win rate of 50% . For a win rate of 90% this advantage is ~9x, and for 99% win rate the advantage is ~35x. So the higher the win rate, the faster LogL converges compared to MSE. Note that in all cases the fitted parameter value is virtually identical.
While I also would like to believe that ML is superior to least squares I am not sure your test shows this. I think all you have been measuring is the performance of R's cg method for optimizing certain specific 1-variable functions.

I used the conjugate gradient method because in my application domain (draughts), the number of parameters is usually in the tens of thousands or even hundreds of thousands, which precludes methods that require a Hessian matrix of second-order derivatives (such as Newton's method). For similar reasons, gradient methods are the de facto standard in deep learning.

For the curious: the large number of parameters arises as follows. You can divide the 10x10 draughts board into 4 x 4 overlapping regions of 4x4 squares each (the overlap is 2 squares in each case). Each 4x4 region has 8 valid squares (draughts is played on the dark squares only) and can take on 3 valued (black, white or empty; kings are disregarded for positional patterns). Then one can compute an index in the range 1 ... 3^8 for each of the 16 regions.

In the eval, one simply computes the indices for each of the 16 patterns (depending on the board occupancy for the position being evaluated), and then one does 16 table lookups to get the corresponding parameter value. This is all fitted using either MSE or MLE, starting from initial values equal to zero. Some L1 or L2 regularization can also be used.

These type of evaluation functions were pioneered in draughts in Fabien Letouzey's draughts program Scan (3-times and reigning champion of the Computer Olympiad). Other patterns are of course also possible.

Daniel Shawul · Post by **Daniel Shawul** » Sat Oct 07, 2017 4:07 pm

We still need to determine A since log( L(Ax+b) ) = (1 - L(x)) * A, and this would mean using automatic differentiation via operator overloading or something like that. I was hoping there would be a surrogate minimizer of the ML objective similar to the one for the BT model, but i now see that this is impossible atleast without keeping track of factors for each parameter inside eval -- which i wanted to avoid.

jdart · Post by **jdart** » Sat Oct 07, 2017 4:34 pm

Go programs also have gigantic numbers of evaluation parameters.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Sat Oct 07, 2017 5:21 pm

There was some typo in the sign of the draw score which I corrected below.

score is evaluation of position in centi-pawns relative to the side to move.
home_score is the score of being the side to move (not necessarily white)
draw_score is the threshold score of getting a draw

Rao-Kapper

Code: Select all

p&#40;win&#58;score&#41; = logistic&#40;score + home_score - draw_score&#41; 
p&#40;loss&#58;score&#41; = logistic&#40;-score - home_score - draw_score&#41; 
p&#40;draw&#58;score&#41; = 1 - p&#40;win&#58;score&#41; - p&#40;loss&#58;score&#41;

Glenn-David

Code: Select all

p&#40;win&#58;score&#41; = gaussian&#40;score + home_score - draw_score&#41; 
p&#40;loss&#58;score&#41; = gaussian&#40;-score - home_score - draw_score&#41; 
p&#40;draw&#58;score&#41; = 1 - p&#40;win&#58;score&#41; - p&#40;loss&#58;score&#41;

Davidson

Code: Select all

draw_gamma = score_to_gamma&#40;draw_score&#41;
f = draw_gamma * sqrt&#40;logistic&#40;score + home_score&#41; *  logistic&#40;-score - home_score&#41; ) 
p&#40;win&#58;score&#41; = logistic&#40;score + home_score&#41; / &#40;1 + f&#41; 
p&#40;loss&#58;score&#41; = logistic&#40;-score - home_score&#41; / &#40;1 + f&#41; 
p&#40;draw&#58;score&#41; = 1 - p&#40;win&#58;score&#41; - p&#40;loss&#58;score&#41;

where:

Code: Select all

logistic&#40;score&#41; = 1 / &#40;1 + pow&#40;10.0,score / 400.0&#41;)
gaussian&#40;score&#41; = &#40;1 + erf&#40;-score / 400.0&#41;) / 2
score_to_gamma&#40;score&#41; = pow&#40;10.0,score / 400.0&#41;;

Daniel Shawul · Post by **Daniel Shawul** » Sat Oct 07, 2017 5:34 pm

Using a bayesian approach, we can add a prior distribution of parameters which acts like a regularization for the ML estimates. So we do a maximum a posteriori (MAP) estimate rather than maximum likelihood instead (ML). So far, I have not been able to get better results than my existing parameers for the eval terms, so it makes sense to keep that momentum ('or absence of it') going for a while by introducing a prior distribution over the evaluation parameters (theta)

Code: Select all

P&#40;theta&#58;results&#41; = P&#40;results&#58;theta&#41;P&#40;theta&#41;=likelihood*prior

Rein Halbersma · Post by **Rein Halbersma** » Sun Oct 08, 2017 9:36 pm

Michel wrote: It would nice to understand this for evaluation functions depending on more parameters. The model would be a "true" (unknown) evaluation function E(x1,...,xn) predicting the correct w/l ratios (using the logistic function) depending on measurable parameters x1,..,xn and a heuristic evaluation function H(x1,...,xn).

I guess in first approximation E and H can be assumed to be linear in x1,...,xn. The challenge is to match up the coefficients of H with those of E.

In a realistic analysis draws should also be incorporated. The presence of draws may smooth out the effect of evaluation errors.

OK, I did some more thorough analyses. Below a code snippet in R (Works On My Machine).

What I do is generate a 4-term evaluation function. Two terms are 0/1 variables, generated by a binomial process. The other two are normally distributed variables.

I fix the "true" eval parameters to be -0.6190392 0.4054651 0.1000000 -0.1500000 respectively. I then generate linear predictors (called "eta", as in the literature on generalized models). I encode this as 0, 0.5 and 1 depending into which section between the theta thresholds the linear predictor falls. I also fix the "true" theta parameters to -1.386294 1.386294 (this corresponds to a 60% draw rate).

After that, I run two models: 1) Non-Linear Squares using a continuous LHS (y1 in my program) that has 0, 0.5 and 1 as values, and regress it on the eval features, 2) Cumulative Link Model using a one-hot encoded triple LHS (y3 in my program), and regress it on the eval features and also fitting the draw threshold parameters. I extract the fitted coefficients and compute the linear predictors (the evals used during search).

It turns out that there is perfect linear correlation between the two evals!! The NLS estimates are 65% smaller than the CLM estimates, in order to "squeeze" the eval score towards the value of a draw. The CLM estimates don't have to because this model already has the theta parameters as generalized intercepts.

Code: Select all

library&#40;ordinal&#41;
# make this reproducible 
set.seed&#40;47110815&#41; 

# Generate N logistically distributed values 
N = 1e5
x = cbind&#40;rbinom&#40;N, 1, .5&#41;, rbinom&#40;N, 1, .5&#41;, rnorm&#40;N&#41;, rnorm&#40;N&#41;)
&#40;w_star = c&#40;qlogis&#40;.35&#41;, qlogis&#40;.6&#41;, .1, -.15&#41;)
theta_star = c&#40;qlogis&#40;.2&#41;, qlogis&#40;.8&#41;)
eta = as.vector&#40;x %*% w_star&#41;
y_latent = eta + rlogis&#40;N, 0, 1&#41;

# 0/0.5/1 encoded
y1 = ifelse&#40;y_latent > theta_star&#91;2&#93;, 1, 
  ifelse&#40;y_latent > theta_star&#91;1&#93;, 1/2, 0&#41;
)

# one-hot encoded of length 3
y3 = apply&#40;cbind&#40;y1 == 0, y1 == 1/2 , y1 == 1&#41;, c&#40;1,2&#41;, as.integer&#41;

# Use R library functions to get the most accurate results 
w0 = rep&#40;0, length&#40;w_star&#41;)

&#40;nls1.est = nls&#40;y1 ~ plogis&#40;x %*% w&#41;, start = list&#40;w = w0&#41;))
(-logLik&#40;nls1.est&#41; / N&#41; 
&#40;clm1.est = clm&#40;factor&#40;y1&#41; ~ x&#41;)
(-logLik&#40;clm1.est&#41; / N&#41;

nls1.eta = x %*% coef&#40;nls1.est&#41;
clm1.eta = x %*% coef&#40;clm1.est&#41;&#91;-&#40;1&#58;2&#41;&#93;

plot&#40;clm1.eta, nls1.eta&#41;
summary&#40;lm&#40;nls1.eta ~ clm1.eta&#41;)

Output:

Code: Select all

> &#40;nls1.est = nls&#40;y1 ~ plogis&#40;x %*% w&#41;, start = list&#40;w = w0&#41;))
Nonlinear regression model
  model&#58; y1 ~ plogis&#40;x %*% w&#41;
   data&#58; parent.frame&#40;)
      w1       w2       w3       w4 
-0.39761  0.25896  0.06154 -0.10136 
 residual sum-of-squares&#58; 9977

Number of iterations to convergence&#58; 2 
Achieved convergence tolerance&#58; 5.337e-06
> (-logLik&#40;nls1.est&#41; / N&#41; 
'log Lik.' 0.2664951 &#40;df=5&#41;
> &#40;clm1.est = clm&#40;factor&#40;y1&#41; ~ x&#41;)
formula&#58; factor&#40;y1&#41; ~ x

 link  threshold nobs  logLik    AIC       niter max.grad cond.H 
 logit flexible  1e+05 -94636.09 189284.17 6&#40;0&#41;  1.47e-12 1.1e+01

Coefficients&#58;
      x1       x2       x3       x4 
-0.60832  0.40026  0.09369 -0.15488 

Threshold coefficients&#58;
 0|0.5  0.5|1 
-1.380  1.387 
> (-logLik&#40;clm1.est&#41; / N&#41;
'log Lik.' 0.9463609 &#40;df=6&#41;
> 
> nls1.eta = x %*% coef&#40;nls1.est&#41;
> clm1.eta = x %*% coef&#40;clm1.est&#41;&#91;-&#40;1&#58;2&#41;&#93;
> 
> plot&#40;clm1.eta, nls1.eta&#41;
> summary&#40;lm&#40;nls1.eta ~ clm1.eta&#41;)

Call&#58;
lm&#40;formula = nls1.eta ~ clm1.eta&#41;

Residuals&#58;
       Min         1Q     Median         3Q        Max 
-0.0041136 -0.0010737  0.0000032  0.0010759  0.0034799 

Coefficients&#58;
              Estimate Std. Error t value Pr&#40;>|t|)    
&#40;Intercept&#41; -1.467e-03  4.136e-06  -354.7   <2e-16 ***
clm1.eta     6.523e-01  9.861e-06 66150.5   <2e-16 ***
---
Signif. codes&#58;  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error&#58; 0.001266 on 99998 degrees of freedom
Multiple R-squared&#58;      1,	Adjusted R-squared&#58;      1 
F-statistic&#58; 4.376e+09 on 1 and 99998 DF,  p-value&#58; < 2.2e-16

For completeness (code not shown), I also reproduced the above R library functions with hand-written mean-squared error and log-likelihood formulas, and the outcomes agreed perfectly.

Michel · Post by **Michel** » Tue Oct 10, 2017 8:43 am

Daniel Shawul wrote:We still need to determine A since log( L(Ax+b) ) = (1 - L(x)) * A, and this would mean using automatic differentiation via operator overloading or something like that. I was hoping there would be a surrogate minimizer of the ML objective similar to the one for the BT model, but i now see that this is impossible atleast without keeping track of factors for each parameter inside eval -- which i wanted to avoid.

It may be more simple. If the evaluation function is linear in a parameter (are there good examples where it is not?) you can trivially differentiate it numerically with respect to that parameter. The result will be exact.

In this way one can even compute the Hessian of the objective function. Of course if you have a lot of parameters the Hessian will be a big matrix. But this is not such a big issue on modern computers. For example numpy has no problem whatsoever inverting a 1000x1000 matrix.

I guess if you need 36*3^8 parameters, as Rein was reporting for draughts, then using a second order method seems out of the question.

Michel · Post by **Michel** » Wed Oct 11, 2017 11:41 am

Daniel wrote:Using a bayesian approach, we can add a prior distribution of parameters which acts like a regularization for the ML estimates. So we do a maximum a posteriori (MAP) estimate rather than maximum likelihood instead (ML). So far, I have not been able to get better results than my existing parameers for the eval terms, so it makes sense to keep that momentum ('or absence of it') going for a while by introducing a prior distribution over the evaluation parameters (theta)

I do not think putting on a prior will make much of a difference, except perhaps for the speed of convergence.

What might be an interesting experiment is to make the draw_elo parameter (in the BE model) itself a linear combination of the features (i.e. it would vary from position to position). A model with a constant draw_elo parameter is obviously incorrect as in the endgame the expected draw rate should be much higher.

If implemented literally this would roughly double the number of parameters. There is an increased risk of over fitting however so one could cut this down by only including features that are expected to have an influence on the draw ratio, like game phase and perhaps king safety.

jdart · Post by **jdart** » Thu Oct 12, 2017 1:43 am

r * log( logistic(score) ) + (1 - r) log( 1 - logistic(score) )

This may have been covered, at least indirectly, but the issue I see with this function and draws is that the value of the loss is not zero when the predicted score is 0.5 and the actual score is 0.5. The loss function for draws does have its minimum value in that case. But it is not zero.

Each draw will increase the value of the objective even if it is predicted correctly, and this increase is more than what most loss and win positions contribute (because many will be more or less accurately predicted).

--Jon

tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

maximum a posteriori

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: maximum a posteriori

Re: tuning via maximizing likelihood