tuning via maximizing likelihood

Michel · Post by **Michel** » Fri Oct 06, 2017 11:36 am

Another important issue is "model mismatch", that is, when the true evaluation function (predicting the true w/l probabilities via the logistic function) and the heuristic evaluation function have different forms (e.g. E(x)=a+bx, H(c)=cx, where x is a scalar property of the position).

Both ML and LS will come up with a value for c, which will be different in both cases. Unfortunately there is no objective criterion (yet?) to decide which value is better. So this issue cannot be handled theoretically.

Michel · Post by **Michel** » Fri Oct 06, 2017 3:44 pm

I did some reading on Google and it seems that the log likelihood objective function (usually called cross-entropy) is actually standard in neural network training where it has been observed that it often has better convergence properties than least squares (as in Rein's toy experiment). So it seems certainly worthwhile to experiment with it for evaluation tuning...

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 06, 2017 7:23 pm

I am gonna plugin a draw and home advantage parameters to be tuned along with the evaluation parameters. If the evaluation is asymmetric and has a side to move bonus, we don't need the home advantage parameter. But the draw parameter could be important if for example a score of [-50,+50] centi-pawns probably leads to a draws. The ML objective with these new parameters is then:

Code: Select all

obj = 1/N sum_i=1,N ( -log&#40; P&#40;result_i&#58; score_i&#41; ) )

where result_i=win/loss/draw

For the three elo models I have in Bopo rating tool (Glenn-David, Rao-Kapper and Davidson) the formulas are

Rao-Kapper

Code: Select all

p&#40;win&#58;score&#41; = logistic&#40;score + home_score - draw_score&#41;
p&#40;loss&#58;score&#41; = logistic&#40;-score - home_score + draw_score&#41;
p&#40;draw&#58;score&#41; = 1 - p&#40;win&#58;score&#41; - p&#40;loss&#58;score&#41;

Glenn-David

Code: Select all

p&#40;win&#58;score&#41; = gaussian&#40;score + home_score - draw_score&#41;
p&#40;loss&#58;score&#41; = gaussian&#40;-score - home_score + draw_score&#41;
p&#40;draw&#58;score&#41; = 1 - p&#40;win&#58;score&#41; - p&#40;loss&#58;score&#41;

Davidson

Code: Select all

f = draw_gamma * sqrt&#40;logistic&#40;score + home_score&#41; *  logistic&#40;-score - home_score&#41; )
p&#40;win&#58;score&#41; = logistic&#40;score + home_score&#41; / &#40;1 + f&#41;
p&#40;loss&#58;score&#41; = logistic&#40;-score - home_score&#41; / &#40;1 + f&#41;
p&#40;draw&#58;score&#41; = 1 - p&#40;win&#58;score&#41; - p&#40;loss&#58;score&#41;

Daniel

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 06, 2017 7:37 pm

Michel wrote:I did some reading on Google and it seems that the log likelihood objective function (usually called cross-entropy) is actually standard in neural network training where it has been observed that it often has better convergence properties than least squares (as in Rein's toy experiment). So it seems certainly worthwhile to experiment with it for evaluation tuning...

Not only that, the LS method for neural network training is flawed as it does not yield ML estimates of parameters. See this 1990 paper http://pubmedcentralcanada.ca/pmcc/arti ... 4-0306.pdf

Michel · Post by **Michel** » Fri Oct 06, 2017 9:16 pm

Rao-Kapper
Code:

p(win:score) = logistic(score + home_score - draw_score)
p(loss:score) = logistic(-score - home_score + draw_score)
p(draw:score) = 1 - p(win:score) - p(loss:score)

Glenn-David
Code:

p(win:score) = gaussian(score + home_score - draw_score)
p(loss:score) = gaussian(-score - home_score + draw_score)
p(draw:score) = 1 - p(win:score) - p(loss:score)

Davidson
Code:

f = draw_gamma * sqrt(logistic(score + home_score) * logistic(-score - home_score) )
p(win:score) = logistic(score + home_score) / (1 + f)
p(loss:score) = logistic(-score - home_score) / (1 + f)
p(draw:score) = 1 - p(win:score) - p(loss:score)

I assume "score_i" is the result of the evaluation function for position_i? You would also need a proportionality constant to translate score into elo. Unless of course you would do away with the convention that the evaluation function is expressed in pawns...

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 06, 2017 9:36 pm

I assume "score_i" is the result of the evaluation function for position_i? You would also need a proportionality constant to translate score into elo. Unless of course you would do away with the convention that the evaluation function is expressed in pawns...

Yes, but for most engines who use centi-pawns, 1 Pawn = 100 centi-pawns = 100 elo according to: Pawn advantage vs Elo, so I didn't need to use a scaling factor.
I tested the models using eloDraw=97 and eloHome=32 default bayeselo values, and all seem to give reasonable piece values. But, I am afraid that an eloDraw=97 could be too high for tuning purposes. These are kept constant for now but could be tuned along with the parameters.

Do you see a way we could use Minorization Maximization instead of conjugate gradient for optimizing parameters here? Remi used it for example to select the best move using a generalized bradley-terry model, where the features (patterns) are competing with one another https://www.remi-coulom.fr/Amsterdam200 ... tterns.pdf. Each feature is considered as a player with an elo. However, here we have patterns that add up (not compete) so I do not think MM is applicable.

Evert · Post by **Evert** » Sat Oct 07, 2017 12:32 am

Michel wrote:You would also need a proportionality constant to translate score into elo. Unless of course you would do away with the convention that the evaluation function is expressed in pawns...

I find that expressing the evaluation function in (nominal) pawn units, and using logistic(x)=1/(1+exp(-x)), gives me a scale factor of 1. This is far a "clean" evaluation function that was written and tuned from scratch. I think others have found similar results (personally, I thought this was unexpected).

Michel · Post by **Michel** » Sat Oct 07, 2017 5:26 am

daniel wrote:Yes, but for most engines who use centi-pawns, 1 Pawn = 100 centi-pawns = 100 elo according to: Pawn advantage vs Elo, so I didn't need to use a scaling factor.

Ah yes! I had forgotten this.

Do you see a way we could use Minorization Maximization instead of conjugate gradient for optimizing parameters here?

Not immediately I think. But MM treats one parameter at a time. This works perhaps here too provided the evaluation function is linear in the parameters.

For simplicity I will use the BE model. Let x be a parameter. Treating all other parameters as constants the function to be optimized has the form

sum_i log L(Ai x+Bi) (*)

where L(x)=1/(1+exp(-x)) is the logistic function. Now L satisfies the functional equation L(x)+L(-x)=1 and the differential-functional equation L'(x)=L(x)L(-x). So for example (log L(x))'= L(-x)=1-L(x). Therefore the first and second derivatives of (*) with respect to x are trivial to compute and one can use Newton (perhaps doing only a single step before moving on to the next parameter).

PS. I am not sure if this is relevant at all but if the initial x is large in absolute value the derivative of (*) is nearly constant and then Newton is perhaps not the best method (one might get the "bouncing around" behaviour). For large absolute values of x one can regard (*) as being stepwise linear and in that case it is perhaps best to use that approximation.

Rein Halbersma · Post by **Rein Halbersma** » Sat Oct 07, 2017 2:15 pm

Michel wrote:I thought it would be nice to analyze your toy example theoretically.

Assume x is the value of the evaluation function (suitably scaled) and let p be the measured win ratio (no draws in the example). Let L(-) be the logistic function.

In ML method we optimize

-(p*log L(x)+(1-p)*log L(-x))

For x-->+infty this behaves as (1-p)*(-x) and for x-->-infty this behaves as x->p*x.

So slant asymptotes (gradient=contant).

It is worthwhile to note that generalizing from a continuous, Gaussian distributed, variable to a discrete, logistically distributed, variable is very tricky and, depending on the route, can be misleading (a general point, your exposition above is correct of course).

E.g. the log-likelihood for outcomes that can take on C categories, is most conveniently written as

Code: Select all

LogL = 1/N sum_&#123;i=1&#125;^N sum_&#123;j=1&#125;^C delta&#40;Y_i = j&#41; log&#40;Prob&#40;Y_i | X_i&#41;)

Here, only 1 term per observation is active. For binomially distributed variables this can also be written as:

Code: Select all

LogL = 1/N sum_&#123;i=1&#125;^N (&#40;1 - Y_i&#41; * log&#40;Prob&#40;0|X_i&#41;) + Y_i * log&#40;Prob&#40;1|X_i&#41;)

Again, only 1 term per observation is active.

However, the latter form is also numerically valid when plugging in Y_i = 1/2, which suggests that it could incorporate draws as well, imposing that 1 draw is the average of a win and a loss. But this is a false generalization, in the sense that it no longer is a bona fide log-likelihood. Instead, one needs to use the former shape of the log-likelihood with C=3, and using e.g. an ordered logit model with 3 - 1 = 2 threshold parameters to generate triples of probabilities (for which only 1 contributes to the log-likelihood).

In the LS method we optimize

p*L(-x)^2+(1-p)*L(x)^2

For x-->infty this behaves as 1-p and for x-->-infty this behaves as x-->p.

So horizontal asymptotes (gradient=0).

Again, also for the MSE it can be tricky to generalize from the continuous to the discrete case. For outcomes taking on C categories, the general formula is

Code: Select all

MSE = 1/N sum_&#123;i=1&#125;^N sum_&#123;j=1&#125;^C &#40;delta&#40;Y_i = j&#41; - Prob&#40;Y_i = j|X_i&#41;)^2

Here, all C terms contribute to the MSE. For binomially distributed variables, the two terms are identical and this can be written as

Code: Select all

MSE = 1/N sum_&#123;i=1&#125; &#123; ((    delta&#40;Y_i = 0&#41;   -     Prob&#40;0|X_i&#41;)^2 + &#40;delta&#40;Y_i = 1&#41; - Prob&#40;1|X_i&#41;)^2 &#125;
    = 1/N sum_&#123;i=1&#125; &#123; (&#40;1 - delta&#40;Y_i = 1&#41;) - &#40;1 - Prob&#40;1|X_i&#41;)^2 + &#40;delta&#40;Y_i = 1&#41; - Prob&#40;1|X_i&#41;)^2 &#125;
    = 2/N sum_&#123;i=1&#125; &#40;delta&#40;Y_i = 1&#41; - Prob&#40;1|X_i&#41;)^2 
    = 2/N sum_&#123;i=1&#125; &#40;Y_i - Prob&#40;1|X_i&#41;)^2

Because this form is almost identical (apart from a factor of 2) to the continuous case, it is easy to forget that it is a special case for the C-valued categorical outcomes!

Again, one could plug in Y_i = 1/2 in the latter form for the MSE, but it would no longer be a bona fide MSE for a probability distribution. Instead, one needs to generate probability triples and use the former form for the MSE of a 3-valued outcome.

TL;DR both MSE and LogL can be generalized from the 2-valued to the 3-valued case, but plugging in Y = 1/2 in the 2-valued formulas is not a good idea (even if it is well-behaved numerically). It is ad hoc and does not follow from probability theory. Instead, use the proper 3-valued probabilities, e.g. from an ordered logit model.

Rein Halbersma · Post by **Rein Halbersma** » Sat Oct 07, 2017 2:22 pm

Michel wrote: While I also would like to believe that ML is superior to least squares I am not sure your test shows this. I think all you have been measuring is the performance of R's cg method for optimizing certain specific 1-variable functions.

However this does not show that other optimization methods will behave in the same way! For example if an oracle had told the optimizer to express the objective functions in terms of plogis(b) then the least squares objective function would have converged in a single step!

In general, you are correct that transforming the RHS of the non-linear equation Y = f(X * b) into the shape f^inv(Y) = X * b will transform the problem to linear least squares, for which a closed form solution exists.

However, for 0/1 valued outcomes, this transformation is ill-formed, because you get (for logistic): log(Y/(1-Y)) = X * b (this is why logistic regression is sometimes called log-odds regression). Taking Y = 0 or 1, shows that this precludes least squares directly. So the problem is truly non-linear.

tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood