Another important issue is "model mismatch", that is, when the true evaluation function (predicting the true w/l probabilities via the logistic function) and the heuristic evaluation function have different forms (e.g. E(x)=a+bx, H(c)=cx, where x is a scalar property of the position).
Both ML and LS will come up with a value for c, which will be different in both cases. Unfortunately there is no objective criterion (yet?) to decide which value is better. So this issue cannot be handled theoretically.
tuning via maximizing likelihood
Moderators: hgm, Rebel, chrisw
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: tuning via maximizing likelihood
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: tuning via maximizing likelihood
I did some reading on Google and it seems that the log likelihood objective function (usually called cross-entropy) is actually standard in neural network training where it has been observed that it often has better convergence properties than least squares (as in Rein's toy experiment). So it seems certainly worthwhile to experiment with it for evaluation tuning...
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: tuning via maximizing likelihood
I am gonna plugin a draw and home advantage parameters to be tuned along with the evaluation parameters. If the evaluation is asymmetric and has a side to move bonus, we don't need the home advantage parameter. But the draw parameter could be important if for example a score of [-50,+50] centi-pawns probably leads to a draws. The ML objective with these new parameters is then:
where result_i=win/loss/draw
For the three elo models I have in Bopo rating tool (Glenn-David, Rao-Kapper and Davidson) the formulas are
Rao-Kapper
Glenn-David
Davidson
Daniel
Code: Select all
obj = 1/N sum_i=1,N ( -log( P(result_i: score_i) ) )
For the three elo models I have in Bopo rating tool (Glenn-David, Rao-Kapper and Davidson) the formulas are
Rao-Kapper
Code: Select all
p(win:score) = logistic(score + home_score - draw_score)
p(loss:score) = logistic(-score - home_score + draw_score)
p(draw:score) = 1 - p(win:score) - p(loss:score)
Code: Select all
p(win:score) = gaussian(score + home_score - draw_score)
p(loss:score) = gaussian(-score - home_score + draw_score)
p(draw:score) = 1 - p(win:score) - p(loss:score)
Code: Select all
f = draw_gamma * sqrt(logistic(score + home_score) * logistic(-score - home_score) )
p(win:score) = logistic(score + home_score) / (1 + f)
p(loss:score) = logistic(-score - home_score) / (1 + f)
p(draw:score) = 1 - p(win:score) - p(loss:score)
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: tuning via maximizing likelihood
Not only that, the LS method for neural network training is flawed as it does not yield ML estimates of parameters. See this 1990 paper http://pubmedcentralcanada.ca/pmcc/arti ... 4-0306.pdfMichel wrote:I did some reading on Google and it seems that the log likelihood objective function (usually called cross-entropy) is actually standard in neural network training where it has been observed that it often has better convergence properties than least squares (as in Rein's toy experiment). So it seems certainly worthwhile to experiment with it for evaluation tuning...
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: tuning via maximizing likelihood
I assume "score_i" is the result of the evaluation function for position_i? You would also need a proportionality constant to translate score into elo. Unless of course you would do away with the convention that the evaluation function is expressed in pawns...Rao-Kapper
Code:
p(win:score) = logistic(score + home_score - draw_score)
p(loss:score) = logistic(-score - home_score + draw_score)
p(draw:score) = 1 - p(win:score) - p(loss:score)
Glenn-David
Code:
p(win:score) = gaussian(score + home_score - draw_score)
p(loss:score) = gaussian(-score - home_score + draw_score)
p(draw:score) = 1 - p(win:score) - p(loss:score)
Davidson
Code:
f = draw_gamma * sqrt(logistic(score + home_score) * logistic(-score - home_score) )
p(win:score) = logistic(score + home_score) / (1 + f)
p(loss:score) = logistic(-score - home_score) / (1 + f)
p(draw:score) = 1 - p(win:score) - p(loss:score)
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: tuning via maximizing likelihood
Yes, but for most engines who use centi-pawns, 1 Pawn = 100 centi-pawns = 100 elo according to: Pawn advantage vs Elo, so I didn't need to use a scaling factor.I assume "score_i" is the result of the evaluation function for position_i? You would also need a proportionality constant to translate score into elo. Unless of course you would do away with the convention that the evaluation function is expressed in pawns...
I tested the models using eloDraw=97 and eloHome=32 default bayeselo values, and all seem to give reasonable piece values. But, I am afraid that an eloDraw=97 could be too high for tuning purposes. These are kept constant for now but could be tuned along with the parameters.
Do you see a way we could use Minorization Maximization instead of conjugate gradient for optimizing parameters here? Remi used it for example to select the best move using a generalized bradley-terry model, where the features (patterns) are competing with one another https://www.remi-coulom.fr/Amsterdam200 ... tterns.pdf. Each feature is considered as a player with an elo. However, here we have patterns that add up (not compete) so I do not think MM is applicable.
-
- Posts: 2929
- Joined: Sat Jan 22, 2011 12:42 am
- Location: NL
Re: tuning via maximizing likelihood
I find that expressing the evaluation function in (nominal) pawn units, and using logistic(x)=1/(1+exp(-x)), gives me a scale factor of 1. This is far a "clean" evaluation function that was written and tuned from scratch. I think others have found similar results (personally, I thought this was unexpected).Michel wrote:You would also need a proportionality constant to translate score into elo. Unless of course you would do away with the convention that the evaluation function is expressed in pawns...
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: tuning via maximizing likelihood
Ah yes! I had forgotten this.daniel wrote:Yes, but for most engines who use centi-pawns, 1 Pawn = 100 centi-pawns = 100 elo according to: Pawn advantage vs Elo, so I didn't need to use a scaling factor.
Not immediately I think. But MM treats one parameter at a time. This works perhaps here too provided the evaluation function is linear in the parameters.Do you see a way we could use Minorization Maximization instead of conjugate gradient for optimizing parameters here?
For simplicity I will use the BE model. Let x be a parameter. Treating all other parameters as constants the function to be optimized has the form
sum_i log L(Ai x+Bi) (*)
where L(x)=1/(1+exp(-x)) is the logistic function. Now L satisfies the functional equation L(x)+L(-x)=1 and the differential-functional equation L'(x)=L(x)L(-x). So for example (log L(x))'= L(-x)=1-L(x). Therefore the first and second derivatives of (*) with respect to x are trivial to compute and one can use Newton (perhaps doing only a single step before moving on to the next parameter).
PS. I am not sure if this is relevant at all but if the initial x is large in absolute value the derivative of (*) is nearly constant and then Newton is perhaps not the best method (one might get the "bouncing around" behaviour). For large absolute values of x one can regard (*) as being stepwise linear and in that case it is perhaps best to use that approximation.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: tuning via maximizing likelihood
It is worthwhile to note that generalizing from a continuous, Gaussian distributed, variable to a discrete, logistically distributed, variable is very tricky and, depending on the route, can be misleading (a general point, your exposition above is correct of course).Michel wrote:I thought it would be nice to analyze your toy example theoretically.
Assume x is the value of the evaluation function (suitably scaled) and let p be the measured win ratio (no draws in the example). Let L(-) be the logistic function.
In ML method we optimize
-(p*log L(x)+(1-p)*log L(-x))
For x-->+infty this behaves as (1-p)*(-x) and for x-->-infty this behaves as x->p*x.
So slant asymptotes (gradient=contant).
E.g. the log-likelihood for outcomes that can take on C categories, is most conveniently written as
Code: Select all
LogL = 1/N sum_{i=1}^N sum_{j=1}^C delta(Y_i = j) log(Prob(Y_i | X_i))
Code: Select all
LogL = 1/N sum_{i=1}^N ((1 - Y_i) * log(Prob(0|X_i)) + Y_i * log(Prob(1|X_i))
However, the latter form is also numerically valid when plugging in Y_i = 1/2, which suggests that it could incorporate draws as well, imposing that 1 draw is the average of a win and a loss. But this is a false generalization, in the sense that it no longer is a bona fide log-likelihood. Instead, one needs to use the former shape of the log-likelihood with C=3, and using e.g. an ordered logit model with 3 - 1 = 2 threshold parameters to generate triples of probabilities (for which only 1 contributes to the log-likelihood).
Again, also for the MSE it can be tricky to generalize from the continuous to the discrete case. For outcomes taking on C categories, the general formula isIn the LS method we optimize
p*L(-x)^2+(1-p)*L(x)^2
For x-->infty this behaves as 1-p and for x-->-infty this behaves as x-->p.
So horizontal asymptotes (gradient=0).
Code: Select all
MSE = 1/N sum_{i=1}^N sum_{j=1}^C (delta(Y_i = j) - Prob(Y_i = j|X_i))^2
Code: Select all
MSE = 1/N sum_{i=1} { (( delta(Y_i = 0) - Prob(0|X_i))^2 + (delta(Y_i = 1) - Prob(1|X_i))^2 }
= 1/N sum_{i=1} { ((1 - delta(Y_i = 1)) - (1 - Prob(1|X_i))^2 + (delta(Y_i = 1) - Prob(1|X_i))^2 }
= 2/N sum_{i=1} (delta(Y_i = 1) - Prob(1|X_i))^2
= 2/N sum_{i=1} (Y_i - Prob(1|X_i))^2
Again, one could plug in Y_i = 1/2 in the latter form for the MSE, but it would no longer be a bona fide MSE for a probability distribution. Instead, one needs to generate probability triples and use the former form for the MSE of a 3-valued outcome.
TL;DR both MSE and LogL can be generalized from the 2-valued to the 3-valued case, but plugging in Y = 1/2 in the 2-valued formulas is not a good idea (even if it is well-behaved numerically). It is ad hoc and does not follow from probability theory. Instead, use the proper 3-valued probabilities, e.g. from an ordered logit model.
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: tuning via maximizing likelihood
In general, you are correct that transforming the RHS of the non-linear equation Y = f(X * b) into the shape f^inv(Y) = X * b will transform the problem to linear least squares, for which a closed form solution exists.Michel wrote: While I also would like to believe that ML is superior to least squares I am not sure your test shows this. I think all you have been measuring is the performance of R's cg method for optimizing certain specific 1-variable functions.
However this does not show that other optimization methods will behave in the same way! For example if an oracle had told the optimizer to express the objective functions in terms of plogis(b) then the least squares objective function would have converged in a single step!
However, for 0/1 valued outcomes, this transformation is ill-formed, because you get (for logistic): log(Y/(1-Y)) = X * b (this is why logistic regression is sometimes called log-odds regression). Taking Y = 0 or 1, shows that this precludes least squares directly. So the problem is truly non-linear.