tuning via maximizing likelihood

Michel · Post by **Michel** » Thu Oct 05, 2017 4:45 pm

Rein wrote:Theoretically, the "log(logistic(x))" and "log(1 - logistic(x))" are perfectly well behaved for any finite value x (since logistic maps [-Inf, +Inf] to [0, 1]).

Yes that is a good observation. Asymptotically

log(1/(1+e^{-x})=x for x-->-infty,

and

log(1/(1+e^{-x})=-e^{-x} for x-->+infty

So there is no risk of "explosion" using the ML method.

AlvaroBegue · Post by **AlvaroBegue** » Thu Oct 05, 2017 4:53 pm

Michel wrote:
Rein wrote:Theoretically, the "log(logistic(x))" and "log(1 - logistic(x))" are perfectly well behaved for any finite value x (since logistic maps [-Inf, +Inf] to [0, 1]).
Yes that is a good observation. Asymptotically

log(1/(1+e^{-x})=x for x-->-infty,

and

log(1/(1+e^{-x})=-e^{-x} for x-->+infty

So there is no risk of "explosion" using the ML method.

The undesirable behavior is that a misprediction of a single position can taint total score by an arbitrarily high amount. This means that if I have an evaluation function that predicts the exact average of the results for each position in my database except for one, and for that one the evaluation is completely wrong, that single position could ruin the score completely, so this evaluation function will get a worse score than something that returns 0 for every position.

Whether you call that a risk of explosion or not is up to you. Using mean squared error is more robust.

Daniel Shawul · Post by **Daniel Shawul** » Thu Oct 05, 2017 5:26 pm

Rein Halbersma wrote:
AlvaroBegue wrote: I raised two separate objections. One of them is that, although using log-likelihood is somewhat theoretically motivated, it is not at all clear how draws should be handled.
I agree that plugging result = 1/2 into the log-likelihood for binomial logistical regression is ad hoc (even if it works numerically). The ordered logit model (aka cumulative link model aka proportional odds model) is ideally suited for handling draws. The loss function is then
Code: Select all
if &#40;result == L&#41; return log&#40;logistic&#40;theta_LD - w * x&#41;)
if &#40;result == D&#41; return log&#40;logistic&#40;theta_DW - w * x&#41; - logistic&#40;theta_LD - w *x&#41;)
if &#40;result == W&#41; return log&#40;1 - logistic&#40;theta_DW - w * x&#41;)
Here, theta_LD and theta_DW are threshold parameters which are a kind of generalized intercept terms. They represent the eval advantage necessary to make a loss/win more likely than a draw. These parameters also need to be optimized (the Arasan source code takes them as constants, but that is suboptimal). Note that "w * x" should not contain a constant term. Also note that the probability terms inside the logs add up to 1. Finally, you can take impose the restriction that theta_LD <0 < theta_WD or even theta_LD = - theta_DW. In that case, the eval should contain a side-to-move bonus.

Since 1 - logistic(x) = logistic(-x), taking theta_DW = theta_LD = 0, you recover the log-likelihood for binomial regression.

That is all a different subject all together, and the draw model is not specific to the ML method. Even with the LS method you will have to choose how you weight 1 draw against wins/losses -- so this can not be an objection to the ML objective. The truth is Alvaro is all over the place, and i am pretty sure now he started out thinking of a [-1,+1] score range and thought a log(0) would be a problem for draws as this quote shows

If my evaluation function predicts the probability of the result is 0, it gets an infinite penalty.

A few years ago I did a study with Remi Couloum that showed the Davidson method (2 draws = 1 win + 1 draw) fitted CCRL data better than the other models https://sites.google.com/site/dshawul/C ... edirects=0
We compared three models one with the gaussian distribution for player strength (Glenn-David) and others using logistic distribution. I think that the Glenn-David model maybe better suited with the MSE objective, while the other two models are better of with the ML objective.

As Michel pointed out there is also other things to consider like home advantage if you want to be pedantic.

AlvaroBegue · Post by **AlvaroBegue** » Thu Oct 05, 2017 5:33 pm

Daniel Shawul wrote:[...] The truth is Alvaro is all over the place, and i am pretty sure now he started out thinking of a [-1,+1] score range and thought a log(0) would be a problem for draws as this quote shows
If my evaluation function predicts the probability of the result is 0, it gets an infinite penalty.

No, I think you just didn't understand what I said. What you quoted there is a general statement about using log-likelihood to evaluate probability models. If there is a single instance where the model assigns a probability of 0 to an event that does happen, the penalty is infinite.

I tried to clarify in a previous post that the issue with handling draws and the issue with arbitrarily large penalties are two separate issues, but both of them indicate some possible problems with using log-likelihood as the thing to optimize. I don't know how to say this more clearly.

Michel · Post by **Michel** » Thu Oct 05, 2017 5:40 pm

Alvaro wrote:The undesirable behavior is that a misprediction of a single position can taint total score by an arbitrarily high amount. This means that if I have an evaluation function that predicts the exact average of the results for each position in my database except for one, and for that one the evaluation is completely wrong, that single position could ruin the score completely, so this evaluation function will get a worse score than something that returns 0 for every position.

This is not true. An evaluation function function that predicts all positions correctly, except one, will get a better LL than one that returns 0 for every position, if there are enough positions (and assuming of course that the evaluation of the positions is indeed not exactly zero).

Also "tainting the score by an arbitrarily high amount" has to be taken with a grain of salt. As I showed, the amount is linearly bounded by the range of the evaluation function.

AlvaroBegue · Post by **AlvaroBegue** » Thu Oct 05, 2017 5:44 pm

Michel wrote:
Alvaro wrote:The undesirable behavior is that a misprediction of a single position can taint total score by an arbitrarily high amount. This means that if I have an evaluation function that predicts the exact average of the results for each position in my database except for one, and for that one the evaluation is completely wrong, that single position could ruin the score completely, so this evaluation function will get a worse score than something that returns 0 for every position.
This is not true. An evaluation function function that predicts all positions correctly, except one, will get a better LL than one that returns 0 for every position, if there are enough positions (and assuming of course that the evaluation of the positions is indeed not exactly zero).

Also "tainting the score by an arbitrarily high amount" has to be taken with a grain of salt. As I showed, the amount is linearly bounded by the range of the evaluation function.

At this points all the facts are understood by all the participants in this thread. I raised two issues to think about, and think about them we have. Now everyone can decide how bothered they are by them, and empirical evidence can speak as to whether they matter in practice or not.

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 05, 2017 10:36 pm

AlvaroBegue wrote:
At this points all the facts are understood by all the participants in this thread. I raised two issues to think about, and think about them we have. Now everyone can decide how bothered they are by them, and empirical evidence can speak as to whether they matter in practice or not.

Actually, this is not true empirically. It is precisely because log-likelihood penalizes outlying predicitons more than MSE does, that the weights will converge quicker.

Below a test program in R (install R and use an IDE like RStudio to run it). I have tested both LogL and MSE for a simple eval with only a constant term on a million positions. I generate these positions from a logistic distribution with a fixed win percentage that corresponds to the "true value" that is going to be fitted.

LogL converges 3x faster (fewer iterations) than MSE if the positions have a win rate of 50% . For a win rate of 90% this advantage is ~9x, and for 99% win rate the advantage is ~35x. So the higher the win rate, the faster LogL converges compared to MSE. Note that in all cases the fitted parameter value is virtually identical.

My conclusion so far is that LogL is to be preferred because it converges faster. This is perfectly consistent with the better gradient behavior for extreme predictions. I see no drawbacks.

Code: Select all

# Loss function and gradients for 
# a&#41; non-linear least squares minimization
# b&#41; logistic log-likelihood maximazation 

logl.loss = function&#40;b, y&#41; &#123;
  n = length&#40;y&#41;
  obj = - sum&#40;&#40;1 - y&#41; * log&#40;1 - plogis&#40;b&#41;) + y * log&#40;plogis&#40;b&#41;)) / n
  return&#40;obj&#41;
&#125;

logl.grad = function&#40;b, y&#41; &#123;
  n = length&#40;y&#41;
  deriv = - sum&#40;y - plogis&#40;b&#41;) / n
  return&#40;deriv&#41;
&#125;

mse.loss = function&#40;b, y&#41; &#123;
  n = length&#40;y&#41;
  obj = 1/2 * sum&#40;&#40;y - plogis&#40;b&#41;)^2&#41; / n
  return&#40;obj&#41;
&#125;

mse.grad = function&#40;b, y&#41; &#123;
  n = length&#40;y&#41;
  deriv = - sum&#40;y - plogis&#40;b&#41;) * plogis&#40;b&#41; * &#40;1 - plogis&#40;b&#41;) / n
  return&#40;deriv&#41;
&#125;

# make this reproducible
set.seed&#40;42&#41;

# Compute intercept "b_star" &#40;the "true" parameter value of the underlying logistic distribution&#41; for a given scoring percentage
score = .5 # change to any value of interest, e.g. .9, .99 etc.
b_star = qlogis&#40;score&#41;

# Generate a million logistically distributed values
N = 1e6
y_latent = rlogis&#40;N, location = b_star&#41;

# Encoded as 0-1
y = ifelse&#40;y_latent > 0, 1, 0&#41;

# Compute sample mean and corresponding intercept 
y_hat = mean&#40;y&#41;
b_hat = qlogis&#40;y_hat&#41;

# Use base R library functions to get the most accurate results
glm&#40;y ~ 1, family = binomial&#41;
nls&#40;y ~ plogis&#40;b&#41;, data = data.frame&#40;y&#41;, start = list&#40;b = 0&#41;)

# Use conjugate-gradient on hand-coded loss functions and gradients to compare convergence
optim&#40;c&#40;0&#41;, logl.loss, logl.grad, y = y, method = "CG", control = list&#40;maxit = 1e6&#41;)
optim&#40;c&#40;0&#41;,  mse.loss,  mse.grad, y = y, method = "CG", control = list&#40;maxit = 1e6&#41;)

Michel · Post by **Michel** » Fri Oct 06, 2017 8:57 am

LogL converges 3x faster (fewer iterations) than MSE if the positions have a win rate of 50% . For a win rate of 90% this advantage is ~9x, and for 99% win rate the advantage is ~35x. So the higher the win rate, the faster LogL converges compared to MSE. Note that in all cases the fitted parameter value is virtually identical.

While I also would like to believe that ML is superior to least squares I am not sure your test shows this. I think all you have been measuring is the performance of R's cg method for optimizing certain specific 1-variable functions.

However this does not show that other optimization methods will behave in the same way! For example if an oracle had told the optimizer to express the objective functions in terms of plogis(b) then the least squares objective function would have converged in a single step!

Michel · Post by **Michel** » Fri Oct 06, 2017 10:19 am

I thought it would be nice to analyze your toy example theoretically.

Assume x is the value of the evaluation function (suitably scaled) and let p be the measured win ratio (no draws in the example). Let L(-) be the logistic function.

In ML method we optimize

-(p*log L(x)+(1-p)*log L(-x))

For x-->+infty this behaves as (1-p)*(-x) and for x-->-infty this behaves as x->p*x.

So slant asymptotes (gradient=contant).

In the LS method we optimize

p*L(-x)^2+(1-p)*L(x)^2

For x-->infty this behaves as 1-p and for x-->-infty this behaves as x-->p.

So horizontal asymptotes (gradient=0).

So it is understandable that ML converges faster if x is far from the optimal value. However once x is close to the optimum there will be no difference (both functions become quadratic in x). If we change variables x-->L(x) then LS will converge faster no matter what the starting value of x is.

It would nice to understand this for evaluation functions depending on more parameters. The model would be a "true" (unknown) evaluation function E(x1,...,xn) predicting the correct w/l ratios (using the logistic function) depending on measurable parameters x1,..,xn and a heuristic evaluation function H(x1,...,xn).

I guess in first approximation E and H can be assumed to be linear in x1,...,xn. The challenge is to match up the coefficients of H with those of E.

In a realistic analysis draws should also be incorporated. The presence of draws may smooth out the effect of evaluation errors.

Michel · Post by **Michel** » Fri Oct 06, 2017 10:42 am

Sorry I got the signs wrong in the asymptotes for the ML method.

-(p*log L(x)+(1-p)*log L(-x))

For x-->+infty this behaves as (1-p)*x and for x-->-infty this behaves as x->-p*x.

The 15 minute edit limit is extremely annoying.

tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood