AlvaroBegue wrote:
At this points all the facts are understood by all the participants in this thread. I raised two issues to think about, and think about them we have. Now everyone can decide how bothered they are by them, and empirical evidence can speak as to whether they matter in practice or not.
Actually, this is not true empirically. It is precisely because log-likelihood penalizes outlying predicitons more than MSE does, that the weights will converge quicker.
Below a test program in R (install R and use an IDE like RStudio to run it). I have tested both LogL and MSE for a simple eval with only a constant term on a million positions. I generate these positions from a logistic distribution with a fixed win percentage that corresponds to the "true value" that is going to be fitted.
LogL converges 3x faster (fewer iterations) than MSE if the positions have a win rate of 50% . For a win rate of 90% this advantage is ~9x, and for 99% win rate the advantage is ~35x. So the higher the win rate, the faster LogL converges compared to MSE. Note that in all cases the fitted parameter value is virtually identical.
My conclusion so far is that LogL is to be preferred because it converges faster. This is perfectly consistent with the better gradient behavior for extreme predictions. I see no drawbacks.
Code: Select all
# Loss function and gradients for
# a) non-linear least squares minimization
# b) logistic log-likelihood maximazation
logl.loss = function(b, y) {
n = length(y)
obj = - sum((1 - y) * log(1 - plogis(b)) + y * log(plogis(b))) / n
return(obj)
}
logl.grad = function(b, y) {
n = length(y)
deriv = - sum(y - plogis(b)) / n
return(deriv)
}
mse.loss = function(b, y) {
n = length(y)
obj = 1/2 * sum((y - plogis(b))^2) / n
return(obj)
}
mse.grad = function(b, y) {
n = length(y)
deriv = - sum(y - plogis(b)) * plogis(b) * (1 - plogis(b)) / n
return(deriv)
}
# make this reproducible
set.seed(42)
# Compute intercept "b_star" (the "true" parameter value of the underlying logistic distribution) for a given scoring percentage
score = .5 # change to any value of interest, e.g. .9, .99 etc.
b_star = qlogis(score)
# Generate a million logistically distributed values
N = 1e6
y_latent = rlogis(N, location = b_star)
# Encoded as 0-1
y = ifelse(y_latent > 0, 1, 0)
# Compute sample mean and corresponding intercept
y_hat = mean(y)
b_hat = qlogis(y_hat)
# Use base R library functions to get the most accurate results
glm(y ~ 1, family = binomial)
nls(y ~ plogis(b), data = data.frame(y), start = list(b = 0))
# Use conjugate-gradient on hand-coded loss functions and gradients to compare convergence
optim(c(0), logl.loss, logl.grad, y = y, method = "CG", control = list(maxit = 1e6))
optim(c(0), mse.loss, mse.grad, y = y, method = "CG", control = list(maxit = 1e6))