tuning via maximizing likelihood

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: tuning via maximizing likelihood

Post by Daniel Shawul »

jdart wrote:
The value of the loss is not zero also for wins and draws that are predicted correctly too so I don't see why that is a problem?
The value for correctly predicted losses and wins is very close to zero. Not exactly because the logistic function will only asymptotically approach 1 as the score increases, or 0 as the score decreases.

--Jon
I don't think so. Note that the result could be a win or a loss even with a drawish evaluation (score) of 0. The log-likelihood in this case is log(0.5)=-0.3 whether the result is a win/loss or draw. A reasonable score of +200 for a win gives you log-likelihood of -0.12, so we can not assume only positions with score >1000 are correctly predicted as a win.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: maximum a posteriori

Post by Michel »

I already have separate MG and EG values for most important parameters, is that what you are proposing ?
For example.

At the very least it makes a lot of sense to have "draw_elo_MG" and "draw_elo_EG". The actual value of draw_elo used for a particular position would be interpolated between these two values depending on the "game phase" (which is feature of the position). The draw model parameters draw_elo_MG and draw_elo_EG can be optimized together with the parameters used in the evaluation function.

But one could also make draw_elo for a particular position depend on other features of the position, like king safety. This would create some more draw model parameters.

Note that I mainly like this idea for its mathematical elegance. It may not make any difference in practice. On the other hand it may also lead to an objective model for the "drawishness" of a position which seems interesting.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

Michel wrote:
I already have separate MG and EG values for most important parameters, is that what you are proposing ?
For example.

At the very least it makes a lot of sense to have "draw_elo_MG" and "draw_elo_EG". The actual value of draw_elo used for a particular position would be interpolated between these two values depending on the "game phase" (which is feature of the position). The draw model parameters draw_elo_MG and draw_elo_EG can be optimized together with the parameters used in the evaluation function.

But one could also make draw_elo for a particular position depend on other features of the position, like king safety. This would create some more draw model parameters.

Note that I mainly like this idea for its mathematical elegance. It may not make any difference in practice. On the other hand it may also lead to an objective model for the "drawishness" of a position which seems interesting.
Ok, I did a first attempt at this considering only material. I added slopes for the draw_elo and home_elo as

Code: Select all

draw_elo = ELO_DRAW + phase * ELO_DRAW_SLOPE
home_elo = ELO_HOME + phase * ELO_HOME_SLOPE
where phase=0 in opening and 1 with no pieces.

The result using Davidson model after few iterations is

Code: Select all

ELO_DRAW=62, ELO_DRAW_SLOPE=-8
ELO_HOME=20, ELO_HOME_SLOPE=-3
The drawishness increases with game phase, while the home advantage also does the same at a lower rate. The starting values for ELO_DRAW and ELO_HOME were 97 and 32 respectively.

Daniel
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

Ok I added king safety as one of the parameters that affect drawishness and it seems to be clearly the major factor affecting drawishness. It was the first parameter to be quickly modified by the CG iterations. I am not sure i got the definitions of phase right in my previous post, so I will define them here with the code i used to avoid mistakes while wording it.

Code: Select all

	double factor_m = material / 62.0;  //cumulatives pieces (no pawns)  for both sides with values of 9-5-3-3 for Q-R-B-N  (factor goes from 1.0 to 0.0)
	double factor_k = ksafety / 100.0;  //cumulative king safety for both sides scaled to 1 pawn value (factor goes from 0.0 to maybe 5.0)
	int eloH = ELO_HOME + factor_m * ELO_HOME_SLOPE_PHASE 
						+ factor_k * ELO_HOME_SLOPE_KSAFETY;
	int eloD = ELO_DRAW + factor_m * ELO_DRAW_SLOPE_PHASE 
						+ factor_k * ELO_DRAW_SLOPE_KSAFETY;
Result after a few iterations

Code: Select all

ELO_HOME 10
ELO_DRAW 65
ELO_HOME_SLOPE_PHASE -2
ELO_DRAW_SLOPE_PHASE -16
ELO_HOME_SLOPE_KSAFETY 19
ELO_DRAW_SLOPE_KSAFETY -25
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: maximum a posteriori

Post by Michel »

Result after a few iterations
Code:

ELO_HOME 10
ELO_DRAW 65
ELO_HOME_SLOPE_PHASE -2
ELO_DRAW_SLOPE_PHASE -16
ELO_HOME_SLOPE_KSAFETY 19
ELO_DRAW_SLOPE_KSAFETY -25
Thanks. Very nice! I looked at your code and it was what I had in mind.

If the converged values are similar as above then it seems that the model is at least compatible with common sense!
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

I have got two issues

a) The scale of centi-pawns vs elo is not one to one when we have significant draw percentages, so despite our previous assumption that no scaling is needed, this needs to be addressed especially for Davidson. The tuning results I got using Davidson are siginficantly ramped up compared to the other models. Assuming the centi-pawn vs winning-percentage relation is logistic, we will try to morf the three draw models to give evaluation scores obtained with no draw model. The criteria I am using to calculate the scaling factor is the slope with eval=0 should match that of the logistic curve. Let me know if there is a better scaling factor calculation method especially for Davidson that is slighlty off after scaling using this method. In code, it is:

Code: Select all

1282 //scale elos so that they look more like elostat's 
1283 //  Match slopes at 0 elo difference using df(x)/dx = K/4
1284 void calculate_scale() {
1285     const double K = log(10)/400.0;
1286     double df;
1287     if(eloModel == 0) {
1288         double dg = elo_to_gamma(eloDraw);
1289         double f = 1 / (1 + dg);
1290         df = f * (1 - f) * K;
1291     } else if(eloModel == 1) {
1292         double dg = elo_to_gamma(eloDraw) - 1;
1293         df = (dg / pow(2+dg,2.0)) * K;
1294     } else if(eloModel == 2) {
1295         const double pi = 3.14159265359;
1296         double x = -eloDraw/400.0;
1297         df = exp(-x*x) / (400.0 * sqrt(pi));
1298     }
1299     eloScale = (4.0 / K) * df;
1300     printf("EloScale %f\n",eloScale);
1301 }
For Davidson, calculating the factor (nu=dg) is made so that eloDraw=0 gives dg=0. This wasn't the case in the previous code i posted but i think it should be like that.

b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.

I will do the drawishness simulations to convergence after i fixed these issues.

Daniel
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: maximum a posteriori

Post by Evert »

Daniel Shawul wrote: b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
Yes, I found that ideally the positions should exhibit the features I'm trying to fit (other positions just add noise). If you use a fitting method that uses the Jacobian, the condition number gives you a hint on how well features are represented.
Values can also be unstable due to bugs in the implementation of the evaluation; I've fiund a few of those that way.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: maximum a posteriori

Post by Daniel Shawul »

About the scaling it turns out it is better to match the slope at the drawElo point instead of 0 point. Here are pictures that show before and after scaling with two criteria i.e. matching the slope at 0 or eloDraw point.

First using eloDraw=50 which doesn't show any difference with the two methods

Image
Image
Image


But using a very large value of eloDraw=250, matching the slope at 250 is better for Davidson
Image
Image
Image

The slopes are scaled by 400 times.

Daniel
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: maximum a posteriori

Post by Michel »

Daniel wrote:b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
You know of course as well as I that one usually tries to address this by adding a (somewhat adhoc) regularization term to the objective function. The theoretical justification for doing this seems somewhat unclear.

daniel wrote:About the scaling it turns out it is better to match the slope at the drawElo point instead of 0 point. Here are pictures that show before and after scaling with two criteria i.e. matching the slope at 0 or eloDraw point.
If the aim is to have the evaluation function reflect the expected score of a position via the standard logistic function then scaling is indeed necessary if one of the standard draw models is being used.

However for simple symmetry reasons it seems weird to do the matching for any other point than ev_score=0.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: maximum a posteriori

Post by AlvaroBegue »

Michel wrote:
Daniel wrote:b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
You know of course as well as I that one usually tries to address this by adding a (somewhat adhoc) regularization term to the objective function. The theoretical justification for doing this seems somewhat unclear.
The theoretical justification for regularization is not that unclear. I'll introduce the standard naming for these things from statistics, because I shouldn't assume everyone is familiar with the terminology. We are considering a large family of models that might explain our observations (all the possible settings of the parameters). The Bayesian approach to this situation requires that we start with a probability distribution over the set of models (known as a "prior"), which expresses our beliefs about what's reasonable (e.g., one of the bonus you talk about having a value of 500 isn't reasonable). The data allows us to refine this and obtain another probability distribution (known as the "posterior") by using Bayes's formula to compute the probability of a model given the data. One way to estimate our model parameters is to pick the model with the highest posterior probability. In case our prior probability distribution is flat, this estimator is called the "maximum likelihood estimator".

Now for the relevant part: In some cases regularization is exactly equivalent to starting with a prior probability distribution that is not flat. For instance, L2 regularization of a linear model (a.k.a. "Tikhonov regularization" or "ridge regression") corresponds to a Gaussian prior centered around zero, with diagonal covariance and equal variance in every parameter. I seem to remember that the regularization coefficient is inversely proportional to the variance of the prior, or something like that; I have done the computation before but I can't remember.

https://en.wikipedia.org/wiki/Tikhonov_ ... rpretation