This may have been covered, at least indirectly, but the issue I see with this function and draws is that the value of the loss is not zero when the predicted score is 0.5 and the actual score is 0.5. The loss function for draws does have its minimum value in that case. But it is not zero.
Each draw will increase the value of the objective even if it is predicted correctly, and this increase is more than what most loss and win positions contribute (because many will be more or less accurately predicted).
--Jon
This was the objective function discussed in the OP but it is not really how maximum likelihood is supposed to work in the presence of draws. One needs to choose a draw model. Luckily there are reasonable draw models available which are used by rating tools. This is discussed in subsequent posts.
A draw model involves at least one extra parameter which is a proxy for the draw ratio (draw_elo in the Bayes Elo model) but perhaps it may involve more parameters.
Whether or not the text book approach yields better results in actual games than more adhoc approaches is of course impossible to know in advance.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
This may have been covered, at least indirectly, but the issue I see with this function and draws is that the value of the loss is not zero when the predicted score is 0.5 and the actual score is 0.5. The loss function for draws does have its minimum value in that case. But it is not zero.
The value of the loss is not zero also for wins and draws that are predicted correctly too so I don't see why that is a problem?
Originally, I was thinking of a situation where the draws are replaced with wins and losses (no draw models). The implicit draw model in this equation is 'sort of' a special case of the Davidson Model where 2 draws = 1 win + 1 loss
Substituting r=0.5 in the original loss function, you get
Daniel wrote:Using a bayesian approach, we can add a prior distribution of parameters which acts like a regularization for the ML estimates. So we do a maximum a posteriori (MAP) estimate rather than maximum likelihood instead (ML). So far, I have not been able to get better results than my existing parameers for the eval terms, so it makes sense to keep that momentum ('or absence of it') going for a while by introducing a prior distribution over the evaluation parameters (theta)
I do not think putting on a prior will make much of a difference, except perhaps for the speed of convergence.
You are right, I am not getting any thing out of it except for slowing down convergence. I did not try what I proposed as a prior, but something that could only serve as regularization. I added a virtual set of results that favour the set of parameters that are currently being considered by assigning a mse=0 for a tiny fraction of the total number of positions. I think prior was very useful to elo estimation because we have often very few games for estimating strength.
What might be an interesting experiment is to make the draw_elo parameter (in the BE model) itself a linear combination of the features (i.e. it would vary from position to position). A model with a constant draw_elo parameter is obviously incorrect as in the endgame the expected draw rate should be much higher.
I tried making the drawElo and homeAdvantage parameters dynamic and it works. The problem is I am not getting any improvement from tuning parameters so I can not gauge whether determining drawElo dynamically improves things.
If implemented literally this would roughly double the number of parameters. There is an increased risk of over fitting however so one could cut this down by only including features that are expected to have an influence on the draw ratio, like game phase and perhaps king safety.
I already have separate MG and EG values for most important parameters, is that what you are proposing ?
The value of the loss is not zero also for wins and draws that are predicted correctly too so I don't see why that is a problem?
The value for correctly predicted losses and wins is very close to zero. Not exactly because the logistic function will only asymptotically approach 1 as the score increases, or 0 as the score decreases.
The value of the loss is not zero also for wins and draws that are predicted correctly too so I don't see why that is a problem?
The value for correctly predicted losses and wins is very close to zero. Not exactly because the logistic function will only asymptotically approach 1 as the score increases, or 0 as the score decreases.
--Jon
I don't think so. Note that the result could be a win or a loss even with a drawish evaluation (score) of 0. The log-likelihood in this case is log(0.5)=-0.3 whether the result is a win/loss or draw. A reasonable score of +200 for a win gives you log-likelihood of -0.12, so we can not assume only positions with score >1000 are correctly predicted as a win.
I already have separate MG and EG values for most important parameters, is that what you are proposing ?
For example.
At the very least it makes a lot of sense to have "draw_elo_MG" and "draw_elo_EG". The actual value of draw_elo used for a particular position would be interpolated between these two values depending on the "game phase" (which is feature of the position). The draw model parameters draw_elo_MG and draw_elo_EG can be optimized together with the parameters used in the evaluation function.
But one could also make draw_elo for a particular position depend on other features of the position, like king safety. This would create some more draw model parameters.
Note that I mainly like this idea for its mathematical elegance. It may not make any difference in practice. On the other hand it may also lead to an objective model for the "drawishness" of a position which seems interesting.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
I already have separate MG and EG values for most important parameters, is that what you are proposing ?
For example.
At the very least it makes a lot of sense to have "draw_elo_MG" and "draw_elo_EG". The actual value of draw_elo used for a particular position would be interpolated between these two values depending on the "game phase" (which is feature of the position). The draw model parameters draw_elo_MG and draw_elo_EG can be optimized together with the parameters used in the evaluation function.
But one could also make draw_elo for a particular position depend on other features of the position, like king safety. This would create some more draw model parameters.
Note that I mainly like this idea for its mathematical elegance. It may not make any difference in practice. On the other hand it may also lead to an objective model for the "drawishness" of a position which seems interesting.
Ok, I did a first attempt at this considering only material. I added slopes for the draw_elo and home_elo as
The drawishness increases with game phase, while the home advantage also does the same at a lower rate. The starting values for ELO_DRAW and ELO_HOME were 97 and 32 respectively.
Ok I added king safety as one of the parameters that affect drawishness and it seems to be clearly the major factor affecting drawishness. It was the first parameter to be quickly modified by the CG iterations. I am not sure i got the definitions of phase right in my previous post, so I will define them here with the code i used to avoid mistakes while wording it.
double factor_m = material / 62.0; //cumulatives pieces (no pawns) for both sides with values of 9-5-3-3 for Q-R-B-N (factor goes from 1.0 to 0.0)
double factor_k = ksafety / 100.0; //cumulative king safety for both sides scaled to 1 pawn value (factor goes from 0.0 to maybe 5.0)
int eloH = ELO_HOME + factor_m * ELO_HOME_SLOPE_PHASE
+ factor_k * ELO_HOME_SLOPE_KSAFETY;
int eloD = ELO_DRAW + factor_m * ELO_DRAW_SLOPE_PHASE
+ factor_k * ELO_DRAW_SLOPE_KSAFETY;
a) The scale of centi-pawns vs elo is not one to one when we have significant draw percentages, so despite our previous assumption that no scaling is needed, this needs to be addressed especially for Davidson. The tuning results I got using Davidson are siginficantly ramped up compared to the other models. Assuming the centi-pawn vs winning-percentage relation is logistic, we will try to morf the three draw models to give evaluation scores obtained with no draw model. The criteria I am using to calculate the scaling factor is the slope with eval=0 should match that of the logistic curve. Let me know if there is a better scaling factor calculation method especially for Davidson that is slighlty off after scaling using this method. In code, it is:
1282 //scale elos so that they look more like elostat's
1283 // Match slopes at 0 elo difference using df(x)/dx = K/4
1284 void calculate_scale() {
1285 const double K = log(10)/400.0;
1286 double df;
1287 if(eloModel == 0) {
1288 double dg = elo_to_gamma(eloDraw);
1289 double f = 1 / (1 + dg);
1290 df = f * (1 - f) * K;
1291 } else if(eloModel == 1) {
1292 double dg = elo_to_gamma(eloDraw) - 1;
1293 df = (dg / pow(2+dg,2.0)) * K;
1294 } else if(eloModel == 2) {
1295 const double pi = 3.14159265359;
1296 double x = -eloDraw/400.0;
1297 df = exp(-x*x) / (400.0 * sqrt(pi));
1298 }
1299 eloScale = (4.0 / K) * df;
1300 printf("EloScale %f\n",eloScale);
1301 }
For Davidson, calculating the factor (nu=dg) is made so that eloDraw=0 gives dg=0. This wasn't the case in the previous code i posted but i think it should be like that.
b) Some evaluation terms dealing with imbalances are very problematic for tuning. I had a bonus for a major or minor pieces vs pawns bonus. Tuning this increased the value of the bonus from a default value of 45 to even 500! I think this is because in the dataset there probably aren't enough positions where a side is a piece up and not win. When I carefully changed the evaluation codition to be for a side to be up a piece AND down by atleast pawns that equal the piece value, then the tuning kept more or less the 45 value. I can imagine this kind of thing could be problematic to other people tuning their eval too.
I will do the drawishness simulations to convergence after i fixed these issues.