Rémi Coulom wrote:Very interesting. I am curious to see the results.
I had started to implement alternative models in bayeselo at the time I wrote the unfinished paper I posted here earlier. But I did not try to MM them. My plan was to use Newton's method or Conjugate Gradient. I don't expect it will be possible to apply MM to Glenn-David.
I recommend normalizing elo scales by having the same derivative at zero of the expected gain (p(win)+p(draw)/2). That's how I did for the original bayeselo.
Rémi
hi Remi,
What seems very popular nowadays is that all sorts of engines do learning in a rather hard manner. Hard i mean: difficult to turn off.
Note that I do not think that most engines do learning in a hard manner.
If we talk only about learning from hash table that is probably the most common problem then it is easy to disable this learning by telling the engine to quit after every game and download the engine again.
Even in the worst case
I think that it is easy to turn off learning even if the engine does not support it.
Simply care to have a copy of every engine.
After every game delete every file that the engine generated during the game and also delete the engines(to prevent the engine to learn by changing the exe file).
Save the pgn in a folder that the engine has no idea about it so they cannot use the pgn of the previous game to learn
After that copy the copies that you have to get an additional copy for the next game and run another game.
If you don't like the prior, you can remove it in bayeselo. But my feeling is that it should be always better to have a prior. Depending on the situation, it might be possible to find a better value than the default. But it would require serious testing to know. Default prior was tuned for WBEC data.
I spent some time searching for a suitable formula to calculate scaling factor but unfortunately I couldn't find one. I tried to match the slopes at different winning percentages but it didn't work. The slopes of standard logistic curve and this particular model seem to be very close which is why I was getting scale=1. So I leave it up to the user to select a suitable 'scale' parameter after invoking 'elostat' command. There does not seem to be a glaring problem with the derivation. Here is a modified file for anyone interested to experiment with it . I am done with this method so moving on to something else.
The real question is which of the two models matches the data best. Could you simply plot draw rate as a function of rating difference and compare with data frequency for both? That may give a first impression. Real test would be to compare on prediction ability.
Well without proper scaling I am not sure if anything can be deducted from the plots. But anyway here are the plots for winning percentage and draw ratio for both models. The elo differences are before any scaling is applied for both methods. I am sure it is something trivial I missed but why aren't we getting the correct logistic curve anyway since we used that to calculate elo differences? Shifting as it occurs in elostat makes sense but a change in relative elo difference doesn't make sense. What is responsible for the need in scaling in both Bayesian models ? I used a 10 elo bin to make the plots
Winning percentage and Draw ratio for both models
Number of games in bins
Ok now I fixed all the problems. The previous plot was wrong because the draw ratio was halved. Now both models seem to fit the winning percentage and draw ratio well. Their models definately fit the data but how to tell which is better I don't know.
Edit I found another problem with the previous plot. Since I didn't differentiate colors but still used eloAdvantage in the win probablity calculations the result of offseted by some amount. Now that I removed eloAdvantage from the probablity calculations the plots seem to fit even better. I shouldn't have used eloAdvantage since I didn't check for white/black wins. And my wishful thinking sees a better fit for the draw model by davidson towards the top may be.
Very nice! The Davidson model looks clearly better to me.
I believe the right way to measure would be to compare prediction quality. For instance, compute ratings with half of the games, then measure average log-liklihood of the other half based on computed ratings. Doing it in a proper Bayesian way would require to integrate over the posterior instead of using the maximum-a-posteriori estimation, but not integrating should give a good idea already.
Ok I am working on a conjugate gradient minimizer for the likelihood. It is having a bit of difficulty converging so it might take me sometime. There aren't multiple maximas ,are there ? Anyway maybe tommorrow I will have some thing for the Glenn David model. But I do not expect it to be significantly different from the RaoKapper model. Plotting the graphs for an assumed strength variance of 200 elos for the Glenn-David model, I couldn't see much difference with the RaoKapper model. It will be clear tomorrow after I calculate the frequencies .. Btw is there a standard value of the variance for glenn-david model ? Edit But I see that Glenn-David model is mid-way between the two models so it could be better. I will change the variance to a value that is equivalent to the one used for the logistic model, and see if this good fit below disappears. Edit 2 CG method has converged now on the RaoKapper model and game same results as the MM which is a good thing. I did it on a smaller database so it could go wrong on a bigger database or when I change the model to Glenn-David. Later.
Well all models work now on the smaller databases but it is so damn slow on the larger ccrl database that I used to produce the plots. I am going to have to optimize it aggressively to get results within reasonable time. The most time consuming part is calculating the loglikelihood. Some sort of caching is necessary. I am thinking to switch to a poorer grid optimization method that does one variable at a time, because I can cache likelihoods calculations efficiently that way. Ofcourse when optimizing thetaW and thetaD, I need to calculate likelihoods from scratch but for the rest most of it can be reused... Just thinking out loud.