lucasart wrote:Michel wrote:the log-likelyhood ratio is then (since the trinomial coeffients will simplify):
ln(L1/L0)(%w,%d) = (pw1-pw0).ln(%w) + (pd1-pd0).ln(%d) + (pl1-pl0).ln(1-%w-%d)
I have difficulty parsing this formula. I think it should be (this forum should really implement mathml, as all decent forums do these days).
log(pw1/pw0).%w + log(pd1/pd0).%d + log(pl1/pl0).(1-%w-%d).
Here log(pw1/pw0), log(pd1/pd0), log(pl1/pl0) are precomputed numbers.
So you see that you are really doing a one-dimensional random walk. This is the key to analyzing the problem mathematically.
On second thoughts, we should really look at Xi not Xi/n, because the law of Xi is what we want to know, while the law of Xi/n depends on n... So the correct loglikelyhood ratio should be
Code: Select all
LLR = ln(pw1/pw0).nb_win + ln(pd1/pd0).nb_draw + ln(pl1/pl0).nb_loss
I implemented the SPTR algorithm and tested it:
=> Code
Code: Select all
#include <math.h>
static double log_likelyhood, pw1, pw0, pd1, pd0, pl1, pl0, a, b;
static void proba_elo(double elo, double *pwin, double *pdraw, double *ploss)
/* from BayesElo: computes probabilities of win and draw, given the elo difference. What this does
* is basically make a realistic assumption on (pwin,pdraw) for a given elo difference, that is for
* a given theoretical score pwin+pdraw/2 */
{
static const double draw_elo = 97.3;
*pwin = 1 / (1 + pow(10, (-elo + draw_elo) / 400));
*ploss = 1 / (1 + pow(10, ( elo + draw_elo) / 400));
*pdraw = 1.0 - *pwin - *ploss;
}
void init_SPRT(double elo0, double elo1, double alpha, double beta)
{
log_likelyhood = 0.0;
proba_elo(elo0, &pw0, &pd0, &pl0);
proba_elo(elo1, &pw1, &pd1, &pl1);
a = log(beta / (1 - alpha));
b = log((1 - beta) / alpha);
}
int stop_SPRT(int game_result)
{
if (1 == game_result)
log_likelyhood += log(pw1/pw0);
else if (0 == game_result)
log_likelyhood += log(pd1/pd0);
else
log_likelyhood += log(pl1/pl0);
if (log_likelyhood > b)
return 1; // accept H1: elo_diff = elo1
else if (log_likelyhood < a)
return -1; // accept H0: elo_diff = elo1
else
return 0; // continue sampling
}
So after a call to init_SPRT(), everytime you call stop_SPRT(), it returns:
* 0 if no conclusion can be drawn (continue sampling)
* 1 if H1 is accepted (elo=elo1)
* -1 if H0 is accepted (elo=elo0)
=> Results
In practice, what happens is the following:
1/ when the true elo difference is elo > elo1, H1 is accepted in 100% cases, which is correct.
2/ when the true elo difference is < elo0, H0 is accepted in 100% cases, which is correct.
3/ when the true elo difference is between elo0 and elo1, any kind of nonsense can come out. For example when testing elo_diff = 0 vs elo_diff = +10, with a random generator P(win)=0.4 P(draw)=0.2, I get 96% conclusions on H1, ie it concludes 96% of the time that B is stronger than A, although they have the exact same elo by construction.
In terms of speed of convergence:
Cases 1/ and 2/ offer a good speed of convergence. Best performance is achieved by SPTR when close to the bounds elo0 and elo1, while still outside the interval (elo0,elo1). When we are far from the bounds, my algorithm is the one that concludes the fastest however, and SPTR performs pretty much like the empirical bernstein algorithm.
Case 3/ SPTR concludes comparatively fast, but it gives any kind of nonsense. EBStop here is the most accurate (no mistake ever, for any elo diff != 0 even a tiny difference), and my hacky algo also makes very very few mistakes (typically between 0.0%-0.3% as we get close to elo0) and conclused about twice faster than EBStop.
Now in real life, I am testing A (my modified program) versus B (my previous version of the program). And I want to know if A is better than B, if it is equal, or if it is worse. With a margin of error delta=5%, let's say. I don't even care by how much elo yet, I just want a definite answer to that simple question, with the least error possible and a fairly small number of games.
If the true difference is between my arbitrary bounds elo0 and elo1, which happens a lot if not most of the times in reality. I'm completely screwed!
The problem is that you have no way of knowing ig you are in case 1 2 or 3 a priori, and even setting a stopping time maximum could certainly lead to a catastrophic number of errors in many cases.
So that's why, I insist on saying that discussing things theoretically is not enough. If you don't experiment, you'll never understand if and how it works... (or doesn't work in this case)