The IPON BayesElo mystery solved.

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: The IPON BayesElo mystery solved.

Post by Laskos »

lkaufman wrote:
hgm wrote:
lkaufman wrote:It suddenly occurred to me that if BayesElo is correct, then when doing normal sequential ratings (like USCF or FIDE) wouldn't it be correct to rate each draw twice, or alternatively to only half-rate wins and losses? That doesn't sound right, but it would seem to be logical if the underlying assumption of BayesElo is right. Only in that way would one draw = one win plus one loss.
This sounds reasonable. But I thought FIDE ratings were based on a Gaussian model, and I never calculated how it would work out for that. But it canot be excluded draws should have a different weight from wins or losses.

Of course Sonas showed that the FIDE model is no good at all, so it could use a lot more changes as just the draw weight.
FIDE ratings may use the normal distribution rather than the Logistic, I'm not sure, but the differences are insignificant except for extreme elo differences. But they consider only the player's score against the given opponents, not how it was composed of wins and draws, so it therefore follows that for FIDE one win plus one loss is identical to two draws. If the model behind BayesElo implies that one win plus one loss is the same as one draw, that is a HUGE difference. Ironically, if FIDE were to switch to counting draws twice to conform to BayesElo, the effect would be to favor top players who had less draws. These days, many events award prizes based on effectively giving each player only 1/3 of a point for draws. So a revised Bayes-Elo like FIDE rating system would accidentally work to favor the strong players who made less draws, the same ones favored by this prize-distribution! Of course it would have the opposite effect for tail-enders, but their ratings are generally considered less important.
This idea of double-rating draws is not of just hypothetical importance to me. As a member of the USCF rating committee, I am in a position to propose the idea for serious consideration. But one thing about it bothers me. Let's say K (the limiting value of a loss against a very weak opponent) is 32. Normally a huge upset win would get 32 and a huge upset draw would get 16, but if we give draws double credit they also get 32, which is absurd. So maybe double credit for draws just works if the players are closely matched, and the multiplier gradually decays to 1 as the rating spread increases. Does this sound right to you? Can you make a better proposal for modifying sequential ratings in the spirit of BayesElo?
I didn't followed this discussion, I just assumed the logistic distribution out of my ignorance. The curve being empirical, and the common cases used being logistic and Gaussian model, I saw no problem with that. I don't believe that 5 Elo points difference are meaningless, I mean, if the differences are 340 points and 345 points, then the difference of 5 points of differences is probably meaningless. But 5 points absolute difference must look similar in Gaussian model or logistic. I don't think the problem was about the shape of the logistic, Gaussian model, whatever very similar empirical curves.

Kai
Michel
Posts: 2277
Joined: Mon Sep 29, 2008 1:50 am

Re: The IPON BayesElo mystery solved.

Post by Michel »

hgm wrote:Indeed, this function seems to have many amazing properties.
Still come to think of it: your proof only shows the equivalence of
a draw with 1 win and 1 loss, up to 2nd order corrections in drawelo.

My original point that the likelihood function itself depends on first order
corrections in drawelo still stands I think.

If the number of wins is different from the number of losses the likelihood
function depends on drawelo and hence so does the extremum.

Of course if the elo's are close then the number of wins will not be
too different from the number of losses so the correction is probably
small. I did not take the trouble to try to estimate it.
The derivative (a hyperbolic secant) is also an eigen-function of the Fourier transform. (For Gaussians this is well known, but for me it was a surprise sech is too.)
Funny, somebody in electronics asked me this recently. Like you I told
him that I thought Gaussians were the only such functions....
User avatar
hgm
Posts: 27945
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

Post by hgm »

Sven Schüle wrote:As far as I understood BayesElo does not change the weight of draws for calculating the rating but just for the uncertainty values.
There is no explicit weighting of the games in BayesElo, but it follows implicitly from the way it calculates the maximum-likelihood estimate of the ratings. In a Bayesian treatment, the likelihood for a given rating is proportional to the probability that the observed results would occur when this rating was the true one. So if the probability of a win with rating x equals F(x-d), and of a loss F(-x-d), the likelihood for rating x with 1 loss and 1 win would be (proportional to) F(x-d)*F(-x-d). The probability for a single draw at rating (difference) x (ad thus the likelihood for x with a single draw) would be 1-F(x-d)-F(-x-d).

Now for the Logistic F(x-d)*F(-x-d) happens to be the same function (deviation O(d^2)) as 1-F(x-d)-F(-x-d). Thus when BayesElo is maximizing the total likelihood for the ratings (which also contains factors for the likelihood of all other results), the win+loss will give exactly the same contribution to the likelihood as the single draw.

This is all a great coincidence. It is in general not true that the product of a win and loss probability is proportional to the draw probability, or even any power of the draw probability. So in other rating models an exact equivalence of win+loss to a number of draws in general does not occur. You can always compare the likelihood curve of win+draw (the product of the win and loss probability, renormalized to a surface area of 1) with that of a single draw based on their width (e.g. expressed as standard deviation), and in general these widths will not be equal. For looking at larger numbers of games, you have to raise the likelihoods to a power, and any curve with a quadratic top raised to a high-enough power will start to look like a Gaussian (the Central Limit Theorem). So in the limit of large numbers of games there will again be exacte equivalence between N win+loss combinations and M draws, so that you loosely speaking can say that 1 draw has as much weight as N/M win+loss combinations.
User avatar
hgm
Posts: 27945
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

Post by hgm »

Michel wrote:Funny, somebody in electronics asked me this recently. Like you I told
him that I thought Gaussians were the only such functions....
Actually I once made an observation in this area, which I never could find in any text book. Gaussians are eigen-functions of the operator H = (d/dx)^2 + x^2 = p^2 + x^2, known as the Hamiltonian for the harmonic oscillator in quantum mechanics. But the Gaussian is only the ground-state eigen-function, and you can get other eigen-functions with larger eigen-values by multiplying it by Hermite polynomials. All these eigen-functions are also eigen-functions of the Fourier transform, since a FT maps p = d/dx onto x and vice versa, so the operator H remains invariant under it, as H was symmetric in x and p. This is all well known from elementary quantum-mechanics texts.

But this set me thinking: operators that have the same (complete set of) eigen-functions must be functions of each other, by simply mapping the eigen-values. Now H has eigen-values 2N+1, while the FT conserves (L2-)norm, and thus can only have eigen-values with absolute value 1, i.e. on the unit circle of the complex plain. Explicitly calculating some FT of the Hermite functions quickly revealed the pattern: their eigen-values under FT are 1, i, -1 or -i. N=0 (the Gaussian) for H maps onto 1 for FT, N=1 to i, and so on. In other words, N maps onto i^N = exp(iN PI/2).

So unlike those of H, the eigen-values of the FT are degenerate. This means that any linear combination of eigen-functions with the same eigen-values will also be an eigen-functions. This is why other eigen-functions are possible, also eigen-functions other than the Hermite functions. Apparently the sech (with FT eigen-value 1) can be written as a series of Hermite functions of N=0, 4, 8, ....

But it also means that FT = exp(i PI/4 (H-1)), which I had never seen mentioned anywhere. It also means that the harmonic oscilator generates the FT through a 'time evolution' (d/dt)U(t) = i/2 (H-1) U(t), U(0)=I, where U(PI/2) then becomes FT. I guess this should have been obvious, as a harmonic oscillator converts positions to momenta and vice versa in a quarter cycle of its oscillation, (even classically), and that in quantum mechanics momenta are obtained by FT of the position wave function (and vice versa).

Nevertheless, I thought it was all very remarkable and beautiful.
User avatar
hgm
Posts: 27945
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

Post by hgm »

Laskos wrote:I don't know Bayeselo, but what's this "BayesElo assumes that one win and one draw predicts the same as two wins and one loss."? With trinomials, draws are not equivalent to anything wins or losses, only when one is approximating trinomials with binomials, draws must be modeled on wins/losses. The approximation is inherently inaccurate for large ranges of distributions. If I understood something, Bayeselo is making some assumptions about draws as related to wins/losses? I repeat, any such assumption is an approximation, a better or a worse one, depending on the concrete results.
BayesElo simply calculates the maximum-likelihood estimate for the ratings, given the results and the rating model. But for independent games the total likelihood is the product of the likelihoods under a single game result (and the prior, if you use one).

To know the likelihood factor contributed by a draw, you would have to know the draw probability as a function of rating difference. Not just the average score (=P_win + 0.5*P_draw) as a function of Elo difference.

In BayesElo this modelling assumes that players perform in a given game according to a performance rating that is drawn from a distribution of ratings around a certain mean (and it is this mean that we call their Elo rating, and that we want to determine), with a given distribution shape equal for all players (Gaussian, Logistic, ....). So far nothing special compared to other rating models.

But now there is the additional assumption that when the performance rating the players stochastically get for the game has only a difference smaller than a certain threshold, (the 'drawElo' parameter), the game will end in a draw, and only when the performance ratings drawn by the players will differ more than this threshold, the game will end in a win for the best player (i.e. the one that drew the highest performance rating).

This sounds like a pretty reasonable assumption to me. When a player that has Elo rating 1800 on a good day performs like 2000, and meets a player that performs like 1990, why would the 10-Elo difference be not enough to force a win, (say), but would have needed 30, while if it was (say) a 1900-rated player that performed like 2000, a 10-Elo margin would already be sufficient for a win (but a 10-Elo deficit would have been a loss). I see no reason why a player that performs like 2000 would have a draw margin that is dependent on his base rating. (Of course for individual players the draw margin could be different, e.g. because they would rather play on to lose than accept any draw, but also the distribution width of individual players could be different from what the model assumes, and we are looking only for things here that are valid in the average over a large number of players.)

But like with any fit, it is of course true that merely fitting the model parameters for minimum deviation to the observed results is sloppy science, and careful science requires you to test the model by looking how good this 'best fit' actually is on an absolute scale (and perhaps then going for a better model).
User avatar
hgm
Posts: 27945
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

Post by hgm »

lkaufman wrote:Let's say K (the limiting value of a loss against a very weak opponent) is 32. Normally a huge upset win would get 32 and a huge upset draw would get 16, but if we give draws double credit they also get 32, which is absurd. So maybe double credit for draws just works if the players are closely matched, and the multiplier gradually decays to 1 as the rating spread increases. Does this sound right to you? Can you make a better proposal for modifying sequential ratings in the spirit of BayesElo?
Well, calculating ratings for humans is a different game than calculating them for engines. Human ratings can vary in time, and ay viable system must be able to handle that by allowig its rating to vary in time. Engine ratings have a fixed (time-independent) value, which can be extracted from all games the egine (and others) has ever played, by runing it through programs like EloStat or BayesElo. Adapting the rating in time ads a whole new problem to it. (Sonas also investigated how to optimize that.)

But to address your example: If the WDL model used by BayesElo would be valid, it would indeed be logical if a draw against a person had the same effect as a win+loss against him. Like you observe, this seems a bit weird in cases where the current system would not deduct any Elo when you lose. As soon as you lose points by losing a game, a win+loss would not bring you as much as a win. But if thereis any problemi that, I guess it would be due to not changing the ratings at all on the 'expected' outcome. This can only be justified when there is no probability at all for the stronger player to lose or draw. And under those conditions both a loss and a draw would be evidence the rating difference in reality is much smaller than the assumed one, which had better be corrected as quickly as possible in both cases.

Note that the problem you sketch only occurs when you weight draws by a factor 2 or higher. If actual analysis of the WDL probability as function of rating differece would point to a waiting of, say, 1.5, you would not have this apparent problem.
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: The IPON BayesElo mystery solved.

Post by IWB »

Hello Larry,
lkaufman wrote:
Your draw percentage is normal, but it is much higher than the draw percentage assumed by BayesElo. This is the main cause of the above phenomenon. You could fix it by using the option that lets BayesElo calculate the "DrawElo" from your own data, but I'm not recommending that you do so. Reducing or eliminatin "PRIOR" would also help, but again I'm not recommending that. .
Just for fun this would be my current full list with a drawelo of 35.6:

Code: Select all

   1 Houdini 2.0 STD          2998   14   13  2900   79%  2786   25% 
   2 Houdini 1.5a             2991   12   11  4000   79%  2779   26% 
   3 Critter 1.4 SSE42        2961   14   14  2400   77%  2773   32% 
   4 Komodo 4 SSE42           2960   14   14  2500   76%  2780   30% 
   5 Komodo 3 SSE42           2950   13   13  2800   74%  2783   31% 
   6 Deep Rybka 4.1 SSE42     2941   11   11  3700   72%  2796   37% 
   7 Deep Rybka 4             2940   10   10  4900   74%  2775   33% 
   8 Critter 1.2              2939   12   12  3100   72%  2790   37% 
   9 Komodo 2.03 DC SSE42     2938   13   13  2700   74%  2775   30% 
  10 Houdini 1.03a            2937   13   12  3200   79%  2738   30% 
  11 Stockfish 2.1.1 JA       2928   11   11  3500   69%  2797   36% 
  12 Critter 1.01 SSE42       2911   12   12  2800   70%  2777   36% 
  13 Stockfish 2.01 JA        2910   12   12  3100   72%  2762   35% 
  14 Rybka 3 mp               2893   10   10  4200   77%  2716   31% 
  15 Stockfish 1.9.1 JA       2891   12   12  3000   71%  2756   36% 
  16 Critter 0.90 SSE42       2886   11   11  3400   68%  2768   36% 
  17 Stockfish 1.7.1 JA       2880   12   12  2900   73%  2726   33% 
  18 Rybka 3 32b              2845   15   15  1700   70%  2721   35% 
  19 Stockfish 1.6.x JA       2831   12   12  2600   69%  2714   37% 
  20 Chiron 1.1a              2830   12   12  2600   56%  2791   39% 
  21 Komodo64 1.3 JA          2830   11   11  3300   59%  2773   37% 
  22 Naum 4.2                 2824    8    8  6800   58%  2776   40% 
  23 Critter 0.80             2819   12   12  2800   64%  2730   36% 
  24 Fritz 13 32b             2818   12   12  2600   54%  2792   39% 
  25 Komodo 1.2 JA            2805   10   10  3700   59%  2751   40% 
  26 Deep Shredder 12         2800    8    8  7900   55%  2769   38% 
  27 Rybka 2.3.2a mp          2800   11   11  3500   62%  2729   40% 
  28 Gull 1.2                 2796   10   10  3800   49%  2803   36% 
  29 Gull 1.1                 2792   11   11  3100   54%  2767   38% 
  30 Critter 0.70             2792   14   14  1900   58%  2740   36% 
  31 Deep Sjeng c't 2010 32b  2790    9    9  4800   49%  2799   38% 
  32 Naum 4.1                 2789   13   13  2300   64%  2706   40% 
  33 Spike 1.4 32b            2787   10   10  3900   47%  2803   38% 
  34 Komodo 1.0 JA            2785   11   11  2900   61%  2721   42% 
  35 Deep Fritz 12 32b        2781    8    8  6300   52%  2770   38% 
  36 Naum 4                   2778   12   12  2700   60%  2717   40% 
  37 Rybka 2.2n2 mp           2778   13   13  2100   62%  2705   40% 
  38 Gull 1.0a                2772   12   13  2300   55%  2745   39% 
  39 Rybka 1.2f               2767   13   12  2400   66%  2673   36% 
  40 Stockfish 1.5.1 JA       2767   14   14  1900   59%  2712   38% 
  41 Protector 1.4.0          2764   10   10  4000   45%  2799   36% 
  42 Hannibal 1.1             2763   11   11  3300   44%  2804   38% 
  43 spark-1.0 SSE42          2760    9   10  4500   44%  2801   39% 
  44 HIARCS 13.2 MP 32b       2754   10   10  4300   43%  2799   36% 
  45 Fritz 12 32b             2751   13   13  2000   55%  2724   40% 
  46 HIARCS 13.1 MP 32b       2740   10   11  3600   48%  2753   37% 
  47 Deep Junior 12.5         2738   11   11  3600   40%  2806   33% 
  48 Deep Fritz 11 32b        2736   17   16  1300   57%  2693   39% 
  49 Doch64 1.2 JA            2727   15   15  1600   51%  2719   41% 
  50 spark-0.4                2725   11   11  3100   47%  2744   39% 
  51 Stockfish 1.4 JA         2725   15   14  1700   50%  2726   38% 
  52 Zappa Mexico II          2725    7    7  9200   45%  2757   37% 
  53 Shredder Bonn 32b        2723   13   13  2200   51%  2718   36% 
  54 Protector 1.3.2 JA       2713    9    8  5300   45%  2748   38% 
  55 Critter 0.60             2713   13   13  2200   49%  2721   37% 
  56 Deep Shredder 11         2707   12   12  2700   52%  2694   36% 
  57 Doch64 09.980 JA         2704   15   15  1500   47%  2720   38% 
  58 Naum 3.1                 2699   11   11  3000   50%  2694   39% 
  59 Hannibal 1.0a            2697   10   10  4200   38%  2775   33% 
  60 Onno-1-1-1               2697   10   10  4300   45%  2731   40% 
  61 Deep Junior 12           2697   11   11  3600   38%  2777   30% 
  62 Deep Onno 1-2-70         2696    8    9  5800   36%  2787   36% 
  63 Zappa Mexico I           2695   13   13  2200   56%  2664   41% 
  64 Rybka 1.0 Beta           2694   13   13  2300   45%  2733   35% 
  65 Spark-0.3 VC(a)          2692   10   10  3600   45%  2722   40% 
  66 Onno-1-0-0               2691   17   17  1200   50%  2693   41% 
  67 Deep Sjeng WC2008        2689    9    9  5600   43%  2729   37% 
  68 Toga II 1.4 beta5c BB    2686    7    7  8300   39%  2756   37% 
  69 Strelka 2.0 B            2684   11   11  3900   32%  2804   33% 
  70 Deep Junior 11.2         2682   12   12  2900   41%  2745   31% 
  71 Umko 1.2 SSE42           2678   12   12  3100   31%  2808   34% 
  72 Hiarcs 12.1 MP 32b       2677    9    8  5600   43%  2719   38% 
  73 Deep Sjeng 3.0           2676   16   16  1400   43%  2719   34% 
  74 Shredder Classic 4 32b   2667   14   14  1800   51%  2660   38% 
  75 Critter 0.52b            2667   12   12  2600   42%  2716   39% 
  76 Naum 2.2 32b             2660   16   16  1300   47%  2674   45% 
  77 Deep Junior 11.1a        2659   12   12  2800   41%  2714   34% 
  78 Umko 1.1 SSE42           2654   11   11  3900   29%  2792   33% 
  79 Glaurung 2.2 JA          2653   12   12  2600   40%  2714   38% 
  80 Rybka 1.0 Beta 32b       2652   18   18  1100   46%  2675   37% 
  81 Deep Junior 2010         2651   11   11  3100   39%  2719   31% 
  82 HIARCS 11.2 32b          2647   14   14  1900   44%  2686   38% 
  83 Fruit 05/11/03 32b       2647    9    9  4400   40%  2704   41% 
  84 Loop 2007                2640    8    8  7700   31%  2766   33% 
  85 Toga II 1.2.1a           2636   15   15  1600   45%  2668   41% 
  86 ListMP 11                2634   12   12  2600   38%  2708   37% 
  87 Jonny 4.00 32b           2632   10   10  5000   27%  2797   28% 
  88 LoopMP 12 32b            2632   15   15  1500   42%  2677   38% 
  89 Deep Shredder 10         2627   10   10  4400   40%  2690   33% 
  90 Tornado 4.80             2627   13   14  2600   26%  2799   27% 
  91 Twisted Logic 20100131x  2623   11   11  3500   33%  2737   30% 
  92 Crafty 23.3 JA           2618   10   10  5000   25%  2797   27% 
  93 Spike 1.2 Turin 32b      2607    8    8  7700   31%  2733   33% 
  94 Deep Sjeng 2.7 32b       2588   16   16  1400   33%  2689   36% 
  95 Crafty 23.1 JA           2575   11   11  3800   26%  2735   28%

The rating changes, the ranking is more or less the same (at least at the top). If I would use this list as my default, the speculation about why Komodo beats Houdini and is not in front of it would be even worse ...!
On the other hand, having the top closer together would be more exciting :-) ...!

And yet, thinking about it, calculating the list with the correct values for the draw rate wouldn't that be more correct, more precise ...? I don't know, I have to think about it ...

Bye
Ingo

PS: Houdini 2.0 should have a SSE42 at the end of its name as well, I will change that in the future!
lkaufman
Posts: 5981
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: The IPON BayesElo mystery solved.

Post by lkaufman »

IWB wrote:Hello Larry,
lkaufman wrote:
Your draw percentage is normal, but it is much higher than the draw percentage assumed by BayesElo. This is the main cause of the above phenomenon. You could fix it by using the option that lets BayesElo calculate the "DrawElo" from your own data, but I'm not recommending that you do so. Reducing or eliminatin "PRIOR" would also help, but again I'm not recommending that. .
Just for fun this would be my current full list with a drawelo of 35.6:

Code: Select all

   1 Houdini 2.0 STD          2998   14   13  2900   79%  2786   25% 
   2 Houdini 1.5a             2991   12   11  4000   79%  2779   26% 
   3 Critter 1.4 SSE42        2961   14   14  2400   77%  2773   32% 
   4 Komodo 4 SSE42           2960   14   14  2500   76%  2780   30% 
   5 Komodo 3 SSE42           2950   13   13  2800   74%  2783   31% 
   6 Deep Rybka 4.1 SSE42     2941   11   11  3700   72%  2796   37% 
   7 Deep Rybka 4             2940   10   10  4900   74%  2775   33% 
   8 Critter 1.2              2939   12   12  3100   72%  2790   37% 
   9 Komodo 2.03 DC SSE42     2938   13   13  2700   74%  2775   30% 
  10 Houdini 1.03a            2937   13   12  3200   79%  2738   30% 
  11 Stockfish 2.1.1 JA       2928   11   11  3500   69%  2797   36% 
  12 Critter 1.01 SSE42       2911   12   12  2800   70%  2777   36% 
  13 Stockfish 2.01 JA        2910   12   12  3100   72%  2762   35% 
  14 Rybka 3 mp               2893   10   10  4200   77%  2716   31% 
  15 Stockfish 1.9.1 JA       2891   12   12  3000   71%  2756   36% 
  16 Critter 0.90 SSE42       2886   11   11  3400   68%  2768   36% 
  17 Stockfish 1.7.1 JA       2880   12   12  2900   73%  2726   33% 
  18 Rybka 3 32b              2845   15   15  1700   70%  2721   35% 
  19 Stockfish 1.6.x JA       2831   12   12  2600   69%  2714   37% 
  20 Chiron 1.1a              2830   12   12  2600   56%  2791   39% 
  21 Komodo64 1.3 JA          2830   11   11  3300   59%  2773   37% 
  22 Naum 4.2                 2824    8    8  6800   58%  2776   40% 
  23 Critter 0.80             2819   12   12  2800   64%  2730   36% 
  24 Fritz 13 32b             2818   12   12  2600   54%  2792   39% 
  25 Komodo 1.2 JA            2805   10   10  3700   59%  2751   40% 
  26 Deep Shredder 12         2800    8    8  7900   55%  2769   38% 
  27 Rybka 2.3.2a mp          2800   11   11  3500   62%  2729   40% 
  28 Gull 1.2                 2796   10   10  3800   49%  2803   36% 
  29 Gull 1.1                 2792   11   11  3100   54%  2767   38% 
  30 Critter 0.70             2792   14   14  1900   58%  2740   36% 
  31 Deep Sjeng c't 2010 32b  2790    9    9  4800   49%  2799   38% 
  32 Naum 4.1                 2789   13   13  2300   64%  2706   40% 
  33 Spike 1.4 32b            2787   10   10  3900   47%  2803   38% 
  34 Komodo 1.0 JA            2785   11   11  2900   61%  2721   42% 
  35 Deep Fritz 12 32b        2781    8    8  6300   52%  2770   38% 
  36 Naum 4                   2778   12   12  2700   60%  2717   40% 
  37 Rybka 2.2n2 mp           2778   13   13  2100   62%  2705   40% 
  38 Gull 1.0a                2772   12   13  2300   55%  2745   39% 
  39 Rybka 1.2f               2767   13   12  2400   66%  2673   36% 
  40 Stockfish 1.5.1 JA       2767   14   14  1900   59%  2712   38% 
  41 Protector 1.4.0          2764   10   10  4000   45%  2799   36% 
  42 Hannibal 1.1             2763   11   11  3300   44%  2804   38% 
  43 spark-1.0 SSE42          2760    9   10  4500   44%  2801   39% 
  44 HIARCS 13.2 MP 32b       2754   10   10  4300   43%  2799   36% 
  45 Fritz 12 32b             2751   13   13  2000   55%  2724   40% 
  46 HIARCS 13.1 MP 32b       2740   10   11  3600   48%  2753   37% 
  47 Deep Junior 12.5         2738   11   11  3600   40%  2806   33% 
  48 Deep Fritz 11 32b        2736   17   16  1300   57%  2693   39% 
  49 Doch64 1.2 JA            2727   15   15  1600   51%  2719   41% 
  50 spark-0.4                2725   11   11  3100   47%  2744   39% 
  51 Stockfish 1.4 JA         2725   15   14  1700   50%  2726   38% 
  52 Zappa Mexico II          2725    7    7  9200   45%  2757   37% 
  53 Shredder Bonn 32b        2723   13   13  2200   51%  2718   36% 
  54 Protector 1.3.2 JA       2713    9    8  5300   45%  2748   38% 
  55 Critter 0.60             2713   13   13  2200   49%  2721   37% 
  56 Deep Shredder 11         2707   12   12  2700   52%  2694   36% 
  57 Doch64 09.980 JA         2704   15   15  1500   47%  2720   38% 
  58 Naum 3.1                 2699   11   11  3000   50%  2694   39% 
  59 Hannibal 1.0a            2697   10   10  4200   38%  2775   33% 
  60 Onno-1-1-1               2697   10   10  4300   45%  2731   40% 
  61 Deep Junior 12           2697   11   11  3600   38%  2777   30% 
  62 Deep Onno 1-2-70         2696    8    9  5800   36%  2787   36% 
  63 Zappa Mexico I           2695   13   13  2200   56%  2664   41% 
  64 Rybka 1.0 Beta           2694   13   13  2300   45%  2733   35% 
  65 Spark-0.3 VC(a)          2692   10   10  3600   45%  2722   40% 
  66 Onno-1-0-0               2691   17   17  1200   50%  2693   41% 
  67 Deep Sjeng WC2008        2689    9    9  5600   43%  2729   37% 
  68 Toga II 1.4 beta5c BB    2686    7    7  8300   39%  2756   37% 
  69 Strelka 2.0 B            2684   11   11  3900   32%  2804   33% 
  70 Deep Junior 11.2         2682   12   12  2900   41%  2745   31% 
  71 Umko 1.2 SSE42           2678   12   12  3100   31%  2808   34% 
  72 Hiarcs 12.1 MP 32b       2677    9    8  5600   43%  2719   38% 
  73 Deep Sjeng 3.0           2676   16   16  1400   43%  2719   34% 
  74 Shredder Classic 4 32b   2667   14   14  1800   51%  2660   38% 
  75 Critter 0.52b            2667   12   12  2600   42%  2716   39% 
  76 Naum 2.2 32b             2660   16   16  1300   47%  2674   45% 
  77 Deep Junior 11.1a        2659   12   12  2800   41%  2714   34% 
  78 Umko 1.1 SSE42           2654   11   11  3900   29%  2792   33% 
  79 Glaurung 2.2 JA          2653   12   12  2600   40%  2714   38% 
  80 Rybka 1.0 Beta 32b       2652   18   18  1100   46%  2675   37% 
  81 Deep Junior 2010         2651   11   11  3100   39%  2719   31% 
  82 HIARCS 11.2 32b          2647   14   14  1900   44%  2686   38% 
  83 Fruit 05/11/03 32b       2647    9    9  4400   40%  2704   41% 
  84 Loop 2007                2640    8    8  7700   31%  2766   33% 
  85 Toga II 1.2.1a           2636   15   15  1600   45%  2668   41% 
  86 ListMP 11                2634   12   12  2600   38%  2708   37% 
  87 Jonny 4.00 32b           2632   10   10  5000   27%  2797   28% 
  88 LoopMP 12 32b            2632   15   15  1500   42%  2677   38% 
  89 Deep Shredder 10         2627   10   10  4400   40%  2690   33% 
  90 Tornado 4.80             2627   13   14  2600   26%  2799   27% 
  91 Twisted Logic 20100131x  2623   11   11  3500   33%  2737   30% 
  92 Crafty 23.3 JA           2618   10   10  5000   25%  2797   27% 
  93 Spike 1.2 Turin 32b      2607    8    8  7700   31%  2733   33% 
  94 Deep Sjeng 2.7 32b       2588   16   16  1400   33%  2689   36% 
  95 Crafty 23.1 JA           2575   11   11  3800   26%  2735   28%

The rating changes, the ranking is more or less the same (at least at the top). If I would use this list as my default, the speculation about why Komodo beats Houdini and is not in front of it would be even worse ...!
On the other hand, having the top closer together would be more exciting :-) ...!

And yet, thinking about it, calculating the list with the correct values for the draw rate wouldn't that be more correct, more precise ...? I don't know, I have to think about it ...

Bye
Ingo

PS: Houdini 2.0 should have a SSE42 at the end of its name as well, I will change that in the future!
You made a huge mistake here. Drawelo is not the same as draw percentage. I think your draw percentage implies a Drawelo of well above a hundred, maybe 120 to 140 or so. There is supposed to be an option with BayesElo to let the BayesElo program calculate the drawelo from your data; that is the way to do what you tried to do here. I suggest you run it that way and post the result here. I predict that if you do it right the ratings of the top engines will go up a decent amount, maybe 15 or 20 elo.
As to whether doing this is "right", yes, technically, it is. Certainly there can be no objection to switching to using that option in BayesElo. The downside is that it will disagree with EloStat by much more than presently (EloStat understates rating differences), and it will correlate less well with ratings that would be obtained against humans. However since there are no longer serious human vs. top engine matches without handicap, maybe this is not important.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: The IPON BayesElo mystery solved.

Post by Houdini »

IWB wrote:PS: Houdini 2.0 should have a SSE42 at the end of its name as well, I will change that in the future!
Hello Ingo,

There is no "SSE42" version of Houdini 1.5 or 2, why would you want to confuse your readers?
The official (and only) names of the versions you've tested are "Houdini 1.5a" and "Houdini 2.0 Standard".

Robert
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: The IPON BayesElo mystery solved.

Post by IWB »

Houdini wrote: There is no "SSE42" version of Houdini 1.5 or 2, why would you want to confuse your readers?
The official (and only) names of the versions you've tested are "Houdini 1.5a" and "Houdini 2.0 Standard".
True, but right now I got already 2 questions what the difference would be if Houdini would use SSE as well ... and it is using SSE ... so what is more confusing? Introducing a SSE at the end of Houdini to tell everyone that it is using SSE or leaving it as it is and let a lot of people think it is NOT using SSE?

The best would be to make an extra remark if the enigne is using SSE and not changing the name, but that is much more work every time I make the lsit than just to edit the name.

I dont know ...

Bye
Ingo