The IPON BayesElo mystery solved.

IWB · Post by **IWB** » Thu Jan 05, 2012 4:28 pm

lkaufman wrote:
You made a huge mistake here. Drawelo is not the same as draw percentage...

OK, next try.

That is what I am doing right now

------------------------------------------
readpgn 161200.pgn

pgn >result.pgn
elo
mm
exactdist
offset 2800 Deep Shredder 12
ratings >ratings_bayes.dat
details >individual.dat
los > LOS.dat
-------------------------------------------

Before the mm I have this possibility:
quote bayeselo:
"drawelo [x] ..... get[set] draw Elo"

In my above example I put this to 35.6, which is wrong - ok. But what else to do.

I have this
quote bayeselo:
prior [x] ....... get[set] prior (= number of virtual draws)

and this:
quote bayeselo:
"mm [a] [d] ...... compute maximum likelihood Elo:
a: flag to compute advantage (default = 0)
d: flag to compute elodraw (default = 0)"

I guess "elodraw" is not the same as drawelo what you are talking about.

Is there anybody out there who can tell me how to change the little file I showed above to meet your requirements.

THanks
Ingo

PS: There is a result.pgn in my download, anyone can do that with elostat as well if he wants to ...

Sven · Post by **Sven** » Thu Jan 05, 2012 4:38 pm

"elodraw" should actually be the same as "drawelo". I think this is a tiny inaccuracy in the program's help text.

Change "mm" into "mm 0 1" to let BayesElo recalculate the "drawelo" parameter and use that one instead of the built-in default. If you instead use "mm 1 1" then also the "advantage" parameter will be treated that way, indicating the ELO equivalent of playing white.

I did the same already, albeit with an older snapshot of the IPON games from a couple of days ago, and got something like an overall rescaling that gave the top engines about 9-13 ELO points more.

EDIT: I think that most, if not all other rating lists use BayesElo the same way you did until now, since that is how it is described in the documentation as a typical example. So I would like to point out again that you are not to blame for anything here, not at all. We are only talking about a possible improvement in accuracy which, if it is confirmed (and I would really like to read Rémi's opinion about it), would be applicable for others, too.

Sven

Houdini · Post by **Houdini** » Thu Jan 05, 2012 4:42 pm

IWB wrote:True, but right now I got already 2 questions what the difference would be if Houdini would use SSE as well ... and it is using SSE ... so what is more confusing? Introducing a SSE at the end of Houdini to tell everyone that it is using SSE or leaving it as it is and let a lot of people think it is NOT using SSE?

The best would be to make an extra remark if the enigne is using SSE and not changing the name, but that is much more work every time I make the lsit than just to edit the name.

I dont know ...

Bye
Ingo

I don't understand why you feel the need to specify that an engine is using "SSE42", in 2012 that's a given.
You could simply drop all the "SSE42" tags and mention on the page that SSE42 versions of all engines are used, if available. It would definitely improve the readability of the list.

Robert

IWB · Post by **IWB** » Thu Jan 05, 2012 4:43 pm

Sven Schüle wrote:"elodraw" should actually be the same as "drawelo". I think this is a tiny inaccuracy in the program's help text.

Change "mm" into "mm 0 1" to let BayesElo recalculate the "drawelo" parameter and use that one instead of the built-in default. If you instead use "mm 1 1" then also the "advantage" parameter will be treated that way, indicating the ELO equivalent of playing white.

...

OK, thx Sven, I tried mm 01 already, but wasnt sure if this is really the right thing.

Anyhow, with mm 0 1 the IPON looks like this:

Code: Select all

   1 Houdini 2.0 STD          3023   11   11  2900   79%  2782   25% 
   2 Houdini 1.5a             3017   10   10  4000   79%  2774   26% 
   3 Critter 1.4 SSE42        2984   12   12  2400   77%  2766   32% 
   4 Komodo 4 SSE42           2982   12   12  2500   76%  2774   30% 
   5 Komodo64 3 SSE42         2972   11   11  2800   74%  2778   31% 
   6 Deep Rybka 4.1 SSE42     2962    9    9  3700   72%  2793   37% 
   7 Deep Rybka 4             2960    8    8  4900   74%  2769   33% 
   8 Critter 1.2              2958   10   10  3100   72%  2786   37% 
   9 Houdini 1.03a            2957   11   11  3200   79%  2726   30% 
  10 Komodo 2.03 DC SSE42     2957   11   11  2700   74%  2768   30% 
  11 Stockfish 2.1.1 JA       2947   10   10  3500   69%  2794   36% 
  12 Critter 1.01 SSE42       2929   11   11  2800   70%  2771   36% 
  13 Stockfish 2.01 JA        2927   10   10  3100   72%  2754   35% 
  14 Rybka 3 mp               2908    9    9  4200   77%  2700   31% 
  15 Stockfish 1.9.1 JA       2907   10   10  3000   71%  2747   36% 
  16 Critter 0.90 SSE42       2901   10   10  3400   68%  2761   36% 
  17 Stockfish 1.7.1 JA       2892   11   11  2900   73%  2711   33% 
  18 Rybka 3 32b              2856   13   13  1700   70%  2706   35% 
  19 Stockfish 1.6.x JA       2838   11   11  2600   69%  2697   37% 
  20 Komodo64 1.3 JA          2836   10   10  3300   59%  2767   37% 
  21 Chiron 1.1a              2834   11   11  2600   56%  2787   39% 
  22 Naum 4.2                 2828    7    7  6800   58%  2770   40% 
  23 Critter 0.80             2823   10   10  2800   64%  2716   36% 
  24 Fritz 13 32b             2820   10   11  2600   54%  2788   39% 
  25 Komodo 1.2 JA            2807    9    9  3700   59%  2741   40% 
  26 Rybka 2.3.2a mp          2802    9    9  3500   62%  2715   40% 
  27 Deep Shredder 12         2800    6    6  7900   55%  2762   38% 
  28 Gull 1.2                 2794    9    9  3800   49%  2802   36% 
  29 Critter 0.70             2791   12   12  1900   58%  2728   36% 
  30 Gull 1.1                 2791   10   10  3100   54%  2760   38% 
  31 Naum 4.1                 2788   11   11  2300   64%  2688   40% 
  32 Deep Sjeng c't 2010 32b  2787    8    8  4800   49%  2796   38% 
  33 Komodo 1.0 JA            2783   10   10  2900   61%  2706   42% 
  34 Spike 1.4 32b            2783    9    9  3900   47%  2802   38% 
  35 Deep Fritz 12 32b        2778    7    7  6300   52%  2764   38% 
  36 Rybka 2.2n2 mp           2777   12   11  2100   62%  2686   40% 
  37 Naum 4                   2776   10   10  2700   60%  2701   40% 
  38 Gull 1.0a                2768   11   11  2300   55%  2734   39% 
  39 Rybka 1.2f               2765   11   11  2400   66%  2649   36% 
  40 Stockfish 1.5.1 JA       2763   12   12  1900   59%  2696   38% 
  41 Protector 1.4.0          2756    9    9  4000   45%  2796   36% 
  42 Hannibal 1.1             2755   10    9  3300   44%  2803   38% 
  43 spark-1.0 SSE42          2752    8    8  4500   44%  2799   39% 
  44 HIARCS 13.2 MP 32b       2744    8    8  4300   43%  2797   36% 
  45 Fritz 12 32b             2744   12   12  2000   55%  2709   40% 
  46 HIARCS 13.1 MP 32b       2729    9    9  3600   48%  2744   37% 
  47 Deep Fritz 11 32b        2726   15   14  1300   57%  2672   39% 
  48 Deep Junior 12.5         2726    9    9  3600   40%  2804   33% 
  49 Doch64 1.2 JA            2714   13   13  1600   51%  2704   41% 
  50 Stockfish 1.4 JA         2712   13   13  1700   50%  2712   38% 
  51 spark-0.4                2711   10   10  3100   47%  2734   39% 
  52 Zappa Mexico II          2711    6    6  9200   45%  2748   37% 
  53 Shredder Bonn 32b        2709   11   11  2200   51%  2703   36% 
  54 Critter 0.60             2697   11   11  2200   49%  2706   37% 
  55 Protector 1.3.2 JA       2697    8    8  5300   45%  2738   38% 
  56 Deep Shredder 11         2691   10   10  2700   52%  2674   36% 
  57 Doch64 09.980 JA         2685   14   14  1500   47%  2705   38% 
  58 Naum 3.1                 2680   10   10  3000   50%  2674   39% 
  59 Onno-1-1-1               2678    8    8  4300   45%  2718   40% 
  60 Zappa Mexico I           2678   11   11  2200   56%  2638   41% 
  61 Hannibal 1.0a            2678    9    9  4200   38%  2769   33% 
  62 Deep Onno 1-2-70         2677    8    8  5800   36%  2782   36% 
  63 Deep Junior 12           2676    9    9  3600   38%  2772   30% 
  64 Rybka 1.0 Beta           2675   11   11  2300   45%  2720   35% 
  65 Spark-0.3 VC(a)          2672    9    9  3600   45%  2707   40% 
  66 Onno-1-0-0               2670   15   15  1200   50%  2673   41% 
  67 Deep Sjeng WC2008        2667    8    8  5600   43%  2716   37% 
  68 Toga II 1.4 beta5c BB    2664    6    7  8300   39%  2747   37% 
  69 Strelka 2.0 B            2663    9    9  3900   32%  2802   33% 
  70 Deep Junior 11.2         2659   10   10  2900   41%  2734   31% 
  71 Umko 1.2 SSE42           2655   10   10  3100   31%  2807   34% 
  72 Hiarcs 12.1 MP 32b       2653    7    7  5600   43%  2704   38% 
  73 Deep Sjeng 3.0           2651   14   14  1400   43%  2705   34% 
  74 Shredder Classic 4 32b   2642   12   12  1800   51%  2632   38% 
  75 Critter 0.52b            2641   10   11  2600   42%  2699   39% 
  76 Naum 2.2 32b             2633   14   14  1300   47%  2651   45% 
  77 Deep Junior 11.1a        2631   10   10  2800   41%  2697   34% 
  78 Umko 1.1 SSE42           2627    9    9  3900   29%  2789   33% 
  79 Rybka 1.0 Beta 32b       2624   16   16  1100   46%  2652   37% 
  80 Glaurung 2.2 JA          2622   11   11  2600   40%  2698   38% 
  81 Deep Junior 2010         2621   10   10  3100   39%  2704   31% 
  82 HIARCS 11.2 32b          2617   12   12  1900   44%  2664   38% 
  83 Fruit 05/11/03 32b       2616    8    8  4400   40%  2686   41% 
  84 Loop 2007                2610    7    7  7700   31%  2759   33% 
  85 Toga II 1.2.1a           2604   13   13  1600   45%  2643   41% 
  86 Jonny 4.00 32b           2603    9    9  5000   27%  2795   28% 
  87 ListMP 11                2601   11   11  2600   38%  2691   37% 
  88 LoopMP 12 32b            2598   14   14  1500   42%  2654   38% 
  89 Tornado 4.80             2597   11   12  2600   26%  2797   27% 
  90 Deep Shredder 10         2592    8    8  4400   40%  2669   33% 
  91 Twisted Logic 20100131x  2589   10    9  3500   33%  2725   30% 
  92 Crafty 23.3 JA           2587    8    9  5000   25%  2795   27% 
  93 Spike 1.2 Turin 32b      2569    7    7  7700   31%  2721   33% 
  94 Deep Sjeng 2.7 32b       2545   14   14  1400   33%  2669   36% 
  95 Crafty 23.1 JA           2533    9    9  3800   26%  2722   28%

Different, but if you compare with the online list not that different. We would have exactly the same questioning of the meaningless, individual 100 game results ...

Bye
Ingo
[/code]

Sven · Post by **Sven** » Thu Jan 05, 2012 4:45 pm

IWB wrote:OK, thx Sven, I tried mm 01 already, but wasnt sure if this is really the right thing.

You were fast ... please note also my EDIT above.
Sven

IWB · Post by **IWB** » Thu Jan 05, 2012 5:02 pm

Houdini wrote: ...
I don't understand why you feel the need to specify that an engine is using "SSE42", in 2012 that's a given.
You could simply drop all the "SSE42" tags and mention on the page that SSE42 versions of all engines are used, if available. It would definitely improve the readability of the list.

I fully aggree on the 'given' and the 'readability', alone it is naive to believe that people read conditions (neither mine nor the once of the other lists)! Until today I got requests why my rating is lower than of the
... (put in any list you want)

So, by doing as you proposed everything would be clear and I would have more work!

Again: It is not easy!

BYe
Ingo

IWB · Post by **IWB** » Thu Jan 05, 2012 5:08 pm

Sven Schüle wrote:
IWB wrote:OK, thx Sven, I tried mm 01 already, but wasnt sure if this is really the right thing.
You were fast ... please note also my EDIT above.
Sven

I read the edit now and that is fine. I dont feel offended in any way by the discussion. Actually if the mathematicians (isn't somehting magic in that word

) agree here I can change imediately.

There are just three arguments against change:

1. Looking at the different result the discussion about the individual perfomance of engines will be the same.
2. To compare the IPON with the other lists will become even more complicated.
3. The difference between the life rating calculation (pure elo formula) and the final result would be bigger - more discussions

Anyhow, thanks it is an interesting discussion as fas as I can follow.

Bye
Ingo

lkaufman · Post by **lkaufman** » Thu Jan 05, 2012 7:47 pm

IWB wrote:
Sven Schüle wrote:
IWB wrote:OK, thx Sven, I tried mm 01 already, but wasnt sure if this is really the right thing.
You were fast ... please note also my EDIT above.
Sven
I read the edit now and that is fine. I dont feel offended in any way by the discussion. Actually if the mathematicians (isn't somehting magic in that word ) agree here I can change imediately.

There are just three arguments against change:

1. Looking at the different result the discussion about the individual perfomance of engines will be the same.
2. To compare the IPON with the other lists will become even more complicated.
3. The difference between the life rating calculation (pure elo formula) and the final result would be bigger - more discussions

Anyhow, thanks it is an interesting discussion as fas as I can follow.

Bye
Ingo

I disagree only about point 1. The complaint is that your ratings for top engines are somewhat lower than the average of the performances. This would be partly (but not fully) corrected by making the change. I'm a bit puzzled as to why it isn't almost fully corrected by it.

So you would have less complaints on point number 1, but more on points 2 and 3. So if you want to leave everything as is, that's fine, it's a tradeoff between the above considerations and the fact that the new way is technically more correct.

Ajedrecista · Post by **Ajedrecista** » Fri Jan 06, 2012 11:46 am

Hello:

I have compared two formulæ for calculating standard deviations: the one I usually use:

Code: Select all

sd = sqrt{(1/n)·[mu·(1 - mu) - D/4]}

And other that I found thanks to this recent post by user Ruxy Sylwyka.

http://u.cs.biu.ac.il/~koppel/papers/expertga-oct21.pdf

Code: Select all

s = sqrt{[1/(n - 1)]·[W·(1 - mu)² + D·(1/2 - mu)² + L·mu²]}

It can be found in the last page of a 16-page PDF. It is also here (second post):

http://www.open-aurec.com/wbforum/viewtopic.php?t=949

If we do not take into account (1/n) and [1/(n - 1)] (which are very similar when n grows), here is my comparison:

n = number of games
w = number of won games
d = number of drawn games
l = number of lost games

n = w + d + l
W = w/n ; D = d/n ; L = l/n
W + D + L = 1

mu = (w + d/2)/n = W + D/2
1 - mu = (d/2 + l)/n = D/2 + L

Comparison without square roots, (1/n) and [1/(n - 1)]:

W·(1 - mu)² + D·(1/2 - mu)² + L·mu² = mu·(1 - mu) - D/4
W·(1 - 2·mu + mu²) + D·(1/4 - mu + mu²) + L·mu² = mu - mu² - D/4
W - 2·mu·W + mu²·W + D/4 - mu·D + mu²·D + L·mu² = mu - mu² - D/4
mu²·(W + D + L + 1) + mu·(-2·W - D - 1) + W + D/2 = 0
mu²·(1 + 1) + mu·(-2·W - D - 1) + mu = 0
2·mu² - mu·(2W + D) = 0
2·mu² - 2·mu² = 0

If I am not wrong, the only difference between these two formulæ is (1/n) and [1/(n - 1)].

In the paper, W is the number of won games, while I have used the win ratio (the same for D and L). I have done this because the standard deviation of the paper is abnormally large, so I suppose that it was a little error. An example:

Code: Select all

n = 100 (+40 = 30 - 30)

mu = 0.55

s = sqrt{(1/99)·[40·(0.45)² + 30·(-0.05)² + 30·(0.55)²]} ~ 0.41742 ; (1.96)·s ~ 0.81815

95% confidence ~ 1.96-sigma: mu ± (1.96)·s ~ [-0.26815, 1.36815] (Strange for me).

-------------------------------------------------

sd = sqrt{(1/n)·[mu·(1 - mu) - D/4]}
rd(+) = 400·log{[mu + k·(sd)]/[1 - mu - k·(sd)]}
rd(-) = 400·log{[mu - k·(sd)]/[1 - mu + k·(sd)]}
e(+) = [rd(+)] - rd > 0
e(-) = [rd(-)] - rd < 0
<e> = ±[|e(+)| + |e(-)|]/2 = ±{[e(+)] - [e(-)]}/2

These are part of my calculations; regarding <e>, it can be calculated in this way (just operating with the properties of logarithms):

<e> = ± 200·log{[mu + k·(sd)][1 - mu + k·(sd)]/[mu - k·(sd)][1 - mu - k·(sd)]}

Where k gives the confidence level (k ~ 1.96 for 95% confidence, k = 2 for ~ 95.45% confidence...). Comments and/or corrections are welcome.

Regards from Spain.

Ajedrecista.

IWB · Post by **IWB** » Fri Jan 06, 2012 3:06 pm

Hello Larry

lkaufman wrote:
IWB wrote: 1. Looking at the different result the discussion about the individual perfomance of engines will be the same.
I disagree only about point 1. The complaint is that your ratings for top engines are somewhat lower than the average of the performances. This would be partly (but not fully) corrected by making the change. I'm a bit puzzled as to why it isn't almost fully corrected by it.

So you would have less complaints on point number 1, ...

I checked that, and this is the result:

Code: Select all

 Default:
   4 Komodo 4 SSE42           2975 2500.0 (1892.5 : 607.5)									Perf.:
                                   100.0 ( 51.5 :  48.5) Houdini 2.0 STD          3016		3026
                                   100.0 ( 45.0 :  55.0) Critter 1.4 SSE42        2977		2942
                                   100.0 ( 51.5 :  48.5) Deep Rybka 4.1 SSE42     2956		2966
                                   100.0 ( 53.5 :  46.5) Critter 1.2              2952		2976
				                       100.0 ( 52.5 :  47.5) Stockfish 2.1.1 JA       2941		2958
                                   100.0 ( 65.5 :  34.5) Chiron 1.1a              2833		2944
                                   100.0 ( 69.5 :  30.5) Naum 4.2                 2827		2970
                                   100.0 ( 70.0 :  30.0) Fritz 13 32b             2819		2966	
                                   100.0 ( 68.0 :  32.0) Deep Shredder 12         2800		2930
                                   100.0 ( 75.0 :  25.0) Gull 1.2                 2795		2985	
                                   100.0 ( 79.0 :  21.0) Deep Sjeng c't 2010 32b  2788		3018
                                   100.0 ( 77.0 :  23.0) Spike 1.4 32b            2785		2994
                                   100.0 ( 78.5 :  21.5) Protector 1.4.0          2759		2983
                                   100.0 ( 80.0 :  20.0) Hannibal 1.1             2758		2998
                                   100.0 ( 85.0 :  15.0) spark-1.0 SSE42          2755		3056
                                   100.0 ( 87.5 :  12.5) HIARCS 13.2 MP 32b       2748		3086
                                   100.0 ( 83.5 :  16.5) Deep Junior 12.5         2731		3012
                                   100.0 ( 88.5 :  11.5) Zappa Mexico II          2716		3070
                                   100.0 ( 90.5 :   9.5) Deep Onno 1-2-70         2684		3075
                                   100.0 ( 90.5 :   9.5) Strelka 2.0 B            2671		3062
                                   100.0 ( 87.5 :  12.5) Umko 1.2 SSE42           2664		3002
                                   100.0 ( 88.0 :  12.0) Loop 2007                2621		2967
                                   100.0 ( 89.5 :  10.5) Jonny 4.00 32b           2614		2986
                                   100.0 ( 93.0 :   7.0) Tornado 4.80             2608		3057
                                   100.0 ( 92.5 :   7.5) Crafty 23.3 JA           2598		3034
																							
																							                 Aver. 3003

 DrawElo
 4 Komodo 4 SSE42 mm01       2982 2500.0 (1892.5 : 607.5)									Perf.:
                                   100.0 ( 51.5 :  48.5) Houdini 2.0 STD          3023		3033
                                   100.0 ( 45.0 :  55.0) Critter 1.4 SSE42        2984		2949
                                   100.0 ( 51.5 :  48.5) Deep Rybka 4.1 SSE42     2962		2972
                                   100.0 ( 53.5 :  46.5) Critter 1.2              2958		2982
                                   100.0 ( 52.5 :  47.5) Stockfish 2.1.1 JA       2947		2964
                                   100.0 ( 65.5 :  34.5) Chiron 1.1a              2834		2945
                                   100.0 ( 69.5 :  30.5) Naum 4.2                 2828		2971
                                   100.0 ( 70.0 :  30.0) Fritz 13 32b             2820		2967
                                   100.0 ( 68.0 :  32.0) Deep Shredder 12         2800		2930
                                   100.0 ( 75.0 :  25.0) Gull 1.2                 2794		2984
                                   100.0 ( 79.0 :  21.0) Deep Sjeng c't 2010 32b  2787		3017
                                   100.0 ( 77.0 :  23.0) Spike 1.4 32b            2783		2992
                                   100.0 ( 78.5 :  21.5) Protector 1.4.0          2756		2980
                                   100.0 ( 80.0 :  20.0) Hannibal 1.1             2755		2995
                                   100.0 ( 85.0 :  15.0) spark-1.0 SSE42          2752		3053
                                   100.0 ( 87.5 :  12.5) HIARCS 13.2 MP 32b       2744		3082
                                   100.0 ( 83.5 :  16.5) Deep Junior 12.5         2726		3007
                                   100.0 ( 88.5 :  11.5) Zappa Mexico II          2711		3065
                                   100.0 ( 90.5 :   9.5) Deep Onno 1-2-70         2677		3068
                                   100.0 ( 90.5 :   9.5) Strelka 2.0 B            2663		3054
                                   100.0 ( 87.5 :  12.5) Umko 1.2 SSE42           2655		2993
                                   100.0 ( 88.0 :  12.0) Loop 2007                2610		2956
                                   100.0 ( 89.5 :  10.5) Jonny 4.00 32b           2603		2975
                                   100.0 ( 93.0 :   7.0) Tornado 4.80             2597		3046
                                   100.0 ( 92.5 :   7.5) Crafty 23.3 JA           2587		3023
								   
																							                 Aver. 3000

While at default the difference between the avarage individual performance and the final rating is 28, the difference between drawelo and final rating is just 18 Elo. So, overall "mm 0 1" seems to give the better result ...

Still point 2 and 3 of my previous posting to take care about

Bye
Ingo

The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Comparison between formulæ of standard deviations.

Re: The IPON BayesElo mystery solved.