The IPON BayesElo mystery solved.

lkaufman · Post by **lkaufman** » Wed Jan 04, 2012 2:16 am

hgm wrote:
lkaufman wrote: I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws.
Yes, of course it depends on the percentage of draws. But that is an entirely different matter than that it would depend on the (assumed) drawValue.

With a hyperbolic secant it uses as Elo model, a single draw is equivalent to one loss and one win. (By a mathematical coincidence (d/dx)F(x) = F(x) * (1 - F(-x)). With a Gaussian distribution you would not have that.) This would be true for any assumed 'drawValue'. It just means that 75% obtained by one draw and one win predicts exactly the same as two wins and one loss (after 'expanding' the draw), which is a 66% score!

This conclusion is only dependent on the shape of the rating curve.

So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate.
I am not sure what the problem is, then. If you believe the sech rating model to be correct, this is a true effect (draws are stronger evidence that the players are close in rating that wins and losses with a same total score). Any analysis based on that model should get it too. If you don't believe that, you should base the analysis on another score-vs-Elo curve.

Okay, that's a very informative post. You say that BayesElo assumes that one win and one draw predicts the same as two wins and one loss. This seems wrong to me. I would think one draw should be considered like (half a win + half a loss), not like (one win plus one loss). In other words one win and one loss are like two draws, not like one draw. At least that's the assumption of the real Elo rating system and of Elostat, as well as the way events are scored. To me it makes BayesElo suspect, although I'm open-minded on this and could be convinced otherwise. Do you really believe that this model underlying BayesElo is more correct than the normal assumption?

michiguel · Post by **michiguel** » Wed Jan 04, 2012 2:30 am

lkaufman wrote:
ThatsIt wrote:
lkaufman wrote: [...snip...]
Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
Look at your own post:
http://www.talkchess.com/forum/viewtopi ... 22&t=41655
The mystery you've told about was displayed by ELO-Stat not by Bayes !

Best wishes,
G.S.
The phenomenon occurs with both EloStat and BayesElo. With EloStat there is no mystery, it's due to the incorrectness of the model itself, which averages ratings (improperly). With BayesElo the reasons for the discrepancy are much less obvious. I attributed it to PRIOR, but how much of an effect that has is not yet clear to me.

FWIW, this is the rating calculated with my program ordo, normalized for DS12 to be 2800, as IPON does.

Miguel

Code: Select all

Average rating = 2745.500000

SORTED by RATING
               Houdini 2.0 STD: 2277.5 /  2900, 78.5%   rating = 3035.7
                  Houdini 1.5a: 3162.5 /  4000, 79.1%   rating = 3030.0
             Critter 1.4 SSE42: 1853.0 /  2400, 77.2%   rating = 3001.0
                Komodo 4 SSE42: 1892.5 /  2500, 75.7%   rating = 2995.3
              Komodo64 3 SSE42: 2075.5 /  2800, 74.1%   rating = 2983.8
          Deep Rybka 4.1 SSE42: 2655.0 /  3700, 71.8%   rating = 2978.3
                   Critter 1.2: 2232.0 /  3100, 72.0%   rating = 2973.6
                  Deep Rybka 4: 3627.0 /  4900, 74.0%   rating = 2973.2
                 Houdini 1.03a: 2520.0 /  3200, 78.8%   rating = 2971.3
          Komodo 2.03 DC SSE42: 1985.5 /  2700, 73.5%   rating = 2966.5
            Stockfish 2.1.1 JA: 2426.5 /  3500, 69.3%   rating = 2957.9
            Critter 1.01 SSE42: 1970.0 /  2800, 70.4%   rating = 2938.3
             Stockfish 2.01 JA: 2246.0 /  3100, 72.5%   rating = 2937.7
            Stockfish 1.9.1 JA: 2131.0 /  3000, 71.0%   rating = 2916.0
                    Rybka 3 mp: 3228.0 /  4200, 76.9%   rating = 2915.9
            Critter 0.90 SSE42: 2327.5 /  3400, 68.5%   rating = 2908.3
            Stockfish 1.7.1 JA: 2131.0 /  2900, 73.5%   rating = 2901.0
                   Rybka 3 32b: 1191.5 /  1700, 70.1%   rating = 2857.8
            Stockfish 1.6.x JA: 1792.5 /  2600, 68.9%   rating = 2842.2
               Komodo64 1.3 JA: 1946.0 /  3300, 59.0%   rating = 2837.3
                   Chiron 1.1a: 1453.5 /  2600, 55.9%   rating = 2835.0
                      Naum 4.2: 3927.5 /  6800, 57.8%   rating = 2831.7
                  Critter 0.80: 1795.5 /  2800, 64.1%   rating = 2823.3
                  Fritz 13 32b: 1406.5 /  2600, 54.1%   rating = 2820.9
                 Komodo 1.2 JA: 2175.0 /  3700, 58.8%   rating = 2807.9
               Rybka 2.3.2a mp: 2172.5 /  3500, 62.1%   rating = 2804.0
              Deep Shredder 12: 4346.0 /  7900, 55.0%   rating = 2800.0
                      Gull 1.2: 1854.5 /  3800, 48.8%   rating = 2794.0
                  Critter 0.70: 1107.0 /  1900, 58.3%   rating = 2790.6
                      Gull 1.1: 1675.5 /  3100, 54.0%   rating = 2789.9
                      Naum 4.1: 1465.0 /  2300, 63.7%   rating = 2788.8
       Deep Sjeng c't 2010 32b: 2333.0 /  4800, 48.6%   rating = 2786.8
                 Komodo 1.0 JA: 1756.5 /  2900, 60.6%   rating = 2784.3
                 Spike 1.4 32b: 1843.5 /  3900, 47.3%   rating = 2781.7
             Deep Fritz 12 32b: 3268.5 /  6300, 51.9%   rating = 2777.3
                        Naum 4: 1628.5 /  2700, 60.3%   rating = 2775.3
                Rybka 2.2n2 mp: 1311.5 /  2100, 62.5%   rating = 2774.7
                     Gull 1.0a: 1254.0 /  2300, 54.5%   rating = 2766.0
            Stockfish 1.5.1 JA: 1128.5 /  1900, 59.4%   rating = 2761.8
                    Rybka 1.2f: 1578.5 /  2400, 65.8%   rating = 2760.8
               Protector 1.4.0: 1789.5 /  4000, 44.7%   rating = 2755.7
                  Hannibal 1.1: 1436.5 /  3300, 43.5%   rating = 2751.9
               spark-1.0 SSE42: 1965.5 /  4500, 43.7%   rating = 2750.5
            HIARCS 13.2 MP 32b: 1850.0 /  4300, 43.0%   rating = 2743.0
                  Fritz 12 32b: 1091.0 /  2000, 54.5%   rating = 2740.3
            HIARCS 13.1 MP 32b: 1734.5 /  3600, 48.2%   rating = 2727.3
              Deep Junior 12.5: 1442.5 /  3600, 40.1%   rating = 2725.7
             Deep Fritz 11 32b:  744.5 /  1300, 57.3%   rating = 2720.8
                 Doch64 1.2 JA:  820.5 /  1600, 51.3%   rating = 2710.8
                     spark-0.4: 1458.0 /  3100, 47.0%   rating = 2708.9
              Stockfish 1.4 JA:  849.0 /  1700, 49.9%   rating = 2708.5
               Zappa Mexico II: 4152.0 /  9200, 45.1%   rating = 2707.6
             Shredder Bonn 32b: 1119.0 /  2200, 50.9%   rating = 2705.4
                  Critter 0.60: 1072.0 /  2200, 48.7%   rating = 2694.0
            Protector 1.3.2 JA: 2361.5 /  5300, 44.6%   rating = 2693.7
              Deep Shredder 11: 1412.0 /  2700, 52.3%   rating = 2685.4
              Doch64 09.980 JA:  710.0 /  1500, 47.3%   rating = 2682.7
                    Onno-1-1-1: 1923.0 /  4300, 44.7%   rating = 2674.9
                Deep Junior 12: 1356.0 /  3600, 37.7%   rating = 2674.8
                 Hannibal 1.0a: 1600.0 /  4200, 38.1%   rating = 2674.3
                      Naum 3.1: 1514.5 /  3000, 50.5%   rating = 2673.3
                Zappa Mexico I: 1221.0 /  2200, 55.5%   rating = 2672.7
              Deep Onno 1-2-70: 2109.0 /  5800, 36.4%   rating = 2672.4
                Rybka 1.0 Beta: 1023.5 /  2300, 44.5%   rating = 2671.7
               Spark-0.3 VC(a): 1625.0 /  3600, 45.1%   rating = 2668.7
                    Onno-1-0-0:  594.5 /  1200, 49.5%   rating = 2666.1
             Deep Sjeng WC2008: 2434.5 /  5600, 43.5%   rating = 2663.7
         Toga II 1.4 beta5c BB: 3255.5 /  8300, 39.2%   rating = 2660.0
              Deep Junior 11.2: 1176.0 /  2900, 40.6%   rating = 2658.6
                 Strelka 2.0 B: 1255.5 /  3900, 32.2%   rating = 2656.5
            Hiarcs 12.1 MP 32b: 2427.5 /  5600, 43.3%   rating = 2651.1
                Umko 1.2 SSE42:  956.0 /  3100, 30.8%   rating = 2649.0
                Deep Sjeng 3.0:  601.5 /  1400, 43.0%   rating = 2648.5
                 Critter 0.52b: 1097.0 /  2600, 42.2%   rating = 2637.5
        Shredder Classic 4 32b:  922.5 /  1800, 51.2%   rating = 2637.4
             Deep Junior 11.1a: 1153.0 /  2800, 41.2%   rating = 2627.5
                  Naum 2.2 32b:  614.0 /  1300, 47.2%   rating = 2625.8
                Umko 1.1 SSE42: 1146.0 /  3900, 29.4%   rating = 2620.6
              Deep Junior 2010: 1210.0 /  3100, 39.0%   rating = 2618.7
               Glaurung 2.2 JA: 1027.5 /  2600, 39.5%   rating = 2617.9
            Rybka 1.0 Beta 32b:  506.0 /  1100, 46.0%   rating = 2617.7
               HIARCS 11.2 32b:  827.0 /  1900, 43.5%   rating = 2612.9
            Fruit 05/11/03 32b: 1774.0 /  4400, 40.3%   rating = 2610.3
                     Loop 2007: 2396.5 /  7700, 31.1%   rating = 2602.6
                Toga II 1.2.1a:  716.5 /  1600, 44.8%   rating = 2600.2
                Jonny 4.00 32b: 1330.5 /  5000, 26.6%   rating = 2598.3
                     ListMP 11:  987.5 /  2600, 38.0%   rating = 2595.9
                 LoopMP 12 32b:  635.0 /  1500, 42.3%   rating = 2593.9
                  Tornado 4.80:  672.0 /  2600, 25.8%   rating = 2592.1
              Deep Shredder 10: 1754.0 /  4400, 39.9%   rating = 2590.2
       Twisted Logic 20100131x: 1140.0 /  3500, 32.6%   rating = 2585.7
                Crafty 23.3 JA: 1241.5 /  5000, 24.8%   rating = 2581.1
           Spike 1.2 Turin 32b: 2349.5 /  7700, 30.5%   rating = 2563.5
            Deep Sjeng 2.7 32b:  465.5 /  1400, 33.2%   rating = 2539.9
                Crafty 23.1 JA: 1002.0 /  3800, 26.4%   rating = 2528.5

lkaufman · Post by **lkaufman** » Wed Jan 04, 2012 6:45 am

michiguel wrote:[FWIW, this is the rating calculated with my program ordo, normalized for DS12 to be 2800, as IPON does.

Miguel

Code: Select all

Average rating = 2745.500000

SORTED by RATING
               Houdini 2.0 STD: 2277.5 /  2900, 78.5%   rating = 3035.7
                  Houdini 1.5a: 3162.5 /  4000, 79.1%   rating = 3030.0
             Critter 1.4 SSE42: 1853.0 /  2400, 77.2%   rating = 3001.0
                Komodo 4 SSE42: 1892.5 /  2500, 75.7%   rating = 2995.3
              Komodo64 3 SSE42: 2075.5 /  2800, 74.1%   rating = 2983.8
          Deep Rybka 4.1 SSE42: 2655.0 /  3700, 71.8%   rating = 2978.3
                   Critter 1.2: 2232.0 /  3100, 72.0%   rating = 2973.6
                  Deep Rybka 4: 3627.0 /  4900, 74.0%   rating = 2973.2
                 Houdini 1.03a: 2520.0 /  3200, 78.8%   rating = 2971.3
          Komodo 2.03 DC SSE42: 1985.5 /  2700, 73.5%   rating = 2966.5
            Stockfish 2.1.1 JA: 2426.5 /  3500, 69.3%   rating = 2957.9
            Critter 1.01 SSE42: 1970.0 /  2800, 70.4%   rating = 2938.3
             Stockfish 2.01 JA: 2246.0 /  3100, 72.5%   rating = 2937.7
            Stockfish 1.9.1 JA: 2131.0 /  3000, 71.0%   rating = 2916.0
                    Rybka 3 mp: 3228.0 /  4200, 76.9%   rating = 2915.9
            Critter 0.90 SSE42: 2327.5 /  3400, 68.5%   rating = 2908.3
            Stockfish 1.7.1 JA: 2131.0 /  2900, 73.5%   rating = 2901.0
                   Rybka 3 32b: 1191.5 /  1700, 70.1%   rating = 2857.8
            Stockfish 1.6.x JA: 1792.5 /  2600, 68.9%   rating = 2842.2
               Komodo64 1.3 JA: 1946.0 /  3300, 59.0%   rating = 2837.3
                   Chiron 1.1a: 1453.5 /  2600, 55.9%   rating = 2835.0
                      Naum 4.2: 3927.5 /  6800, 57.8%   rating = 2831.7
                  Critter 0.80: 1795.5 /  2800, 64.1%   rating = 2823.3
                  Fritz 13 32b: 1406.5 /  2600, 54.1%   rating = 2820.9
                 Komodo 1.2 JA: 2175.0 /  3700, 58.8%   rating = 2807.9
               Rybka 2.3.2a mp: 2172.5 /  3500, 62.1%   rating = 2804.0
              Deep Shredder 12: 4346.0 /  7900, 55.0%   rating = 2800.0
                      Gull 1.2: 1854.5 /  3800, 48.8%   rating = 2794.0
                  Critter 0.70: 1107.0 /  1900, 58.3%   rating = 2790.6
                      Gull 1.1: 1675.5 /  3100, 54.0%   rating = 2789.9
                      Naum 4.1: 1465.0 /  2300, 63.7%   rating = 2788.8
       Deep Sjeng c't 2010 32b: 2333.0 /  4800, 48.6%   rating = 2786.8
                 Komodo 1.0 JA: 1756.5 /  2900, 60.6%   rating = 2784.3
                 Spike 1.4 32b: 1843.5 /  3900, 47.3%   rating = 2781.7
             Deep Fritz 12 32b: 3268.5 /  6300, 51.9%   rating = 2777.3
                        Naum 4: 1628.5 /  2700, 60.3%   rating = 2775.3
                Rybka 2.2n2 mp: 1311.5 /  2100, 62.5%   rating = 2774.7
                     Gull 1.0a: 1254.0 /  2300, 54.5%   rating = 2766.0
            Stockfish 1.5.1 JA: 1128.5 /  1900, 59.4%   rating = 2761.8
                    Rybka 1.2f: 1578.5 /  2400, 65.8%   rating = 2760.8
               Protector 1.4.0: 1789.5 /  4000, 44.7%   rating = 2755.7
                  Hannibal 1.1: 1436.5 /  3300, 43.5%   rating = 2751.9
               spark-1.0 SSE42: 1965.5 /  4500, 43.7%   rating = 2750.5
            HIARCS 13.2 MP 32b: 1850.0 /  4300, 43.0%   rating = 2743.0
                  Fritz 12 32b: 1091.0 /  2000, 54.5%   rating = 2740.3
            HIARCS 13.1 MP 32b: 1734.5 /  3600, 48.2%   rating = 2727.3
              Deep Junior 12.5: 1442.5 /  3600, 40.1%   rating = 2725.7
             Deep Fritz 11 32b:  744.5 /  1300, 57.3%   rating = 2720.8
                 Doch64 1.2 JA:  820.5 /  1600, 51.3%   rating = 2710.8
                     spark-0.4: 1458.0 /  3100, 47.0%   rating = 2708.9
              Stockfish 1.4 JA:  849.0 /  1700, 49.9%   rating = 2708.5
               Zappa Mexico II: 4152.0 /  9200, 45.1%   rating = 2707.6
             Shredder Bonn 32b: 1119.0 /  2200, 50.9%   rating = 2705.4
                  Critter 0.60: 1072.0 /  2200, 48.7%   rating = 2694.0
            Protector 1.3.2 JA: 2361.5 /  5300, 44.6%   rating = 2693.7
              Deep Shredder 11: 1412.0 /  2700, 52.3%   rating = 2685.4
              Doch64 09.980 JA:  710.0 /  1500, 47.3%   rating = 2682.7
                    Onno-1-1-1: 1923.0 /  4300, 44.7%   rating = 2674.9
                Deep Junior 12: 1356.0 /  3600, 37.7%   rating = 2674.8
                 Hannibal 1.0a: 1600.0 /  4200, 38.1%   rating = 2674.3
                      Naum 3.1: 1514.5 /  3000, 50.5%   rating = 2673.3
                Zappa Mexico I: 1221.0 /  2200, 55.5%   rating = 2672.7
              Deep Onno 1-2-70: 2109.0 /  5800, 36.4%   rating = 2672.4
                Rybka 1.0 Beta: 1023.5 /  2300, 44.5%   rating = 2671.7
               Spark-0.3 VC(a): 1625.0 /  3600, 45.1%   rating = 2668.7
                    Onno-1-0-0:  594.5 /  1200, 49.5%   rating = 2666.1
             Deep Sjeng WC2008: 2434.5 /  5600, 43.5%   rating = 2663.7
         Toga II 1.4 beta5c BB: 3255.5 /  8300, 39.2%   rating = 2660.0
              Deep Junior 11.2: 1176.0 /  2900, 40.6%   rating = 2658.6
                 Strelka 2.0 B: 1255.5 /  3900, 32.2%   rating = 2656.5
            Hiarcs 12.1 MP 32b: 2427.5 /  5600, 43.3%   rating = 2651.1
                Umko 1.2 SSE42:  956.0 /  3100, 30.8%   rating = 2649.0
                Deep Sjeng 3.0:  601.5 /  1400, 43.0%   rating = 2648.5
                 Critter 0.52b: 1097.0 /  2600, 42.2%   rating = 2637.5
        Shredder Classic 4 32b:  922.5 /  1800, 51.2%   rating = 2637.4
             Deep Junior 11.1a: 1153.0 /  2800, 41.2%   rating = 2627.5
                  Naum 2.2 32b:  614.0 /  1300, 47.2%   rating = 2625.8
                Umko 1.1 SSE42: 1146.0 /  3900, 29.4%   rating = 2620.6
              Deep Junior 2010: 1210.0 /  3100, 39.0%   rating = 2618.7
               Glaurung 2.2 JA: 1027.5 /  2600, 39.5%   rating = 2617.9
            Rybka 1.0 Beta 32b:  506.0 /  1100, 46.0%   rating = 2617.7
               HIARCS 11.2 32b:  827.0 /  1900, 43.5%   rating = 2612.9
            Fruit 05/11/03 32b: 1774.0 /  4400, 40.3%   rating = 2610.3
                     Loop 2007: 2396.5 /  7700, 31.1%   rating = 2602.6
                Toga II 1.2.1a:  716.5 /  1600, 44.8%   rating = 2600.2
                Jonny 4.00 32b: 1330.5 /  5000, 26.6%   rating = 2598.3
                     ListMP 11:  987.5 /  2600, 38.0%   rating = 2595.9
                 LoopMP 12 32b:  635.0 /  1500, 42.3%   rating = 2593.9
                  Tornado 4.80:  672.0 /  2600, 25.8%   rating = 2592.1
              Deep Shredder 10: 1754.0 /  4400, 39.9%   rating = 2590.2
       Twisted Logic 20100131x: 1140.0 /  3500, 32.6%   rating = 2585.7
                Crafty 23.3 JA: 1241.5 /  5000, 24.8%   rating = 2581.1
           Spike 1.2 Turin 32b: 2349.5 /  7700, 30.5%   rating = 2563.5
            Deep Sjeng 2.7 32b:  465.5 /  1400, 33.2%   rating = 2539.9
                Crafty 23.1 JA: 1002.0 /  3800, 26.4%   rating = 2528.5

You don't say how your program works, but I'm guessing it uses standard elo rating adjustments and iterates until convergence, is that roughly correct? This would correspond to the way most people would expect ratings from such data to work: they should produce ratings that would not change if the games of one player were all rated as a single event. If I'm right, it shows that the top program drops 20 elo with BayesElo using defaults, of which it was reported earlier in this thread that 13 were due to using defaults rather than deriving the optimum parameter values from the data. If all this is correct then the other 7 are probably due to the use of PRIOR. Perhaps if PRIOR is also turned off as well as deriving the parameters from the data, BayesElo might virtually match your output. There would still be minor differences, but perhaps no noticeable systematic ones.
So it still looks like "Mystery Solved", the only question being exactly how much of the 20 elo disparity is due to PRIOR and how much to defaults. There remains only the question of whether an iterative method like I'm guessing that you used is superior to BayesElo or not. Thanks to HGM posts I see that the differences are fundamental and can be significant, though they are not huge with the given data pool.

michiguel · Post by **michiguel** » Wed Jan 04, 2012 9:01 am

lkaufman wrote:

michiguel wrote:[FWIW, this is the rating calculated with my program ordo, normalized for DS12 to be 2800, as IPON does.

Miguel

Code: Select all

Average rating = 2745.500000

SORTED by RATING
               Houdini 2.0 STD: 2277.5 /  2900, 78.5%   rating = 3035.7
                  Houdini 1.5a: 3162.5 /  4000, 79.1%   rating = 3030.0
             Critter 1.4 SSE42: 1853.0 /  2400, 77.2%   rating = 3001.0
                Komodo 4 SSE42: 1892.5 /  2500, 75.7%   rating = 2995.3
              Komodo64 3 SSE42: 2075.5 /  2800, 74.1%   rating = 2983.8
          Deep Rybka 4.1 SSE42: 2655.0 /  3700, 71.8%   rating = 2978.3
                   Critter 1.2: 2232.0 /  3100, 72.0%   rating = 2973.6
                  Deep Rybka 4: 3627.0 /  4900, 74.0%   rating = 2973.2
                 Houdini 1.03a: 2520.0 /  3200, 78.8%   rating = 2971.3
          Komodo 2.03 DC SSE42: 1985.5 /  2700, 73.5%   rating = 2966.5
            Stockfish 2.1.1 JA: 2426.5 /  3500, 69.3%   rating = 2957.9
            Critter 1.01 SSE42: 1970.0 /  2800, 70.4%   rating = 2938.3
             Stockfish 2.01 JA: 2246.0 /  3100, 72.5%   rating = 2937.7
            Stockfish 1.9.1 JA: 2131.0 /  3000, 71.0%   rating = 2916.0
                    Rybka 3 mp: 3228.0 /  4200, 76.9%   rating = 2915.9
            Critter 0.90 SSE42: 2327.5 /  3400, 68.5%   rating = 2908.3
            Stockfish 1.7.1 JA: 2131.0 /  2900, 73.5%   rating = 2901.0
                   Rybka 3 32b: 1191.5 /  1700, 70.1%   rating = 2857.8
            Stockfish 1.6.x JA: 1792.5 /  2600, 68.9%   rating = 2842.2
               Komodo64 1.3 JA: 1946.0 /  3300, 59.0%   rating = 2837.3
                   Chiron 1.1a: 1453.5 /  2600, 55.9%   rating = 2835.0
                      Naum 4.2: 3927.5 /  6800, 57.8%   rating = 2831.7
                  Critter 0.80: 1795.5 /  2800, 64.1%   rating = 2823.3
                  Fritz 13 32b: 1406.5 /  2600, 54.1%   rating = 2820.9
                 Komodo 1.2 JA: 2175.0 /  3700, 58.8%   rating = 2807.9
               Rybka 2.3.2a mp: 2172.5 /  3500, 62.1%   rating = 2804.0
              Deep Shredder 12: 4346.0 /  7900, 55.0%   rating = 2800.0
                      Gull 1.2: 1854.5 /  3800, 48.8%   rating = 2794.0
                  Critter 0.70: 1107.0 /  1900, 58.3%   rating = 2790.6
                      Gull 1.1: 1675.5 /  3100, 54.0%   rating = 2789.9
                      Naum 4.1: 1465.0 /  2300, 63.7%   rating = 2788.8
       Deep Sjeng c't 2010 32b: 2333.0 /  4800, 48.6%   rating = 2786.8
                 Komodo 1.0 JA: 1756.5 /  2900, 60.6%   rating = 2784.3
                 Spike 1.4 32b: 1843.5 /  3900, 47.3%   rating = 2781.7
             Deep Fritz 12 32b: 3268.5 /  6300, 51.9%   rating = 2777.3
                        Naum 4: 1628.5 /  2700, 60.3%   rating = 2775.3
                Rybka 2.2n2 mp: 1311.5 /  2100, 62.5%   rating = 2774.7
                     Gull 1.0a: 1254.0 /  2300, 54.5%   rating = 2766.0
            Stockfish 1.5.1 JA: 1128.5 /  1900, 59.4%   rating = 2761.8
                    Rybka 1.2f: 1578.5 /  2400, 65.8%   rating = 2760.8
               Protector 1.4.0: 1789.5 /  4000, 44.7%   rating = 2755.7
                  Hannibal 1.1: 1436.5 /  3300, 43.5%   rating = 2751.9
               spark-1.0 SSE42: 1965.5 /  4500, 43.7%   rating = 2750.5
            HIARCS 13.2 MP 32b: 1850.0 /  4300, 43.0%   rating = 2743.0
                  Fritz 12 32b: 1091.0 /  2000, 54.5%   rating = 2740.3
            HIARCS 13.1 MP 32b: 1734.5 /  3600, 48.2%   rating = 2727.3
              Deep Junior 12.5: 1442.5 /  3600, 40.1%   rating = 2725.7
             Deep Fritz 11 32b:  744.5 /  1300, 57.3%   rating = 2720.8
                 Doch64 1.2 JA:  820.5 /  1600, 51.3%   rating = 2710.8
                     spark-0.4: 1458.0 /  3100, 47.0%   rating = 2708.9
              Stockfish 1.4 JA:  849.0 /  1700, 49.9%   rating = 2708.5
               Zappa Mexico II: 4152.0 /  9200, 45.1%   rating = 2707.6
             Shredder Bonn 32b: 1119.0 /  2200, 50.9%   rating = 2705.4
                  Critter 0.60: 1072.0 /  2200, 48.7%   rating = 2694.0
            Protector 1.3.2 JA: 2361.5 /  5300, 44.6%   rating = 2693.7
              Deep Shredder 11: 1412.0 /  2700, 52.3%   rating = 2685.4
              Doch64 09.980 JA:  710.0 /  1500, 47.3%   rating = 2682.7
                    Onno-1-1-1: 1923.0 /  4300, 44.7%   rating = 2674.9
                Deep Junior 12: 1356.0 /  3600, 37.7%   rating = 2674.8
                 Hannibal 1.0a: 1600.0 /  4200, 38.1%   rating = 2674.3
                      Naum 3.1: 1514.5 /  3000, 50.5%   rating = 2673.3
                Zappa Mexico I: 1221.0 /  2200, 55.5%   rating = 2672.7
              Deep Onno 1-2-70: 2109.0 /  5800, 36.4%   rating = 2672.4
                Rybka 1.0 Beta: 1023.5 /  2300, 44.5%   rating = 2671.7
               Spark-0.3 VC(a): 1625.0 /  3600, 45.1%   rating = 2668.7
                    Onno-1-0-0:  594.5 /  1200, 49.5%   rating = 2666.1
             Deep Sjeng WC2008: 2434.5 /  5600, 43.5%   rating = 2663.7
         Toga II 1.4 beta5c BB: 3255.5 /  8300, 39.2%   rating = 2660.0
              Deep Junior 11.2: 1176.0 /  2900, 40.6%   rating = 2658.6
                 Strelka 2.0 B: 1255.5 /  3900, 32.2%   rating = 2656.5
            Hiarcs 12.1 MP 32b: 2427.5 /  5600, 43.3%   rating = 2651.1
                Umko 1.2 SSE42:  956.0 /  3100, 30.8%   rating = 2649.0
                Deep Sjeng 3.0:  601.5 /  1400, 43.0%   rating = 2648.5
                 Critter 0.52b: 1097.0 /  2600, 42.2%   rating = 2637.5
        Shredder Classic 4 32b:  922.5 /  1800, 51.2%   rating = 2637.4
             Deep Junior 11.1a: 1153.0 /  2800, 41.2%   rating = 2627.5
                  Naum 2.2 32b:  614.0 /  1300, 47.2%   rating = 2625.8
                Umko 1.1 SSE42: 1146.0 /  3900, 29.4%   rating = 2620.6
              Deep Junior 2010: 1210.0 /  3100, 39.0%   rating = 2618.7
               Glaurung 2.2 JA: 1027.5 /  2600, 39.5%   rating = 2617.9
            Rybka 1.0 Beta 32b:  506.0 /  1100, 46.0%   rating = 2617.7
               HIARCS 11.2 32b:  827.0 /  1900, 43.5%   rating = 2612.9
            Fruit 05/11/03 32b: 1774.0 /  4400, 40.3%   rating = 2610.3
                     Loop 2007: 2396.5 /  7700, 31.1%   rating = 2602.6
                Toga II 1.2.1a:  716.5 /  1600, 44.8%   rating = 2600.2
                Jonny 4.00 32b: 1330.5 /  5000, 26.6%   rating = 2598.3
                     ListMP 11:  987.5 /  2600, 38.0%   rating = 2595.9
                 LoopMP 12 32b:  635.0 /  1500, 42.3%   rating = 2593.9
                  Tornado 4.80:  672.0 /  2600, 25.8%   rating = 2592.1
              Deep Shredder 10: 1754.0 /  4400, 39.9%   rating = 2590.2
       Twisted Logic 20100131x: 1140.0 /  3500, 32.6%   rating = 2585.7
                Crafty 23.3 JA: 1241.5 /  5000, 24.8%   rating = 2581.1
           Spike 1.2 Turin 32b: 2349.5 /  7700, 30.5%   rating = 2563.5
            Deep Sjeng 2.7 32b:  465.5 /  1400, 33.2%   rating = 2539.9
                Crafty 23.1 JA: 1002.0 /  3800, 26.4%   rating = 2528.5

You don't say how your program works, but I'm guessing it uses standard elo rating adjustments and iterates until convergence, is that roughly correct?

I am not using ELO, but your intuition is not totally off, because the calculation is iterative. But the iteration is just a tool. The important thing is that at the end of the calculation you can pick any engine and calculate its individual rating using the opponent ratings and it will make sense. The system assumes a "Boltzmann-like" distribution between different levels of "energy" (engine strength = energy). This means that the equation that relates strength ends up being a logistic function. This equation is a bit more spread at the tails than an integrated Gaussian but it is quite similar. That is the reason, I think, the scale is a tiny bit more spread (i.e. Houdini get higher rating at the top, Crafty gets lower at the bottom). It has an interesting feature, it A beats B 10 to 1, and B beats C 10 to 1, then A beats C 100 to 1.

The parameters were chosen to make the scale similar to elo, for people accustomed to it, At least, 200 elo points equals 200 points in this rating, but any other is slightly different. For practical purposes, the difference is insignificant, unless it is big, i.e. more than 400 points etc.

This would correspond to the way most people would expect ratings from such data to work: they should produce ratings that would not change if the games of one player were all rated as a single event.

Correct.

If I'm right, it shows that the top program drops 20 elo with BayesElo using defaults, of which it was reported earlier in this thread that 13 were due to using defaults rather than deriving the optimum parameter values from the data. If all this is correct then the other 7 are probably due to the use of PRIOR. Perhaps if PRIOR is also turned off as well as deriving the parameters from the data, BayesElo might virtually match your output. There would still be minor differences, but perhaps no noticeable systematic ones.
So it still looks like "Mystery Solved", the only question being exactly how much of the 20 elo disparity is due to PRIOR and how much to defaults. There remains only the question of whether an iterative method like I'm guessing that you used is superior to BayesElo or not. Thanks to HGM posts I see that the differences are fundamental and can be significant, though they are not huge with the given data pool.

I do not know what BayesELO is doing, but the difference here could be because of a use of a different formula. Still, note that there are minor differences in between, and some engines that are 9th are 10th etc.

I believe that this shows that fighting for 5 points elo or so is not worth it. It is a meaningless difference, IMHO. It only means something in head to head competition (with enough games, of course).

Miguel

Michel · Post by **Michel** » Wed Jan 04, 2012 10:03 am

hgm wrote:I don't get it. The drawValue shouldn't affect the ratings in BayesElo, should it? Given a certain rating difference x, one can calculate the probability for a draw, and it will be higher if drawValue is higher (because it will be equal to F(x+drawValue) - F(x-drawValue), where F is the cumulative Elo distribution). But unless drawValue is ridiculously large, the shape of this draw probability distribution is practically independent of it, as the expression is a quite accurate estimate for 2*drawValue*(d/dx)F(x), i.e. proportional to the Bell-shaped Elo curve itself.

In retrospect I think this observation should be modified. The value of drawelo also affects the likelihood assigned to a win or a loss which the above analysis does not take into account.

For large drawelo, BayesElo assigns lower likelihood to non-draw results. So it should expand elo ratings to accomodate these.

Ajedrecista · Post by **Ajedrecista** » Wed Jan 04, 2012 1:04 pm

Hello:

First of all, I must say that my knowledge in Statistics and Elo calculation is very limited, so surely you will not draw interesting conclusions from this post.

Code: Select all

Engine A vs Engine B (ELO 2755) 60.0-40.0 perf=2825
Engine A vs Engine C (ELO 2705) 82.0-18.0 perf=2968
Engine A vs Engine D (ELO 2815) 62.0-38.0 perf=2900
Engine A vs Engine E (ELO 2800) 65.0-35.0 perf=2908
Engine A vs Engine F (ELO 2680) 85.0-15.0 perf=2981

Taking this example, what I do is the following:

(Referred to Engine A):

n = number of games
w = number of wins
l = number of loses
d = number of draws
D = draw ratio
mu = relative score
r_i = rating of i-th opponent
<r> = average rating of N opponents
rd = rating difference
sd = standard deviation
rd(+) = upper rating difference
rd(-) = lower rating difference
e(+) = uncertainty between rd and rd(+)
e(-) = uncertainty between rd and rd(-)
<e> = average uncertainty

n = w + l + d
D = d/n
mu = (w + d/2)/n
1 - mu = (d/2 + l)/n

<r> = (1/N)·[SUM_i=1, ..., N (r_i)]
rd = 400·log[mu/(1 - mu)]
sd = sqrt{(1/n)·[mu·(1 - mu) - D/4]}
rd(+) = 400·log{[mu + (1.96)·sd]/[1 - mu - (1.96)·sd]}
rd(-) = 400·log{[mu - (1.96)·sd]/[1 - mu + (1.96)·sd]}
e(+) = [rd(+)] - rd > 0
e(-) = [rd(-)] - rd < 0
<e> = ±[|e(+)| + |e(-)|]/2 = ±{[e(+)] - [e(-)]}/2
K = |<e>·sqrt(n)|

K is a 'sanity check': usual values (most of the time, but not always) for 95% confidence (~ 1.96-sigma) are between 500 and 600, according with my tiny experience. It does not mean than K can not be less than 500 or greater than 600.

Rating difference interval (with 1.96-sigma confidence ~ 95% confidence): ]<r> + rd(-), <r> + rd(+)[

(Calculations have been done with a Casio calculator, so may contain errors).

==================================================================================================

n = 500:

354 - 146 (+300 -92 = 108)
rd ~ +153.86
(1.96)·sd ~ 3.42564%
(1.96)·n·sd ~ 17.1282 points

rd(+) ~ +183.75 ; e(+) ~ +29.89
rd(-) ~ +125.97 ; e(-) ~ -27.89
<e> ~ ± 28.89 ; K = |<e>·sqrt(n)| ~ 646

[Rating difference interval (with 1.96-sigma confidence ~ 95% confidence)] ~ ]2876.97, 2934.75[

The formula for standard deviation (sigma) was extracted from here (post #52):

http://immortalchess.net/forum/showthre ... 237&page=3

I suppose that the term D/4 comes from P·(1 - P)·D, where the maximum variability for this is P = 1 - P = 1/2, and then the term is D/4, but I do not know for sure. The rest of the method is mine (with my misconceptions). Comparing with Gerhard's results for 95% confidence:

Code: Select all

Wins   = 300 
Draws  = 108 
Losses = 92 
Av.Op. Elo = 2751 

Result     : 354.0/500 (+300,=108,-92) 
Perf.      : 70.8 % 
Margins    :  
 95 %      : (+  3.3,-  3.5 %) -> [ 67.3, 74.1 %]  

Elo        : 2905 
Margins    :  
 95 %      : (+ 29,- 29) -> [2876,2934]

Quite similar results (~ +1 Elo in my case, respect to Mr. Sonnabend). In my case, I have only one margin (~ 3.4%), while Gerhard has two (3.3% and 3.5%).

But with my method, it is like Engine A had played 500 games against a 2751 rated engine with a draw ratio of 21.6%. I can calculate five different standard deviations for five different opponents (B, C, D, E and F):

sd_i = sqrt{(1/n_i)·[mu_i·(1 - mu_i) - D_i/4]}
_i stands for i subindex

sd_B = (1/10)·sqrt(0.24 - D_B/4)
sd_C = (1/10)·sqrt(0.1476 - D_C/4)
sd_D = (1/10)·sqrt(0.2356 - D_D/4)
sd_E = (1/10)·sqrt(0.2275 - D_E/4)
sd_F = (1/10)·sqrt(0.1275 - D_F/4)

<sd> = average standard deviation
<sd> = (1/5)·sqrt{SUM_i=1, ..., 5 [(sd_i)²]} = (1/50)·sqrt[0.9782 - (D_B + D_C + D_D + D_E + D_F)/4]

And again we have (1.96)<sd>

I hope no typos. Maybe my calculations are full of misconceptions... corrections are welcome.

Regards from Spain.

Ajedrecista.

QED · Post by **QED** » Wed Jan 04, 2012 3:12 pm

Gerhard Sonnabend wrote:

Code: Select all

Engine A vs Engine B (ELO 2755) 60.0-40.0 perf=2825
Engine A vs Engine C (ELO 2705) 82.0-18.0 perf=2968
Engine A vs Engine D (ELO 2815) 62.0-38.0 perf=2900
Engine A vs Engine E (ELO 2800) 65.0-35.0 perf=2908
Engine A vs Engine F (ELO 2680) 85.0-15.0 perf=2981

If I had to come up with a simple explanation, I would start with explaining why the simple average seemed to be a good idea.

Expected score from 100 games is a non-linear function of rating difference. Difference is a linear function, the correct computation should respect this linearity, so averaging is not ruled out yet. But the general formula is weighted average.

Simple average takes every match with equal weight. This reflects the fact that every match has the same number of games. The same number of games means the same amount of information obtained, right? Wrong. This is where the nonlinearity kicks in.

In wide range of statistical models, the uncertainity, expressed by sigma, is inversely proportional to the square root of the number of games. But I suspect that matches against C and F would have larger sigma. Matches between uneven opponents are simply more predictable and not so sensitive to the exact value of rating difference.

If it is true that sigma is better indicator of information content than mere number of games, then I would use the average of performace ratings weighted by their sigma raised to the minus two. This is still not the same as the complete BayesElo computation, but it should be less off.

On the other hand, there are other factors such as prior, or value of drawelo; together with chess specifics such as "playing style", contempt, or drawishness increasing with overall level of play. I consider all of it to be important mainly for explaining why performance against B, D and E is lower than average (weighted or not). But I expect the weighted average to be less off also in a model case, with binomial distribution, when differences between match performaces are purely statistical.

lkaufman · Post by **lkaufman** » Wed Jan 04, 2012 5:01 pm

michiguel wrote:
lkaufman wrote:I do not know what BayesELO is doing, but the difference here could be because of a use of a different formula. Still, note that there are minor differences in between, and some engines that are 9th are 10th etc.

I believe that this shows that fighting for 5 points elo or so is not worth it. It is a meaningless difference, IMHO. It only means something in head to head competition (with enough games, of course).

Miguel
First point: there are two versions of the elo formula, going back to the publication of his book on ratings. One uses the normal distribution, and the other uses the Logistic, exactly the one you say your model uses. Some rating agencies use normal, others use Logistic. I believe at least the USCF uses Logistic. I don't know whether Bayeselo uses normal or logistic. If Logistic, it should match your formula exactly or almost exactly with the proper options set and no PRIOR. As you say, the differences between the two distributions are in general pretty tiny, except at the extremes. So you have independently re-invented one version of the Elo formula. I did the same in the early 1970s, before the book was published; I was using the Logistic version of "Elo" to rate blitz events before he published his book in which that idea was introduced.
I disagree about five points being "meaningless". It doesn't matter whether you play head to head or against identical opponents, if you play enough games it will have real meaning. Given the samples that the rating agencies actually play, its meaning is limited. But it's rather moot; we strive to gain five points because if we do it ten times, we have fifty points! We don't release a new version if it's just five or ten elo better in our judgment.

hgm · Post by **hgm** » Wed Jan 04, 2012 5:53 pm

I never heard the term 'Logistic' before, but I assume this is the expression

score = 100%/(1+exp(-EloDifference/WIDTH)) =F(EloDifference)

which is indeed a commonly used alternative for the Gaussian (normal) model (using WIDTH = 400/ln(10), so the exponential becomes 10^(-EloDif/400)). This is what BayesElo uses, except that it does not only model the average sore, but wins, draws and losses separately, by

wins = F(EloDiff - drawValue)

from which it automatically follows (asone player's win is another player's loss)

losses = F(-EloDiff - drawValue) = 100% - F(EloDiff + drawValue)

(F(-x) = 1 - F(x) for all x)

and thus

draws =100% - wins - losses = F(EloDiff + drawValue) - F(EloDiff - drawValue)

From this it follows that a draw between twoplayers is twice as strong evidence for their equality than a win or loss is for their unequality. But that conclusion is of course only as good as the model predicting the WDL frequencies was.

The only way to say anything sensible about that is actually plotting win, draw and loss frequencies as a function of Elo difference, (e.g. take a huge set of games, calculate the ratings, divide the games over rating-difference bins, and calculate the WDL percentage for each bin, and plot the results inthe same graph as F(EloDiff).)

If this confirms the model,theratings were OK. If not, you should repeat the rating calculation with an improved model, plot the results again using the new ratings, and so on, until you reach consistency.

lkaufman · Post by **lkaufman** » Wed Jan 04, 2012 6:24 pm

Surely someone has done this somewhere; perhaps Jeff Sonas has done it or could do it easily with his database. For me the interesting question is to compare the predictions of the BayesElo model with the standard assumption that two draws are identical to one win and one loss. The basic assumption made by BayesElo seems suspect to me.

The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.