Komodo run - Ingo list revisited

lkaufman · Post by **lkaufman** » Sun Nov 10, 2013 4:38 pm

IWB wrote:
lkaufman wrote:. A simple way to check this out would be to calculate a rating list including both Houdini Cont.0 and Houdini Cont. 1 on the same list, and run it with both Ordo and BayesElo.
What engine do you like to have fixed with what rating?

BYe
INgo

Well it doesn't really matter as I'm interested in differences, but to make the difference obvious you could set Houdini 3 contempt 0 to 3000 for both lists. Then we only have to compare the last two digits of the rating of default Houdini 3 on the Ordo and Bayeselo list. I'll predict that the difference will fall by something like 30-40%, but I could be totally wrong.

IWB · Post by **IWB** » Sun Nov 10, 2013 5:15 pm

Here is the result with Bayes and ORDO, I hope this is what you wanted to have. If not please tell me.

Bayes mm 0 1:

Code: Select all

   1 Houdini 3 STD            3020   10   10  3000   78%  2791   27% 
   2 K113300                  3013   10    9  3150   77%  2801   32% 
   3 Houdini 3 Con 0          3000   10   10  3000   77%  2791   35% 
   4 Stockfish 4              2966    9    9  3150   72%  2803   38% 
   5 Gull 2.2                 2929    9    9  3150   67%  2805   40% 
   6 Critter 1.4a             2928    9    9  3150   66%  2805   41% 
   7 Deep Rybka 4.1           2901    9    9  3150   63%  2806   42% 
   8 Hannibal 1.4a            2819    9    9  3150   51%  2810   43% 
   9 Chiron 1.5               2799    9    9  3150   48%  2811   40% 
  10 Protector 1.5.0          2792    9    9  3150   47%  2812   44% 
  11 Naum 4.2                 2788    9    9  3150   47%  2812   40% 
  12 HIARCS 14 WCSC 32b       2769    9    9  3150   44%  2813   40% 
  13 Deep Shredder 12         2755    9    9  3150   42%  2813   38% 
  14 Jonny 6.00               2752    9    9  3150   42%  2814   37% 
  15 Deep Sjeng c't 2010 32b  2735    9    9  3150   40%  2814   39% 
  16 Spike 1.4 32b            2730    9    9  3150   39%  2815   41% 
  17 spark-1.0                2718    9    9  3150   37%  2815   38% 
  18 Deep Junior 13.3         2697    9    9  3150   35%  2816   32% 
  19 Booot 5.2.0              2695    9    9  3150   34%  2816   37% 
  20 Quazar 0.4               2687    9    9  3150   33%  2817   35% 
  21 Zappa Mexico II          2676    9    9  3150   32%  2817   35% 
  22 Toga II 3.0 32b          2666    9    9  3150   30%  2818   35%

Ordo:

Code: Select all

   1 Houdini 3 STD              &#58; 3012.2    2349.0    3000   78.3%
   2 K113300                    &#58; 3010.4    2421.0    3150   76.9%
   3 Houdini 3 Con 0            &#58; 3000.0    2315.5    3000   77.2%
   4 Stockfish 4                &#58; 2961.1    2257.5    3150   71.7%
   5 Gull 2.2                   &#58; 2917.1    2097.5    3150   66.6%
   6 Critter 1.4a               &#58; 2916.2    2094.0    3150   66.5%
   7 Deep Rybka 4.1             &#58; 2888.4    1987.0    3150   63.1%
   8 Hannibal 1.4a              &#58; 2795.9    1609.0    3150   51.1%
   9 Chiron 1.5                 &#58; 2774.6    1519.5    3150   48.2%
  10 Protector 1.5.0            &#58; 2767.6    1490.0    3150   47.3%
  11 Naum 4.2                   &#58; 2765.2    1480.0    3150   47.0%
  12 HIARCS 14 WCSC 32b         &#58; 2744.2    1392.0    3150   44.2%
  13 Deep Shredder 12           &#58; 2730.8    1336.0    3150   42.4%
  14 Jonny 6.00                 &#58; 2726.9    1320.0    3150   41.9%
  15 Deep Sjeng c't 2010 32b    &#58; 2708.7    1245.0    3150   39.5%
  16 Spike 1.4 32b              &#58; 2703.3    1223.0    3150   38.8%
  17 spark-1.0                  &#58; 2691.6    1175.5    3150   37.3%
  18 Deep Junior 13.3           &#58; 2672.0    1097.5    3150   34.8%
  19 Booot 5.2.0                &#58; 2666.0    1074.0    3150   34.1%
  20 Quazar 0.4                 &#58; 2660.9    1054.0    3150   33.5%
  21 Zappa Mexico II            &#58; 2647.6    1003.0    3150   31.8%
  22 Toga II 3.0 32b            &#58; 2636.1     960.0    3150   30.5%

For the mathematical impaired user this looks just slightly different. "I" usually consider everything within 10 Elo as as equal as 10 Elo for a human means nothing (and testing bejond it is a waste of energy) - we simply can't feel it!

Bye
Ingo

lkaufman · Post by **lkaufman** » Sun Nov 10, 2013 5:42 pm

IWB wrote:Here is the result with Bayes and ORDO, I hope this is what you wanted to have. If not please tell me.

Bayes mm 0 1:

Code: Select all

   1 Houdini 3 STD            3020   10   10  3000   78%  2791   27% 
   2 K113300                  3013   10    9  3150   77%  2801   32% 
   3 Houdini 3 Con 0          3000   10   10  3000   77%  2791   35% 
   4 Stockfish 4              2966    9    9  3150   72%  2803   38% 
   5 Gull 2.2                 2929    9    9  3150   67%  2805   40% 
   6 Critter 1.4a             2928    9    9  3150   66%  2805   41% 
   7 Deep Rybka 4.1           2901    9    9  3150   63%  2806   42% 
   8 Hannibal 1.4a            2819    9    9  3150   51%  2810   43% 
   9 Chiron 1.5               2799    9    9  3150   48%  2811   40% 
  10 Protector 1.5.0          2792    9    9  3150   47%  2812   44% 
  11 Naum 4.2                 2788    9    9  3150   47%  2812   40% 
  12 HIARCS 14 WCSC 32b       2769    9    9  3150   44%  2813   40% 
  13 Deep Shredder 12         2755    9    9  3150   42%  2813   38% 
  14 Jonny 6.00               2752    9    9  3150   42%  2814   37% 
  15 Deep Sjeng c't 2010 32b  2735    9    9  3150   40%  2814   39% 
  16 Spike 1.4 32b            2730    9    9  3150   39%  2815   41% 
  17 spark-1.0                2718    9    9  3150   37%  2815   38% 
  18 Deep Junior 13.3         2697    9    9  3150   35%  2816   32% 
  19 Booot 5.2.0              2695    9    9  3150   34%  2816   37% 
  20 Quazar 0.4               2687    9    9  3150   33%  2817   35% 
  21 Zappa Mexico II          2676    9    9  3150   32%  2817   35% 
  22 Toga II 3.0 32b          2666    9    9  3150   30%  2818   35%

Ordo:

Code: Select all

   1 Houdini 3 STD              &#58; 3012.2    2349.0    3000   78.3%
   2 K113300                    &#58; 3010.4    2421.0    3150   76.9%
   3 Houdini 3 Con 0            &#58; 3000.0    2315.5    3000   77.2%
   4 Stockfish 4                &#58; 2961.1    2257.5    3150   71.7%
   5 Gull 2.2                   &#58; 2917.1    2097.5    3150   66.6%
   6 Critter 1.4a               &#58; 2916.2    2094.0    3150   66.5%
   7 Deep Rybka 4.1             &#58; 2888.4    1987.0    3150   63.1%
   8 Hannibal 1.4a              &#58; 2795.9    1609.0    3150   51.1%
   9 Chiron 1.5                 &#58; 2774.6    1519.5    3150   48.2%
  10 Protector 1.5.0            &#58; 2767.6    1490.0    3150   47.3%
  11 Naum 4.2                   &#58; 2765.2    1480.0    3150   47.0%
  12 HIARCS 14 WCSC 32b         &#58; 2744.2    1392.0    3150   44.2%
  13 Deep Shredder 12           &#58; 2730.8    1336.0    3150   42.4%
  14 Jonny 6.00                 &#58; 2726.9    1320.0    3150   41.9%
  15 Deep Sjeng c't 2010 32b    &#58; 2708.7    1245.0    3150   39.5%
  16 Spike 1.4 32b              &#58; 2703.3    1223.0    3150   38.8%
  17 spark-1.0                  &#58; 2691.6    1175.5    3150   37.3%
  18 Deep Junior 13.3           &#58; 2672.0    1097.5    3150   34.8%
  19 Booot 5.2.0                &#58; 2666.0    1074.0    3150   34.1%
  20 Quazar 0.4                 &#58; 2660.9    1054.0    3150   33.5%
  21 Zappa Mexico II            &#58; 2647.6    1003.0    3150   31.8%
  22 Toga II 3.0 32b            &#58; 2636.1     960.0    3150   30.5%

For the mathematical impaired user this looks just slightly different. "I" usually consider everything within 10 Elo as as equal as 10 Elo for a human means nothing (and testing bejond it is a waste of energy) - we simply can't feel it!

Bye
Ingo

So the result was a 39% drop in the rating difference between the two contempt values by using ORDO, within the 30 to 40% I predicted. Note that this 8 elo drop in the difference has nothing to do with error margins, it reflects a difference between the two rating systems on the same data. I also note that rating differences are in general larger on lthe ORDO run, so this means the 39% drop is even more significant.
It appears to me that using ORDO would substantially reduce the effect of using contempt, and would produce ratings that are much closer to what they would be if all pairings were within a hundred elo points. But of course we would need more examples than just this one to prove the point.
So I'm asking the mathematicians reading this if I am correct in saying that BayesElo puts more weight on mismatches than ORDO does, and I'm also asking if any programmers might want to run simulations to test the hypothesis. Fjor example, if we start with a real data set (could be IPON or some other engine rating list), and add a mismatch with a surprising result (for example a 2800 engine scoring 40 out of 100 against a 3000 engine), I would expect the BayesElo rating to be more strongly affect by this than the ORDO rating (i.e. in the given example to rise by more).

Modern Times · Post by **Modern Times** » Sun Nov 10, 2013 7:12 pm

It also depends on which parameters you use for BayesELO.

lkaufman · Post by **lkaufman** » Sun Nov 10, 2013 7:53 pm

Modern Times wrote:It also depends on which parameters you use for BayesELO.

I know that this affects the spread of the ratings and the effect of draws, but I don't think it affects the fundamental tendency of Bayeselo to give more weight to mismatches than does ORDO. But not having a math degree, I could be wrong about this.

Modern Times · Post by **Modern Times** » Sun Nov 10, 2013 8:52 pm

lkaufman wrote: But not having a math degree, I could be wrong about this.

Neither do I... but I wonder what
mm 1 1
scale 1

and just
mm 1 1

would produce.

lkaufman · Post by **lkaufman** » Sun Nov 10, 2013 9:07 pm

Modern Times wrote:
lkaufman wrote: But not having a math degree, I could be wrong about this.
Neither do I... but I wonder what
mm 1 1
scale 1

and just
mm 1 1

would produce.

If scale means what it sounds like then the only effect of setting that value should be to expand or contract the range of ratings by some percentage. This would have nothing to do with the issue being discussed here.

Modern Times · Post by **Modern Times** » Sun Nov 10, 2013 9:49 pm

Maybe not, but I'd like to see it.

Komodo run - Ingo list revisited

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.

Re: Komodo run - Ingo list revisitied.