Edit: I left out an important detail. Rather than imply multiplying by 100 I actually rounded to 0 decimal places. The calculation actually was:
SELECT ( round(los_prob * 100,0)) . Now anyone should be able to replicate exactly.
If you take the output of the program that calculates LOS for two exactly equal opponents so the score should be 50, you will see the following distribution if you multiply the LOS scores by 100 to get bins from 0 to 100:
Code: Select all
pct,count(pct)
100,456
99,996
98,1056
97,990
96,961
95,1024
94,945
93,997
92,1016
91,900
90,970
89,1116
88,1004
87,931
86,1013
85,1000
84,931
83,1162
82,1001
81,1075
80,923
79,1164
78,980
77,993
76,958
75,1004
74,1012
73,970
72,1108
71,867
70,1067
69,849
68,1162
67,942
66,941
65,1169
64,980
63,931
62,988
61,970
60,1017
59,1248
58,1007
57,1022
56,1005
55,1001
54,959
53,1025
52,1006
51,961
50,809
49,973
48,999
47,970
46,977
45,1007
44,981
43,1014
42,959
41,1269
40,1018
39,972
38,972
37,932
36,951
35,1171
34,937
33,935
32,1124
31,861
30,1043
29,863
28,1068
27,1050
26,1089
25,993
24,930
23,990
22,984
21,1117
20,887
19,1021
18,896
17,1095
16,938
15,1040
14,961
13,913
12,1041
11,1032
10,961
9,935
8,1057
7,1058
6,1001
5,978
4,953
3,1028
2,1026
1,972
0,476
you will notice that there are about 1000 items in each bin, except for the first and last bins, which have about 500 each.
That means that it is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure A is stronger than B" to "I am absolutely sure that A is not stronger than B" We can also turn it around and say the same thing in the other direction.
It is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure B is stronger than A" to "I am absolutely sure that B is not stronger than A" .
I offered this data last week in table form and even as a relational database table, but nobody seemed very interested.
Anyway, you questioned why I would use a LOS algorithm to test superiority of an engine over itself. I mean, it does sound kind of silly because we already know the answer, "It's not superior to itself." but that is exactly why the experiment is important. If the algorithm claims that the program is superior to itself when it is not superior to itself, then that indicates a problem.
What we see here is that the LOS algorithm coughs up an answer that is bad most of the time.
That is because as we have more and more trials, the wins and losses of the same engine do move towards the mean. However, the raw number of wins and losses that are not exactly on the mean will increase in spread (even though, on average, they compute a better mean). This destroys the LOS calculation.
Now, the LOS calculation is not the worst calculation in the world. It also tells us the same thing that common sense tells us. If engine A has more wins than engine B, it is probably stronger. But because it does not care about draws is is missing important information (including the number of games in total).
Another important reason for testing LOS using an engine against itself is that the only thing I have ever seen it used for is for a tiebreaker. For instance CCRL uses it to tell differences between adjacent engines on their list. That makes them (by definition of the ordered list) fairly close in strength to each other.
It may be that with only a couple thousand games the error spread has not grown large enough to dominate. And so the answers may be OK. But I also think it is faulty for throwing out draw numbers. Draw numbers impart vital information about strength (as demonstrated by the Elo calculation) and therefore tossing out that information makes the algorithm more prone to bad guesses.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.