The IPON BayesElo mystery solved.

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw, Ras, hgm, chrisw, Rebel, Ras

Engin
Posts: 987
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: The IPON BayesElo mystery solved.

Post by Engin »

where you know this without to have the played games ?

i want to check out it by my self too, but its not possible to download any games from the site.
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: The IPON BayesElo mystery solved.

Post by IWB »

Hi Engin,
Engin wrote:where you know this without to have the played games ?

i want to check out it by my self too, but its not possible to download any games from the site.
We met in Thuringia so you know me and when I included Tornado you asked about the games. I explained in a personal mail why I do not delivere them ...
Nonetheless, for all statistical purposes you find a result.pgn in the individual.7z file. What is done here is done with exactly that file as every statistical information you need is in there.

Bye and have a nice weekend
Ingo
QED
Posts: 60
Joined: Thu Nov 05, 2009 9:53 pm

Re: The IPON BayesElo mystery solved.

Post by QED »

Ingo Bauer wrote:

Code: Select all

 Default:
   4 Komodo 4 SSE42           2975 2500.0 (1892.5 : 607.5)									Perf.:
                                   100.0 ( 51.5 :  48.5) Houdini 2.0 STD          3016		3026
                                   100.0 ( 45.0 :  55.0) Critter 1.4 SSE42        2977		2942
                                   100.0 ( 51.5 :  48.5) Deep Rybka 4.1 SSE42     2956		2966
                                   100.0 ( 53.5 :  46.5) Critter 1.2              2952		2976
				                       100.0 ( 52.5 :  47.5) Stockfish 2.1.1 JA       2941		2958
                                   100.0 ( 65.5 :  34.5) Chiron 1.1a              2833		2944
                                   100.0 ( 69.5 :  30.5) Naum 4.2                 2827		2970
                                   100.0 ( 70.0 :  30.0) Fritz 13 32b             2819		2966	
                                   100.0 ( 68.0 :  32.0) Deep Shredder 12         2800		2930
                                   100.0 ( 75.0 :  25.0) Gull 1.2                 2795		2985	
                                   100.0 ( 79.0 :  21.0) Deep Sjeng c't 2010 32b  2788		3018
                                   100.0 ( 77.0 :  23.0) Spike 1.4 32b            2785		2994
                                   100.0 ( 78.5 :  21.5) Protector 1.4.0          2759		2983
                                   100.0 ( 80.0 :  20.0) Hannibal 1.1             2758		2998
                                   100.0 ( 85.0 :  15.0) spark-1.0 SSE42          2755		3056
                                   100.0 ( 87.5 :  12.5) HIARCS 13.2 MP 32b       2748		3086
                                   100.0 ( 83.5 :  16.5) Deep Junior 12.5         2731		3012
                                   100.0 ( 88.5 :  11.5) Zappa Mexico II          2716		3070
                                   100.0 ( 90.5 :   9.5) Deep Onno 1-2-70         2684		3075
                                   100.0 ( 90.5 :   9.5) Strelka 2.0 B            2671		3062
                                   100.0 ( 87.5 :  12.5) Umko 1.2 SSE42           2664		3002
                                   100.0 ( 88.0 :  12.0) Loop 2007                2621		2967
                                   100.0 ( 89.5 :  10.5) Jonny 4.00 32b           2614		2986
                                   100.0 ( 93.0 :   7.0) Tornado 4.80             2608		3057
                                   100.0 ( 92.5 :   7.5) Crafty 23.3 JA           2598		3034
																							
																							                 Aver. 3003

 DrawElo
 4 Komodo 4 SSE42 mm01       2982 2500.0 (1892.5 : 607.5)									Perf.:
                                   100.0 ( 51.5 :  48.5) Houdini 2.0 STD          3023		3033
                                   100.0 ( 45.0 :  55.0) Critter 1.4 SSE42        2984		2949
                                   100.0 ( 51.5 :  48.5) Deep Rybka 4.1 SSE42     2962		2972
                                   100.0 ( 53.5 :  46.5) Critter 1.2              2958		2982
                                   100.0 ( 52.5 :  47.5) Stockfish 2.1.1 JA       2947		2964
                                   100.0 ( 65.5 :  34.5) Chiron 1.1a              2834		2945
                                   100.0 ( 69.5 :  30.5) Naum 4.2                 2828		2971
                                   100.0 ( 70.0 :  30.0) Fritz 13 32b             2820		2967
                                   100.0 ( 68.0 :  32.0) Deep Shredder 12         2800		2930
                                   100.0 ( 75.0 :  25.0) Gull 1.2                 2794		2984
                                   100.0 ( 79.0 :  21.0) Deep Sjeng c't 2010 32b  2787		3017
                                   100.0 ( 77.0 :  23.0) Spike 1.4 32b            2783		2992
                                   100.0 ( 78.5 :  21.5) Protector 1.4.0          2756		2980
                                   100.0 ( 80.0 :  20.0) Hannibal 1.1             2755		2995
                                   100.0 ( 85.0 :  15.0) spark-1.0 SSE42          2752		3053
                                   100.0 ( 87.5 :  12.5) HIARCS 13.2 MP 32b       2744		3082
                                   100.0 ( 83.5 :  16.5) Deep Junior 12.5         2726		3007
                                   100.0 ( 88.5 :  11.5) Zappa Mexico II          2711		3065
                                   100.0 ( 90.5 :   9.5) Deep Onno 1-2-70         2677		3068
                                   100.0 ( 90.5 :   9.5) Strelka 2.0 B            2663		3054
                                   100.0 ( 87.5 :  12.5) Umko 1.2 SSE42           2655		2993
                                   100.0 ( 88.0 :  12.0) Loop 2007                2610		2956
                                   100.0 ( 89.5 :  10.5) Jonny 4.00 32b           2603		2975
                                   100.0 ( 93.0 :   7.0) Tornado 4.80             2597		3046
                                   100.0 ( 92.5 :   7.5) Crafty 23.3 JA           2587		3023
								   
																							                 Aver. 3000
I was curious, what would my idea of weighted average say about this data. After some thinking, I decided that in this case, a simple product of points would be a good weight. So Houdini would weight nearly four times more than Crafty. After a quick computation in spreadsheet, I got weighted average 2990.233 for the 'default' table and for 'drawelo' it was 2989.976. This averages are roughly in between of BayesElo values and simple averages.

So I went ahead and used square of the product of points as a new weight. For the 'default' table I got weighted average 2981.444 and for 'drawelo' it was 2982.835. This works well for 'drawelo' table, and not so well for 'default' table.

It would be nice to have a post-hoc explanation of why squared product is better weight. Simple product would correspond to sigma (of performance from binomial distribution) being proportional to 1/sqrt(n*p*(1-p)). But what is a reason for additional weighting? All I came up with, is a hypothesis that the rating difference between (future) average and an opponent is less precise when there is a gap. So not only Crafty match has four times less information about the performance, but also the rating of Crafty is four times more unprecise when comparing to 3000-ish future average than the rating of Houdini.

But I am not convinced. It is possible that draws really play a significant role and the square weighted average works just by coincidence, for this data.
QED
Posts: 60
Joined: Thu Nov 05, 2009 9:53 pm

Re: The IPON BayesElo mystery solved.

Post by QED »

H.G.Muller wrote:For one win and one loss the likelihood becomes

F(x-h) * (1 - F(x+h)) = (F - hF')* (1 - F - hF') + O(h^2) =
= F * (1-F) -hF' * (1-F) - hF' * F + O(h^2)
= F * (1-F) - hF' + O(h^2)

(all F and F' taken in x unless specified otherwise).

Now since F * (1-F) is proportional to F' and of O(1), this means that the shape of the likelyhood distribution is F' upto an error of O(h^2). And with Bayes only the shape counts, not the normalization factor (which is indeed 1+O(h), but also O(h) for the draws).
Why stop at small values of h? When defining E=ln(10)/400 (hopefuly ok), we have F(x)=1/(1+E^-x)=1-F(-x). So for BayesElo, with drawelo=h and opponent being x behind, likelihood of win is F(x-h) and likelihood of loss is F(-x-h), we have:

Code: Select all

draw likelihood =  1 - F(-x-h) - F(x-h)
= 1 - F(-x-h) - F(x-h) + F(-x-h) * F(x-h) - F(-x-h) * F(x-h) // plusminus the same term
= (1-F(-x-h)) * (1-F(x-h)) - F(-x-h) * F(x-h)
= F(x+h) * F(-x+h) - F(-x-h) * F(x-h)
= 1 / ((1 + E^-x * E^-h) * (1 + E^x * E^-h)) - 1 / ((1 + E^x * E^h) * (1 + E^-x * E^h))
= 1 / (1 + E^-x * E^-h + E^x * E^-h + E^-2h) - 1 / (1 + E^x * E^h + E^-x * E^h + E^2h)
= E^h / (E^h + E^-x + E^x +E^-h) - E^-h / (E^-h + E^x + E^-x + E^h)
= (E^h - E^-h) / (E^h + E^-x + E^x + E^-h)
= (1 - E^-2h) * F(x+h) * F(-x+h) = (E^2h - 1) * F(-x-h) * F(x-h)
If BayesElo computes maximum likelihood with fixed (hard-coded or computed from data) drawelo, and the maximum sees only the shape and not the x-independent normalization, we can conclude from the last line of computation that (I have an error somewhere, or) really each draw equals to one lose against an opponent stronger by drawelo together with one win against an opponent weaker by drawelo (or equivalently, a win against the stronger one and a lose against the weaker one).

So, 'draw = win + loss' holds only when (computed) rating difference is way above drawelo. Conversely, when drawelo is way above rating difference, draws do not count at all (as their likelihood is not sensitive to such difference).