Wilo rating properties from FGRL rating lists

Laskos · Post by **Laskos** » Thu May 04, 2017 6:15 pm

lkaufman wrote:
Another issue is that the programming of engines, as well as the play of human grandmasters, is aimed to maximize score with draws counting as 1/2, rather than just number of wins (although wins are sometimes used as a tiebreak). WILO might be better mathematically, but it does not correspond to the actual scoring of tournaments. This is not a minor issue. Suppose Komodo (or even Carlsen) reaches a middlegame position with a half-pawn advantage or so. He has to decide between retaining queens with let's say a 60% winning chance, a 20% losing chance, and a 20% drawing chance. Or he can simplify to an endgame with a 24% winning chance, a 75% drawing chance, and a 1% losing chance (i.e. a gross blunder or flag fall). In any normal tournament or match, he should keep queens on (assuming a neutral tournament/match situation) to maximize his expected score. But to maximize WILO, he should trade queens. Komodo has code to try to avoid simplifying in such a situation (maybe not very effective, but that's irrelevant); if we wanted to maximize WILO we would have to make significant program changes. In my view, we would have to return to the old practice of replaying draws until someone wins to justify switching to WILO. Elimination tournaments with playoffs at faster time limits to break ties are a version of this, but then you are rating blitz games together with slow ones. This is also my objection to Bayes Elo; it also makes an assumption that does not correspond to normal match/tournament scoring.

Valid objection. It stems from the fact that in ELO you assume a given N=number of games, and one has to optimise W-L for fixed N. In WILO you don't assume a fixed number of games, and one has to optimise W/L every time. If you assume in WILO a fixed N, W-L and W/L descriptions are equivalent. So rating lists with games played according to ELO, which optimize W-L for a certain given number of games, shouldn't be used as WILO rating lists with cleaned-out Draws, because the optimizing of ELO with fixed N and WILO with floating N' bring different playing goals. Only games played to optimize WILO should be used, which are none

. Fixed N for WILO might be adopted (replaying drawn games, which would bring goals back on track). Your point is valid, it's a bit different game, and both humans and engines will have to adjust to a maybe better rating system for Chess like WILO. Also, there are many cases when the game of Chess has different goals, depending on the tournament, Match, RR with 2 opponents, RR with 40 opponents, Swiss, Knock-Out, Tie-Breaks, ELO gap 200, ELO gap 1000, and so on. For each, the "evals" of both Humans and Engines have to adjust.

clumma · Post by **clumma** » Fri May 05, 2017 1:21 am

Laskos wrote:That LOS with a uniform prior is independent of draws? The math is not that complicated, probably best presented here:
http://www.talkchess.com/forum/viewtopi ... 05&t=30624

OK sure. But that is hardly justification for a rating system that ignores drawn games.

kbhearn · Post by **kbhearn** » Fri May 05, 2017 4:02 am

It seems likely that WiLo is an ideal model for predicting the winner of a head to head match. (In the case of dann's extreme example of a trillion draws and 1 loss it should still produce an error margin so large that the answer 'hell if i know' should be the same as the elo model).

It seems less ideal for a tournament setting where how well you bash apart the field is more important than how you do against your rival (giving away draws against weaker opponents is a problem).

I'd be interested in seeing a system to optimise for 3-1-0 tournaments where W+L > 2 * D should lead to more interesting play in the optimised case.

Michel · Post by **Michel** » Wed May 10, 2017 8:19 pm

There was just an interesting test on fishtest.

SF8 and the latest dev version were pitted against each other at time controls 10+0.1, 60+0.6 and 180+1.8.

Despite the 18 fold in TC the value of elo/sigma(elo)/sqrt(games) remained almost exactly the same (let's call it normalized elo).

So it seems that at least in this case elo/sigma(elo)/sqrt(games) is a good statistic to use.

The properties of elo/sigma(elo)/sqrt(games) are:

(1) Its expectation values is independent of the number of games.
(2) Its standard deviation is 1/sqrt(games), i.e. independent of the draw ratio.

normalized elo is probably related to wilo (if wilo is indeed TC independent) but it has a more solid theoretic foundation. It is a measure for the amount of games it takes to establish the superiority of one engine over another with a specified p-value (or LOS).

Here is some simple python code that computes the normalized elo with an error estimate (for small elo values)

Code: Select all

from __future__ import division
def sens(W=None,D=None,L=None):
    N=W+D+L
    (w,d,l)=(W/N,D/N,L/N)
    s=w+d/2
    var=w*(1-s)**2+d*(1/2-s)**2+l*(0-s)**2
    sigma=var**.5
    return ((s-1/2)/sigma-1.96/N**.5,((s-1/2)/sigma+1.96/N**.5))

if __name__=='__main__':
    print('stc',sens(W=7460,L=5146,D=27394))
    print('ltc',sens(W=3777,L=2201,D=21308))
    print('vltc',sens(W=3919,L=2178,D=28614))

Laskos · Post by **Laskos** » Thu May 11, 2017 4:41 am

Michel wrote:There was just an interesting test on fishtest.

SF8 and the latest dev version were pitted against each other at time controls 10+0.1, 60+0.6 and 180+1.8.

Despite the 18 fold in TC the value of elo/sigma(elo)/sqrt(games) remained almost exactly the same (let's call it normalized elo).

So it seems that at least in this case elo/sigma(elo)/sqrt(games) is a good statistic to use.

The properties of elo/sigma(elo)/sqrt(games) are:

(1) Its expectation values is independent of the number of games.
(2) Its standard deviation is 1/sqrt(games), i.e. independent of the draw ratio.

normalized elo is probably related to wilo (if wilo is indeed TC independent) but it has a more solid theoretic foundation. It is a measure for the amount of games it takes to establish the superiority of one engine over another with a specified p-value (or LOS).

Here is some simple python code that computes the normalized elo with an error estimate (for small elo values)
Code: Select all
from __future__ import division
def sens(W=None,D=None,L=None):
    N=W+D+L
    (w,d,l)=(W/N,D/N,L/N)
    s=w+d/2
    var=w*(1-s)**2+d*(1/2-s)**2+l*(0-s)**2
    sigma=var**.5
    return ((s-1/2)/sigma-1.96/N**.5,((s-1/2)/sigma+1.96/N**.5))

if __name__=='__main__':
    print('stc',sens(W=7460,L=5146,D=27394))
    print('ltc',sens(W=3777,L=2201,D=21308))
    print('vltc',sens(W=3919,L=2178,D=28614))

Nice observation! For elo/sigma(elo) you can use the simplified expression for small Elo differences: (Wins-Losses)/sqrt(Wins+Losses). And we see that although this expression is independent of Draws, and is identical in ELO and WILO, the time control invariant for ELO elo/sigma(elo)/sqrt(games) is dependent on draw rate (but is independent of the total number of games).

Invariant ELO = (Wins-Losses)/sqrt(Wins+Losses)/sqrt(games)
It seems to not depend much on time control. It is independent of the number of games. It is dependent on draw ratio.
The 95% confidence interval for "Invariant ELO" is
+- 1.96/sqrt(games),
and is independent on the draw rate.

The interpretation of the "invariant Elo" can be seen also as this: invert it

sqrt(Wins+Losses) * sqrt(games) / (Wins-Losses)

The square of this: (Wins+Losses) * games / (Wins-Losses)**2 is the number of games required to get 1 standard deviation or LOS=84%, and is invariant with regard to time according to your observation. It is also independent of the number of games. It depends on draw rate. To get X standard deviations (and the according LOS), one needs X^2 as much games.

WILO is identical if we keep Draws in the number of games. If WILO is applied to a "drawless Chess" (Draws do not occur), only dealing with Wins and Losses in "games", then the number of games needed for 1SD in WILO should decrease with time control, this quantity is not anymore time control independent in WILO.

Michel · Post by **Michel** » Thu May 11, 2017 6:43 am

Thanks. Yes to understand the difference between elo,wilo and normalized elo one can indeed look at small elo differences and do taylor series development. So we put

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

elo is proportional to eps
normalized elo is proportional to eps/sqrt(a)
wilo is proportional to eps/a

In the fishtest experiment normalized elo seemed to remain constant across TC. So eps/sqrt(a) stayed constant. Since a goes down with TC
one expects:

elo goes down with TC
normalized elo stays constant (this was our the hypothesis)
wilo goes up with TC

This was indeed what was observed.

Of course this is only a single data point and it is dangerous to draw serious conclusions from it, but it points the way to a more systematic analysis of "scaling".

Laskos · Post by **Laskos** » Thu May 11, 2017 8:49 pm

Michel wrote:Thanks. Yes to understand the difference between elo,wilo and normalized elo one can indeed look at small elo differences and do taylor series development. So we put

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

elo is proportional to eps
normalized elo is proportional to eps/sqrt(a)
wilo is proportional to eps/a

In the fishtest experiment normalized elo seemed to remain constant across TC. So eps/sqrt(a) stayed constant. Since a goes down with TC
one expects:

elo goes down with TC
normalized elo stays constant (this was our the hypothesis)
wilo goes up with TC

This was indeed what was observed.

Of course this is only a single data point and it is dangerous to draw serious conclusions from it, but it points the way to a more systematic analysis of "scaling".

Good, so Normalized ELO is somehow middle the road between ELO and WILO, and has a very nice interpretation, being possibly a time control invariant.

I used FGRL data to compute the Normalized ELO rating lists and scaling globally. I used your python script, as ELO differences are not that small.

Code: Select all

60'' + 0.6''

#   PLAYER          Norm ELO         Error (95%)
================================================
                                      0.041
1.  Stockfish 8       0.904           
2.  Houdini 5         0.849 
3.  Komodo 10.4       0.745
4.  Shredder 13       0.011
5.  Fire 5           -0.073
6.  Fizbo            -0.177
7.  Gull 3           -0.289
8.  Andscacs 0.90    -0.501
9.  Fritz 15         -0.509
10. Chiron 4         -0.536

Code: Select all

60' + 15''

#   PLAYER          Norm ELO         Error (95%)
================================================
                                      0.053
1.  Stockfish 8       0.760           
2.  Komodo 10.4       0.700
3.  Houdini 5         0.605
4.  Shredder 13       0.056
5.  Fire 5           -0.026
6.  Gull 3           -0.242
7.  Fizbo 1.9        -0.263
8.  Andscacs 0.90    -0.264
9.  Chiron 4         -0.513
10. Fritz 15         -0.575

Code: Select all

Scaling:

#   PLAYER          Norm ELO         Error (95%)
================================================
                                      0.067
1.  Andscacs 0.90     0.237
2.  Fire 5            0.047
3.  Gull 3            0.047
4.  Shredder 13       0.045
5.  Chiron 4          0.023
6.  Komodo 10.4      -0.045
7.  Fritz 15         -0.066
8.  Fizbo 1.9        -0.086
9.  Stockfish 8      -0.144 
10. Houdini 5        -0.244

Just visually, it seems that Normalized ELO favors in scaling, like ELO, weaker engines. WILO didn't exhibit this behavior.

I also played two matches of 2000 games each between Stockfish dev and Komodo 10.4 at 10''+0.1'' and 60''+0.6'':

Code: Select all

Score of SF dev vs Komodo: 654 - 142 - 1204  [0.628] 2000
ELO difference: 90.97 +/- 9.38

Score of SF dev vs Komodo: 470 - 126 - 1404  [0.586] 2000
ELO difference: 60.36 +/- 8.12

The normalized ELO is in these two cases:

Code: Select all

STC:  0.444 +/- 0.044
LTC:  0.332 +/- 0.044

So, it seems Komodo does indeed scale better from ultra-fast to bullet than Stockfish (outside error margins).

lucasart · Post by **lucasart** » Fri May 12, 2017 2:16 am

I prefer to measure things in BayesElo. The point is that DrawElo absorbs the higher draw ratio between different tc. Constant BayesElo scaling is in fact the implicit assumption underpinning SF testing methodology.

Code: Select all

tc       W     L     D      DrawElo  BayesElo  Wilo
10+0.1   7460  5146  27394  294      38.2      64.5
60+0.6   5397  3150  30010  368      52.5      93.5
180+1.8  4318  2394  31584  414      56.0      102.5

So I would conclude that Stockfish has good scaling. Both BayesElo and Wilo are still increasing between LTC and VLTC. Scaling is even better from STC to LTC.

The BayesElo gain from LTC to VLTC is probably still within error bars. However, error bars are difficult to calculate for this model. Apart from running MC simulations, I can't see another way.

Ajedrecista · Post by **Ajedrecista** » Sat May 13, 2017 12:01 pm

Hello:

Just to complete the picture:

Michel wrote:Thanks. Yes to understand the difference between elo,wilo and normalized elo one can indeed look at small elo differences and do taylor series development. So we put

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

elo is proportional to eps
normalized elo is proportional to eps/sqrt(a)
wilo is proportional to eps/a

lucasart wrote:Both BayesElo and Wilo are still increasing between LTC and VLTC. Scaling is even better from STC to LTC.

For small differences, using Bayeselo = 200*log10{(1 - L)*W/[(1 - W)*L]} and Michel's notation, I get Bayeselo ~ 2*eps/[a*(1 - a)] ~ eps/[a*(1 - a)], which is somewhat similar to WiLo's behaviour for reasonable values of a {please remember that 0 =< a =< 1/2, then just compare 1/a and 1/[a*(1 - a)]}.

In fact, not taking into account constants of 200 and 400:

Code: Select all

WiLo = 400*log10(W/L)
Bayeselo = 200*log10{(1 - L)*W/[(1 - W)*L]}

~ means 'proportional to' here.

WiLo ~ ln(W/L) ~ 2*eps/a ~ eps/a
Bayeselo ~ ln[(1 - L)/(1 - W)] + ln(W/L) ~ [(1 - L)/(1 - W)] + WiLo

Bayeselo ~ 2*eps/(1 - a) + 2*eps/a ~ eps*[1/(1 - a) + 1/a] ~ eps/[a*(1 - a)]

So I expect some similarities for WiLo and Bayeselo for small differences (|eps| << 1), unless I got wrong somewhere.

Corrections and other insights are welcome.

Regards from Spain.

Ajedrecista.

PCM72 · Post by **PCM72** » Sat May 13, 2017 6:36 pm

lkaufman wrote: Another issue is [...] not a minor issue [...] to maximize WILO, he should trade queens

That's the main issue, IMO.
The logic is quite simple: I'm playing an interesting unclear game but draws don't matter -> I manage to draw this game and I'll try to win the next one (no matter if I will win after a trillion of games).

Chess' goal has always been to try to win avoiding draws, since draws ARE time-costing and often effortful. Trying to avoid them just ignoring them is a fruitless placebo effect, if not producing the negative effect of the time (and #games) wasting.

Laskos wrote:Also, there are many cases when the game of Chess has different goals [...]

OK, measuring engines' "strength" is already a different goal. In this case Wilo rating can be useful, as well as an unbalanced set of openings (which would reduce draws' weight as well, IMO in a more efficient way if the set is chosen accurately), but this strength should always have as reference the main goal above.

Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: WiLo rating properties from FGRL rating lists.

Re: WiLo rating properties from FGRL rating lists.