H4 or S5 !?

Laskos · Post by **Laskos** » Tue Jun 03, 2014 11:38 pm

michiguel wrote:
Laskos wrote:
michiguel wrote:
There is only one correct answer, and that is SF5 should be #1 (by a very tiny small margin, though). Why? this is a round robin, so everybody played each other in the same conditions etc. etc. so, the programs who score mores points overall should be #1. This is one of the cases in which there is no doubt about the relative order. As a reference, in the output of Ordo you can see the actual points (the others give %). Whatever program you use, the relative order should follow the number of points. Basically, SF won this gigantic RR tournament, and should be #1.

1 Stockfish 5 : 3115.1 2473.0 3300 74.9%
2 Houdini 4 : 3111.0 2458.5 3300 74.5%

Miguel
Miguel, I am a bit tired and can't reason clearly. Can you prove that in the case:
Direct matches in RR
A>B
B>C
C>A

and total points in RR are A>B>C, then Elo ratings are always also A>B>C?
It can be demonstrated as long as certain assumptions are respected, like two draws equal one win + loss. But, I am not sure I can do it elegantly or understandably by quickly typing here.

For instance, let's try a reductio ad absurdum. Let's assume that the elo is EloB > EloA > EloC. In that case, A face stronger schedule than B. Both faced C, but the head to head match was tougher for A (because EloB > EloA). Consequently, A face a tougher schedule and got more points, which means it should have a higher elo. But, that contradicts the initial assumption EloB > EloA > EloC, disproving it. If you keep doing this analysis, you will see that the only reasonable scenario is EloA > EloB > EloC.

Miguel

Still seems to assume transitivity in matches (assumed by Elo scale transitivity). Maybe I will try to disprove your proof with 4 entities, when I will get time to build a suitable pgn file for Ordo (I recommend Ordo to every tester).

syzygy · Post by **syzygy** » Tue Jun 03, 2014 11:41 pm

To be able to prove anything we must first define what we are talking about.

What is "Elo rating"?

Option 1: A number calculated using some formula (e.g. Elostat, BayesElo, Ordo) based on a bunch of game results.

In this case the answer to the question follows from the formula. Use a different formula and the answer may be different.

I would say that a "good" formula should have the property that, if the set of input games is full RR, then Elo(A) > Elo(B) iff A > B according to the RR (in points).

Clearly BayesElo does not have this property.

Option 2: Elo is a measure of a player's "true chess strength". All numbers that we calculate are estimations of this measure.

In this case it is clearly always possible that Elo(A) > Elo(B) even though in a particular RR B scores more points than A.

We're probably not talking about option 2, but about option 1. So it all depends on the formula used to calculate the Elo rating from game results.

Laskos · Post by **Laskos** » Tue Jun 03, 2014 11:48 pm

syzygy wrote:To be able to prove anything we must first define what we are talking about.

What is "Elo rating"?

Option 1: A number calculated using some formula (e.g. Elostat, BayesElo, Ordo) based on a bunch of game results.

In this case the answer to the question follows from the formula. Use a different formula and the answer may be different.

I would say that a "good" formula should have the property that, if the set of input games is full RR, then Elo(A) > Elo(B) iff A > B according to the RR (in points).

Clearly BayesElo does not have this property.

Option 2: Elo is a measure of a player's "true chess strength". All numbers that we calculate are estimations of this measure.

In this case it is clearly always possible that Elo(A) > Elo(B) even though in a particular RR B scores more points than A.

We're probably not talking about option 2, but about option 1. So it all depends on the formula used to calculate the Elo rating from game results.

Yes Ron, on the same lines, we have to define what we want, but intuitively a 99:1 results should have less weight than 52:48, as the error margins on logistic curve in the first case are larger, and statistical weight is 1/(error margins)^2. So, not all points are equal, if we want the lesser error margins, and that's the goal, I think, of all rating calculations (besides obeying the logistic, which BayesElo does not).

Modern Times · Post by **Modern Times** » Wed Jun 04, 2014 6:47 am

Take one of Graham's amateur tourneys where he provides the pgn. Run it through bayeslelo or your other tool of choice, and see if the ratings follow the rankings in the tournament. I'm certain I've done this in the past, and the Elo ratings do not always follow the tournament rankings.

lkaufman · Post by **lkaufman** » Wed Jun 04, 2014 6:54 am

Modern Times wrote:Take one of Graham's amateur tourneys where he provides the pgn. Run it through bayeslelo or your other tool of choice, and see if the ratings follow the rankings in the tournament. I'm certain I've done this in the past, and the Elo ratings do not always follow the tournament rankings.

It's quite clear that BayesElo does not have the property of ordering the ratings the same as results in a RR, while both EloStat (which has worse problems) and Ordo will always do this. Anyone who wants ratings to be in the same order as RR scores should switch to Ordo. It also has the nice property that rating differences in a simple match will always be the same as the elo system dictates. With BayesElo, you always have to specify the parameters when talking about a rating difference between two engines.

Modern Times · Post by **Modern Times** » Wed Jun 04, 2014 7:37 am

lkaufman wrote:
Modern Times wrote:Take one of Graham's amateur tourneys where he provides the pgn. Run it through bayeslelo or your other tool of choice, and see if the ratings follow the rankings in the tournament. I'm certain I've done this in the past, and the Elo ratings do not always follow the tournament rankings.
It's quite clear that BayesElo does not have the property of ordering the ratings the same as results in a RR, while both EloStat (which has worse problems) and Ordo will always do this. Anyone who wants ratings to be in the same order as RR scores should switch to Ordo. It also has the nice property that rating differences in a simple match will always be the same as the elo system dictates. With BayesElo, you always have to specify the parameters when talking about a rating difference between two engines.

No, I would have been using EloStat back then.

And no, I don't want ratings to be in the same order as RR scores because I think that is a false premise. True in some cases, not in others.

carldaman · Post by **carldaman** » Wed Jun 04, 2014 8:07 am

IWB wrote:

IWB wrote: ..., as I am curious I will run the S5 match again with 4pc SYSYSY bases....

Test is running. The original setup had this:

Code: Select all

     Stockfish 5              3080.0 (2297.0 : 783.0)
                              220.0 (127.5 :  92.5) Houdini 4           3111
                              220.0 (121.5 :  98.5) Komodo 7a           3088
                              220.0 (134.0 :  86.0) Gull 3              3057
                              220.0 (150.5 :  69.5) Critter 1.4a        2980
                              220.0 (149.0 :  71.0) Equinox 2.02        2975
                              220.0 (159.5 :  60.5) Deep Rybka 4.1      2959
                              220.0 (176.0 :  44.0) Deep Fritz 14       2894
                              220.0 (170.5 :  49.5) Chiron 2            2889
                              220.0 (181.0 :  39.0) Protector 1.6.0     2870
                              220.0 (168.0 :  52.0) Hannibal 1.4b       2870
                              220.0 (183.0 :  37.0) Naum 4.2            2838
                              220.0 (187.0 :  33.0) Texel 1.04          2838
                              220.0 (187.0 :  33.0) Senpai 1.0          2838
                              220.0 (188.0 :  32.0) HIARCS 14 WCSC 32b  2812
                              220.0 (190.5 :  29.5) Jonny 6.00          2798

So 74.58% have to be beaten!

Bye
Ingo

Thanks for the re-run, Ingo

CL

Laskos · Post by **Laskos** » Wed Jun 04, 2014 10:21 am

Laskos wrote:
michiguel wrote:
Laskos wrote:
michiguel wrote:
There is only one correct answer, and that is SF5 should be #1 (by a very tiny small margin, though). Why? this is a round robin, so everybody played each other in the same conditions etc. etc. so, the programs who score mores points overall should be #1. This is one of the cases in which there is no doubt about the relative order. As a reference, in the output of Ordo you can see the actual points (the others give %). Whatever program you use, the relative order should follow the number of points. Basically, SF won this gigantic RR tournament, and should be #1.

1 Stockfish 5 : 3115.1 2473.0 3300 74.9%
2 Houdini 4 : 3111.0 2458.5 3300 74.5%

Miguel
Miguel, I am a bit tired and can't reason clearly. Can you prove that in the case:
Direct matches in RR
A>B
B>C
C>A

and total points in RR are A>B>C, then Elo ratings are always also A>B>C?
It can be demonstrated as long as certain assumptions are respected, like two draws equal one win + loss. But, I am not sure I can do it elegantly or understandably by quickly typing here.

For instance, let's try a reductio ad absurdum. Let's assume that the elo is EloB > EloA > EloC. In that case, A face stronger schedule than B. Both faced C, but the head to head match was tougher for A (because EloB > EloA). Consequently, A face a tougher schedule and got more points, which means it should have a higher elo. But, that contradicts the initial assumption EloB > EloA > EloC, disproving it. If you keep doing this analysis, you will see that the only reasonable scenario is EloA > EloB > EloC.

Miguel
Still seems to assume transitivity in matches (assumed by Elo scale transitivity). Maybe I will try to disprove your proof with 4 entities, when I will get time to build a suitable pgn file for Ordo (I recommend Ordo to every tester).

With a concocted PGN file using Ordo, I got

Code: Select all

   # PLAYER  RATING  ERROR   POINTS  PLAYED    (%)
   1 4    : 2361.7  100.0      8.0      15   53.3%
   2 3    : 2314.6   99.5      7.5      15   50.0%
   3 2    : 2303.0  102.2      8.0      15   53.3%
   4 1    : 2220.7   96.7      6.5      15   43.3%

We see an inversion between 2nd and 3rd places as number of points goes. My guess is it may have to do with the -W switch I used for white advantage.

Laskos · Post by **Laskos** » Wed Jun 04, 2014 11:01 am

The command line in Ordo v0.8 was:

Code: Select all

ordo -p order.pgn -o ratings.txt -s1000 -W

The artificial PGN has the following properties

Code: Select all

Games        :     30 (finished)

White Wins   :     16 (53.3 %)
Black Wins   :     13 (43.3 %)
Draws        :      1 ( 3.3 %)
Unfinished   :      0

White Perf.  : 55.0 %
Black Perf.  : 45.0 %

and is listed here:

Code: Select all

[White "1"]
[Black "2"]
[Result "1-0"]
1-0

[White "1"]
[Black "2"]
[Result "1-0"]
1-0

[White "1"]
[Black "2"]
[Result "1-0"]
1-0


[White "1"]
[Black "2"]
[Result "0-1"]
0-1

[White "1"]
[Black "2"]
[Result "0-1"]
0-1

[White "1"]
[Black "3"]
[Result "1-0"]
1-0

[White "1"]
[Black "3"]
[Result "1-0"]
1-0

[White "1"]
[Black "3"]
[Result "1-0"]
1-0

[White "1"]
[Black "3"]
[Result "1/2-1/2"]
1/2-1/2

[White "1"]
[Black "3"]
[Result "0-1"]
0-1

[White "2"]
[Black "3"]
[Result "1-0"]
1-0

[White "2"]
[Black "3"]
[Result "1-0"]
1-0

[White "2"]
[Black "3"]
[Result "1-0"]
1-0

[White "2"]
[Black "3"]
[Result "0-1"]
0-1

[White "2"]
[Black "3"]
[Result "0-1"]
0-1

[White "2"]
[Black "4"]
[Result "1-0"]
1-0

[White "2"]
[Black "4"]
[Result "1-0"]
1-0

[White "2"]
[Black "4"]
[Result "1-0"]
1-0

[White "2"]
[Black "4"]
[Result "0-1"]
0-1

[White "2"]
[Black "4"]
[Result "0-1"]
0-1

[White "3"]
[Black "4"]
[Result "1-0"]
1-0

[White "3"]
[Black "4"]
[Result "1-0"]
1-0

[White "3"]
[Black "4"]
[Result "1-0"]
1-0

[White "3"]
[Black "4"]
[Result "1-0"]
0-1

[White "3"]
[Black "4"]
[Result "0-1"]
0-1

[White "1"]
[Black "4"]
[Result "0-1"]
0-1

[White "1"]
[Black "4"]
[Result "0-1"]
0-1

[White "1"]
[Black "4"]
[Result "0-1"]
0-1

[White "1"]
[Black "4"]
[Result "0-1"]
0-1

[White "1"]
[Black "4"]
[Result "0-1"]
0-1

Vinvin · Post by **Vinvin** » Wed Jun 04, 2014 11:08 am

IWB wrote:

IWB wrote: ..., as I am curious I will run the S5 match again with 4pc SYSYSY bases....

Test is running. The original setup had this:

Code: Select all

     Stockfish 5              3080.0 (2297.0 : 783.0)
                              220.0 (127.5 :  92.5) Houdini 4           3111
                              220.0 (121.5 :  98.5) Komodo 7a           3088
                              220.0 (134.0 :  86.0) Gull 3              3057
                              220.0 (150.5 :  69.5) Critter 1.4a        2980
                              220.0 (149.0 :  71.0) Equinox 2.02        2975
                              220.0 (159.5 :  60.5) Deep Rybka 4.1      2959
                              220.0 (176.0 :  44.0) Deep Fritz 14       2894
                              220.0 (170.5 :  49.5) Chiron 2            2889
                              220.0 (181.0 :  39.0) Protector 1.6.0     2870
                              220.0 (168.0 :  52.0) Hannibal 1.4b       2870
                              220.0 (183.0 :  37.0) Naum 4.2            2838
                              220.0 (187.0 :  33.0) Texel 1.04          2838
                              220.0 (187.0 :  33.0) Senpai 1.0          2838
                              220.0 (188.0 :  32.0) HIARCS 14 WCSC 32b  2812
                              220.0 (190.5 :  29.5) Jonny 6.00          2798

So 74.58% have to be beaten!

Bye
Ingo

I think 4pc Syzygy will change nothing in the SF strength. So for me, it's a second test with SF at the exact same level, which it is interesting too

BTW, it should be interesting to find one (yes, only one ! ) game where the 4pc Syzygy change the final outcome.

H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?