H4 or S5 !?

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: H4 or S5 !?

Post by IWB »

Uri Blass wrote:
I disagree that humans value a decisive game more than a tie
when the result is equal.

I feel that it is the opposite.

If A beats B 40-0 with 960 draws then
it shows that A is clearly the better player.

If A beats B 520-480 with no draws then I am even not sure that A is stronger than B so my feeling is that if 2 programs have the same score the program that got more draws should be number 1.

I see your point from a statistical perspective. My feeling however is that humans who see a human tourney tend to give the win of the tourney to the person with more wins than to the one with more draws ...

As I know that your knowledge about all this is far superiour to mine what would be your recomendation for a statistical evaluation of my RRRL (particular in this case)? What tool would be the best to use?
Some years ago I changed to Bayes because basicaly everyone agreed that this is "more precise" than Elostat. I am willing to change again if this decision was flawed!

Thx
Ingo
Modern Times
Posts: 3703
Joined: Thu Jun 07, 2012 11:02 pm

Re: H4 or S5 !?

Post by Modern Times »

IWB wrote: Some years ago I changed to Bayes because basicaly everyone agreed that this is "more precise" than Elostat. I am willing to change again if this decision was flawed!

Thx
Ingo
I don't think that was a flawed decision.
lkaufman
Posts: 6227
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: H4 or S5 !?

Post by lkaufman »

Modern Times wrote:
IWB wrote: Some years ago I changed to Bayes because basicaly everyone agreed that this is "more precise" than Elostat. I am willing to change again if this decision was flawed!

Thx
Ingo
I don't think that was a flawed decision.
I think that the decision was correct, but that now you (and CCRL and CEGT) should switch to Ordo. Elostat is seriously flawed in how in handles mismatches; in some cases a player can even lose points with a 100% score. So Bayes is better than Elostat. But Ordo has none of the flaws of Elostat, and always will give a higher rating to a higher score in a RR tournament. Bayes gives more weight to draws than do the other two, so the top program which draws more with weak programs will be punished. Basically, the choice comes down to whether you want to treat a win and a loss as the same as two draws (if so you should choose Ordo) or whether you accept the premise of Bayes, which is that one win and one loss equal ONE draw (this was explained here by HGM). This may have some theoretical justification, but Ordo corresponds to the way chess tournalments are actually scored.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: H4 or SF5!?

Post by michiguel »

IWB wrote:
Ajedrecista wrote:
... Naum won less points than Texel but with a higher draw ratio, so I start to think that the draw ratio is not the most important factor...
Ahh interesting. If it is not the draw rate why is the ranking in that order with bayes? Question to everyone who can explain in simple words ;-)

Thx for this
Ingo
It must be the draw rate, together with the fact that BE does not consider win + loss = 2 draws. In other words, they do not have the same weight (draws weight more, IIRC). Then, on the bottom half, a higher draw rate might bump you up, but in the top half, it might bump you down.

Miguel
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: H4 or S5 !?

Post by IWB »

lkaufman wrote:... I think that the decision was correct, but that now you (and CCRL and CEGT) should switch to Ordo. ...
My biggest problem with ORDO is the missing error bar. I like the concept of uncertanty and these absolut values (with decimals?) somehow look too precise.

Is there a way to switch on an error bar in ORDO? I did not check so maybe ...

BYe
Ingo
Modern Times
Posts: 3703
Joined: Thu Jun 07, 2012 11:02 pm

Re: H4 or S5 !?

Post by Modern Times »

IWB wrote:
My biggest problem with ORDO is the missing error bar. I like the concept of uncertanty and these absolut values (with decimals?) somehow look too precise.

Is there a way to switch on an error bar in ORDO? I did not check so maybe ...

BYe
Ingo
It doesn't have LOS either ?
Modern Times
Posts: 3703
Joined: Thu Jun 07, 2012 11:02 pm

Re: H4 or S5 !?

Post by Modern Times »

lkaufman wrote: I think that the decision was correct, but that now you (and CCRL and CEGT) should switch to Ordo.
I'm happy with BayesElo personally.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: H4 or S5 !?

Post by Lyudmil Tsvetkov »

IWB wrote:
michiguel wrote: I do not understand what you mean. What argument have I had before? I am confused about those numbers 6,7 3,4 13 and 14. What are those?
:-)

That should be just an example that, what now is obvious for No 1 and 2, might be the case for Engines ranked 6 and 7 or 3 and 4 or whatever pair you like in the past. Just examples where nobody cared ... and not it is important suddenly? (Because of 5 Elo which are fully in one SD ... No! It is because of the Number in front - if it is a one or a two ;-) )

My problem is that people usually do not mind conditions but just rankings! Worse, they look for No 1, 2 and maybe 3. Thats it!

At least we agree that there is very little difference between the Tops :-)

Bye
Ingo
Hi Ingo.

Just say it: SF is the new number 1, even in your rating list.

That is the truth. Houdini would have scored even worse without contempt.
When you win against your direct opponents with 55-60% scores, there is no doubt about who the number 1 currently is.

And SF really plays much stronger than Houdini in purely chess terms. It is for Houdini and Komodo to catch up with SF now.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: H4 or S5 !?

Post by michiguel »

IWB wrote:
lkaufman wrote:... I think that the decision was correct, but that now you (and CCRL and CEGT) should switch to Ordo. ...
My biggest problem with ORDO is the missing error bar. I like the concept of uncertanty and these absolut values (with decimals?) somehow look too precise.

Is there a way to switch on an error bar in ORDO? I did not check so maybe ...

BYe
Ingo

ordo -W -p TOPRES.pgn -a2800 -s1000


where

-W automatic white advantage
-a2800 average set to 2800
-s1000 simulate ranking 1000 times to calc standard deviations

Each engine's error is the error relative to the average of the pool.

Code: Select all

   # PLAYER                : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish 5           : 2996.1    9.3   2473.0    3300   74.9%
   2 Houdini 4             : 2992.0    9.2   2458.5    3300   74.5%
   3 Komodo 7a             : 2970.4    9.3   2379.0    3300   72.1%
   4 Gull 3                : 2935.9    8.9   2245.5    3300   68.0%
   5 Critter 1.4a          : 2849.9    8.3   1882.0    3300   57.0%
   6 Equinox 2.02          : 2844.8    8.3   1859.5    3300   56.3%
   7 Deep Rybka 4.1        : 2826.6    8.2   1778.5    3300   53.9%
   8 Deep Fritz 14         : 2756.7    8.2   1464.5    3300   44.4%
   9 Chiron 2              : 2750.4    8.4   1436.5    3300   43.5%
  10 Protector 1.6.0       : 2731.1    8.3   1351.0    3300   40.9%
  11 Hannibal 1.4b         : 2729.3    8.1   1343.0    3300   40.7%
  12 Texel 1.04            : 2697.4    8.7   1204.5    3300   36.5%
  13 Naum 4.2              : 2696.5    8.6   1200.5    3300   36.4%
  14 Senpai 1.0            : 2695.9    8.6   1198.0    3300   36.3%
  15 HIARCS 14 WCSC 32b    : 2671.6    8.9   1096.0    3300   33.2%
  16 Jonny 6.00            : 2655.4    8.6   1030.0    3300   31.2%
or

ordo -p TOPRES.pgn -a3100 -A "Stockfish 5" -W -s1000


Where Stockfish is fixed to 3100, so it will have no error for that reason.
Then, each engines error is the error relative to SF. Of course, errors are bigger (they now implicitly include SF error too). In the previous example, each engines error is the error relative to the average of the pool.
This is better if you want to compare one engine to the rest.

Code: Select all

   # PLAYER                : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish 5           : 3100.0   ----   2473.0    3300   74.9%
   2 Houdini 4             : 3095.9   13.5   2458.5    3300   74.5%
   3 Komodo 7a             : 3074.3   13.0   2379.0    3300   72.1%
   4 Gull 3                : 3039.8   12.9   2245.5    3300   68.0%
   5 Critter 1.4a          : 2953.8   13.0   1882.0    3300   57.0%
   6 Equinox 2.02          : 2948.7   12.8   1859.5    3300   56.3%
   7 Deep Rybka 4.1        : 2930.5   12.6   1778.5    3300   53.9%
   8 Deep Fritz 14         : 2860.6   12.5   1464.5    3300   44.4%
   9 Chiron 2              : 2854.3   13.4   1436.5    3300   43.5%
  10 Protector 1.6.0       : 2835.1   13.1   1351.0    3300   40.9%
  11 Hannibal 1.4b         : 2833.2   12.6   1343.0    3300   40.7%
  12 Texel 1.04            : 2801.3   13.1   1204.5    3300   36.5%
  13 Naum 4.2              : 2800.4   13.5   1200.5    3300   36.4%
  14 Senpai 1.0            : 2799.8   13.4   1198.0    3300   36.3%
  15 HIARCS 14 WCSC 32b    : 2775.5   13.8   1096.0    3300   33.2%
  16 Jonny 6.00            : 2759.4   13.4   1030.0    3300   31.2%
You can also save a matrix or errors (each engine against each other) with the switch -e.

help with ordo

Code: Select all

quick example: ordo -a 2500 -p input.pgn -o output.txt
  - Processes input.pgn (PGN file) to calculate ratings to output.txt.
  - The general pool will have an average of 2500

usage: ordo [-OPTION]
 -h          print this help
 -H          print just the switches
 -v          print version number and exit
 -L          display the license information
 -q          quiet mode (no screen progress updates)
 -a <avg>    set rating for the pool average
 -A <player> anchor: rating given by '-a' is fixed for <player>, if provided
 -m <file>   multiple anchors: file contains rows of "AnchorName",AnchorRating
 -w <value>  white advantage value (default=0.0)
 -W          white advantage, automatically adjusted
 -z <value>  scaling: set rating for winning expectancy of 76% (default=202)
 -T          display winning expectancy table
 -p <file>   input file in PGN format
 -c <file>   output file (comma separated value format)
 -o <file>   output file (text format), goes to the screen if not present
 -g <file>   output file with group connection info (no rating output on screen)
 -s  #       perform # simulations to calculate errors
 -e <file>   saves an error matrix, if -s was used
 -F <value>  confidence (%) to estimate error margins. Default is 95.0
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: H4 or S5 !?

Post by Lyudmil Tsvetkov »

IWB wrote:
Uri Blass wrote:
I disagree that humans value a decisive game more than a tie
when the result is equal.

I feel that it is the opposite.

If A beats B 40-0 with 960 draws then
it shows that A is clearly the better player.

If A beats B 520-480 with no draws then I am even not sure that A is stronger than B so my feeling is that if 2 programs have the same score the program that got more draws should be number 1.

I see your point from a statistical perspective. My feeling however is that humans who see a human tourney tend to give the win of the tourney to the person with more wins than to the one with more draws ...

As I know that your knowledge about all this is far superiour to mine what would be your recomendation for a statistical evaluation of my RRRL (particular in this case)? What tool would be the best to use?
Some years ago I changed to Bayes because basicaly everyone agreed that this is "more precise" than Elostat. I am willing to change again if this decision was flawed!

Thx
Ingo
In chess if you have more wins or more draws, does not matter at all. It is only the score that matters.