program style, risk aversion

lucasart · Post by **lucasart** » Sat Dec 15, 2012 3:53 pm

Kempelen wrote:
jdart wrote:I think you can also measure this by looking at evals.

Some programs have high king safety scores. Scorpio is one example, Stockfish also I think. Houdini's scores in similar positions seem to be much lower, in my experience (this leads to a somwhat different conclusion than yours about Houdini's style: I think it is very good at finding winning shots but not quick to make a sacrifice or risky move that may not win).

--Jon
I read here: http://www.chessbase.com/newsdetail.asp?newsid=8591 that Houdini does not score evals as other engines doing 'pawn counts', but as probability of winning the game:
For example, when Houdini 3 shows a +1.00 evaluation in the middle game it has an 80% chance to win the game against an equally strong opponent at blitz time controls. I believe this is a very useful aspect of the engine.
This is maybe a reason why Houdini has a different style.

Only Robert Houdart could clarify that, but I don't think it explains why Houdini has a different style. What most likely happens is that, internally, Houdini has an eval like everyone else (probably a better one than most engines), and when it displays scores, it tries rescales them in this way. As to how the rescaling works, I don't know.

Don · Post by **Don** » Sat Dec 15, 2012 7:17 pm

Ajedrecista wrote:Hello Don:
Don wrote:I'm running another test where komodo is given a high contempt factor and Houdini's is set to zero.

I only have a few hundred games each, but this appears to upset the balance a bit. Houdini gets stronger because contempt 1 is ridiculous against evenly matched opponents and Komodo gets weaker for the same reason but they are all within about 15 ELO of each other. I used 23 contempt in Komodo and it appears to have a much smaller effect on the results than changing it does for Houdini, probably due to the king safety issue Richard mentioned.

It appears from the data so far that Houdini is not particularly dynamic - the draw aversion was primarily a result of the contempt factor. It did not change Komodo very much.
Code: Select all
 Percent   Percent   Percent   Percent      Risk 
Decisive      Wins    Losses     Draws     Style  Player
--------  --------  --------  --------  --------  -------------------
   60.56     29.47     31.09     39.44   3.07371  sf23
   61.49     28.70     32.80     38.51   3.07323  kdev-4518.00
   59.30     32.50     26.79     40.70   2.86312  hou3
Please take all of this with a grain of salt. I'm not sure of the significance of any of this. I do have a hypothesis though. The hypothesis is that no strong program is going to be particularly draw fearing. To play really exiting "go for broke" chess you have to have a somewhat unsound evaluation function and strong programs do not have that. Maybe you can do some things to make them more "fun" but if you want your program to play soundly you cannot just sacrifice material left and right.
I do not know how did you compute 'risk style' column this time. A higher number in this column means more likelihood of drawing, or less? I think that a higher number in your 'risk style' column means less likelihood to draw. Anyway, I get the following column c (rounding up to 0.0001):

I'm using the same exact formula that you are, the only difference is that I am "normalizing" them - comparing them to the total. Sum all the column and then use the sum as the numerator to get my numbers.

By doing this the numbers are less arbitrary as they are compared to the sample.

Code: Select all


 Percent   Percent   Percent   Percent      
Decisive      Wins    Losses     Draws     c      Player
--------  --------  --------  --------  --------  -------------------
   60.56     29.47     31.09     39.44   0.2004   sf23
   61.49     28.70     32.80     38.51   0.2005   kdev-4518.00
   59.30     32.50     26.79     40.70   0.2151   hou3

SF: (0.5 + |µ - 0.5|)*D = (0.5 + |0.4919 - 0.5|)*0.3944 ~ 0.2004
Komodo: (0.5 + |µ - 0.5|)*D ~ (0.5 + |0.47955 - 0.5|)*0.3851 ~ 0.2005
Houdini: (0.5 + |µ - 0.5|)*D ~ (0.5 + |0.5285 - 0.5|)*0.407 ~ 0.2151

A higher number c means more likelihood to draw, and viceversa. In this case with three engines, Houdini is the least 'draw fearing' engine.

Regards from Spain.

Ajedrecista.

Ajedrecista · Post by **Ajedrecista** » Sun Dec 16, 2012 11:05 am

Hello:

Don wrote:I'm using the same exact formula that you are, the only difference is that I am "normalizing" them - comparing them to the total. Sum all the column and then use the sum as the numerator to get my numbers.

By doing this the numbers are less arbitrary as they are compared to the sample.

I see and I think it is a good idea:

Code: Select all

0.2004 + 0.2005 + 0.2151 = 0.616

0.616/0.2004 ~ 3.0739
0.616/0.2005 ~ 3.0723
0.616/0.2151 ~ 2.8638

(My numbers are slightly different from yours due to roundings).

And then, by definition:

1/(3.07371) + 1/(3.07323) + 1/(2.86312) ~ 1

With this normalization, now higher numbers mean less risk aversion if we can define 'risk aversion' as 'being pleasant with draws', which is arguable.

Thanks again for your interest in my numerical idea! I stay tuned for further results and/or conclusions.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Sun Dec 16, 2012 11:08 am

I took CCRL 40/4 and picked the engines around Rybka 1.0 strength. The strength must be accounted for, so I took engines of similar strength. For example, we know that stronger play (longer time control) increases the draw ratio, independent of the score ratio. A regression on strength (rating) can be made to account for it, but I am not ready to do it.

The draw ratio I divided by s(1-s), where s is the score ratio. Here is the draw averseness of the engines around Rybka 1.0 strength (smaller the number, more averse)

Code: Select all

Engine        d / (1-s)s

Fritz 9          1.17
Junior 10        1.33
Shredder 10      1.33
Hiarcs 11        1.33
Fritz 10         1.36
Shredder 11      1.38
Zappa Mexico II  1.40
Fruit 2.3        1.42
Junior 11        1.43
Rybka 1.0        1.48
Doch 1.2         1.48
Naum 3.1         1.52
Fritz 11         1.55

It seems the artistic impression about these engines has some grounds in this "numeric style". But more engines (and maybe a regression) must be included.

Kai

Don · Post by **Don** » Sun Dec 16, 2012 1:53 pm

Laskos wrote:I took CCRL 40/4 and picked the engines around Rybka 1.0 strength. The strength must be accounted for, so I took engines of similar strength. For example, we know that stronger play (longer time control) increases the draw ratio, independent of the score ratio. A regression on strength (rating) can be made to account for it, but I am not ready to do it.

The draw ratio I divided by s(1-s), where s is the score ratio. Here is the draw averseness of the engines around Rybka 1.0 strength (smaller the number, more averse)
Code: Select all
Engine        d / (1-s)s

Fritz 9          1.17
Junior 10        1.33
Shredder 10      1.33
Hiarcs 11        1.33
Fritz 10         1.36
Shredder 11      1.38
Zappa Mexico II  1.40
Fruit 2.3        1.42
Junior 11        1.43
Rybka 1.0        1.48
Doch 1.2         1.48
Naum 3.1         1.52
Fritz 11         1.55
It seems the artistic impression about these engines has some grounds in this "numeric style". But more engines (and maybe a regression) must be included.

Kai

There clearly needs to be a lot of variety to be able to see the bigger picture. But I see that the contempt factor obfuscates the issue because that seems to have a fairly large impact on how draw averse a program is. I am more interested in the base playing style and how it affects this, not how skilled a program is in avoiding draw by repetition.

Ajedrecista · Post by **Ajedrecista** » Mon Dec 17, 2012 3:42 pm

Hello:

Don has posted more results in Open Chess Forum:

Draw aversion

I take the liberty to copy the info here because I think that these tests must be in TalkChess too:

Don wrote:Here are the players in my study the result of a few thousands game match. Note that I made a preliminary attempt to "time adjust" the ratings so that there would not be serious mismatches:
Code: Select all
    Rank    ELO     +/-    Games    Score  Player
    ---- ------- ------ -------- --------  ----------------------------
       1  3027.2   10.6     2938   52.280  spike14     
       2  3019.5   10.6     2940   50.901  kdev-4518.00
       3  3015.8   10.6     2938   50.221  c16         
       4  3010.2   10.6     2940   49.218  sf23         
       5  3000.0   10.6     2938   47.379  spark1-0     

    w/l/d: 2582 2332 2433    33.12 percent draws
In a previous 3 player run I was meticulous about adjusting the ratings, coming within 5 ELO of each other. But a forumla was suggested by Jesús Muñoz which in his own words looks like this:
Code: Select all
    µ_i: score of the i-th engine.
    D_i: draw ratio of the i-th engine.

    c_i = (0.5 + |µ_i - 0.5|)*D_i
    (c')_i = (0.5 - |µ_i - 0.5|)*D_i
When one program is significantly stronger than another the draw rate naturally goes way down so cannot simply observe the draw rate. In tests I did this formula appears to compensate for that, I could not make one program appears more drawish than another by manipulating the handicaps.

I "normalized" the values output by this formula by displaying each result as the denominator (and the sum as the numerator) to get this table - Risk Style is positive if the program wants to avoid draws:
Code: Select all
    Percent   Percent   Percent   Percent      Risk
    Decisive      Wins    Losses     Draws     Style  Player
    --------  --------  --------  --------  --------  -------------------
       68.90     31.82     37.08     31.10   5.19589  spark1-0
       66.52     33.48     33.04     33.48   5.05858  c16
       66.47     32.44     34.04     33.53   4.99435  sf23
       66.20     34.04     32.17     33.80   4.94098  kdev-4518.00
       66.28     35.42     30.86     33.72   4.82530  spike14
I am cautious to assign any deeper meaning to this - partly due to the contempt factor issue. It is simply an attempt to measure the "draw aversion" of a program in relation to other programs. Komodo has a default contempt of 7, Stockfish 0 and I have not checked the others. Some programs do not allow you to change it. With a little experimentation you can usually figure out what the contempt factor of a program is simply by setting up positions where it can force a draw and should.

I would love to run this test with many programs with contempt factors of zero.

Thanks to Don for the credit given to me!

According to the numbers of 'risk style' column, Spark is the most draw fear engine among these five while Spike is the most draw friendly, in the assumption that these numbers are actually correlated with draw aversion, which is a huge supposition.

Sorry for the long cross-posting but I think that TalkChess is a reference site in computer chess and also deserves to have this info.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Mon Dec 17, 2012 5:59 pm

I tested Houdini 3, Komodo 5 and Hiarcs 14 (all at default contempt), adjusted for strength

Code: Select all

    Program                            Score              Elo     +   -    Draws

  1 Hiarcs 14                      : 1022.5/2001  51.1    3005   13  13   28.3 %
  2 Houdini 3                      : 1000.5/2002  50.0    3000   13  13   32.0 %
  3 Komodo 5                       :  978.0/1999  48.9    2995   13  13   31.6 %

The "draw averseness" is (smaller - more averse) defined as draws over score*(1-score)

Code: Select all

Engine              d / s(1-s)

Hiarcs 14             1.13
Komodo 5              1.26
Houdini 3             1.28

Houdini and Komodo almost equal, but Hiarcs much more "draw averse". Somehow a confirmation of an artistic impression that the recent Rybkish/Fruitish engines play a somewhat duller chess (but are much stronger).

Kai

Don · Post by **Don** » Mon Dec 17, 2012 6:18 pm

Please feel free to cross post my stuff - my intent is not to "punish" talkchess but to save myself some time and aggravation.

Ajedrecista wrote:Hello:

Don has posted more results in Open Chess Forum:

Draw aversion

I take the liberty to copy the info here because I think that these tests must be in TalkChess too:
Don wrote:Here are the players in my study the result of a few thousands game match. Note that I made a preliminary attempt to "time adjust" the ratings so that there would not be serious mismatches:
Code: Select all
    Rank    ELO     +/-    Games    Score  Player
    ---- ------- ------ -------- --------  ----------------------------
       1  3027.2   10.6     2938   52.280  spike14     
       2  3019.5   10.6     2940   50.901  kdev-4518.00
       3  3015.8   10.6     2938   50.221  c16         
       4  3010.2   10.6     2940   49.218  sf23         
       5  3000.0   10.6     2938   47.379  spark1-0     

    w/l/d: 2582 2332 2433    33.12 percent draws
In a previous 3 player run I was meticulous about adjusting the ratings, coming within 5 ELO of each other. But a forumla was suggested by Jesús Muñoz which in his own words looks like this:
Code: Select all
    µ_i: score of the i-th engine.
    D_i: draw ratio of the i-th engine.

    c_i = (0.5 + |µ_i - 0.5|)*D_i
    (c')_i = (0.5 - |µ_i - 0.5|)*D_i
When one program is significantly stronger than another the draw rate naturally goes way down so cannot simply observe the draw rate. In tests I did this formula appears to compensate for that, I could not make one program appears more drawish than another by manipulating the handicaps.

I "normalized" the values output by this formula by displaying each result as the denominator (and the sum as the numerator) to get this table - Risk Style is positive if the program wants to avoid draws:
Code: Select all
    Percent   Percent   Percent   Percent      Risk
    Decisive      Wins    Losses     Draws     Style  Player
    --------  --------  --------  --------  --------  -------------------
       68.90     31.82     37.08     31.10   5.19589  spark1-0
       66.52     33.48     33.04     33.48   5.05858  c16
       66.47     32.44     34.04     33.53   4.99435  sf23
       66.20     34.04     32.17     33.80   4.94098  kdev-4518.00
       66.28     35.42     30.86     33.72   4.82530  spike14
I am cautious to assign any deeper meaning to this - partly due to the contempt factor issue. It is simply an attempt to measure the "draw aversion" of a program in relation to other programs. Komodo has a default contempt of 7, Stockfish 0 and I have not checked the others. Some programs do not allow you to change it. With a little experimentation you can usually figure out what the contempt factor of a program is simply by setting up positions where it can force a draw and should.

I would love to run this test with many programs with contempt factors of zero.
Thanks to Don for the credit given to me!

According to the numbers of 'risk style' column, Spark is the most draw fear engine among these five while Spike is the most draw friendly, in the assumption that these numbers are actually correlated with draw aversion, which is a huge supposition.

Sorry for the long cross-posting but I think that TalkChess is a reference site in computer chess and also deserves to have this info.

Regards from Spain.

Ajedrecista.

Adam Hair · Post by **Adam Hair** » Tue Dec 18, 2012 5:33 am

Ajedrecista wrote:
From the IPON data:

Code: Select all

  Name of the engine      µ    D   D_max    k     k*µ*(1 - µ)

Houdini 3 STD            82%  24%   36%  0.6667     0.0984
Komodo 5                 73%  34%   54%  0.6296     0.1241
Critter 1.4a             71%  37%   58%  0.6379     0.1314
Stockfish 2.2.2 JA       69%  40%   62%  0.6452     0.138
Deep Rybka 4.1           68%  40%   64%  0.625      0.136
Chiron 1.5               52%  42%   96%  0.4375     0.1092
Deep Fritz 13 32b        51%  40%   98%  0.4082     0.102
Naum 4.2                 50%  42%  100%  0.42       0.105
HIARCS 14 WCSC 32b       48%  40%   96%  0.4167     0.104
Hannibal 1.2             45%  40%   90%  0.4444     0.11
Gull 1.2                 45%  39%   90%  0.4333     0.1073
Deep Shredder 12         45%  40%   90%  0.4444     0.11
Deep Sjeng c't 2010 32b  43%  41%   86%  0.4767     0.1169
Spike 1.4 32b            42%  40%   84%  0.4762     0.116
spark-1.0                41%  39%   82%  0.4756     0.1151
Protector 1.4.0          39%  39%   78%  0.5        0.119
Deep Junior 13.3         39%  34%   78%  0.4359     0.1037
Quazar 0.4               36%  37%   72%  0.5139     0.1184
Zappa Mexico II          32%  35%   64%  0.5469     0.119
MinkoChess 1.3           31%  36%   62%  0.5806     0.1242

Just so I would not be left out of the fun, I have done some work on this also. I compared the draw rates for each match to an estimate of draw rate as a function of Elo difference and found the average difference for each engine (throwing out the Zappa vs Fritz results due to being an outlier). I then adjusted the average differences due to the positive correlation of draw rates to Elo ratings. The resulting percentages represent the deviation of the IPON draw rates for each engine from the expected draw rates, given the Elo difference of each match and the strength of each engine. Here is Jesús' table with my draw deviation column added:

Code: Select all

 Name of the engine      µ    D   D_max    k     k*µ*(1 - µ)    Draw deviation

Houdini 3 STD            82%  24%   36%  0.6667     0.0984      -2.05%
Komodo 5                 73%  34%   54%  0.6296     0.1241      -0.19%
Critter 1.4a             71%  37%   58%  0.6379     0.1314       0.10%
Stockfish 2.2.2 JA       69%  40%   62%  0.6452     0.138        1.68%
Deep Rybka 4.1           68%  40%   64%  0.625      0.136        2.41%
Chiron 1.5               52%  42%   96%  0.4375     0.1092       0.87%
Deep Fritz 13 32b        51%  40%   98%  0.4082     0.102       -0.07%
Naum 4.2                 50%  42%  100%  0.42       0.105        0.63%
HIARCS 14 WCSC 32b       48%  40%   96%  0.4167     0.104       -1.06%
Hannibal 1.2             45%  40%   90%  0.4444     0.11        -0.04%
Gull 1.2                 45%  39%   90%  0.4333     0.1073      -1.93%
Deep Shredder 12         45%  40%   90%  0.4444     0.11        -0.08%
Deep Sjeng c't 2010 32b  43%  41%   86%  0.4767     0.1169       0.45%
Spike 1.4 32b            42%  40%   84%  0.4762     0.116        0.24%
spark-1.0                41%  39%   82%  0.4756     0.1151      -0.03%
Protector 1.4.0          39%  39%   78%  0.5        0.119       -0.16%
Deep Junior 13.3         39%  34%   78%  0.4359     0.1037      -5.21%
Quazar 0.4               36%  37%   72%  0.5139     0.1184      -1.07%
Zappa Mexico II          32%  35%   64%  0.5469     0.119        0.00%
MinkoChess 1.3           31%  36%   62%  0.5806     0.1242       0.69%

From this, it appears that Junior 13.3 has the most draw aversion, followed by Houdini 3. Rybka 4.1 and Stockfish 2.2.2 have the least draw aversion.

Laskos · Post by **Laskos** » Wed Dec 19, 2012 7:06 pm

I didn't quite get this k factor, and I use only score*(1-score), or µ(1-µ)) in the notation of Jesus when comparing to draw ratio. The problem with the assumption that k*score*(1-score) is somehow constant is evidenced by this:

Score = 0.5, Draw Ratio = d = 0.40
Then k=0.4
k*s*(1-s)=0.1 assumed to be a constant for an engine

Same engine:
Score = 0.9
k is smaller than 1 by definition
Then k*s*(1-s) is smaller than 0.09
There is no k to match the old k*s*(1-s) of 0.1, and even the maximum k=1 is unrealistic, as there would be lots of wins and draws, but no losses.

On the other hand, if the factor is d / s*(1-s) then

Score = 0.5, Draw Ratio = d = 0.40
d / s*(1-s) = 1.6

Same engine:
Score = 0.9
The prediction for the same 1.6 is that d / 0.1*0.9 =! 1.6
Then d=0.09*1.6=0.144
So it predicts a result of 82.8% wins, 14.4% draws, and 2.8% losses, which is pretty realistic.

Therefore I think that d / s*(1-s) a more useful quantity than k*s*(1-s).

Anyway, I ran another test test with adjusted for strength engines (not perfectly adjusted):

Code: Select all

    Program                            Score     %     Elo     Draws

  1 Stockfish 2.3.1                : 520.5/834  62.4   3073     32.5 %
  2 Komodo 5                       : 421.5/842  50.1   3000     35.0 %
  3 Rybka 4.1                      : 404.5/819  49.4   2996     34.8 %
  4 Hiarcs 14                      : 415.0/845  49.1   2995     29.1 %
  5 Houdini 3                      : 394.0/840  46.9   2982     33.6 %
  6 Junior 13                      : 353.5/838  42.2   2954     26.1 %

And the draw averseness (smaller - more averse) is:

Code: Select all

       Engine            d / s*(1-s)

   Junior 13               1.07
   Hiarcs 14               1.16
   Houdini 3               1.35
   Stockfish 2.3.1         1.39
   Rybka 4.1               1.39
   Komodo 5                1.40

Again, "older style" engines seem more draw-averse.

Kai

program style, risk aversion

Re: program style, risk aversion

Re: My numeric method for determine draw trends of each engi

My numeric method for determine draw trends of each engine.

Re: My numeric method for determine draw trends of each engi

Re: My numeric method for determine draw trends of each engi

My numeric method for determine draw trends of each engine.

Re: My numeric method for determine draw trends of each engi

Re: My numeric method for determine draw trends of each engi

Re: My numeric method for determine draw trends of each engi

Re: My numeric method for determine draw trends of each engi