The Tennison's Deliria ...

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

User avatar
Ajedrecista
Posts: 1968
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

The Tennison's Deliria ...

Post by Ajedrecista »

Hello:

I found this test between Deep Rybka 4.1 x64 and Deep Rybka 4.1 (960) x64 at Chess2U Forum:

http://www.chess2u.com/t5502-the-tennis ... iria#31681

I think that this Tennison is the same Ben Tennison of Talkchess. Here is the test:

Code: Select all

Games Completed = 4000 of 4000 (Avg game length = 6.667 sec)
Settings = RR/32MB/Book/[b]500ms+50ms[/b]/M 1000cp for 12 moves, D 150 moves/EPD:[b]openings.epd[/b](4000)

 1.   Deep Rybka 4.1             1953.5/4000   1062-1155-1783     (L: m=571 t=1 i=0 a=583)   (D: r=1312 i=281 f=133 s=10 a=47)   (tpm=52.2 d=8.8 nps=127344)

 2.   Deep Rybka 4.1 960         2046.5/4000   1155-1062-1783     (L: m=483 t=0 i=0 a=579)   (D: r=1312 i=281 f=133 s=10 a=47)   (tpm=52.3 d=8.8 nps=129394)
Deep Rybka 4.1 960 scores 51,16 %.

Is this only a statistical margin ?
Is this a strength difference ?
Is this a little error margin in the opening book (openings.epd) ?
...

Have a nice debate ...
First of all: I am not an expert in tests. The output is clearly from LittleBlitzer and the used EPD seems '4000 openings' by Bob Hyatt IIRC. A good question is ask for the number of cores/threads that each engine used, and also the hardware.
Test 001 :

How is playing Deep Rybka 4.1 (x64) against Deep Rybka 4.1 960 (x64)? Is there a difference ?
AFAIK, the only difference between the standard version and the 960 one is that the latter is able to play Chess960, aka FRC (Fischer Random Chess), while the first not. So, I am a bit surprised about the speed:

Code: Select all

127344 nps ~ 129394 nps - 1.58%.
129394 nps ~ 127344 nps + 1.61%.
These are small differences, but I expected even less. Anyway, it does not seem exaggerated. I will try to answer some questions asked by Tennison:

a) Is this only a statistical margin?

According with my math, the results are inside the statistical margin. Writing some numbers with roundings after work with many decimals (hoping no typos in my calculations done with a Casio calculator):

Code: Select all

(Referred to non-960 version):
n = 4000 games (+1062 -1155 = 1783)

(Rating difference) = 400·log(1953.5/2046.5) ~ -8.08
(Standard deviation or sigma) = sqrt{(1/4000)·[(1953.5) · (2046.5)/(4000)² - (1783)/(4000 · 4)]} ~ 0.005883 ~ 0.5883%

2-sigma confidence ~ 95.45% confidence (an usual value):
2n·sigma ~ 2 · 4000 · 0.005883 ~ 47.0621

(Lower bound of the rating difference) = 400·log[(1953.5 - 47.0621)/(2046.5 + 47.0621)] ~ -16.27
(Upper bound of the rating difference) = 400·log[(1953.5 + 47.0621)/(2046.5 - 47.0621)] ~  +0.1

(2-sigma confidence interval for rating difference) ~ ]-16.27, +0.1[
So, pretty equal; with my results (~ 95.45% confidence) Deep Rybka 4.1 should score between ~ 47.66% (-16.27 Elo) and ~ 50.01% (+0.1 Elo) against Deep Rybka 4.1 (960) x64 under the conditions of the test. It looks normal for me.

b) Is this a strength difference?

With my limited knowledge on Statistics, I would say that there is not an easily measurable difference, even with 4000 games; I suppose that there is not any kind of bias in this test. If I have to chose, I would bet NO regarding strength difference (other than statistical uncertainties).

c) Is this a little error margin in the opening book (openings.epd)?

I do not fully understand the question, but I suspect that this EPD file is very balanced and therefore very trustable. Of course, people with more knowledge than me can answer better to this question.

Other comments:

· The time control is very short from my POV although I have not any problem with it. One lose by non-960 version and no loses by illegal moves... not bad.

· The number of loses by adjudication is very high (more than the half for each engine). I say very high because I usually get 0 loses by adjudication in my few, short and clumsy tests, but I set 'M 777777 cp for 7 moves' instead 'M 1000 cp for 12 moves' (which is the default setting). I do not know if changing this leads to more/less lose adjudications.

· The draw statistics seem very normal from my unexperienced POV. Very logical the split for threefold repetition, insufficient material, the fifty-move rule, stalemate and adjudication (with the condition 'D 150 moves', the default one).

Any comments, corrections... are welcome, as usual.

Regards from Spain.

Ajedrecista.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: The Tennison's Deliria ...

Post by Adam Hair »

I have a short response now and possibly a longer response when I can get to an actual computer.

Short response:

I believe that the reason why you have have seen very little, if any, lose adjudications in your tests is the threshold you use. Most (all?) engines do not return a score with an absolute value that high. I would be surprised if any match is adjudicated with a limit that high.

Having said that, I recommend that most test should not adjudicate based on score. If the is a time constraint or if the test consists of long time control games, then adjudication does save time and does not greatly affect the results. But, some engines do have a hard time with some endings. I think, when it is feasible, that it is better to force engines to checkmate the opponent, if they can.


Aside - That was suppose to be a short response. I have a cramp in my fingers now :)

Adam