Jesús Muñoz
Joined: 13 Jul 2011 Posts: 1356 Location: Madrid, Spain.

Post subject: The Tennison's Deliria ... Posted: Thu Mar 01, 2012 5:16 pm 


Hello:
I found this test between Deep Rybka 4.1 x64 and Deep Rybka 4.1 (960) x64 at Chess2U Forum:
http://www.chess2u.com/t5502thetennisonsdeliria#31681
I think that this Tennison is the same Ben Tennison of Talkchess. Here is the test:
Code: 
Games Completed = 4000 of 4000 (Avg game length = 6.667 sec)
Settings = RR/32MB/Book/[b]500ms+50ms[/b]/M 1000cp for 12 moves, D 150 moves/EPD:[b]openings.epd[/b](4000)
1. Deep Rybka 4.1 1953.5/4000 106211551783 (L: m=571 t=1 i=0 a=583) (D: r=1312 i=281 f=133 s=10 a=47) (tpm=52.2 d=8.8 nps=127344)
2. Deep Rybka 4.1 960 2046.5/4000 115510621783 (L: m=483 t=0 i=0 a=579) (D: r=1312 i=281 f=133 s=10 a=47) (tpm=52.3 d=8.8 nps=129394) 
Quote: 
Deep Rybka 4.1 960 scores 51,16 %.
Is this only a statistical margin ?
Is this a strength difference ?
Is this a little error margin in the opening book (openings.epd) ?
...
Have a nice debate ... 
First of all: I am not an expert in tests. The output is clearly from LittleBlitzer and the used EPD seems '4000 openings' by Bob Hyatt IIRC. A good question is ask for the number of cores/threads that each engine used, and also the hardware.
Quote: 
Test 001 :
How is playing Deep Rybka 4.1 (x64) against Deep Rybka 4.1 960 (x64)? Is there a difference ? 
AFAIK, the only difference between the standard version and the 960 one is that the latter is able to play Chess960, aka FRC (Fischer Random Chess), while the first not. So, I am a bit surprised about the speed:
Code: 
127344 nps ~ 129394 nps  1.58%.
129394 nps ~ 127344 nps + 1.61%. 
These are small differences, but I expected even less. Anyway, it does not seem exaggerated. I will try to answer some questions asked by Tennison:
a) Is this only a statistical margin?
According with my math, the results are inside the statistical margin. Writing some numbers with roundings after work with many decimals (hoping no typos in my calculations done with a Casio calculator):
Code: 
(Referred to non960 version):
n = 4000 games (+1062 1155 = 1783)
(Rating difference) = 400·log(1953.5/2046.5) ~ 8.08
(Standard deviation or sigma) = sqrt{(1/4000)·[(1953.5) · (2046.5)/(4000)²  (1783)/(4000 · 4)]} ~ 0.005883 ~ 0.5883%
2sigma confidence ~ 95.45% confidence (an usual value):
2n·sigma ~ 2 · 4000 · 0.005883 ~ 47.0621
(Lower bound of the rating difference) = 400·log[(1953.5  47.0621)/(2046.5 + 47.0621)] ~ 16.27
(Upper bound of the rating difference) = 400·log[(1953.5 + 47.0621)/(2046.5  47.0621)] ~ +0.1
(2sigma confidence interval for rating difference) ~ ]16.27, +0.1[ 
So, pretty equal; with my results (~ 95.45% confidence) Deep Rybka 4.1 should score between ~ 47.66% (16.27 Elo) and ~ 50.01% (+0.1 Elo) against Deep Rybka 4.1 (960) x64 under the conditions of the test. It looks normal for me.
b) Is this a strength difference?
With my limited knowledge on Statistics, I would say that there is not an easily measurable difference, even with 4000 games; I suppose that there is not any kind of bias in this test. If I have to chose, I would bet NO regarding strength difference (other than statistical uncertainties).
c) Is this a little error margin in the opening book (openings.epd)?
I do not fully understand the question, but I suspect that this EPD file is very balanced and therefore very trustable. Of course, people with more knowledge than me can answer better to this question.
Other comments:
· The time control is very short from my POV although I have not any problem with it. One lose by non960 version and no loses by illegal moves... not bad.
· The number of loses by adjudication is very high (more than the half for each engine). I say very high because I usually get 0 loses by adjudication in my few, short and clumsy tests, but I set 'M 777777 cp for 7 moves' instead 'M 1000 cp for 12 moves' (which is the default setting). I do not know if changing this leads to more/less lose adjudications.
· The draw statistics seem very normal from my unexperienced POV. Very logical the split for threefold repetition, insufficient material, the fiftymove rule, stalemate and adjudication (with the condition 'D 150 moves', the default one).
Any comments, corrections... are welcome, as usual.
Regards from Spain.
Ajedrecista. _________________ Six Fortran 95 tools.
Online Checkers Library.
Chess will never be solved. 
