Some test résults

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Daniel Anulliero
Posts: 759
Joined: Fri Jan 04, 2013 4:55 pm
Location: Nice

Some test résults

Post by Daniel Anulliero »

Hi all !
I just finished my small tests for my engine Isa :
My testing « protocol »
Each version tested play 60 games against five engines :
Tscp 1.81
Jars 1.75 (my old one)
FairyMax
Jabba 1,0
Clarabit 1.00
Each version test play also 60 games against each versions

Versions tested :
ISA 1.6.0.70 : It's the version who played the H.G Muller tournaments (june,august ,september, october 2015)
ISA 1.6.0.93 : It's the version who played the HG's tournament November 2015
(notable modification : debuging of 3 reps)
ISA 1.6.0.158 : Too much evaluation's modifications / corections
ISA 1.6.0.95 : Back up of version 1.6.0.93 with fail soft implementation

Games played by each versions : 480
Time control : Blitz 5' + 1''

Results :

Code: Select all

ISA 1.6.0.70		:	+127	, 	=160	,	-193	= 	207,0/480 (43,12%)
ISA 1.6.0.93		: 	+201	, 	=95	,	-184	= 	248,5/480 (51,77%)
ISA 1.6.0.158		:	+189	, 	=99	,	-192 	= 	238,5/480 (49,68%)
ISA 1.6.0.95		: 	+195	, 	=113	,	-172	= 	251,5/480 (52,39%)
As you can see , there is some improvments since the « first » version tested , it seems my « 3reps » is well debugged
I notice with fail soft , the engine drew mores games and losses less games

My questions :
1. is there a version « better » ?
2. Can we say « the 1.6.0.95 version is the best ?
3. Must I play more games ?

Thanks for your answers and your thoughts

Best

Dany
jdart
Posts: 4361
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Some test résults

Post by jdart »

60 games is not very many.

You can use BayesELO (http://www.remi-coulom.fr/Bayesian-Elo/) to calculate ratings and error bars for your matches. If the differences are within the error bounds then generally you can't consider them significant.

(Most engine developers now are using very large numbers of games at very fast time controls for testing. For example my standard test setup runs 36000 games on 60 cores. Takes about 5 hours).

--Jon