Some test résults

Daniel Anulliero · Post by **Daniel Anulliero** » Sat Dec 05, 2015 6:51 pm

Hi all !
I just finished my small tests for my engine Isa :
My testing « protocol »
Each version tested play 60 games against five engines :
Tscp 1.81
Jars 1.75 (my old one)
FairyMax
Jabba 1,0
Clarabit 1.00
Each version test play also 60 games against each versions

Versions tested :
ISA 1.6.0.70 : It's the version who played the H.G Muller tournaments (june,august ,september, october 2015)
ISA 1.6.0.93 : It's the version who played the HG's tournament November 2015
(notable modification : debuging of 3 reps)
ISA 1.6.0.158 : Too much evaluation's modifications / corections
ISA 1.6.0.95 : Back up of version 1.6.0.93 with fail soft implementation

Games played by each versions : 480
Time control : Blitz 5' + 1''

Results :

Code: Select all

ISA 1.6.0.70		:	+127	, 	=160	,	-193	= 	207,0/480 (43,12%)
ISA 1.6.0.93		: 	+201	, 	=95	,	-184	= 	248,5/480 (51,77%)
ISA 1.6.0.158		:	+189	, 	=99	,	-192 	= 	238,5/480 (49,68%)
ISA 1.6.0.95		: 	+195	, 	=113	,	-172	= 	251,5/480 (52,39%)

As you can see , there is some improvments since the « first » version tested , it seems my « 3reps » is well debugged
I notice with fail soft , the engine drew mores games and losses less games

My questions :
1. is there a version « better » ?
2. Can we say « the 1.6.0.95 version is the best ?
3. Must I play more games ?

Thanks for your answers and your thoughts

Best

Dany

jdart · Post by **jdart** » Sat Dec 05, 2015 7:54 pm

60 games is not very many.

You can use BayesELO (http://www.remi-coulom.fr/Bayesian-Elo/) to calculate ratings and error bars for your matches. If the differences are within the error bounds then generally you can't consider them significant.

(Most engine developers now are using very large numbers of games at very fast time controls for testing. For example my standard test setup runs 36000 games on 60 cores. Takes about 5 hours).

--Jon

Some test résults

Some test résults

Re: Some test résults