About the credibility of the tests ... and the testers

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

User avatar
Sylwy
Posts: 4435
Joined: Fri Apr 21, 2006 4:19 pm
Location: IASI - the historical capital of MOLDOVA
Full name: SilvianR

About the credibility of the tests ... and the testers

Post by Sylwy »

Just a case study with three tests implicating the same two engines: Clover 3.1 64-bit and Uralochka 3.36c 64-bit.

Stefan Pohl latest rating list:
===============================

22.Clover 3.1 64-bit.......3.403 Elo points
23.Uralochka 3.36c 64-bit..3.400 Elo points

+3 Elo points for Clover 3.1 after 8.000 games

Image

CCRL Blitz rating list:
=======================

31.Uralochka 3.36c 64-bit..3.403 Elo points after 499 games
37.Clover 3.1 64-bit.......3.368 Elo points after 1543 games

+35 elo points for Uralochka 3.36c 64-bit

Image

Image

I deepened the CCRL result a bit and I had a big surprise! The result of the 50 games match between the two engines! So:

+15 -10 =25 /+35 Elo points for Uralochka 3.36c 64-bit

Image

Coincidentally, I also had a match of 100 games between the 2 engines with the result:

My test:
========

+21 -32 =47/ +42 Elo points for Clover 3.1

Test conditions:

-TC=4'+2"
-Hash=256 MB
-GUI: Arena 3.5.1
-Books: SuperGM_4mvs.abk (8 plies book) for both engines
-default settings for both engines
-1 thread-CPU=Intel i5-7400-3GHz (Kaby Lake)
-TBS: 6-men Syzygy bases for both engines
-OS: Windows 10 Home.

Image

Image

Why am I not surprised by the CCRL result? ... maybe because I know who pulls the strings over there .....
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: About the credibility of the tests ... and the testers

Post by xr_a_y »

I think all of this is ok.

Depending on hardware, opening book and hash size, the engines will behave differently.
Moreover, this is not surprising that a 1 on 1 match between two engines gives results at the oposite of a match including more diverse opponents.
It appends quite a lot that some engines "dislike" each other because the style of one make it weak or strong for the other one (agressivness, specific weakness well exploited, ...).

So no need to worry, all the tests are interesting and the variaty of rating list conditions is for me more a strength and an opportunity than anything else.
User avatar
Sylwy
Posts: 4435
Joined: Fri Apr 21, 2006 4:19 pm
Location: IASI - the historical capital of MOLDOVA
Full name: SilvianR

Re: About the credibility of the tests ... and the testers

Post by Sylwy »

xr_a_y wrote: Thu Jun 16, 2022 6:46 pm I think all of this is ok.

Depending on hardware, opening book and hash size, the engines will behave differently.
Moreover, this is not surprising that a 1 on 1 match between two engines gives results at the oposite of a match including more diverse opponents.
It appends quite a lot that some engines "dislike" each other because the style of one make it weak or strong for the other one (agressivness, specific weakness well exploited, ...).

So no need to worry, all the tests are interesting and the variaty of rating list conditions is for me more a strength and an opportunity than anything else.
My test conditions are not so different. And the result of my match of 100 games differs fundamentally from the result of the 50 games of CCRL ...... the problem is elsewhere (among others): the compilation that goes to the test .... is always that of the author .. .I have my suspicions ...

+15 -10 =25 /50 games ......+35 Elo points for Uralochka 3.36c 64-bit ......CCRL Blitz

+32 -21 =47/100 games ....... +42 Elo points for Clover 3.1....my test
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: About the credibility of the tests ... and the testers

Post by xr_a_y »

Sylwy wrote: Thu Jun 16, 2022 6:53 pm
xr_a_y wrote: Thu Jun 16, 2022 6:46 pm I think all of this is ok.

Depending on hardware, opening book and hash size, the engines will behave differently.
Moreover, this is not surprising that a 1 on 1 match between two engines gives results at the oposite of a match including more diverse opponents.
It appends quite a lot that some engines "dislike" each other because the style of one make it weak or strong for the other one (agressivness, specific weakness well exploited, ...).

So no need to worry, all the tests are interesting and the variaty of rating list conditions is for me more a strength and an opportunity than anything else.
My test conditions are not so different. And the result of my match of 100 games differs fundamentally from the result of the 50 games of CCRL ...... the problem is elsewhere (among others): the compilation that goes to the test .... is always that of the author .. .I have my suspicions ...

+15 -10 =25 /50 games ......+35 Elo points for Uralochka 3.36c 64-bit ......CCRL Blitz

+32 -21 =47/100 games ....... +42 Elo points for Clover 3.1....my test
Well, here is mine then (with error margin included...) : 10s+0.1 Hash 256Mb, book : hert500, avx2 hardware

Code: Select all

Score of Clover.3.1 vs Uralochka3.36c-avx2: 40 - 24 - 37 [0.579]
...      Clover.3.1 playing White: 23 - 9 - 19  [0.637] 51
...      Clover.3.1 playing Black: 17 - 15 - 18  [0.520] 50
...      White vs Black: 38 - 26 - 37  [0.559] 101
Elo difference: 55.5 +/- 54.6, LOS: 97.7 %, DrawRatio: 36.6 %
User avatar
Sylwy
Posts: 4435
Joined: Fri Apr 21, 2006 4:19 pm
Location: IASI - the historical capital of MOLDOVA
Full name: SilvianR

Re: About the credibility of the tests ... and the testers

Post by Sylwy »

xr_a_y wrote: Thu Jun 16, 2022 7:14 pm
Well, here is mine then (with error margin included...) : 10s+0.1 Hash 256Mb, book : hert500, avx2 hardware

Code: Select all

Score of Clover.3.1 vs Uralochka3.36c-avx2: 40 - 24 - 37 [0.579]
...      Clover.3.1 playing White: 23 - 9 - 19  [0.637] 51
...      Clover.3.1 playing Black: 17 - 15 - 18  [0.520] 50
...      White vs Black: 38 - 26 - 37  [0.559] 101
Elo difference: 55.5 +/- 54.6, LOS: 97.7 %, DrawRatio: 36.6 %
THANK YOU ! :wink: