testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

flok

testing

Post by flok »

Hi,

I'm trying to figure out of a change gave an improvement or not.
For that I let a version run a few times against other versions and other programs. In gauntlet mode with the main program being the one being tested (Embla3949-fp).
This gave:

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek       119   10   10  2900   69%   -37   12%
   2 Embla3949       33    8    8  2900   61%   -37   40%
   3 Embla3949-fp   -37    6    6  8700   44%    12   30%
   4 0.9.7         -115    9    9  2900   38%   -37   39%
To verify I also ran other combinations:

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek        87    6    6  8700   65%   -29   13%
   2 Embla3949       64    9    9  2900   47%    87   13%
   3 0.9.7          -75    9   10  2900   29%    87   17%
   4 Embla3949-fp   -76    9   10  2900   30%    87   10%

Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek        66   10   10  2900   71%   -93   17%
   2 Embla3949       47    9    8  2900   71%   -93   36%
   3 Embla3949-fp   -20    9    9  2900   61%   -93   39%
   4 0.9.7          -93    6    6  8700   32%    31   31%

non-gauntlet
Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek        85    6    6  8700   64%   -28   13%
   2 Embla3949       46    6    6  8700   60%   -15   30%
   3 Embla3949-fp   -39    6    5  8700   43%    13   30%
   4 0.9.7          -92    6    5  8700   33%    31   30%
To my horror this all gave different results!
So I wonder: is there a correct result to find?




(this may have been discussed already but I can't find that thread anymore)[/b]
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: testing

Post by Sven »

flok wrote:To my horror this all gave different results!
So I wonder: is there a correct result to find?
Are you sure you are playing 8700 different and independent games?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: testing

Post by Laskos »

Sven Schüle wrote:
flok wrote:To my horror this all gave different results!
So I wonder: is there a correct result to find?
Are you sure you are playing 8700 different and independent games?
Seems so, the gauntlets are against different engines. I think the gauntlet should be systematically run against the engine in development to follow its ELO progress. The last test is RR, seems a waste of time if he is working on a single engine.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: testing

Post by Robert Pope »

I don't think you can compare results of different gauntlets like that. In each gauntlet, all the others' ELO scores are relative to how they perform against a single engine. So their ELO will shift around, depending on whether the main engine is one they do well against.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: testing

Post by Dann Corbit »

The value of playing against other engines, is that they will show evaluation or search flaws because they have different chess and search knowledge.

So they are not the best for calculation of increase (especially if you change them from round to round!).

But they are very good at showing you want you need to fix next, if you examine the losses carefully.

In particular, a tool like the analyzer that comes with Tom's Live Chess Viewer stuff can find a spot where your engine missed something very important. If you see a bunch of those, now all you have to do is find out what it is that you are overlooking.

OK, easier said than done sometimes.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
flok

Re: testing

Post by flok »

I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: testing

Post by Robert Pope »

Simple example with 3 engines

B has some fatal weakness that A knows how to exploit, but C doesnt. Otherwise, all three are equal strength.

So we expect in head-to-head matches:
A scores 75% against B
A scores 50% against C
B scores 50% against C

A gauntlet run on engine A will show that it is better than B and equal to C:
A 63% +100
B 25% 0
C 50% +100

But a gauntlet run on engine B will show that it is worse than A but equal to C:
A 75% +100
B 37% 0
C 50% 0

And a gauntlet run on engine C will show that all three are equal:
A 50% 0
B 50% 0
C 50% 0
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: testing

Post by Sven »

flok wrote:I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.
Ok. If you extract the results from your three different gauntlets and ignore all second occurrences of match pairings (there are three out of six but their results are almost identical to the corresponding first occurrence) you get these results (numbers are percentages which is valid in this case since you always played 2900 games per match pairing):

Code: Select all

                1  2  3  4 avg
1 dorpsgek     xx 53 69 71 64,3
2 Embla3949    47 xx 61 71 59,7
3 Embla3949-fp 31 39 xx 62 44,0
4 0.9.7        29 29 38 xx 32,0
This is almost the same as your last "round robin" result. Of course leaving out a part of the result leads to a different overall outcome.

So in summary Embla3949-fp seems to be roughly 50 Elo weaker than embla3949.
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: testing

Post by Dirt »

flok wrote:I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.
Is each game from the starting position? Are you using a book?

The safest way is to use a set of starting positions.
Deasil is the right way to go.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: testing

Post by Sven »

Sven Schüle wrote:So in summary Embla3949-fp seems to be roughly 50 Elo weaker than embla3949.
I meant more like 85 Elo, 50 was wrong.