testing

flok · Post by **flok** » Fri Mar 10, 2017 4:02 pm

Hi,

I'm trying to figure out of a change gave an improvement or not.
For that I let a version run a few times against other versions and other programs. In gauntlet mode with the main program being the one being tested (Embla3949-fp).
This gave:

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek       119   10   10  2900   69%   -37   12%
   2 Embla3949       33    8    8  2900   61%   -37   40%
   3 Embla3949-fp   -37    6    6  8700   44%    12   30%
   4 0.9.7         -115    9    9  2900   38%   -37   39%

To verify I also ran other combinations:

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek        87    6    6  8700   65%   -29   13%
   2 Embla3949       64    9    9  2900   47%    87   13%
   3 0.9.7          -75    9   10  2900   29%    87   17%
   4 Embla3949-fp   -76    9   10  2900   30%    87   10%

Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek        66   10   10  2900   71%   -93   17%
   2 Embla3949       47    9    8  2900   71%   -93   36%
   3 Embla3949-fp   -20    9    9  2900   61%   -93   39%
   4 0.9.7          -93    6    6  8700   32%    31   31%

non-gauntlet
Rank Name           Elo    +    - games score oppo. draws
   1 dorpsgek        85    6    6  8700   64%   -28   13%
   2 Embla3949       46    6    6  8700   60%   -15   30%
   3 Embla3949-fp   -39    6    5  8700   43%    13   30%
   4 0.9.7          -92    6    5  8700   33%    31   30%

To my horror this all gave different results!
So I wonder: is there a correct result to find?

(this may have been discussed already but I can't find that thread anymore)[/b]

Sven · Post by **Sven** » Fri Mar 10, 2017 4:37 pm

flok wrote:To my horror this all gave different results!
So I wonder: is there a correct result to find?

Are you sure you are playing 8700 different and independent games?

Laskos · Post by **Laskos** » Fri Mar 10, 2017 5:35 pm

Sven Schüle wrote:
flok wrote:To my horror this all gave different results!
So I wonder: is there a correct result to find?
Are you sure you are playing 8700 different and independent games?

Seems so, the gauntlets are against different engines. I think the gauntlet should be systematically run against the engine in development to follow its ELO progress. The last test is RR, seems a waste of time if he is working on a single engine.

Robert Pope · Post by **Robert Pope** » Fri Mar 10, 2017 5:49 pm

I don't think you can compare results of different gauntlets like that. In each gauntlet, all the others' ELO scores are relative to how they perform against a single engine. So their ELO will shift around, depending on whether the main engine is one they do well against.

Dann Corbit · Post by **Dann Corbit** » Fri Mar 10, 2017 7:15 pm

The value of playing against other engines, is that they will show evaluation or search flaws because they have different chess and search knowledge.

So they are not the best for calculation of increase (especially if you change them from round to round!).

But they are very good at showing you want you need to fix next, if you examine the losses carefully.

In particular, a tool like the analyzer that comes with Tom's Live Chess Viewer stuff can find a spot where your engine missed something very important. If you see a bunch of those, now all you have to do is find out what it is that you are overlooking.

OK, easier said than done sometimes.

flok · Post by **flok** » Fri Mar 10, 2017 9:07 pm

I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.

Robert Pope · Post by **Robert Pope** » Fri Mar 10, 2017 9:39 pm

Simple example with 3 engines

B has some fatal weakness that A knows how to exploit, but C doesnt. Otherwise, all three are equal strength.

So we expect in head-to-head matches:
A scores 75% against B
A scores 50% against C
B scores 50% against C

A gauntlet run on engine A will show that it is better than B and equal to C:
A 63% +100
B 25% 0
C 50% +100

But a gauntlet run on engine B will show that it is worse than A but equal to C:
A 75% +100
B 37% 0
C 50% 0

And a gauntlet run on engine C will show that all three are equal:
A 50% 0
B 50% 0
C 50% 0

Sven · Post by **Sven** » Sat Mar 11, 2017 4:43 pm

flok wrote:I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.

Ok. If you extract the results from your three different gauntlets and ignore all second occurrences of match pairings (there are three out of six but their results are almost identical to the corresponding first occurrence) you get these results (numbers are percentages which is valid in this case since you always played 2900 games per match pairing):

Code: Select all

                1  2  3  4 avg
1 dorpsgek     xx 53 69 71 64,3
2 Embla3949    47 xx 61 71 59,7
3 Embla3949-fp 31 39 xx 62 44,0
4 0.9.7        29 29 38 xx 32,0

This is almost the same as your last "round robin" result. Of course leaving out a part of the result leads to a different overall outcome.

So in summary Embla3949-fp seems to be roughly 50 Elo weaker than embla3949.

Dirt · Post by **Dirt** » Sat Mar 11, 2017 6:06 pm

flok wrote:I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.

Is each game from the starting position? Are you using a book?

The safest way is to use a set of starting positions.

Sven · Post by **Sven** » Sat Mar 11, 2017 6:43 pm

Sven Schüle wrote:So in summary Embla3949-fp seems to be roughly 50 Elo weaker than embla3949.

I meant more like 85 Elo, 50 was wrong.

testing

testing

Re: testing

Re: testing

Re: testing

Re: testing

Re: testing

Re: testing

Re: testing

Re: testing

Re: testing