I'm trying to figure out of a change gave an improvement or not.
For that I let a version run a few times against other versions and other programs. In gauntlet mode with the main program being the one being tested (Embla3949-fp).
This gave:
flok wrote:To my horror this all gave different results!
So I wonder: is there a correct result to find?
Are you sure you are playing 8700 different and independent games?
Seems so, the gauntlets are against different engines. I think the gauntlet should be systematically run against the engine in development to follow its ELO progress. The last test is RR, seems a waste of time if he is working on a single engine.
I don't think you can compare results of different gauntlets like that. In each gauntlet, all the others' ELO scores are relative to how they perform against a single engine. So their ELO will shift around, depending on whether the main engine is one they do well against.
The value of playing against other engines, is that they will show evaluation or search flaws because they have different chess and search knowledge.
So they are not the best for calculation of increase (especially if you change them from round to round!).
But they are very good at showing you want you need to fix next, if you examine the losses carefully.
In particular, a tool like the analyzer that comes with Tom's Live Chess Viewer stuff can find a spot where your engine missed something very important. If you see a bunch of those, now all you have to do is find out what it is that you are overlooking.
OK, easier said than done sometimes.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.
flok wrote:I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.
Ok. If you extract the results from your three different gauntlets and ignore all second occurrences of match pairings (there are three out of six but their results are almost identical to the corresponding first occurrence) you get these results (numbers are percentages which is valid in this case since you always played 2900 games per match pairing):
flok wrote:I verified it and they're all different (I compared the orignal pgn with the output of pgn-extract -D -s).
And independent: every game was played on its own cpu-core.
Is each game from the starting position? Are you using a book?
The safest way is to use a set of starting positions.