Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.
It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.
It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
May be we must validate the new version twice ? (Gauntlets AND self test )
Because for exemple with my (weak) engine Isa , it perform always better against other engines than against previous versions
I prefer gauntlet to test new versions. I noticed that it is more consistent than self play.
I usually select 5 engines, but I include my previous version. So I will test the new version against 2 slightly weaker, previous version, and 2 slightly stronger engines.
Once my engine gets stronger after a lot of changes and testing, the stronger ones becomes the weaker ones and I have to look for 2 new stronger ones. I wish I had a problem finding stronger engines
Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?
IOW, are there known cases self-play sucks?
Sometimes it is the only option. For example Embla: I could not find any program with a relatively close strength (and running on a raspberry py with linux). My own "POS" is > 500 elo weaker so loses everything and tscp181 and fairymax are > 500 elo stronger.
I'm using 5000 or 10000 games at 10s+0.1s, my results are not exactly the same as CCRL or other rating lists, but I have no problems and it's good enough to know how new versions are better than the previous one.
I would not release a version that play worse in self tests.
Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?
IOW, are there known cases self-play sucks?
Sometimes it is the only option. For example Embla: I could not find any program with a relatively close strength (and running on a raspberry py with linux). My own "POS" is > 500 elo weaker so loses everything and tscp181 and fairymax are > 500 elo stronger.
If your tournament manager program allows it, you could use time odds to reduce the relative strength of tscp and fairymax.
Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.
It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
May be we must validate the new version twice ? (Gauntlets AND self test )
Because for exemple with my (weak) engine Isa , it perform always better against other engines than against previous versions
For Gaviota, Miguel uses self-testing to evaluate code and parameter changes. After accumulating some Elo in the self-tests, we confirm with a gauntlet.