An old dilemma

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Rebel
Posts: 6994
Joined: Thu Aug 18, 2011 12:04 pm

Re: An old dilemma

Post by Rebel »

Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.

It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
Daniel Anulliero
Posts: 759
Joined: Fri Jan 04, 2013 4:55 pm
Location: Nice

Re: An old dilemma

Post by Daniel Anulliero »

Rebel wrote:
Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.

It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
May be we must validate the new version twice ? (Gauntlets AND self test )
Because for exemple with my (weak) engine Isa , it perform always better against other engines than against previous versions
sedicla
Posts: 178
Joined: Sat Jan 08, 2011 12:51 am
Location: USA
Full name: Alcides Schulz

Re: An old dilemma

Post by sedicla »

I prefer gauntlet to test new versions. I noticed that it is more consistent than self play.
I usually select 5 engines, but I include my previous version. So I will test the new version against 2 slightly weaker, previous version, and 2 slightly stronger engines.
Once my engine gets stronger after a lot of changes and testing, the stronger ones becomes the weaker ones and I have to look for 2 new stronger ones. I wish I had a problem finding stronger engines :D
User avatar
Rebel
Posts: 6994
Joined: Thu Aug 18, 2011 12:04 pm

Re: An old dilemma

Post by Rebel »

Did a test. First step a robin tournament of my last 4 versions.

Code: Select all

1 ProDeo 2.0      54.3%
2 ProDeo 1.87     49.3%
3 ProDeo 1.86     48.2%
4 ProDeo 1.85     48.1%
Versions list nicely in order.

Same against an engine of about equal strength.

Code: Select all

1 ProDeo 1.86     56.3%
2 ProDeo 2.0      55.1%
3 ProDeo 1.87     53.0%
4 ProDeo 1.85     50.3%
And the 6.1% (roughly 43 elo) version 2.0 had over 1.86 has dropped to -1.2% (-52 elo) playing another engine.

Bizarre....

But I solved the dilemma myself, my engine sucks in self-play :lol:
flok

Re: An old dilemma

Post by flok »

Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?
Sometimes it is the only option. For example Embla: I could not find any program with a relatively close strength (and running on a raspberry py with linux). My own "POS" is > 500 elo weaker so loses everything and tscp181 and fairymax are > 500 elo stronger.
Patrice Duhamel
Posts: 193
Joined: Sat May 25, 2013 11:17 am
Location: France
Full name: Patrice Duhamel

Re: An old dilemma

Post by Patrice Duhamel »

I'm using 5000 or 10000 games at 10s+0.1s, my results are not exactly the same as CCRL or other rating lists, but I have no problems and it's good enough to know how new versions are better than the previous one.

I would not release a version that play worse in self tests.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: An old dilemma

Post by Adam Hair »

flok wrote:
Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?
Sometimes it is the only option. For example Embla: I could not find any program with a relatively close strength (and running on a raspberry py with linux). My own "POS" is > 500 elo weaker so loses everything and tscp181 and fairymax are > 500 elo stronger.
If your tournament manager program allows it, you could use time odds to reduce the relative strength of tscp and fairymax.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: An old dilemma

Post by Adam Hair »

Daniel Anulliero wrote:
Rebel wrote:
Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.

It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
May be we must validate the new version twice ? (Gauntlets AND self test )
Because for exemple with my (weak) engine Isa , it perform always better against other engines than against previous versions
For Gaviota, Miguel uses self-testing to evaluate code and parameter changes. After accumulating some Elo in the self-tests, we confirm with a gauntlet.