An old dilemma

Rebel · Post by **Rebel** » Mon May 30, 2016 10:07 am

Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...

Exactly.

It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.

Daniel Anulliero · Post by **Daniel Anulliero** » Mon May 30, 2016 12:04 pm

Rebel wrote:
Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.

It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.

May be we must validate the new version twice ? (Gauntlets AND self test )
Because for exemple with my (weak) engine Isa , it perform always better against other engines than against previous versions

sedicla · Post by **sedicla** » Mon May 30, 2016 4:08 pm

I prefer gauntlet to test new versions. I noticed that it is more consistent than self play.
I usually select 5 engines, but I include my previous version. So I will test the new version against 2 slightly weaker, previous version, and 2 slightly stronger engines.
Once my engine gets stronger after a lot of changes and testing, the stronger ones becomes the weaker ones and I have to look for 2 new stronger ones. I wish I had a problem finding stronger engines

Rebel · Post by **Rebel** » Fri Jun 03, 2016 4:27 pm

Did a test. First step a robin tournament of my last 4 versions.

Code: Select all

1 ProDeo 2.0      54.3%
2 ProDeo 1.87     49.3%
3 ProDeo 1.86     48.2%
4 ProDeo 1.85     48.1%

Versions list nicely in order.

Same against an engine of about equal strength.

Code: Select all

1 ProDeo 1.86     56.3%
2 ProDeo 2.0      55.1%
3 ProDeo 1.87     53.0%
4 ProDeo 1.85     50.3%

And the 6.1% (roughly 43 elo) version 2.0 had over 1.86 has dropped to -1.2% (-52 elo) playing another engine.

Bizarre....

But I solved the dilemma myself, my engine sucks in self-play

flok · Post by **flok** » Fri Jun 03, 2016 4:53 pm

Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?

Sometimes it is the only option. For example Embla: I could not find any program with a relatively close strength (and running on a raspberry py with linux). My own "POS" is > 500 elo weaker so loses everything and tscp181 and fairymax are > 500 elo stronger.

Patrice Duhamel · Post by **Patrice Duhamel** » Sat Jun 04, 2016 11:12 am

I'm using 5000 or 10000 games at 10s+0.1s, my results are not exactly the same as CCRL or other rating lists, but I have no problems and it's good enough to know how new versions are better than the previous one.

I would not release a version that play worse in self tests.

Adam Hair · Post by **Adam Hair** » Sat Jun 04, 2016 1:46 pm

flok wrote:
Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?
Sometimes it is the only option. For example Embla: I could not find any program with a relatively close strength (and running on a raspberry py with linux). My own "POS" is > 500 elo weaker so loses everything and tscp181 and fairymax are > 500 elo stronger.

If your tournament manager program allows it, you could use time odds to reduce the relative strength of tscp and fairymax.

Adam Hair · Post by **Adam Hair** » Sat Jun 04, 2016 1:57 pm

Daniel Anulliero wrote:
Rebel wrote:
Daniel Anulliero wrote:Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
Exactly.

It's odd to release a version that can't win from the previous one. Gauntlet X may perform better against the previous version but it's no guarantee Gauntlet Y will too.
May be we must validate the new version twice ? (Gauntlets AND self test )
Because for exemple with my (weak) engine Isa , it perform always better against other engines than against previous versions

For Gaviota, Miguel uses self-testing to evaluate code and parameter changes. After accumulating some Elo in the self-tests, we confirm with a gauntlet.

An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma