An old dilemma

Rebel · Post by **Rebel** » Sat May 28, 2016 10:19 am

Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?

mar · Post by **mar** » Sat May 28, 2016 10:35 am

Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

No, but back when I didn't know how to test (used to play 100 games!) I released something that wasn't improved at all.

Rebel wrote:IOW, are there known cases self-play sucks?

Well, speking of CCRL 40/40 my last version did worse than expected, I measured 60-70 improvement in self-play but in CCRL 40/40 it only shows 21
(I always expect the real gain to be ~half the gain of self-play (this also means my lower bound for a release is 60 in self-play) so probably still ok wrt their error bars;
overall selfplay: sometimes it's more but usually less from my experience)

hgm · Post by **hgm** » Sat May 28, 2016 10:39 am

Not exactly. But I did have this experience where my Chu Shogi engine could be easily defeated by humans on the 81Dojo server, by an obvious strategic flaw of allowing the opponent to sneak up with his weak steppers on the strong sliders (expose its artillery to infantry attack, as it were).

So I repaired the flaw by encouraging the engine to better develop its steppers, to provide a protective wall around the sliders. This worked quite well, and the results on the server improved spectacularly. Even the strongest players now needed slower TC to be able to beat it.

But when I played this 'improved' version against the old one, it gets crushed! Most of the time it gains material in the beginning of the middle game, to build up a significant lead of 2-3 light pieces. (As you start with 32 (non-pawn) pieces, such an imbalance is not quickly decisive.) But then in the late middle-game the chances turn, all advantage gets lost, and eventually that side loses.

I have not been able to explain this phenomenon yet.

op12no2 · Post by **op12no2** » Sat May 28, 2016 12:14 pm

I see it when tuning LMR/LMP/NMP/Futility - It's not too hard to find a combination that beats the old version but in gauntlet against engines in the next ccrl division up, it's worse; I always just use gauntlet now. I put it down to, not self testing, but single opponent testing - the more opponents the less the chance of it happening kinda thing.

Rebel · Post by **Rebel** » Sat May 28, 2016 12:32 pm

op12no2 wrote:I see it when tuning LMR/LMP/NMP/Futility - It's not too hard to find a combination that beats the old version but in gauntlet against engines in the next ccrl division up, it's worse; I always just use gauntlet now. I put it down to, not self testing, but single opponent testing - the more opponents the less the chance of it happening kinda thing.

And what about the last condition of the OP, does it (also) do better on the rating lists?

So:

1. worse in self-play

2. better against a bunch of other engines

3. Release

4. better on the rating lists.

op12no2 · Post by **op12no2** » Sat May 28, 2016 12:59 pm

Rebel wrote: And what about the last condition of the OP, does it (also) do better on the rating lists?

Sorry Ed, yes, better on CCRL also. Your points 1. - 4. ticked.

In fact I'm mooting ignoring the previous release in the gauntlets.

I should add that Lozza is a little weird in that it's compiled in real-time as it executes as a Javascript source running in for e.g. your browser and it's not super strong. Dunno if that makes a difference (hard to see why) but prob worth mentioning.

http://op12no2.me/toys/lozza

bob · Post by **bob** » Sat May 28, 2016 4:46 pm

Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?

I have personally seen several such cases, enough that I simply don't use self-play games at all except for debugging.

CRoberson · Post by **CRoberson** » Sat May 28, 2016 6:58 pm

I've seen those cases. I have also seen cases in self play where the new version performs much better, then the rating lists show it no better or
slightly better. I don't completely rely on self play.

My method is this:
1) benchmarks - they are quick and identify gross errors
2) self play - can identify not so gross errors
3) gauntlets - best method for me.

However, I don't completely trust the rating lists. The main reason is the number of bad opening lines that I've found in their game databases. The
second reason is their reliance on out of date hardware. I've modified algorithms numerous times to find things that scale well to deep searches but
not to shallow searches and vice versa. Given that I care more about performance at long TC and serious HW, I prioritize scalability to the high
end and the future.

ymatioun · Post by **ymatioun** » Sun May 29, 2016 2:47 pm

I have see this, too. So some time ago i switched from self-play to gauntlets as a primary way to evaluate engine strength. This approach should correlate well with CCRL rankings.

Daniel Anulliero · Post by **Daniel Anulliero** » Mon May 30, 2016 9:11 am

Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...

An old dilemma

An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma

Re: An old dilemma