An old dilemma

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Rebel
Posts: 7381
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

An old dilemma

Post by Rebel »

Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?
mar
Posts: 2665
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: An old dilemma

Post by mar »

Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?
No, but back when I didn't know how to test (used to play 100 games!) I released something that wasn't improved at all.
Rebel wrote:IOW, are there known cases self-play sucks?
Well, speking of CCRL 40/40 my last version did worse than expected, I measured 60-70 improvement in self-play but in CCRL 40/40 it only shows 21
(I always expect the real gain to be ~half the gain of self-play (this also means my lower bound for a release is 60 in self-play) so probably still ok wrt their error bars;
overall selfplay: sometimes it's more but usually less from my experience)
User avatar
hgm
Posts: 28381
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An old dilemma

Post by hgm »

Not exactly. But I did have this experience where my Chu Shogi engine could be easily defeated by humans on the 81Dojo server, by an obvious strategic flaw of allowing the opponent to sneak up with his weak steppers on the strong sliders (expose its artillery to infantry attack, as it were).

So I repaired the flaw by encouraging the engine to better develop its steppers, to provide a protective wall around the sliders. This worked quite well, and the results on the server improved spectacularly. Even the strongest players now needed slower TC to be able to beat it.

But when I played this 'improved' version against the old one, it gets crushed! Most of the time it gains material in the beginning of the middle game, to build up a significant lead of 2-3 light pieces. (As you start with 32 (non-pawn) pieces, such an imbalance is not quickly decisive.) But then in the late middle-game the chances turn, all advantage gets lost, and eventually that side loses.

I have not been able to explain this phenomenon yet.
op12no2
Posts: 551
Joined: Tue Feb 04, 2014 12:25 pm
Location: Gower, Wales
Full name: Colin Jenkins

Re: An old dilemma

Post by op12no2 »

I see it when tuning LMR/LMP/NMP/Futility - It's not too hard to find a combination that beats the old version but in gauntlet against engines in the next ccrl division up, it's worse; I always just use gauntlet now. I put it down to, not self testing, but single opponent testing - the more opponents the less the chance of it happening kinda thing.
User avatar
Rebel
Posts: 7381
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: An old dilemma

Post by Rebel »

op12no2 wrote:I see it when tuning LMR/LMP/NMP/Futility - It's not too hard to find a combination that beats the old version but in gauntlet against engines in the next ccrl division up, it's worse; I always just use gauntlet now. I put it down to, not self testing, but single opponent testing - the more opponents the less the chance of it happening kinda thing.
And what about the last condition of the OP, does it (also) do better on the rating lists?

So:

1. worse in self-play

2. better against a bunch of other engines

3. Release

4. better on the rating lists.
op12no2
Posts: 551
Joined: Tue Feb 04, 2014 12:25 pm
Location: Gower, Wales
Full name: Colin Jenkins

Re: An old dilemma

Post by op12no2 »

Rebel wrote: And what about the last condition of the OP, does it (also) do better on the rating lists?
Sorry Ed, yes, better on CCRL also. Your points 1. - 4. ticked.

In fact I'm mooting ignoring the previous release in the gauntlets.

I should add that Lozza is a little weird in that it's compiled in real-time as it executes as a Javascript source running in for e.g. your browser and it's not super strong. Dunno if that makes a difference (hard to see why) but prob worth mentioning.

http://op12no2.me/toys/lozza
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An old dilemma

Post by bob »

Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?

IOW, are there known cases self-play sucks?
I have personally seen several such cases, enough that I simply don't use self-play games at all except for debugging.
CRoberson
Posts: 2094
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: An old dilemma

Post by CRoberson »

I've seen those cases. I have also seen cases in self play where the new version performs much better, then the rating lists show it no better or
slightly better. I don't completely rely on self play.

My method is this:
1) benchmarks - they are quick and identify gross errors
2) self play - can identify not so gross errors
3) gauntlets - best method for me.

However, I don't completely trust the rating lists. The main reason is the number of bad opening lines that I've found in their game databases. The
second reason is their reliance on out of date hardware. I've modified algorithms numerous times to find things that scale well to deep searches but
not to shallow searches and vice versa. Given that I care more about performance at long TC and serious HW, I prioritize scalability to the high
end and the future.
ymatioun
Posts: 64
Joined: Fri Oct 18, 2013 11:40 pm
Location: New York

Re: An old dilemma

Post by ymatioun »

I have see this, too. So some time ago i switched from self-play to gauntlets as a primary way to evaluate engine strength. This approach should correlate well with CCRL rankings.
Daniel Anulliero
Posts: 772
Joined: Fri Jan 04, 2013 4:55 pm
Location: Nice

Re: An old dilemma

Post by Daniel Anulliero »

Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...