Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?
IOW, are there known cases self-play sucks?
An old dilemma
Moderator: Ras
-
- Posts: 2665
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: An old dilemma
No, but back when I didn't know how to test (used to play 100 games!) I released something that wasn't improved at all.Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?
Well, speking of CCRL 40/40 my last version did worse than expected, I measured 60-70 improvement in self-play but in CCRL 40/40 it only shows 21Rebel wrote:IOW, are there known cases self-play sucks?
(I always expect the real gain to be ~half the gain of self-play (this also means my lower bound for a release is 60 in self-play) so probably still ok wrt their error bars;
overall selfplay: sometimes it's more but usually less from my experience)
-
- Posts: 28381
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: An old dilemma
Not exactly. But I did have this experience where my Chu Shogi engine could be easily defeated by humans on the 81Dojo server, by an obvious strategic flaw of allowing the opponent to sneak up with his weak steppers on the strong sliders (expose its artillery to infantry attack, as it were).
So I repaired the flaw by encouraging the engine to better develop its steppers, to provide a protective wall around the sliders. This worked quite well, and the results on the server improved spectacularly. Even the strongest players now needed slower TC to be able to beat it.
But when I played this 'improved' version against the old one, it gets crushed! Most of the time it gains material in the beginning of the middle game, to build up a significant lead of 2-3 light pieces. (As you start with 32 (non-pawn) pieces, such an imbalance is not quickly decisive.) But then in the late middle-game the chances turn, all advantage gets lost, and eventually that side loses.
I have not been able to explain this phenomenon yet.
So I repaired the flaw by encouraging the engine to better develop its steppers, to provide a protective wall around the sliders. This worked quite well, and the results on the server improved spectacularly. Even the strongest players now needed slower TC to be able to beat it.
But when I played this 'improved' version against the old one, it gets crushed! Most of the time it gains material in the beginning of the middle game, to build up a significant lead of 2-3 light pieces. (As you start with 32 (non-pawn) pieces, such an imbalance is not quickly decisive.) But then in the late middle-game the chances turn, all advantage gets lost, and eventually that side loses.
I have not been able to explain this phenomenon yet.
-
- Posts: 551
- Joined: Tue Feb 04, 2014 12:25 pm
- Location: Gower, Wales
- Full name: Colin Jenkins
Re: An old dilemma
I see it when tuning LMR/LMP/NMP/Futility - It's not too hard to find a combination that beats the old version but in gauntlet against engines in the next ccrl division up, it's worse; I always just use gauntlet now. I put it down to, not self testing, but single opponent testing - the more opponents the less the chance of it happening kinda thing.
-
- Posts: 7381
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: An old dilemma
And what about the last condition of the OP, does it (also) do better on the rating lists?op12no2 wrote:I see it when tuning LMR/LMP/NMP/Futility - It's not too hard to find a combination that beats the old version but in gauntlet against engines in the next ccrl division up, it's worse; I always just use gauntlet now. I put it down to, not self testing, but single opponent testing - the more opponents the less the chance of it happening kinda thing.
So:
1. worse in self-play
2. better against a bunch of other engines
3. Release
4. better on the rating lists.
-
- Posts: 551
- Joined: Tue Feb 04, 2014 12:25 pm
- Location: Gower, Wales
- Full name: Colin Jenkins
Re: An old dilemma
Sorry Ed, yes, better on CCRL also. Your points 1. - 4. ticked.Rebel wrote: And what about the last condition of the OP, does it (also) do better on the rating lists?
In fact I'm mooting ignoring the previous release in the gauntlets.
I should add that Lozza is a little weird in that it's compiled in real-time as it executes as a Javascript source running in for e.g. your browser and it's not super strong. Dunno if that makes a difference (hard to see why) but prob worth mentioning.
http://op12no2.me/toys/lozza
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An old dilemma
I have personally seen several such cases, enough that I simply don't use self-play games at all except for debugging.Rebel wrote:Did you ever released a version of your engine that in self-play performed worse than the previous version and yet performed better 1) against a set other opponents and 2) later confirmed on the rating list to be better also?
IOW, are there known cases self-play sucks?
-
- Posts: 2094
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: An old dilemma
I've seen those cases. I have also seen cases in self play where the new version performs much better, then the rating lists show it no better or
slightly better. I don't completely rely on self play.
My method is this:
1) benchmarks - they are quick and identify gross errors
2) self play - can identify not so gross errors
3) gauntlets - best method for me.
However, I don't completely trust the rating lists. The main reason is the number of bad opening lines that I've found in their game databases. The
second reason is their reliance on out of date hardware. I've modified algorithms numerous times to find things that scale well to deep searches but
not to shallow searches and vice versa. Given that I care more about performance at long TC and serious HW, I prioritize scalability to the high
end and the future.
slightly better. I don't completely rely on self play.
My method is this:
1) benchmarks - they are quick and identify gross errors
2) self play - can identify not so gross errors
3) gauntlets - best method for me.
However, I don't completely trust the rating lists. The main reason is the number of bad opening lines that I've found in their game databases. The
second reason is their reliance on out of date hardware. I've modified algorithms numerous times to find things that scale well to deep searches but
not to shallow searches and vice versa. Given that I care more about performance at long TC and serious HW, I prioritize scalability to the high
end and the future.
-
- Posts: 64
- Joined: Fri Oct 18, 2013 11:40 pm
- Location: New York
Re: An old dilemma
I have see this, too. So some time ago i switched from self-play to gauntlets as a primary way to evaluate engine strength. This approach should correlate well with CCRL rankings.
-
- Posts: 772
- Joined: Fri Jan 04, 2013 4:55 pm
- Location: Nice
Re: An old dilemma
Personnaly I prefer playing gaunglets against others engines instead of self test
But stockfish is tested always against itself right?
Not so bad than it like ...
But stockfish is tested always against itself right?
Not so bad than it like ...