Issue with self play testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Issue with self play testing

Post by CRoberson »

I have been testing a new Ares based on an issue that came up in a game between Ares and Myrddin on Graham's site.
The issue pertains to king safety. The change makes Ares more aware of the potential for a certain type of king attack/defense.
After playing Ares-old vs Ares-new, I saw the new version made the attacks that the old version wasn't aware of and the rating
gain was 28 Elo. Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.

Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: Issue with self play testing

Post by MikeB »

Interesting. Typically, self play ( or very similar engine) testing over estimates the rating gain.
Image
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Issue with self play testing

Post by Evert »

CRoberson wrote: Fri May 18, 2018 4:31 am Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.
Yes, and that's why the gain is typically less than it is in self-play: the other opponent may not have been so blind, so you gain less by playing against them.
Of course this very much depends on the opponent you measure against and the gaps in the evaluation that they have.
Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.
Yes. Self-testing can make you blind for gaps in the evaluation function. It works fairly well for optimising the evaluation weights in features that you have, but you need to test against other engines to find out what your weaknesses are.
Or you need to try adding loads of different terms and see what sticks (which is sortof what SF does), or you need to extract evaluation features in addition to evaluation weights (neural nets).
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Issue with self play testing

Post by cdani »

The last month I'm testing every change vs previous Andscacs version and vs Stockfish. It happens often that a change is good against one and bad against the other.
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Issue with self play testing

Post by CRoberson »

Thanks Evert. I do know all that. I was just posting about an interesting issue in self play testing.
I've used various other engines for gauntlets and such... and my published research from the mid 1990s is in neural nets.
Of course, most don't know that. I should apologize. I am sure you are trying to help.
I see you live in the Netherlands - neat. I was there for 2 weeks in 2002: Amsterdam, Utrecht then Maastricht. A very nice country. I rather liked it.
Continuing to like the US is getting more difficult with all the Republican __((**E&&#@___
If you are up to date on the fairest opening books or positions to use for testing, I would be very interested in hearing about that.
Evert wrote: Fri May 18, 2018 7:56 am
CRoberson wrote: Fri May 18, 2018 4:31 am Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.
Yes, and that's why the gain is typically less than it is in self-play: the other opponent may not have been so blind, so you gain less by playing against them.
Of course this very much depends on the opponent you measure against and the gaps in the evaluation that they have.
Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.
Yes. Self-testing can make you blind for gaps in the evaluation function. It works fairly well for optimising the evaluation weights in features that you have, but you need to test against other engines to find out what your weaknesses are.
Or you need to try adding loads of different terms and see what sticks (which is sortof what SF does), or you need to extract evaluation features in addition to evaluation weights (neural nets).
User avatar
Greg Strong
Posts: 388
Joined: Sun Dec 21, 2008 6:57 pm
Location: Washington, DC

Re: Issue with self play testing

Post by Greg Strong »

Nice to hear that you are working on a new version of Ares :)

I test almost exclusively against eight other engines. I rotate them from time to time, but Ares is one of the engines that I have used a lot.