It is possible that this happens occasionally for a single change, but the sum of numerous self-play improvements has so far always given a net improvement in play against other engines as well.Edsel Apostol wrote:I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?Tord Romstad wrote:I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.
On engine testing again!
Moderators: hgm, Rebel, chrisw
-
- Posts: 1808
- Joined: Wed Mar 08, 2006 9:19 pm
- Location: Oslo, Norway
Re: On engine testing again!
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: On engine testing again!
I agree with you on this. The prejudice against self-play is very strong but I have not found it to be a major issue.Tord Romstad wrote:I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.
Stockfish is a pretty strong program and if it uses self-play that is some evidence that it cannot be too horrible. Rybka relies almost entirely on self-play too I heard because there are no good opponents for it. This is all pretty compelling evidence to me that it cannot be the worst thing in the world.
The other thing about it is that I hate to test other peoples programs when it's my program I'm trying to improve. My resources are limited and I don't want 90% of the cpu time to be used testing foreign programs and only 10% testing mine. If Rybka used other opponents it would have to give time odds and it would be even worse.
Nevertheless, when I prepare release versions I move away some from self-testing with short time controls and do longer time controls with some foreign opponents. But the bulk of my testing is done at fast time controls self-testing.
-
- Posts: 613
- Joined: Sun Jan 18, 2009 7:03 am
Re: On engine testing again!
I've noted this happens quite often when changing the shape of the search tree and rearly when modifying eval terms (with perhaps exception of passed pawns and king safety).Edsel Apostol wrote: I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?
However as Tord pointed out self-play exaggerates elo change which is a good thing. Another point is that when you compare the result of gauntlets the error bar grows by the factor of sqrt(2). These two observation has made me to think that with limited testing resources self-play is the most reliable way to proceed. It's a bit broken compass, but better than other options.
Before release it's of course important to test against other engines.
Joona Kiiski
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: On engine testing again!
I think it's possible that some change can make your program weaker (in general) and yet improve the results with self-play. But that is very rare. It's a matter of how paranoid do you want to be? One could argue that testing against any computer player is incestuous too.zamar wrote:I've noted this happens quite often when changing the shape of the search tree and rearly when modifying eval terms (with perhaps exception of passed pawns and king safety).Edsel Apostol wrote: I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?
However as Tord pointed out self-play exaggerates elo change which is a good thing. Another point is that when you compare the result of gauntlets the error bar grows by the factor of sqrt(2). These two observation has made me to think that with limited testing resources self-play is the most reliable way to proceed. It's a bit broken compass, but better than other options.
Before release it's of course important to test against other engines.
So the best (and imperfect) solution for me has been to use foreign players as a final sanity test once I "think" I have a valid improvement.
To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa. That has way more impact than self vs foreign.
Don
-
- Posts: 166
- Joined: Wed Mar 08, 2006 9:49 pm
- Location: S. New Jersey, USA
Re: On engine testing again!
Just to chime in:
I am currently using 1800 games against 4 opponents and 225 positions - but I can see where more opponents might be better.
The bit about 'thematic' positions is interesting to me. I had also thought of using several 'personalities' of a strong engine as more opponents.
I think that knowing the 'themes' of positions and the 'personality' of opponents might help discover particular areas of weakness.
I think 1800 games can differentiate +- 14 elo changes, but these are hard to come by, and the matches take 48 hours ...
I am currently using 1800 games against 4 opponents and 225 positions - but I can see where more opponents might be better.
The bit about 'thematic' positions is interesting to me. I had also thought of using several 'personalities' of a strong engine as more opponents.
I think that knowing the 'themes' of positions and the 'personality' of opponents might help discover particular areas of weakness.
I think 1800 games can differentiate +- 14 elo changes, but these are hard to come by, and the matches take 48 hours ...
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: On engine testing again!
Good point here. And it's more short time control improvements that hurt long time controls, simply because it's so much faster to test them properly.Don wrote:To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa. That has way more impact than self vs foreign.
But there, no valid advice is possible, except to try to improve a hunch about which change will or won't hurt long time controls.
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: On engine testing again!
Exactly, unless one has enormous testing resources one must make judgment calls and take compromises - or be satisfied with a very slow development process.Milos wrote:Good point here. And it's more short time control improvements that hurt long time controls, simply because it's so much faster to test them properly.Don wrote:To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa. That has way more impact than self vs foreign.
But there, no valid advice is possible, except to try to improve a hunch about which change will or won't hurt long time controls.
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: On engine testing again!
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.
Regards,
Fermin
Miguel
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: On engine testing again!
Miguel,michiguel wrote:IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.
Regards,
Fermin
Miguel
From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.
There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.
I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.
If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: On engine testing again!
One more comment on this. I am assuming that your tester, whether Arena or something else permits time handicap games. I'm not sure if this is the case however. If it's not the case, then you don't have the proper tools for testing.
Don wrote:Miguel,michiguel wrote:IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.
Regards,
Fermin
Miguel
From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.
There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.
I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.
If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.