On engine testing again!

Tord Romstad · Post by **Tord Romstad** » Sat Jan 02, 2010 1:01 pm

Edsel Apostol wrote:
Tord Romstad wrote:I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.
I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?

It is possible that this happens occasionally for a single change, but the sum of numerous self-play improvements has so far always given a net improvement in play against other engines as well.

Don · Post by **Don** » Sat Jan 02, 2010 3:04 pm

Tord Romstad wrote:I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.

I agree with you on this. The prejudice against self-play is very strong but I have not found it to be a major issue.

Stockfish is a pretty strong program and if it uses self-play that is some evidence that it cannot be too horrible. Rybka relies almost entirely on self-play too I heard because there are no good opponents for it. This is all pretty compelling evidence to me that it cannot be the worst thing in the world.

The other thing about it is that I hate to test other peoples programs when it's my program I'm trying to improve. My resources are limited and I don't want 90% of the cpu time to be used testing foreign programs and only 10% testing mine. If Rybka used other opponents it would have to give time odds and it would be even worse.

Nevertheless, when I prepare release versions I move away some from self-testing with short time controls and do longer time controls with some foreign opponents. But the bulk of my testing is done at fast time controls self-testing.

zamar · Post by **zamar** » Sat Jan 02, 2010 3:17 pm

Edsel Apostol wrote: I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?

I've noted this happens quite often when changing the shape of the search tree and rearly when modifying eval terms (with perhaps exception of passed pawns and king safety).

However as Tord pointed out self-play exaggerates elo change which is a good thing. Another point is that when you compare the result of gauntlets the error bar grows by the factor of sqrt(2). These two observation has made me to think that with limited testing resources self-play is the most reliable way to proceed. It's a bit broken compass, but better than other options.

Before release it's of course important to test against other engines.

Don · Post by **Don** » Sat Jan 02, 2010 3:32 pm

zamar wrote:
Edsel Apostol wrote: I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?
I've noted this happens quite often when changing the shape of the search tree and rearly when modifying eval terms (with perhaps exception of passed pawns and king safety).

However as Tord pointed out self-play exaggerates elo change which is a good thing. Another point is that when you compare the result of gauntlets the error bar grows by the factor of sqrt(2). These two observation has made me to think that with limited testing resources self-play is the most reliable way to proceed. It's a bit broken compass, but better than other options.

Before release it's of course important to test against other engines.

I think it's possible that some change can make your program weaker (in general) and yet improve the results with self-play. But that is very rare. It's a matter of how paranoid do you want to be? One could argue that testing against any computer player is incestuous too.

So the best (and imperfect) solution for me has been to use foreign players as a final sanity test once I "think" I have a valid improvement.

To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa. That has way more impact than self vs foreign.

Don

opraus · Post by **opraus** » Sat Jan 02, 2010 3:38 pm

Just to chime in:

I am currently using 1800 games against 4 opponents and 225 positions - but I can see where more opponents might be better.

The bit about 'thematic' positions is interesting to me. I had also thought of using several 'personalities' of a strong engine as more opponents.

I think that knowing the 'themes' of positions and the 'personality' of opponents might help discover particular areas of weakness.

I think 1800 games can differentiate +- 14 elo changes, but these are hard to come by, and the matches take 48 hours ...

Milos · Post by **Milos** » Sat Jan 02, 2010 3:42 pm

Don wrote:To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa. That has way more impact than self vs foreign.

Good point here. And it's more short time control improvements that hurt long time controls, simply because it's so much faster to test them properly.
But there, no valid advice is possible, except to try to improve a hunch about which change will or won't hurt long time controls.

Don · Post by **Don** » Sat Jan 02, 2010 6:30 pm

Milos wrote:
Don wrote:To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa. That has way more impact than self vs foreign.
Good point here. And it's more short time control improvements that hurt long time controls, simply because it's so much faster to test them properly.
But there, no valid advice is possible, except to try to improve a hunch about which change will or won't hurt long time controls.

Exactly, unless one has enormous testing resources one must make judgment calls and take compromises - or be satisfied with a very slow development process.

michiguel · Post by **michiguel** » Sat Jan 02, 2010 7:01 pm

Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin

IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel

Don · Post by **Don** » Sat Jan 02, 2010 7:20 pm

michiguel wrote:
Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel

Miguel,

From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.

There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.

I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.

If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.

Don · Post by **Don** » Sat Jan 02, 2010 7:24 pm

One more comment on this. I am assuming that your tester, whether Arena or something else permits time handicap games. I'm not sure if this is the case however. If it's not the case, then you don't have the proper tools for testing.

Don wrote:
michiguel wrote:
Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel
Miguel,

From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.

There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.

I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.

If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.

On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!