On engine testing again!

Uri Blass · Post by **Uri Blass** » Fri Jan 01, 2010 10:22 am

swami wrote:
Dirt wrote:
Edsel Apostol wrote:Another question, what should be the best range of elos of the opponents if for example the engine has 2850 elo and why:
Uri had the interesting suggestion to play against the strongest engine you can find, but give your engine time odds so that it plays at a roughly equal strength. This can somewhat speed up testing. It might not be best to choose all your opponents this way.
There's no correct rating assigned for top engine with time odd. One needs to know the rating of the opponent to see if the changes made in the engine resulted in a progress.

You compare between version A and version B based on number of points.
against fixed opponents.

You do not need the rating of the opponents.

Uri

bob · Post by **bob** » Fri Jan 01, 2010 6:24 pm

Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent

Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...

bob · Post by **bob** » Fri Jan 01, 2010 6:26 pm

Graham Banks wrote:I'd go for:

A. 120 games against each of the 10 opponents

I think that 120 games is a reasonable number against a given opponent, if you're having each play from the same opening line as both White and Black.
A good range of opponents is better than a limited range.

The only other option I'd consider would be:

B. 240 games against each of the 5 opponents

I'd go with A, period, but there is an issue. You need to carefully vet the positions. You do not want significantly biased positions. Yes, alternating colors will keep them from affecting the results, but it will also produce two games that tell you nothing, and with so few games, you can't afford to have any that are worthless.

Graham Banks · Post by **Graham Banks** » Fri Jan 01, 2010 8:25 pm

bob wrote:
Graham Banks wrote:I'd go for:

A. 120 games against each of the 10 opponents

I think that 120 games is a reasonable number against a given opponent, if you're having each play from the same opening line as both White and Black.
A good range of opponents is better than a limited range.

The only other option I'd consider would be:

B. 240 games against each of the 5 opponents
I'd go with A, period, but there is an issue. You need to carefully vet the positions. You do not want significantly biased positions. Yes, alternating colors will keep them from affecting the results, but it will also produce two games that tell you nothing, and with so few games, you can't afford to have any that are worthless.

Agreed.

Edsel Apostol · Post by **Edsel Apostol** » Fri Jan 01, 2010 9:26 pm

bob wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...

That's why I don't target the 10 to 20 elo changes for now, that will be for tuning later when the engine is already really strong. I'm more on trying out ideas that may give or take at least 30 elo.

I agree about the few starting positions being critical.

Well, at least 1200 games is better than nothing. If a version/setting is good it will show in the rating list no matter how few the games.

Tord Romstad · Post by **Tord Romstad** » Fri Jan 01, 2010 9:31 pm

I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.

Edsel Apostol · Post by **Edsel Apostol** » Fri Jan 01, 2010 9:43 pm

Tord Romstad wrote:I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.

I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?

Don · Post by **Don** » Fri Jan 01, 2010 11:49 pm

Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent

I'm not sure what you are asking. Are you considering the choice of how many programs to test as part of the equation? Are you evaluating just a single version against N different version/settings?

Edsel Apostol · Post by **Edsel Apostol** » Sat Jan 02, 2010 6:24 am

Don wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
I'm not sure what you are asking. Are you considering the choice of how many programs to test as part of the equation? Are you evaluating just a single version against N different version/settings?

What I'm doing right now is that I play for example version/setting A against 10 opponents with 120 games each. Version/setting B would also play the same opponents with the same number of games with the same opening suite. That's how I compare which of the versions is better.

I'm thinking that 120 games for each opponent might be too small and making it 240 but using only 5 opponents might be better. The constraint here is the limit of 1200 games per engine version /setting.

Kempelen · Post by **Kempelen** » Sat Jan 02, 2010 11:18 am

In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin

On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!