On engine testing again!

Discussion of chess software programming and technical issues.

Moderator: Ras

Uri Blass
Posts: 10895
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: On engine testing again!

Post by Uri Blass »

swami wrote:
Dirt wrote:
Edsel Apostol wrote:Another question, what should be the best range of elos of the opponents if for example the engine has 2850 elo and why:
Uri had the interesting suggestion to play against the strongest engine you can find, but give your engine time odds so that it plays at a roughly equal strength. This can somewhat speed up testing. It might not be best to choose all your opponents this way.
There's no correct rating assigned for top engine with time odd. One needs to know the rating of the opponent to see if the changes made in the engine resulted in a progress.
You compare between version A and version B based on number of points.
against fixed opponents.

You do not need the rating of the opponents.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: On engine testing again!

Post by bob »

Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: On engine testing again!

Post by bob »

Graham Banks wrote:I'd go for:

A. 120 games against each of the 10 opponents

I think that 120 games is a reasonable number against a given opponent, if you're having each play from the same opening line as both White and Black.
A good range of opponents is better than a limited range.

The only other option I'd consider would be:

B. 240 games against each of the 5 opponents
I'd go with A, period, but there is an issue. You need to carefully vet the positions. You do not want significantly biased positions. Yes, alternating colors will keep them from affecting the results, but it will also produce two games that tell you nothing, and with so few games, you can't afford to have any that are worthless.
User avatar
Graham Banks
Posts: 44636
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: On engine testing again!

Post by Graham Banks »

bob wrote:
Graham Banks wrote:I'd go for:

A. 120 games against each of the 10 opponents

I think that 120 games is a reasonable number against a given opponent, if you're having each play from the same opening line as both White and Black.
A good range of opponents is better than a limited range.

The only other option I'd consider would be:

B. 240 games against each of the 5 opponents
I'd go with A, period, but there is an issue. You need to carefully vet the positions. You do not want significantly biased positions. Yes, alternating colors will keep them from affecting the results, but it will also produce two games that tell you nothing, and with so few games, you can't afford to have any that are worthless.
Agreed. :)
gbanksnz at gmail.com
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: On engine testing again!

Post by Edsel Apostol »

bob wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...
That's why I don't target the 10 to 20 elo changes for now, that will be for tuning later when the engine is already really strong. I'm more on trying out ideas that may give or take at least 30 elo.

I agree about the few starting positions being critical.

Well, at least 1200 games is better than nothing. If a version/setting is good it will show in the rating list no matter how few the games.
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: On engine testing again!

Post by Tord Romstad »

I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: On engine testing again!

Post by Edsel Apostol »

Tord Romstad wrote:I know it's generally considered a poor way of testing, but we actually rely almost entirely on self-play testing. We usually decide whether or not to keep some change based on a single long match between the original and the modified version. Self-play tends to exaggerate the difference in strength between two versions, but I consider that to be an advantage, because it makes it easier to measure tiny improvements with limited testing time.
I've tested this way before and I could sometimes find that version A wins against version B but when pitted against other opponents, version B scored better. Have you noticed this also with your tests or it is consistent may it be self play or against other engines?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: On engine testing again!

Post by Don »

Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
I'm not sure what you are asking. Are you considering the choice of how many programs to test as part of the equation? Are you evaluating just a single version against N different version/settings?
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: On engine testing again!

Post by Edsel Apostol »

Don wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
I'm not sure what you are asking. Are you considering the choice of how many programs to test as part of the equation? Are you evaluating just a single version against N different version/settings?
What I'm doing right now is that I play for example version/setting A against 10 opponents with 120 games each. Version/setting B would also play the same opponents with the same number of games with the same opening suite. That's how I compare which of the versions is better.

I'm thinking that 120 games for each opponent might be too small and making it 240 but using only 5 opponents might be better. The constraint here is the limit of 1200 games per engine version /setting.
User avatar
Kempelen
Posts: 620
Joined: Fri Feb 08, 2008 10:44 am
Location: Madrid - Spain

Re: On engine testing again!

Post by Kempelen »

In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin
Fermin Serrano
Author of 'Rodin' engine
http://sites.google.com/site/clonfsp/