Statistics from automated tests

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

micron
Posts: 155
Joined: Mon Feb 15, 2010 9:33 am
Location: New Zealand

Re: Statistics from automated tests

Post by micron »

10K games! The computer that I'm using to run my tests isn't all that powerful. Currently on low depth it can finish a game in 2.5 minuets; so 10K games would take me about 2 weeks!
10K is a counsel of perfection, for detection of tiny differences. 500 games will pick up important differences.

It's not clear how advanced your engine is in 'big-ticket' algorithmic features (TT, NMP...). Without these and a decent move generator, the engine would be slow; testing would be limited to very shallow searches indeed. It's not worth tuning anything under these conditions, because it will likely have to be redone later.

Many people think it better to test an engine against a completely different one, not to test a base version against a slight modification. The reference engine should be better than yours. (If you and your opponent only have pikes as weapons, you won't learn how to defend against bows and arrows, let alone machine guns). By trying various free engines, you quickly find one that plays 150--250 elo better than yours, at a suitable time control. A counsel of perfection, irrelevant to most of us, is to test against a huge suite of engines.

To organise test matches I recommend cutechess-cli, with opening positions taken from gaviota-starters.pgn. This has more than enough positions for practical testing (max 2400 games; 4800 with cutechess-cli -repeat option).

Robert P.
jdm64
Posts: 41
Joined: Thu May 27, 2010 11:32 pm

Re: Statistics from automated tests

Post by jdm64 »

micron wrote:It's not clear how advanced your engine is in 'big-ticket' algorithmic features (TT, NMP...). Without these and a decent move generator, the engine would be slow; testing would be limited to very shallow searches indeed. It's not worth tuning anything under these conditions, because it will likely have to be redone later.
Currently, my engine is not all that advanced. I'm mainly asking these questions to make sure I'm at least on the right track. I recently added a simple function to have the engine play itself "internally" to see if it's playing equal for both white/black. That's why I was looking for some statistical measurement that would account for the randomness of play. Using the LOS it looks like my engine favours white, but only 3%.
micron wrote:Many people think it better to test an engine against a completely different one, not to test a base version against a slight modification. The reference engine should be better than yours. (If you and your opponent only have pikes as weapons, you won't learn how to defend against bows and arrows, let alone machine guns). By trying various free engines, you quickly find one that plays 150--250 elo better than yours, at a suitable time control. A counsel of perfection, irrelevant to most of us, is to test against a huge suite of engines.

To organise test matches I recommend cutechess-cli, with opening positions taken from gaviota-starters.pgn. This has more than enough positions for practical testing (max 2400 games; 4800 with cutechess-cli -repeat option).
That could pose problems because I'm writing a chess-variant engine to a game I created. So I would be writing the reference engine!