Testing a set of engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

konsolas
Posts: 182
Joined: Sun Jun 12, 2016 5:44 pm
Location: London
Full name: Vincent

Testing a set of engines

Post by konsolas »

Is there a standard way of determining how many matches are needed to rank a list of engines in order of strength?

I notice that CCRL, CEGT and other lists use hundreds upon hundreds of matches per engine, but is there a lower bound for the number of matches that need to be run?
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Testing a set of engines

Post by AlvaroBegue »

If you look at the CCRL table that lists engines in decreasing order of strength, you'll see a column "LOS". That is something like one minus the p-value of the hypothesis that the engine in one line is actually stronger than the one below. Running more games allows you to make those numbers closer to 100%. But there are still cases where the number is near 50% (we can't really tell which of the engines is stronger).

Perhaps it would be better to think of the result of your matches not as a rank, but as estimates of Elo strength, which come with an error bar. Your error bars will be inversely proportional to the square root of the number of games you play. What error bars are acceptable is completely up to you.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing a set of engines

Post by hgm »

It depends on how large the strength difference is. If the list consists of, say, Stockfish, Fruit, Fairy-Max, Chad's Chess and POS, 10 games per pairing would be more than enough, as every match would end in 10-0. If the list consists of engines that to all intents an purposes are equally strong, you would need thousands of games to determine who gets the 50.5% result against who.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: Testing a set of engines

Post by Robert Pope »

Perhaps the practical answer is to run a smallish tournament and feed the games into bayeselo or ordo. That will give you the estimated relative ratings and the error bars. If there are engines that overlap in their elo +/- error bars and you want confidence in the ranking, increase the number of games, for all, or just for the ones overlapping. How many games is driven by the sqrt() rule to get the error bars as small as they are needed.