Is there a standard way of determining how many matches are needed to rank a list of engines in order of strength?
I notice that CCRL, CEGT and other lists use hundreds upon hundreds of matches per engine, but is there a lower bound for the number of matches that need to be run?
Testing a set of engines
Moderators: hgm, Rebel, chrisw
-
- Posts: 182
- Joined: Sun Jun 12, 2016 5:44 pm
- Location: London
- Full name: Vincent
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Testing a set of engines
If you look at the CCRL table that lists engines in decreasing order of strength, you'll see a column "LOS". That is something like one minus the p-value of the hypothesis that the engine in one line is actually stronger than the one below. Running more games allows you to make those numbers closer to 100%. But there are still cases where the number is near 50% (we can't really tell which of the engines is stronger).
Perhaps it would be better to think of the result of your matches not as a rank, but as estimates of Elo strength, which come with an error bar. Your error bars will be inversely proportional to the square root of the number of games you play. What error bars are acceptable is completely up to you.
Perhaps it would be better to think of the result of your matches not as a rank, but as estimates of Elo strength, which come with an error bar. Your error bars will be inversely proportional to the square root of the number of games you play. What error bars are acceptable is completely up to you.
-
- Posts: 27793
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing a set of engines
It depends on how large the strength difference is. If the list consists of, say, Stockfish, Fruit, Fairy-Max, Chad's Chess and POS, 10 games per pairing would be more than enough, as every match would end in 10-0. If the list consists of engines that to all intents an purposes are equally strong, you would need thousands of games to determine who gets the 50.5% result against who.
-
- Posts: 558
- Joined: Sat Mar 25, 2006 8:27 pm
Re: Testing a set of engines
Perhaps the practical answer is to run a smallish tournament and feed the games into bayeselo or ordo. That will give you the estimated relative ratings and the error bars. If there are engines that overlap in their elo +/- error bars and you want confidence in the ranking, increase the number of games, for all, or just for the ones overlapping. How many games is driven by the sqrt() rule to get the error bars as small as they are needed.