Testing a set of engines

konsolas · Post by **konsolas** » Wed Jul 11, 2018 6:49 pm

Is there a standard way of determining how many matches are needed to rank a list of engines in order of strength?

I notice that CCRL, CEGT and other lists use hundreds upon hundreds of matches per engine, but is there a lower bound for the number of matches that need to be run?

AlvaroBegue · Post by **AlvaroBegue** » Wed Jul 11, 2018 7:40 pm

If you look at the CCRL table that lists engines in decreasing order of strength, you'll see a column "LOS". That is something like one minus the p-value of the hypothesis that the engine in one line is actually stronger than the one below. Running more games allows you to make those numbers closer to 100%. But there are still cases where the number is near 50% (we can't really tell which of the engines is stronger).

Perhaps it would be better to think of the result of your matches not as a rank, but as estimates of Elo strength, which come with an error bar. Your error bars will be inversely proportional to the square root of the number of games you play. What error bars are acceptable is completely up to you.

hgm · Post by **hgm** » Wed Jul 18, 2018 10:39 pm

It depends on how large the strength difference is. If the list consists of, say, Stockfish, Fruit, Fairy-Max, Chad's Chess and POS, 10 games per pairing would be more than enough, as every match would end in 10-0. If the list consists of engines that to all intents an purposes are equally strong, you would need thousands of games to determine who gets the 50.5% result against who.

Robert Pope · Post by **Robert Pope** » Fri Jul 20, 2018 9:37 pm

Perhaps the practical answer is to run a smallish tournament and feed the games into bayeselo or ordo. That will give you the estimated relative ratings and the error bars. If there are engines that overlap in their elo +/- error bars and you want confidence in the ranking, increase the number of games, for all, or just for the ones overlapping. How many games is driven by the sqrt() rule to get the error bars as small as they are needed.

Testing a set of engines

Testing a set of engines

Re: Testing a set of engines

Re: Testing a set of engines

Re: Testing a set of engines