Rating List

dark_wizzie · Post by **dark_wizzie** » Fri Apr 22, 2016 4:59 am

I am wondering which of the following would produce a more accurate rating list. Let's assume we have 3 engines (A, B, C). These engines behave a little atypically. We can have A>B>C>A kind of stuff happening.

Method One:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 50 games
Engine B vs Engine C - 50 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 50 games
Engine C vs Engine B - 50 games
Total: 300 games.

Method Two:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 500 games
Engine B vs Engine C - 80 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 500 games
Engine C vs Engine B - 80 games
Total: 1260 games.

In the second method, some pairings will have far more games played than some other pairings. The sample size is larger, and no pairings have less than 100 games played total. However, the testing is a bit skewed.

Since A might beat B which might beat C which might beat A, I think having say, too many C vs A compared to C vs B games will skew the results. Am I totally off?

lkaufman · Post by **lkaufman** » Fri Apr 22, 2016 6:15 am

dark_wizzie wrote:I am wondering which of the following would produce a more accurate rating list. Let's assume we have 3 engines (A, B, C). These engines behave a little atypically. We can have A>B>C>A kind of stuff happening.

Method One:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 50 games
Engine B vs Engine C - 50 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 50 games
Engine C vs Engine B - 50 games
Total: 300 games.

Method Two:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 500 games
Engine B vs Engine C - 80 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 500 games
Engine C vs Engine B - 80 games
Total: 1260 games.

In the second method, some pairings will have far more games played than some other pairings. The sample size is larger, and no pairings have less than 100 games played total. However, the testing is a bit skewed.

Since A might beat B which might beat C which might beat A, I think having say, too many C vs A compared to C vs B games will skew the results. Am I totally off?

You are right in principle, although the circular effect you are assuming seems to be fairly rare or at least of small magnitude. I suspect that it might be true if A is latest Stockfish, B is latest Komodo, and C is Houdini 4 running with twice the time to make it competitive, at some bullet level. But I haven't actually run that test.

That is why the lists that run round robins of top N engines, such as IPON and CEGT 5' + 3", are probably a bit more reliable than those that just let testers run whatever they wish. But maybe they would lose testers if they did not give testers freedom of choice, so it's a tradeoff.

dark_wizzie · Post by **dark_wizzie** » Fri Apr 22, 2016 6:38 am

lkaufman wrote: You are right in principle, although the circular effect you are assuming seems to be fairly rare or at least of small magnitude. I suspect that it might be true if A is latest Stockfish, B is latest Komodo, and C is Houdini 4 running with twice the time to make it competitive, at some bullet level. But I haven't actually run that test.

That is why the lists that run round robins of top N engines, such as IPON and CEGT 5' + 3", are probably a bit more reliable than those that just let testers run whatever they wish. But maybe they would lose testers if they did not give testers freedom of choice, so it's a tradeoff.

Thanks for the reply Larry. I know this is the chess engine subsection in a chess forum, but actually the rating list I am trying to make isn't for chess engines. I want to apply chess engine testing principles though, because it's the best I've got. In what I am testing, there are more instances of this circular effect than in chess. I don't know (and I don't think I can) quantify just how large this effect is though.

Interesting stuff, both for chess and non-chess situations.

Gurcan Uckardes · Post by **Gurcan Uckardes** » Fri Apr 22, 2016 11:41 pm

First method is recommended unless there are significant elo gaps, which surely does not apply to your circular case.
If the engine population is wide, we always try to:
1- maximize the overall draw ratio
2- maximize the number of opponents
3- minimize the average elo diff vs opponents
Targets 1 and 3 can cooperate but 2 is a real headache. You must optimize the target opponent range depending on the elo distribution of the population.
Personally, i try to avoid pairings beyond +/- 100 elo as long as there are enough variety of opponents, preferably more than 15% of the population.
Of course all above numbers may vary according to your taste. After all, the error margins will shrink quicker if you follow above strategies.

Rating List

Rating List

Re: Rating List

Re: Rating List

Re: Rating List