I am wondering which of the following would produce a more accurate rating list. Let's assume we have 3 engines (A, B, C). These engines behave a little atypically. We can have A>B>C>A kind of stuff happening.
Method One:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 50 games
Engine B vs Engine C - 50 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 50 games
Engine C vs Engine B - 50 games
Total: 300 games.
Method Two:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 500 games
Engine B vs Engine C - 80 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 500 games
Engine C vs Engine B - 80 games
Total: 1260 games.
In the second method, some pairings will have far more games played than some other pairings. The sample size is larger, and no pairings have less than 100 games played total. However, the testing is a bit skewed.
Since A might beat B which might beat C which might beat A, I think having say, too many C vs A compared to C vs B games will skew the results. Am I totally off?
Rating List
Moderator: Ras
-
dark_wizzie
- Posts: 79
- Joined: Mon Jun 30, 2014 3:48 pm
- Full name: Erica Lin
Rating List
Dark_wizzie, aka Eric Lin. ABK book tester, crappy OTB player. Ex Chess2u mod. Likes pizza.
-
lkaufman
- Posts: 6300
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Rating List
You are right in principle, although the circular effect you are assuming seems to be fairly rare or at least of small magnitude. I suspect that it might be true if A is latest Stockfish, B is latest Komodo, and C is Houdini 4 running with twice the time to make it competitive, at some bullet level. But I haven't actually run that test.dark_wizzie wrote:I am wondering which of the following would produce a more accurate rating list. Let's assume we have 3 engines (A, B, C). These engines behave a little atypically. We can have A>B>C>A kind of stuff happening.
Method One:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 50 games
Engine B vs Engine C - 50 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 50 games
Engine C vs Engine B - 50 games
Total: 300 games.
Method Two:
Engine A vs Engine B - 50 games
Engine A vs Engine C - 500 games
Engine B vs Engine C - 80 games
Engine B vs Engine A - 50 games
Engine C vs Engine A - 500 games
Engine C vs Engine B - 80 games
Total: 1260 games.
In the second method, some pairings will have far more games played than some other pairings. The sample size is larger, and no pairings have less than 100 games played total. However, the testing is a bit skewed.
Since A might beat B which might beat C which might beat A, I think having say, too many C vs A compared to C vs B games will skew the results. Am I totally off?
That is why the lists that run round robins of top N engines, such as IPON and CEGT 5' + 3", are probably a bit more reliable than those that just let testers run whatever they wish. But maybe they would lose testers if they did not give testers freedom of choice, so it's a tradeoff.
Komodo rules!
-
dark_wizzie
- Posts: 79
- Joined: Mon Jun 30, 2014 3:48 pm
- Full name: Erica Lin
Re: Rating List
Thanks for the reply Larry. I know this is the chess engine subsection in a chess forum, but actually the rating list I am trying to make isn't for chess engines. I want to apply chess engine testing principles though, because it's the best I've got. In what I am testing, there are more instances of this circular effect than in chess. I don't know (and I don't think I can) quantify just how large this effect is though.lkaufman wrote: You are right in principle, although the circular effect you are assuming seems to be fairly rare or at least of small magnitude. I suspect that it might be true if A is latest Stockfish, B is latest Komodo, and C is Houdini 4 running with twice the time to make it competitive, at some bullet level. But I haven't actually run that test.
That is why the lists that run round robins of top N engines, such as IPON and CEGT 5' + 3", are probably a bit more reliable than those that just let testers run whatever they wish. But maybe they would lose testers if they did not give testers freedom of choice, so it's a tradeoff.
Interesting stuff, both for chess and non-chess situations.
Dark_wizzie, aka Eric Lin. ABK book tester, crappy OTB player. Ex Chess2u mod. Likes pizza.
-
Gurcan Uckardes
- Posts: 196
- Joined: Wed Oct 29, 2014 12:42 am
Re: Rating List
First method is recommended unless there are significant elo gaps, which surely does not apply to your circular case.
If the engine population is wide, we always try to:
1- maximize the overall draw ratio
2- maximize the number of opponents
3- minimize the average elo diff vs opponents
Targets 1 and 3 can cooperate but 2 is a real headache. You must optimize the target opponent range depending on the elo distribution of the population.
Personally, i try to avoid pairings beyond +/- 100 elo as long as there are enough variety of opponents, preferably more than 15% of the population.
Of course all above numbers may vary according to your taste. After all, the error margins will shrink quicker if you follow above strategies.
If the engine population is wide, we always try to:
1- maximize the overall draw ratio
2- maximize the number of opponents
3- minimize the average elo diff vs opponents
Targets 1 and 3 can cooperate but 2 is a real headache. You must optimize the target opponent range depending on the elo distribution of the population.
Personally, i try to avoid pairings beyond +/- 100 elo as long as there are enough variety of opponents, preferably more than 15% of the population.
Of course all above numbers may vary according to your taste. After all, the error margins will shrink quicker if you follow above strategies.
My blog for Android users: http://chesstroid.blogspot.com