Repeating games with switched colors reduces Elo error. All matches should be done like this

mmt · Post by **mmt** » Tue Feb 25, 2020 1:42 pm

This only applies to tests with the same opening book for both sides. It makes intuitive sense that the results will be more accurate if player A and player B play both sides of all openings. But I couldn't find any empirical results so I wrote a utility to test it out myself.

First, I've compared predictions that can be made after the first n games (multiple runs ordered randomly for higher accuracy) about the rest of the match. The results of matches with switched colors give more accurate predictions about the rest of the match.

Then I've used the bootstrap method of Elo error estimation. Playing with switched colors reduces Elo error. The tools like Ordo do not take the switched colors games into account and as a result, their error is too large. Instead of taking a match with 100 games (50 pairs) and picking individual games they should treat this match as 50 games, each having a result of 0, 0.5, 1, 1.5, or 2.

I can run this test for many matches if somebody wants to see the hard numbers. But it's clear that matches not using the switched color system are unnecessarily wasting CPU/GPU time by having to run more games to get the same accuracy as the matches with switched colors.

Ovyron · Post by **Ovyron** » Tue Feb 25, 2020 7:43 pm

That's strange. If engine A plays really well the Sicilian and engine B plays it really badly, it would be expected that engine B would get a higher elo in the end if it never played it at all, instead of being forced to play it with switched colors. If engine B gets the same elo eventually and it gets there faster by playing it that's mysterious, because one expected that engine B's elo would be higher if it never had to play positions from the book that made it under-perform.

mwyoung · Post by **mwyoung** » Tue Feb 25, 2020 7:57 pm

mmt wrote: ↑Tue Feb 25, 2020 1:42 pm This only applies to tests with the same opening book for both sides. It makes intuitive sense that the results will be more accurate if player A and player B play both sides of all openings. But I couldn't find any empirical results so I wrote a utility to test it out myself.

First, I've compared predictions that can be made after the first n games (multiple runs ordered randomly for higher accuracy) about the rest of the match. The results of matches with switched colors give more accurate predictions about the rest of the match.

Then I've used the bootstrap method of Elo error estimation. Playing with switched colors reduces Elo error. The tools like Ordo do not take the switched colors games into account and as a result, their error is too large. Instead of taking a match with 100 games (50 pairs) and picking individual games they should treat this match as 50 games, each having a result of 0, 0.5, 1, 1.5, or 2.

I can run this test for many matches if somebody wants to see the hard numbers. But it's clear that matches not using the switched color system are unnecessarily wasting CPU/GPU time by having to run more games to get the same accuracy as the matches with switched colors.

I find it better to use repeated opening with color switching. I do not use a set of positions. But a good opening book, but it is still a book of games. And with today's engines. They do find errors that are losing from the opening from time to time. Playing reverse colors cancels out that opening to a draw result.

Deberger · Post by **Deberger** » Tue Feb 25, 2020 7:58 pm

I agree:

https://pdfs.semanticscholar.org/d20a/9 ... 12c033.pdf

Ovyron · Post by **Ovyron** » Tue Feb 25, 2020 8:05 pm

mwyoung wrote: ↑Tue Feb 25, 2020 7:57 pm Playing reverse colors cancels out that opening to a draw result.

It's not cancelled, it's on there contaminating the results you'd have gotten if a good book line instead was played. Imagine you play Stockfish v Crafty and they draw that opening even though Stockfish could have easily won with a better line.

All books are meant to do is adding variety, not contaminating with bad lines, that's why the only good way is having specific variety books for each engine, that takes care about not having them play out-of-book positions they don't like, and this will not require switched colors (because each engine has different preferences).

NOTE: This isn't specific to engines, that's why you don't have humans playing chess with switched colors.

Alayan · Post by **Alayan** » Tue Feb 25, 2020 8:46 pm

When an engine is rated, it should not be rated over a tiny subset of chess positions it likes and can go in if using an opening book meant to exclude most of the lines it does worse in.

An engine should instead be tested over a wide range of positions, in order to measure its general chess ability. This gives much better insight on the abilities of the engine to analyze general positions. Opening book used in most rating lists favor regular opening lines over offbeat lines, but that still gives much more diversity than going from the start position.

Switching colors when forcing opening lines onto engines is standard procedure. This has been done at fishtest, TCEC, etc. for a long time.

mwyoung · Post by **mwyoung** » Tue Feb 25, 2020 9:07 pm

Ovyron wrote: ↑Tue Feb 25, 2020 8:05 pm
mwyoung wrote: ↑Tue Feb 25, 2020 7:57 pm Playing reverse colors cancels out that opening to a draw result.
It's not cancelled, it's on there contaminating the results you'd have gotten if a good book line instead was played. Imagine you play Stockfish v Crafty and they draw that opening even though Stockfish could have easily won with a better line.

All books are meant to do is adding variety, not contaminating with bad lines, that's why the only good way is having specific variety books for each engine, that takes care about not having them play out-of-book positions they don't like, and this will not require switched colors (because each engine has different preferences).

NOTE: This isn't specific to engines, that's why you don't have humans playing chess with switched colors.

I think I said that. Not cancelled, cancels out to a draw result. 1-1 = draw.

Michel · Post by **Michel** » Tue Feb 25, 2020 10:12 pm

mmt wrote: ↑Tue Feb 25, 2020 1:42 pm This only applies to tests with the same opening book for both sides. It makes intuitive sense that the results will be more accurate if player A and player B play both sides of all openings. But I couldn't find any empirical results so I wrote a utility to test it out myself.

First, I've compared predictions that can be made after the first n games (multiple runs ordered randomly for higher accuracy) about the rest of the match. The results of matches with switched colors give more accurate predictions about the rest of the match.

Then I've used the bootstrap method of Elo error estimation. Playing with switched colors reduces Elo error. The tools like Ordo do not take the switched colors games into account and as a result, their error is too large. Instead of taking a match with 100 games (50 pairs) and picking individual games they should treat this match as 50 games, each having a result of 0, 0.5, 1, 1.5, or 2.

I can run this test for many matches if somebody wants to see the hard numbers. But it's clear that matches not using the switched color system are unnecessarily wasting CPU/GPU time by having to run more games to get the same accuracy as the matches with switched colors.

Congratulations. You reinvented the pentanomial model

https://github.com/glinscott/fishtest/c ... def14a68f6

mmt · Post by **mmt** » Tue Feb 25, 2020 10:52 pm

Deberger wrote: ↑Tue Feb 25, 2020 7:58 pm I agree:

https://pdfs.semanticscholar.org/d20a/9 ... 12c033.pdf

Great!

mmt · Post by **mmt** » Tue Feb 25, 2020 10:57 pm

Michel wrote: ↑Tue Feb 25, 2020 10:12 pm Congratulations. You reinvented the pentanomial model https://github.com/glinscott/fishtest/c ... def14a68f6

Never claimed it's anything new. The point of the post was that I did some tests to confirm that it works in practice (from the disagreement with this http://talkchess.com/forum3/viewtopic.p ... 27#p830327). It's great that you got it in the code, though. My next step will be to test out my idea of getting additional info from draws (and to a lesser extent from wins and losses).

Repeating games with switched colors reduces Elo error. All matches should be done like this

Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this