I haven't been following this closely and I don't really want to get involved, but I can't help but point out that throwing out 8 decisive games because the same side happened to win is a fairly extreme thing to do. I don't know how much confidence can be had in the results as soon as you start being that hand wavy.
It has nothing to do with being hand wavy. On the contrary. The trinomial model is simply wrong in the case of unbalanced positions and to get accurate results one should use the pentanomial model instead.
However if there are no double wins (were there?) then the pentanomial model degenerates again to a trinomial model where reciprocal wins should be discarded (if one wants to reject the null hypothesis of equal strength).
It seems very naive to reject the results in an opening with double wins based on a sample size of a single 2 game minimatch. That to me is hand wavy.
I will however concede that your point may be even stronger.
Jonathan

I haven't been following this closely and I don't really want to get involved, but I can't help but point out that throwing out 8 decisive games because the same side happened to win is a fairly extreme thing to do. I don't know how much confidence can be had in the results as soon as you start being that hand wavy.
No, Michel already wrote, the trinomial model for paired games is simply wrong when computing the variance, and the first approximation to it is just the trinomial model with paired wins (same color) taken as draws. The correct thing is to use pentanomial variance. I can give the (correct) pentanomial error margins and LOS for TCEC case, but I don't want to use here such language as "pentanomial" or "the stopping rule should be based on SPRT which controls both Type I and Type II errors, with LLR computed using pentanomial for pairs of games". Nobody wants to read that, I cannot convince even some scientists to use SPRT (adapted to the problem at hand) if one stops at pvalue pleasure, so I just suggest them, if they stop at pleasure, to use a pvalue not of 0.05 or 0.01 (as it is often used in life sciences, for example), but of 0.001. In fact, this is one of the main recommendations of many statisticians in order to cure the rising number of papers with faulty use of statistics. I am not a statistician, but I think I understood their reasoning.
I doubt the engines in a few years will lose the weaker side of those positions against current SF and Leela.
Perhaps if we ran this same set several times the other minimatches that ended in one engines favor would end up going in two wins for the same color and the drawless 11 matches would end in decisive minimatches or double draws.
I also don't understand why we are even discussing this. TCEC is not intending to do the impossible thing of determining the "best" engine under all circumstances. There are many factors and either side can argue TCEC is totally unfair for their side.
What we know is LC0 won the S15 TCEC SuFi with a score of 53.546.5 and it was great entertainment. I also felt it was a much better battle of ice and fire than the last GoT season.

Just trying to make sure I understood you correctly; I assume you mean the first order approximation is to use the pentanomial (not the trinomial) model with paired wins taken as draws?
Jonathan

No, pentanomial model is theoretically not an approximation and it does not change any outcome (wins are wins, paired or not). Trinomial model with paired wins taken as draws (as that +10 3 result instead of +14 7) is usually an approximation. But in TCEC case, maybe (I have to check) there is no any 2:0 result, and +10 3 and 8 draws instead of +14 7 maybe gives the exact variance and LOS. If not, it's anyway a good approximation, and it doesn't require technicalities.
No, pentanomial model is theoretically not an approximation and it does not change any outcome (draws are draws). Trinomial model with paired games taken as draws (as that +10 3 result instead of +14 7) is usually an approximation. But in TCEC case, maybe (I have to check) there is no any 2:0 result, +10 3 and 8 draws instead of +14 7 maybe gives the exact variance and LOS.
Ah, that clears up the confusion a bit. There was a 20 opening (the Trompovsky) in favor of Leela, so +9 3 would be the "corrected" result if I am understanding you correctly.
Jonathan

No, pentanomial model is theoretically not an approximation and it does not change any outcome (draws are draws). Trinomial model with paired games taken as draws (as that +10 3 result instead of +14 7) is usually an approximation. But in TCEC case, maybe (I have to check) there is no any 2:0 result, +10 3 and 8 draws instead of +14 7 maybe gives the exact variance and LOS.
Ah, that clears up the confusion a bit, there was a 20 opening (the Trompovsky) in favor of Leela, so +9 3 would be the "corrected" result if I am understanding you correctly.
No, +9 3 I think is wrong, use either trinomial with removed paired wins (taken as draws) as an approximation or the exact pentanomial.

So where I am struggling with this intuitively is that it seems like you should either be looking at game pairs or individual games. Only turning 11 minimatches into draws while counting 20 results as individual games seems like it is doing both. On the other hand I suppose my intuition might be right and thats why its an approximation as apposed to exact.
Jonathan

Sure, 11 minimatches as draws inside trinomial model is just a simple approximation for paired games. You might come with better approximation using trinomial, but I use either this simple approximation or the exact pentanomial results (as variance, LOS) for paired games.

A new possible challenge for LC0.
in his latest book Garry Kasparov still states that “Centaur mode” is still the best expression of strength in chess games.
I am doubtful after witnessing how LC0 defeated Stockfish.
In order to test Kasparov's ipothesis I would suggest the following challenge:
GM + best collection Alpha Beta programs available at the spot vs. best Neural Network program (LC0 or Alpha 0)
Principal addictive rules:
 GM Centaur mode has access to the screen of analysis of AB programs, but without expanding analysis tree handmade
 on the other hand, GM centaur mode has the possibility to withdraw the move after LC0 reply and is allowed to play another single substitute move, with a time penalty.
According to you, who would win in a match of 1624 games in this conditions ?
GM centaur mode or LCO ?
Why wouldn't the GM be allowed to use the best NNengine?
Because Kasparov's hypothesis as regards superiority of the Centaur mode over Computer mode was expressed before December 2017, when the famous paper of Demis Hassabis came out.
His book "Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins" has no reference to neural networks

Not to mention that the main question can't even begin to be answered without clarifying book conditions for both players.
1 Ok to the alternative suggestion of a very good centaur player respect to a GM centaur
2 I suggest that Centaur player should read the first lines of multiple engines but not explore them back and forth.
3 the possibility to withdraw the move after LC0 reply is another advantage for human player.
Regardless of the rules the aim of similar experiments is as follows:
despite having huge advantage thanks to the best AB engines, human intuition would not be able to beat the positional insight of LCO, not even withdrawing the moves played. If this is true, Kasparov ipothesis should be rejected.