We're looking into it for sure.mwyoung wrote: ↑Fri Oct 09, 2020 12:18 amThen I guess CCRL has a lot to explain...
CCRL flawed testing : SF12 above SF12 8CPU
Moderators: hgm, Rebel, chrisw
-
- Posts: 41466
- Joined: Sun Feb 26, 2006 10:52 am
- Location: Auckland, NZ
Re: CCRL flawed testing : SF12 above SF12 8CPU
gbanksnz at gmail.com
-
- Posts: 1759
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: CCRL flawed testing : SF12 above SF12 8CPU
Has someone told KingCrusher? The Content has been getting dry ....
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: CCRL flawed testing: SF12 above SF12 8CPU.
Very nice!Ajedrecista wrote: ↑Thu Oct 08, 2020 9:10 pm Hello Kai and Alayan:
While we wait Mark's answer, I did some math. Based on my own post when SF 12 was released, the draw ratio is:
I computed draw ratios for some K values just to get an idea:Code: Select all
W = K*L >= L // K >= 1 K*L + D + L = 1 (K + 1)*L = 1 - D L = (1 - D)/(K + 1) Elo_diff. = 400*log10{[2*K - (K - 1)*D]/[2 + (K - 1)*D]} D = 2*[10^(Elo_diff./400) - K]/{(1 - K)*[10^(Elo_diff./400) + 1]}
Of course, the extreme cases are:Code: Select all
Elo difference = 72 Elo. W = K*L K W D L 2 40.86% 38.71% 20.43% 3 30.65% 59.14% 10.22% 4 27.24% 65.95% 6.81% 5 25.54% 69.35% 5.11% 6 24.52% 71.40% 4.09% 7 23.84% 72.76% 3.41% 8 23.35% 73.73% 2.92% 9 22.99% 74.46% 2.55% 10 22.70% 75.03% 2.27% 11 22.47% 75.48% 2.04% 12 22.29% 75.85% 1.86% 13 22.13% 76.16% 1.70% 14 22.00% 76.43% 1.57% 15 21.89% 76.65% 1.46% 16 21.79% 76.84% 1.36% 17 21.71% 77.01% 1.28% 18 21.63% 77.16% 1.20% 19 21.57% 77.30% 1.14% 20 21.51% 77.42% 1.08%
So 10.22% < D < 79.57% in this case, with great chances of being around 75%. If there were 200 games, then WDL figures must be in steps of 0.5% and some values of my table can be discarded like K = 7. K has low chances of being an integer, after all. I picked integer values for K just to get a rough idea of the draw ratio.Code: Select all
Assuming Elo_diff. >= 0 Elo: Elo_diff. = 400*log10[score/(1 - score)] score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min. D_min. = 1/[1 + 10^(-Elo_diff./400)] - 1/2 // Elo_diff. = 72 ==> D_min. ~ 10.22% W + D_max. = 1 // L = 0 Elo_diff. = 400*log10[(W + D_max./2)/(D_max./2)] Elo_diff. = 400*log10[(1 - D_max. + D_max./2)/(D_max./2)] 10^(Elo_diff./400) = (1 - D_max./2)/(D_max./2) = 2/D_max. - 1 D_max. = 2/[1 + 10^(Elo_diff./400)] // Elo_diff. = 72 ==> D_max. ~ 79.57%
Regards from Spain.
Ajedrecista.
His result was 78% draws, very close to your likely 75%. And his wins : losses was 44 : 0, quite telling W/L and an associated Elo compression.
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: CCRL flawed testing : SF12 above SF12 8CPU
Thanks,Graham Banks wrote: ↑Fri Oct 09, 2020 12:23 amWe're looking into it for sure.mwyoung wrote: ↑Fri Oct 09, 2020 12:18 amThen I guess CCRL has a lot to explain...
Keep us up to date. Interesting for testers.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.
-
- Posts: 1971
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: CCRL flawed testing: SF12 above SF12 8CPU.
Hello Kai:
Furthermore, I did not realize that +72 Elo could not be a result after 200 games because W - L would not be a multiple of 1/200 = 0.005 = 0.5%:
Computing again with 200*(W - L) = 44 ==> Elo_diff. ~ 77.7:
I would have chosen around 72.5% or 74% of draw ratio in this case, falling even shorter than with the wrong Elo_diff. of 72 Elo. The stronger engine completed a perfect score regarding loses, which is something I value a lot in many games and sports: chess, checkers, goals against in football (soccer), games lost in football (Arsenal F.C. in the 2003-04 Premier League season)...
Regards from Spain.
Ajedrecista.
I found a typo in my formula for D_min. The true value is:
Code: Select all
/* WRONG IN MY PREVIOUS POST:
score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min.
D_min. = 1/[1 + 10^(-Elo_diff./400)] - 1/2
// Elo_diff. = 72 ==> D_min. ~ 10.22%[code]
*/
// IT IS THE DOUBLE:
score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min./2 // D_min./2 instead of D_min.
D_min. = 2*score - 1 = 2/[1 + 10^(-Elo_diff./400)] - 1
// Elo_diff. = 72 ==> D_min. ~ 20.43%
Code: Select all
Elo_diff. = 400*log10{[1/2 + (W - L)/2]/[1/2 - (W - L)/2]}
10^(-Elo_diff./400) = [1 + (W - L)]/[1 - (W - L)]
W - L = [10^(Elo_diff./400) - 1]/[10^(Elo_diff./400) + 1]
// Elo_diff. = 72 ==> W - L ~ 20.43%
// W - L = D_min.
// n*(W - L) = 200*(W - L) ~ 40.86 (far enough to the closest integer, but it could be due to Elo_diff. rounding).
Code: Select all
Elo difference = 77.7 Elo.
W = K*L
K W D L
2 44.00% 34.01% 22.00%
3 33.00% 56.00% 11.00%
4 29.33% 63.34% 7.33%
5 27.50% 67.00% 5.50%
6 26.40% 69.20% 4.40%
7 25.66% 70.67% 3.67%
8 25.14% 71.72% 3.14%
9 24.75% 72.50% 2.75%
10 24.44% 73.11% 2.44%
11 24.20% 73.60% 2.20%
12 24.00% 74.00% 2.00%
13 23.83% 74.34% 1.83%
14 23.69% 74.62% 1.69%
15 23.57% 74.86% 1.57%
16 23.46% 75.07% 1.47%
17 23.37% 75.25% 1.37%
18 23.29% 75.41% 1.29%
19 23.22% 75.56% 1.22%
20 23.16% 75.69% 1.16%
Code: Select all
Assuming Elo_diff. >= 0 Elo:
D_min. = 2/[1 + 10^(-Elo_diff./400)] - 1
// Elo_diff. = 77.7 ==> D_min. = 22%
D_max. = 2/[1 + 10^(Elo_diff./400)]
// Elo_diff. = 77.7 ==> D_max. = 78%
D_min. + D_max. = {2/[1 + 10^(-Elo_diff./400)] - 1} + 2/[1 + 10^(Elo_diff./400)]
D_min. + D_max. = 1
Regards from Spain.
Ajedrecista.
-
- Posts: 508
- Joined: Fri Jun 04, 2010 7:23 am
Re: CCRL flawed testing : SF12 above SF12 8CPU
If a different mix makes a significant difference then there is a flaw with elo rating system period.Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.Code: Select all
Stockfish 12 64-bit 3666 +22 −22 89.3% −325.7 21.0% Stockfish 12 64-bit 8CPU 3658 +28 −28 63.4% −75.1 70.7%
-
- Posts: 708
- Joined: Mon Jan 16, 2012 6:34 am
Re: CCRL flawed testing : SF12 above SF12 8CPU
So, we dont need to use more than 1 core for SF 12.Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.Code: Select all
Stockfish 12 64-bit 3666 +22 −22 89.3% −325.7 21.0% Stockfish 12 64-bit 8CPU 3658 +28 −28 63.4% −75.1 70.7%
Environmently friendly, good job in NNUE revolution!
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: CCRL flawed testing : SF12 above SF12 8CPU
Lasko's Law----What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.
It is clear to me that Stockfish NNUE does not obey Lasko's law as stated above. CCRL most likely does not have flawed testing.. And as suspected. The issues is with Stockfish NNUE. It took me many hours to testing to show this result, and the full results will be shown soon. When the testing is completed. The bottom line is the issue is with Stockfish NNUE, and not with CCRL testing. Full results coming soon. As you know testing can take days to answer this kind of anomaly, or false assumption.
It is clear to me that Stockfish NNUE does not obey Lasko's law as stated above. CCRL most likely does not have flawed testing.. And as suspected. The issues is with Stockfish NNUE. It took me many hours to testing to show this result, and the full results will be shown soon. When the testing is completed. The bottom line is the issue is with Stockfish NNUE, and not with CCRL testing. Full results coming soon. As you know testing can take days to answer this kind of anomaly, or false assumption.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.
-
- Posts: 27811
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: CCRL flawed testing : SF12 above SF12 8CPU
One thing should be clear: there is nothing flawed in CCRL testing. They just play games, and record the results.
What could be flawed is the Elo model used to analyze the testing data. This cannot be blamed on CCRL. It just means we have to develop better rating models, and use these instead to extract useful information (which might not be representable as a single rating number) from the data set.
I have suspected for a long time that conventional ratings are artifacts caused by 'incestuous testing': the players tested are too similar, so that the test only measures a single aspect of their performance, almost completely ignoring other aspects, as there is no opponent in the pool that would punish you for being bad at the others. So I have always wondered whether an A-B engine that tests (say) 100 Elo stronger as another one, would also perform 100 Elo better in a gauntlet against a more varied mix of opponents. I never excluded the possibility that it might actually perform worse, because the 100-Elo advantage was only reached by specializing it more on the single aspect that A-B engines are good at, at the expense of other aspects.
If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis. Deeper search through more cores might not help against strategically superior opponents, because the tactical mistakes the latter make can already be seen at the lower depth. The only thing that would help is to not fall for their their strategical traps. The idea that better search beyond some point makes the engine stronger might just be an artifact of the rating pool being dominated by A-B engines.
What could be flawed is the Elo model used to analyze the testing data. This cannot be blamed on CCRL. It just means we have to develop better rating models, and use these instead to extract useful information (which might not be representable as a single rating number) from the data set.
I have suspected for a long time that conventional ratings are artifacts caused by 'incestuous testing': the players tested are too similar, so that the test only measures a single aspect of their performance, almost completely ignoring other aspects, as there is no opponent in the pool that would punish you for being bad at the others. So I have always wondered whether an A-B engine that tests (say) 100 Elo stronger as another one, would also perform 100 Elo better in a gauntlet against a more varied mix of opponents. I never excluded the possibility that it might actually perform worse, because the 100-Elo advantage was only reached by specializing it more on the single aspect that A-B engines are good at, at the expense of other aspects.
If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis. Deeper search through more cores might not help against strategically superior opponents, because the tactical mistakes the latter make can already be seen at the lower depth. The only thing that would help is to not fall for their their strategical traps. The idea that better search beyond some point makes the engine stronger might just be an artifact of the rating pool being dominated by A-B engines.
-
- Posts: 2661
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: CCRL flawed testing : SF12 above SF12 8CPU
Just as a side remark, it is not clear to me if SF parallel search goes deeper with more cores or just widens the search, maybe NNUE does not profit the same way as classic SF from more cores and they have to rework the parallel search again, or alike...hgm wrote: ↑Sat Oct 10, 2020 11:26 am One thing should be clear: there is nothing flawed in CCRL testing. They just play games, and record the results.
What could be flawed is the Elo model used to analyze the testing data. This cannot be blamed on CCRL. It just means we have to develop better rating models, and use these instead to extract useful information (which might not be representable as a single rating number) from the data set.
I have suspected for a long time that conventional ratings are artifacts caused by 'incestuous testing': the players tested are too similar, so that the test only measures a single aspect of their performance, almost completely ignoring other aspects, as there is no opponent in the pool that would punish you for being bad at the others. So I have always wondered whether an A-B engine that tests (say) 100 Elo stronger as another one, would also perform 100 Elo better in a gauntlet against a more varied mix of opponents. I never excluded the possibility that it might actually perform worse, because the 100-Elo advantage was only reached by specializing it more on the single aspect that A-B engines are good at, at the expense of other aspects.
If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis. Deeper search through more cores might not help against strategically superior opponents, because the tactical mistakes the latter make can already be seen at the lower depth. The only thing that would help is to not fall for their their strategical traps. The idea that better search beyond some point makes the engine stronger might just be an artifact of the rating pool being dominated by A-B engines.
--
Srdja