CCRL flawed testing : SF12 above SF12 8CPU

Alayan · Post by **Alayan** » Tue Oct 06, 2020 6:00 pm

Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%

Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.

Laskos · Post by **Laskos** » Tue Oct 06, 2020 6:16 pm

Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.

5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.

mwyoung · Post by **mwyoung** » Tue Oct 06, 2020 6:48 pm

Laskos wrote: ↑Tue Oct 06, 2020 6:16 pm
Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.

"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.

xr_a_y · Post by **xr_a_y** » Tue Oct 06, 2020 7:09 pm

For what it's worth Minic NNUE seems to be scaling ok at short TC (not much game)

Code: Select all

   1 stockfish.11              90   50   49    96   54%    67   43% 
   2 minic_2.50_uci_nnue_8t    67   18   18   767   62%    -8   47% 
   3 stockfish.10              55   49   50    96   48%    67   44% 
   4 minic_2.50_uci_nnue_4t    35   46   46    96   44%    67   69% 
   5 stockfish.9                8   48   49    96   40%    67   49% 
   6 stockfish.8              -20   50   51    96   36%    67   43% 
   7 minic_2.50_uci_nnue_2t   -40   48   49    96   31%    67   56% 
   8 stockfish.7              -72   52   55    96   29%    67   33% 
   9 minic_2.50_uci_nnue     -123   52   56    95   21%    67   41%

Laskos · Post by **Laskos** » Tue Oct 06, 2020 7:37 pm

mwyoung wrote: ↑Tue Oct 06, 2020 6:48 pm
Laskos wrote: ↑Tue Oct 06, 2020 6:16 pm
Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.

The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.

RogerC · Post by **RogerC** » Tue Oct 06, 2020 7:38 pm

I'm not a fan of CCRL any more, just for that. They mix Multicore results and single core with completely different opponents. No doubt that the results will be necessarely biased !

I only look at CEGT 40/4 and 40/20 tournaments, which is more accurate in ELO calculations on Single and Multicore Engines (1, 4, 8 and 12 thread) :

http://www.cegt.net/40_4_Ratinglist/40_ ... liste.html
http://www.cegt.net/40_40%20Rating%20Li ... liste.html

If you want to focus on competition between SF vs LC0 (the 2 best engines of the world for now) , look at Stefan Pohl Computer Chess tournament. There you will find the best nets tests for LC0 and the results of LC0 best net vs last SFdev :

https://www.sp-cc.de/nn-vs-sf-testing.htm

mwyoung · Post by **mwyoung** » Tue Oct 06, 2020 7:43 pm

Laskos wrote: ↑Tue Oct 06, 2020 7:37 pm
mwyoung wrote: ↑Tue Oct 06, 2020 6:48 pm
Laskos wrote: ↑Tue Oct 06, 2020 6:16 pm
Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.

"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.

Laskos · Post by **Laskos** » Tue Oct 06, 2020 8:02 pm

mwyoung wrote: ↑Tue Oct 06, 2020 7:43 pm
Laskos wrote: ↑Tue Oct 06, 2020 7:37 pm
mwyoung wrote: ↑Tue Oct 06, 2020 6:48 pm
Laskos wrote: ↑Tue Oct 06, 2020 6:16 pm
Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.
"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.

What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.

mwyoung · Post by **mwyoung** » Tue Oct 06, 2020 8:03 pm

RogerC wrote: ↑Tue Oct 06, 2020 7:38 pm I'm not a fan of CCRL any more, just for that. They mix Multicore results and single core with completely different opponents. No doubt that the results will be necessarely biased !

I only look at CEGT 40/4 and 40/20 tournaments, which is more accurate in ELO calculations on Single and Multicore Engines (1, 4, 8 and 12 thread) :

http://www.cegt.net/40_4_Ratinglist/40_ ... liste.html
http://www.cegt.net/40_40%20Rating%20Li ... liste.html

If you want to focus on competition between SF vs LC0 (the 2 best engines of the world for now) , look at Stefan Pohl Computer Chess tournament. There you will find the best nets tests for LC0 and the results of LC0 best net vs last SFdev :

https://www.sp-cc.de/nn-vs-sf-testing.htm

"I'm not a fan of CCRL any more, just for that. They mix Multicore results and single core with completely different opponents. No doubt that the results will be necessarely biased !"

I am not a fan either, but not for this reason. In theory there is nothing wrong with playing different opponents.

mwyoung · Post by **mwyoung** » Tue Oct 06, 2020 8:09 pm

Laskos wrote: ↑Tue Oct 06, 2020 8:02 pm
mwyoung wrote: ↑Tue Oct 06, 2020 7:43 pm
Laskos wrote: ↑Tue Oct 06, 2020 7:37 pm
mwyoung wrote: ↑Tue Oct 06, 2020 6:48 pm
Laskos wrote: ↑Tue Oct 06, 2020 6:16 pm
Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.
"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.

Then you are assuming this is true then with STOCKFISH 12. So you have no data! This is why you always fall off the rails.
So this could be a issue with Stockfish 12 with 8 cores, and CCRL testing could be correct.

CCRL flawed testing : SF12 above SF12 8CPU

CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU