CCRL flawed testing : SF12 above SF12 8CPU

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Alayan
Posts: 550
Joined: Tue Nov 19, 2019 7:48 pm
Full name: Alayan Feh

CCRL flawed testing : SF12 above SF12 8CPU

Post by Alayan » Tue Oct 06, 2020 4:00 pm

Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.

User avatar
Laskos
Posts: 10949
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by Laskos » Tue Oct 06, 2020 4:16 pm

Alayan wrote:
Tue Oct 06, 2020 4:00 pm
Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.

mwyoung
Posts: 2727
Joined: Wed May 12, 2010 8:00 pm

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by mwyoung » Tue Oct 06, 2020 4:48 pm

Laskos wrote:
Tue Oct 06, 2020 4:16 pm
Alayan wrote:
Tue Oct 06, 2020 4:00 pm
Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.

User avatar
xr_a_y
Posts: 1619
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by xr_a_y » Tue Oct 06, 2020 5:09 pm

For what it's worth Minic NNUE seems to be scaling ok at short TC (not much game)

Code: Select all

   1 stockfish.11              90   50   49    96   54%    67   43% 
   2 minic_2.50_uci_nnue_8t    67   18   18   767   62%    -8   47% 
   3 stockfish.10              55   49   50    96   48%    67   44% 
   4 minic_2.50_uci_nnue_4t    35   46   46    96   44%    67   69% 
   5 stockfish.9                8   48   49    96   40%    67   49% 
   6 stockfish.8              -20   50   51    96   36%    67   43% 
   7 minic_2.50_uci_nnue_2t   -40   48   49    96   31%    67   56% 
   8 stockfish.7              -72   52   55    96   29%    67   33% 
   9 minic_2.50_uci_nnue     -123   52   56    95   21%    67   41% 

User avatar
Laskos
Posts: 10949
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by Laskos » Tue Oct 06, 2020 5:37 pm

mwyoung wrote:
Tue Oct 06, 2020 4:48 pm
Laskos wrote:
Tue Oct 06, 2020 4:16 pm
Alayan wrote:
Tue Oct 06, 2020 4:00 pm
Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.

RogerC
Posts: 36
Joined: Tue Oct 29, 2019 7:33 pm
Location: French Polynesia
Full name: Roger C.

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by RogerC » Tue Oct 06, 2020 5:38 pm

I'm not a fan of CCRL any more, just for that. They mix Multicore results and single core with completely different opponents. No doubt that the results will be necessarely biased !

I only look at CEGT 40/4 and 40/20 tournaments, which is more accurate in ELO calculations on Single and Multicore Engines (1, 4, 8 and 12 thread) :

http://www.cegt.net/40_4_Ratinglist/40_ ... liste.html
http://www.cegt.net/40_40%20Rating%20Li ... liste.html

If you want to focus on competition between SF vs LC0 (the 2 best engines of the world for now) , look at Stefan Pohl Computer Chess tournament. There you will find the best nets tests for LC0 and the results of LC0 best net vs last SFdev :

https://www.sp-cc.de/nn-vs-sf-testing.htm

mwyoung
Posts: 2727
Joined: Wed May 12, 2010 8:00 pm

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by mwyoung » Tue Oct 06, 2020 5:43 pm

Laskos wrote:
Tue Oct 06, 2020 5:37 pm
mwyoung wrote:
Tue Oct 06, 2020 4:48 pm
Laskos wrote:
Tue Oct 06, 2020 4:16 pm
Alayan wrote:
Tue Oct 06, 2020 4:00 pm
Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.
"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.

User avatar
Laskos
Posts: 10949
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by Laskos » Tue Oct 06, 2020 6:02 pm

mwyoung wrote:
Tue Oct 06, 2020 5:43 pm
Laskos wrote:
Tue Oct 06, 2020 5:37 pm
mwyoung wrote:
Tue Oct 06, 2020 4:48 pm
Laskos wrote:
Tue Oct 06, 2020 4:16 pm
Alayan wrote:
Tue Oct 06, 2020 4:00 pm
Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.
"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.

mwyoung
Posts: 2727
Joined: Wed May 12, 2010 8:00 pm

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by mwyoung » Tue Oct 06, 2020 6:03 pm

RogerC wrote:
Tue Oct 06, 2020 5:38 pm
I'm not a fan of CCRL any more, just for that. They mix Multicore results and single core with completely different opponents. No doubt that the results will be necessarely biased !

I only look at CEGT 40/4 and 40/20 tournaments, which is more accurate in ELO calculations on Single and Multicore Engines (1, 4, 8 and 12 thread) :

http://www.cegt.net/40_4_Ratinglist/40_ ... liste.html
http://www.cegt.net/40_40%20Rating%20Li ... liste.html

If you want to focus on competition between SF vs LC0 (the 2 best engines of the world for now) , look at Stefan Pohl Computer Chess tournament. There you will find the best nets tests for LC0 and the results of LC0 best net vs last SFdev :

https://www.sp-cc.de/nn-vs-sf-testing.htm
"I'm not a fan of CCRL any more, just for that. They mix Multicore results and single core with completely different opponents. No doubt that the results will be necessarely biased !"

I am not a fan either, but not for this reason. In theory there is nothing wrong with playing different opponents.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.

mwyoung
Posts: 2727
Joined: Wed May 12, 2010 8:00 pm

Re: CCRL flawed testing : SF12 above SF12 8CPU

Post by mwyoung » Tue Oct 06, 2020 6:09 pm

Laskos wrote:
Tue Oct 06, 2020 6:02 pm
mwyoung wrote:
Tue Oct 06, 2020 5:43 pm
Laskos wrote:
Tue Oct 06, 2020 5:37 pm
mwyoung wrote:
Tue Oct 06, 2020 4:48 pm
Laskos wrote:
Tue Oct 06, 2020 4:16 pm
Alayan wrote:
Tue Oct 06, 2020 4:00 pm
Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :

Code: Select all

Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.
"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.
Then you are assuming this is true then with STOCKFISH 12. So you have no data! This is why you always fall off the rails.
So this could be a issue with Stockfish 12 with 8 cores, and CCRL testing could be correct. :shock:
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.

Post Reply