REAL ENGINES ELO COMPARED TO HUMANS?

AlexChess · Post by **AlexChess** » Wed May 26, 2021 8:55 am

towforce wrote: ↑Tue May 25, 2021 10:44 pm Chess computers have been improving for a long period of time. They first beat a GM at blitz in 1977 when Michael Stean famously called Chess 4.6 "bloody iron monster" when it caught him in a tactical trap. 20 years later, Deeper Blue beat Gary Kasparov under tournament conditions

Given that they have continued to improve throughout their entire history, 3600 elo seems reasonable to me.

OK, but they must beat a GM regularly and on long time tournaments of more games. It would be nice if Magnus would accept a true match like Kasparov - Karpov 1987

jr66 · Post by **jr66** » Wed May 26, 2021 10:30 am

When you speak about engines ratings, what conclusion when you see for example this CCRL results please ?
https://ccrl.chessdom.com/ccrl/4040/cgi ... 4-bit_4CPU
Confirmation FF2 is a SF 13 clone and Dragon perhaps less strong but what else ?
Do Carlsen and Caruana for example play often with no GM players in tournaments ?
Just for say i really don't understand this engines rating lists sorry....

MikeB · Post by **MikeB** » Thu May 27, 2021 12:47 am

jr66 wrote: ↑Wed May 26, 2021 10:30 am When you speak about engines ratings, what conclusion when you see for example this CCRL results please ?
https://ccrl.chessdom.com/ccrl/4040/cgi ... 4-bit_4CPU
Confirmation FF2 is a SF 13 clone and Dragon perhaps less strong but what else ?
Do Carlsen and Caruana for example play often with no GM players in tournaments ?
Just for say i really don't understand this engines rating lists sorry....

Indeed, the universe rating pools are quite separate and distinct, which makes it virtaully impossible to compare. Our best attempts are comparing new engines to old engines which played humans years ago. Does not make for sound or easy statistical analysis.

lkaufman · Post by **lkaufman** » Thu May 27, 2021 1:56 am

mehmet123 wrote: ↑Tue May 25, 2021 11:49 pm Let's look at the results of Pocket Fritz 4. The performance of Pocket Fritz 4 was 2398 elo at Argentina 2008.
This is the last engine which had competed in a human tournament. The engine of Pocket Fritz 4 was Hiarcs 13.1.
In this tournament Hiarcs had played with 20 kn/s. At my notebook Hiarcs 13 searches 20x more position at 1 core according to Pocket Fritz 4.
The elo difference between Stockfish 13 and Hiarcs 13.1 is +828 elo according to Cegt 40/20 rating list.
https://en.chessbase.com/post/breakthro ... enos-aires

I think Rybka 1.2 ( released in 2006/06 ) was the first engine to reach the 3000 human elo.

You transposed digits above, the performance was 2938, not 2398!! However the margin of error for a 9.5 out of 10 score is huge, half a point less would have reduced the elo perf. by about 125 elo. Even so, a 2800 performance on that hardware would be remarkable enough. This is indeed a strong argument that the engine rating lists are too low in the range of the human top players at least.

lkaufman · Post by **lkaufman** » Thu May 27, 2021 2:05 am

MikeB wrote: ↑Wed May 26, 2021 12:47 am
Many people ( and I am one of them) believe that using Stockfish Limit Strength rating of 2850 is probably pretty close to human blitz rating of 2850 in game 3 min plus 2 increment. Full strength SF will outperform SF 2850 by 700 or so at least in game 3" 2". Now some of that difference testing, when testing similar or identical engines, the engine that sees more will outperform the true difference . BY how much ..who knows , but even if you assume 100 Elo , then you are sill talking Stockfish on single core is probably at least 3400 Human Blitz. I would tend to agree that standard chess would drop by 200 Elo or so - but the real answer is that we don't have the data . so who knows. Makes for good conversation anyway. Any rating scheme I have ever seen tend to over inflate the ratings over time when you have consistently have stronger engines ( or human players ) coming in. Most of the rating agencies will see that and try to correct for that at some point.
.
Editorial Comment - I define close as within 100 Elo or so either way in this context

I am not saying that you are wrong, but is there some data, perhaps on the servers, that shows that SF limit 2850 is an even match in blitz 3' + 2" for players with FIDE Blitz ratings of about 2850 (Carlsen, Nakamura, MVL, or Wesley So) or perhaps 2850 performances against strong GMs rated around 2700 or more FIDE blitz? If there is some solid evidence for the claim, I could determine blitz "human" ratings of other engines by matching against that SF level. Also, is this SF 13 with NNUE, or some non-NNUE SF? Have the levels been recalibrated to allow for the improvement due to NNUE?

MikeB · Post by **MikeB** » Thu May 27, 2021 2:26 am

lkaufman wrote: ↑Thu May 27, 2021 2:05 am
MikeB wrote: ↑Wed May 26, 2021 12:47 am
Many people ( and I am one of them) believe that using Stockfish Limit Strength rating of 2850 is probably pretty close to human blitz rating of 2850 in game 3 min plus 2 increment. Full strength SF will outperform SF 2850 by 700 or so at least in game 3" 2". Now some of that difference testing, when testing similar or identical engines, the engine that sees more will outperform the true difference . BY how much ..who knows , but even if you assume 100 Elo , then you are sill talking Stockfish on single core is probably at least 3400 Human Blitz. I would tend to agree that standard chess would drop by 200 Elo or so - but the real answer is that we don't have the data . so who knows. Makes for good conversation anyway. Any rating scheme I have ever seen tend to over inflate the ratings over time when you have consistently have stronger engines ( or human players ) coming in. Most of the rating agencies will see that and try to correct for that at some point.
.
Editorial Comment - I define close as within 100 Elo or so either way in this context
I am not saying that you are wrong, but is there some data, perhaps on the servers, that shows that SF limit 2850 is an even match in blitz 3' + 2" for players with FIDE Blitz ratings of about 2850 (Carlsen, Nakamura, MVL, or Wesley So) or perhaps 2850 performances against strong GMs rated around 2700 or more FIDE blitz? If there is some solid evidence for the claim, I could determine blitz "human" ratings of other engines by matching against that SF level. Also, is the SF 13 with NNUE, or some non-NNUE SF? Have the levels been recalibrated to allow for the improvement due to NNUE?

I wish I had that data, eveyrthing I have is anedoatal just based on my knowledge of the attemps to keep computer ratings n sync wth Human ratings for that 40 years plus. SSDF has been around FOR over 40 years and certainly , the late GM Tony Helund and one of the original founders of SSDF wanted SSDF ratings to be comparable to human ratings. So how good is that correlation in 2021 is the million dollar question. One reason I like to say give or take 100 Elo ;>)

No, SF NNUE has not been recalibrated and I suspect strongly it may now be "underrated" relative to the pre NNUE SF ratings

supersharp77 · Post by **supersharp77** » Thu May 27, 2021 2:50 am

mehmet123 wrote: ↑Tue May 25, 2021 11:49 pm Let's look at the results of Pocket Fritz 4. The performance of Pocket Fritz 4 was 2398 elo at Argentina 2008.
This is the last engine which had competed in a human tournament. The engine of Pocket Fritz 4 was Hiarcs 13.1.
In this tournament Hiarcs had played with 20 kn/s. At my notebook Hiarcs 13 searches 20x more position at 1 core according to Pocket Fritz 4.
The elo difference between Stockfish 13 and Hiarcs 13.1 is +828 elo according to Cegt 40/20 rating list.
https://en.chessbase.com/post/breakthro ... enos-aires

I think Rybka 1.2 ( released in 2006/06 ) was the first engine to reach the 3000 human elo.

It is possible that back in the day (Mid 2000's) One or more of the Rybka Engines had a very very high ELO..3000+ in 2006? I can't recall Rybka 1.2 getting that high..In my stable of Rybka engines Rybka v2.3.2a.. Rybka 4.1 ..Deep Rybka 4.1 and
Rybka Cluster 5 performed the strongest...best performance ELO in one of my engine tests was probably around 2800-2850
Tops..Rybka almost always came second best to Houdini.. Robbolito..Komodo Or Stockfish and The King seemed to be able to play well against Rybka even years ago...Hiarcs even though not the Deepest has remained strong even to the current day also..(ELO 2700+)

Modern Times · Post by **Modern Times** » Thu May 27, 2021 3:00 am

MikeB wrote: ↑Thu May 27, 2021 2:26 am SSDF has been around FOR over 40 years and certainly , the late GM Tony Helund and one of the original founders of SSDF wanted SSDF ratings to be comparable to human ratings. So how good is that correlation in 2021 is the million dollar question. One reason I like to say give or take 100 Elo ;>)

Do you know what ratings tools have SSDF used in the past, and what they use today ?

lkaufman · Post by **lkaufman** » Thu May 27, 2021 3:10 am

supersharp77 wrote: ↑Thu May 27, 2021 2:50 am
mehmet123 wrote: ↑Tue May 25, 2021 11:49 pm Let's look at the results of Pocket Fritz 4. The performance of Pocket Fritz 4 was 2398 elo at Argentina 2008.
This is the last engine which had competed in a human tournament. The engine of Pocket Fritz 4 was Hiarcs 13.1.
In this tournament Hiarcs had played with 20 kn/s. At my notebook Hiarcs 13 searches 20x more position at 1 core according to Pocket Fritz 4.
The elo difference between Stockfish 13 and Hiarcs 13.1 is +828 elo according to Cegt 40/20 rating list.
https://en.chessbase.com/post/breakthro ... enos-aires

I think Rybka 1.2 ( released in 2006/06 ) was the first engine to reach the 3000 human elo.
It is possible that back in the day (Mid 2000's) One or more of the Rybka Engines had a very very high ELO..3000+ in 2006? I can't recall Rybka 1.2 getting that high..In my stable of Rybka engines Rybka v2.3.2a.. Rybka 4.1 ..Deep Rybka 4.1 and
Rybka Cluster 5 performed the strongest...best performance ELO in one of my engine tests was probably around 2800-2850
Tops..Rybka almost always came second best to Houdini.. Robbolito..Komodo Or Stockfish and The King seemed to be able to play well against Rybka even years ago...Hiarcs even though not the Deepest has remained strong even to the current day also..(ELO 2700+)

I can be pretty accurate with respect to rating Rybka 2.3.2a against humans, as I was deeply involved with it and we played several matches with GMs, some with pawn handicaps, but some were normal chess with special provisions like time odds, limited opening book, and GM always plays White. Time odds and the White pieces are pretty quantifiable, so I can estimate the rating of Rybka 2.3.2a pretty accurately. It ran on a quad initially, later an Octal, so maybe overall this is something like about 2 threads on the reference I7. On that hardware, under tournament time limits like 90' + 30" inc., it performed somewhat over 2900 FIDE with reasonable estimates for these conditions. The CCRL Rapid ratings of 2961 and 3017 on one and four threads are thus perhaps fifty elo too high for classical chess in human elo terms. I think this is the most solid info we have vs. humans on top engines in classical chess.

lkaufman · Post by **lkaufman** » Fri May 28, 2021 9:47 pm

MikeB wrote: ↑Wed May 26, 2021 12:47 am
Many people ( and I am one of them) believe that using Stockfish Limit Strength rating of 2850 is probably pretty close to human blitz rating of 2850 in game 3 min plus 2 increment. Full strength SF will outperform SF 2850 by 700 or so at least in game 3" 2". Now some of that difference testing, when testing similar or identical engines, the engine that sees more will outperform the true difference . BY how much ..who knows , but even if you assume 100 Elo , then you are sill talking Stockfish on single core is probably at least 3400 Human Blitz. I would tend to agree that standard chess would drop by 200 Elo or so - but the real answer is that we don't have the data . so who knows. Makes for good conversation anyway. Any rating scheme I have ever seen tend to over inflate the ratings over time when you have consistently have stronger engines ( or human players ) coming in. Most of the rating agencies will see that and try to correct for that at some point.
.
Editorial Comment - I define close as within 100 Elo or so either way in this context

Based on 500 blitz games I ran for SF13 with Limit Strength set to 2850 against Lc0cpu, and based on the results of that Lc0 vs. other engines, I can say that the above is not even close to accurate. My results would give it a CCRL blitz rating over 3200 and a CEGT blitz rating over 3100, but even these scales are too low compared to human FIDE blitz ratings. I am convinced that Nakamura or Carlsen wouldn't get more than a couple draws in ten games with SF13/2850 at 2' + 1". They would do a little better at 3' + 2", but not much better. I think it's off by at least 400 elo. Now this doesn't mean that the 2850 figure is wrong for standard chess (40 moves in 2 hours or equivalent), it may be about right in that case, I don't have data at that time control for it.

REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?

Re: REAL ENGINES ELO COMPARED TO HUMANS?