Real engine ELO - normalised to classic time controls

towforce · Post by **towforce** » Wed Jan 07, 2026 11:58 pm

Peter Berger wrote: ↑Wed Jan 07, 2026 7:55 pmThe modern engines ( I am mainly talking Stockfish and lc0 here) clearly have a problem of beating way, way weaker engines in classical time control often enough from the standard opening position.

From time to time I download their latest and greatest versions and pit them against Crafty for fun. It simply never happens that Crafty doesn't get a draw in like 10 games. This should happen like never.

Some of this could be cured by simple opening books, but from what I have seen in my fun experiments, this simply hides the phenomenon, and it is still there.

Chess is drawish, but not +that+ drawish. If these engines were playing against weaker opponents more often in their testing procedure, I am convinced, that Crafty wouldn't get these draws.

Good thought provoking post.

Question to testers: do you test strong v weak engines at long time controls?

towforce · Post by **towforce** » Fri Jan 09, 2026 9:49 am

Would this be a correct simplification of this issue?

* Chess is a drawn game.

* Winning a chess game requires the opponent to make a mistake.

* The longer an engine thinks, the more likely it is to uncover something bad about the erroneous move it has chosen, and hence choose a different move.

* Hence at longer time controls, the probability of a weak engine making a fatal mistake falls dramatically due to search alone being able to uncover errors.

ydebilloez · Post by **ydebilloez** » Fri Jan 09, 2026 10:54 am

I see a strong correlation in between two graphs, so I repost them again next to each other.

jkominek wrote: ↑Tue Jan 06, 2026 2:42 am I would put it differently. Instead of saying that at least one of the (CCRL) lists is wrong, i.e. miscalibrated,
...
But I can give you an idea of how Stockfish scales with the "Chess 324" opening book. In the following plot I have established Stockfish 10 as the baseline against which future releases are compared.

...

jkominek wrote: ↑Tue Jan 06, 2026 9:11 pm ...

To create this plot I began with a fully connected round robin tournament of up to 2^18 fixed nodes per move, in node doubling steps, with the absolute scale anchored to Gaviota 1.0 on CCRL 40-15. Above that threshold each Stockfish version played only against itself at different node budget odds.
...

I am sorry to have used the word 'wrong' instead of 'miscalibrated'. You are absolutely right. On long time controls, which means high node count, the relative difference of the engines is much smaller. It probably also means that in classic classic games, the elo edge of engines over human players would probably be much less, due to the draw rate. Let's find a sponsor and a super GM to play 100 classic games against a SF 17....

jkominek · Post by **jkominek** » Fri Jan 09, 2026 1:11 pm

ydebilloez wrote: ↑Fri Jan 09, 2026 10:54 am I see a strong correlation in between two graphs, so I repost them again next to each other.

I am sorry to have used the word 'wrong' instead of 'miscalibrated'. You are absolutely right. On long time controls, which means high node count, the relative difference of the engines is much smaller. It probably also means that in classic classic games, the elo edge of engines over human players would probably be much less, due to the draw rate.

The bottom graph you quoted is wonky (biased) because I was selectively narrowing data to illustrate a curiosity. To your point here's a better one.

The hill-shaped graph is derived from this one by taking differences relative to Stockfish 10, with the mass of lower curves discarded. The older versions of Stockfish do not start at the very left hand side because they do not have the granularity to measure low node counts per move. The very lowest curve running from 16 to 25 on the x-axis is Glaurung 1.01, the predecessor to the Stockfish family.

Let's find a sponsor and a super GM to play 100 classic games against a SF 17....

A long shot, but with sufficient money, maybe.

For a Man-Machine calibration event - and there hasn't been one of those in a long time - my proposal would be this. Set a family of engines in strength steps of 100 from 2400 to 3200, estimated as best can be done. Offer players a betting scenario with the leeway to choose their opponent. The stronger opponent they choose to play the more money they earn from a draw or a win. Players can choose a different opponent before the beginning of the next round. I'd propose starting with Blitz and Rapid events to see if the format catches on. Promote the event as a fun exhibition match -- Who can come out on top versus the machines? Motivate the players with a big money board. On account of larger winnings being possible by playing stronger engines, a come-from-behind victory is always possible. (In Yasser Seirawan's voice: "Oooh. Nakamura goes big in the final round! Can he pass Carlsen?") Additional prize money is awarded for finishing first through third among human players.

I wouldn't use Stockfish either. It's UCI_Elo setting algorithm is, ah, somewhere between "nice try" and brain-dead. I'd train specialty Leela nets to play like Grandmasters at those specified levels.

Real engine ELO - normalised to classic time controls

Re: Real engine ELO - normalised to classic time controls

Re: Real engine ELO - normalised to classic time controls

Re: Real engine ELO - normalised to classic time controls

Re: Real engine ELO - normalised to classic time controls