Engine vs. engine rating difference

Fritz 0 · Post by **Fritz 0** » Wed Apr 27, 2022 10:55 am

It has been said that the engine ratings range is expanded compared to the human ratings range. I understand this if we compare the same engine at different search depths. For example, Komodo 14 level 21 vs. Komodo 14 level 20 (8 ply vs. 7 ply) shows 160-170 Elo difference, while the real difference (in human terms) is estimated to be 114. One ply more, while everything else being equal, means much more to the engine than to the human, that is understandible. But if we, for instance, compare Dragon and Stockfish at 12 ply both, why would the difference be expanded? I'm not saying this is not true, just don't get the reason for that and would appreciate if someone explained it to me.

lkaufman · Post by **lkaufman** » Wed Apr 27, 2022 5:41 pm

Fritz 0 wrote: ↑Wed Apr 27, 2022 10:55 am It has been said that the engine ratings range is expanded compared to the human ratings range. I understand this if we compare the same engine at different search depths. For example, Komodo 14 level 21 vs. Komodo 14 level 20 (8 ply vs. 7 ply) shows 160-170 Elo difference, while the real difference (in human terms) is estimated to be 114. One ply more, while everything else being equal, means much more to the engine than to the human, that is understandible. But if we, for instance, compare Dragon and Stockfish at 12 ply both, why would the difference be expanded? I'm not saying this is not true, just don't get the reason for that and would appreciate if someone explained it to me.

If one engine is significantly stronger than the other at a fixed depth like 12 ply, and both are using NNUE, it is probably mostly due to some difference in pruning or reduction or extension. So the stronger engine may still be typically seeing one ply deeper in most lines, for example, even if they are both reporting "12 ply", which is really an iteration count.

Note that when the search depth drops to one ply, and the elo differences are due to varying amounts of Variety or randomness at different levels, the opposite may be true; engine ratings may contract compared to human ratings. At least that's what it looks like now to me.

Fritz 0 · Post by **Fritz 0** » Wed Apr 27, 2022 8:33 pm

lkaufman wrote: ↑Wed Apr 27, 2022 5:41 pm
Fritz 0 wrote: ↑Wed Apr 27, 2022 10:55 am It has been said that the engine ratings range is expanded compared to the human ratings range. I understand this if we compare the same engine at different search depths. For example, Komodo 14 level 21 vs. Komodo 14 level 20 (8 ply vs. 7 ply) shows 160-170 Elo difference, while the real difference (in human terms) is estimated to be 114. One ply more, while everything else being equal, means much more to the engine than to the human, that is understandible. But if we, for instance, compare Dragon and Stockfish at 12 ply both, why would the difference be expanded? I'm not saying this is not true, just don't get the reason for that and would appreciate if someone explained it to me.
If one engine is significantly stronger than the other at a fixed depth like 12 ply, and both are using NNUE, it is probably mostly due to some difference in pruning or reduction or extension. So the stronger engine may still be typically seeing one ply deeper in most lines, for example, even if they are both reporting "12 ply", which is really an iteration count.

Note that when the search depth drops to one ply, and the elo differences are due to varying amounts of Variety or randomness at different levels, the opposite may be true; engine ratings may contract compared to human ratings. At least that's what it looks like now to me.

So it boils down to the search depth. Does it mean that all the rating differences between the engines are expanded (regardless of the reason for the difference), and are really lesser in human terms? If so, is there an universal formula to convert it to the human rating difference?

lkaufman · Post by **lkaufman** » Wed Apr 27, 2022 8:58 pm

Fritz 0 wrote: ↑Wed Apr 27, 2022 8:33 pm
lkaufman wrote: ↑Wed Apr 27, 2022 5:41 pm
Fritz 0 wrote: ↑Wed Apr 27, 2022 10:55 am It has been said that the engine ratings range is expanded compared to the human ratings range. I understand this if we compare the same engine at different search depths. For example, Komodo 14 level 21 vs. Komodo 14 level 20 (8 ply vs. 7 ply) shows 160-170 Elo difference, while the real difference (in human terms) is estimated to be 114. One ply more, while everything else being equal, means much more to the engine than to the human, that is understandible. But if we, for instance, compare Dragon and Stockfish at 12 ply both, why would the difference be expanded? I'm not saying this is not true, just don't get the reason for that and would appreciate if someone explained it to me.
If one engine is significantly stronger than the other at a fixed depth like 12 ply, and both are using NNUE, it is probably mostly due to some difference in pruning or reduction or extension. So the stronger engine may still be typically seeing one ply deeper in most lines, for example, even if they are both reporting "12 ply", which is really an iteration count.

Note that when the search depth drops to one ply, and the elo differences are due to varying amounts of Variety or randomness at different levels, the opposite may be true; engine ratings may contract compared to human ratings. At least that's what it looks like now to me.
So it boils down to the search depth. Does it mean that all the rating differences between the engines are expanded (regardless of the reason for the difference), and are really lesser in human terms? If so, is there an universal formula to convert it to the human rating difference?

Well, if you are comparing NNUE engines with ancient engines, then eval is also very different, and the effect is not the same; also for very weak engines it's quite different. I used to say that elo differences on engine lists should be contracted by 25% for human interpretation, and I still think that's reasonably close to the truth for most engines above 2000 Elo or so, but there is some evidence that better eval (as in NNUE) pays off more against humans than against dumber but deeper-searching engines, so it's not a perfect guide.

Engine vs. engine rating difference

Engine vs. engine rating difference

Re: Engine vs. engine rating difference

Re: Engine vs. engine rating difference

Re: Engine vs. engine rating difference