This is fascinating... the data I'm seeing experimenting with different training parameters for baking the NNUE's is producing initial results that look exactly like what you are talking about. As I bake the NNUE's more and more, the tactical vision of the engine crystalizes, becomes sharper, but the lines are less interpretable by an LLM into classical chess strategic themes. Elo goes up, intelligibility goes down. And it's actually a pretty asymetric relationship. A few elo lost equals a huge gain in intelligibility.hgm wrote: ↑Mon Jan 05, 2026 9:56 am Training a network for exactly reproducing win probability in principle desroys the possibility for the engine to convert a non-trivial win. E.g. a KBNK mate typically takes some 60 ply, but, as long as no B or N is blundered away, is always a certain win. There is no way a search would reach 60 ply if it has no guidance for what to prune; without pruning you must be happy to reach 8 ply. And there is no guidance, neither for prunig as for striving for intermediate goals (like getting the bare King into the corner), as all non-sacrificial end-leaves will have an identical 100% score. This reduces the engine to a random mover biased against giving away any material. No way its random walk through state space will ever bring a checkmate within the horizon. At some point it will stumble into the 50-move barrier, and there is nothing it can be done to avoid it.
A corrolary is that when an engine is trained to reproduce the true win rate on the games played by the latest version, it can never achieve the exact win rate. When it is still ignorant it will discover that it will have a higher chance of winning in positions where the mate is close, just because it has a larger chance to stumble on a position that has it within the search horizon. Those will then get higher score, which can be seen by the search from further away, so that the win rate there will also improve. But it has to keep a score gradient large enough to guide the play towards the mate from far away. And there is some minimum gradient for which this still works, because there will always be evaluation noise. So you will get some equilibrium, where positions far away from any mate only get, say, 90% win rate. Because that will indeed be their win rate, because the gradient from 90% to 100% is spread over so many moves that it has become so small that it fails to follow it in 10% of the cases.
It could be that the "eidos" of chess just isn't reducible to finite mathematical systems, at best you are going to get fractured representations. If this is true, I wouldn't expect a maximally powerful chess engine to necessarily be easily intelligible. Which is a difficulty, pedagogically. Which suggests Stockfish might not be the optimum chess engine for evaluation. Other, simpler or weaker engines might actually be superior.