I've been doing some research on chess engines, particularly the philosophical aspects touching on metaphysics (what is the essence of good chess?) and epistemology (how do we know what "good chess" is?). I'm just an amateur, a low level player, but I've always been fascinated by chess engines and frustrated with Stockfish's evaluations. I think I've discovered some philosophic problems behind Stockfish's evaluation that speak to first principles, or lack thereof, and I'm working on building my own neural network outside the Fishtest framework. So potentially, I'm doing a fork of Stockfish. We are using specially curated lc0 training data, without any generation from Stockfish, because I don't trust Stockfish's default evaluation.
Other engines I've been impressed with are Obsidian and PlentyChess. My primary goal is a good analysis engine that delivers clear, interpretable insights (I've done amateur research on the AI "problem of interpretability" and internal representations, too). However, so far it looks like the engine we are working on with the new network is at least equal to Stockfish 17.1 in Cute Chess blitz games, and is alot faster to train.
I tried talking about this on reddit, but folks there just said I was in a bongcloud haze. Nobody likes hearing that their favorite engine might have fundamental flaws, and philosophical critique is immediately dismissed... "but it wins tournaments, bro" and "what does metaphysics have to do with chess?" sort of thing. My primary idea is that there should be self-similarity in evaluation from ply to ply in chess search in terms of evaluation (which translates into evaluation stability), because that's the fundamental nature of good chess analysis (stability, positional judgement), and I figure to get there, you need a clear internal representation of chess inside the network.
The reason I just don't use Lc0 is because I've had issues configuring it (less familiar with the technology, perhaps?), and also because I think alpha-beta search might be fundamentally better than MCTS for catching tactics in an efficient way. Lc0 can get caught in traps sometimes. Our weights seem to play chess in the style similar to Tony Miles, prophylactic moves that are sometimes unorthodox but sound.
Hi... I'm working on engine research
Moderator: Ras
-
FireDragon761138
- Posts: 18
- Joined: Sun Dec 28, 2025 7:25 am
- Full name: Aaron Munn
-
towforce
- Posts: 12719
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Hi... I'm working on engine research
FireDragon761138 wrote: ↑Tue Dec 30, 2025 1:03 am...there should be self-similarity in evaluation from ply to ply in chess search in terms of evaluation (which translates into evaluation stability), because that's the fundamental nature of good chess analysis (stability, positional judgement)...
Engines with HCE (Hand Coded Evals) used to change their evaluations suddenly when search revealed something that the eval had missed. We know that HCEs missed a lot, because when they were replaced with NNs, the ELO rating jumped by hundreds of points, despite the eval taking a lot longer to run.
We also know that NNs miss a lot. They tend to encode a large number of simple "surface" features rather than encoding the deep underlying patterns of chess. We already knew this - but it was then demonstrated to us in a very stark way: in 2023 Adam Pelrine, an amateur Go player, researched KataGo (link), found a fundamental weakness, and used it to beat that engine in a match - link. Basically, he demonstrated that, at that time, KataGo didn't understand the concept of a group of stones - which, from a human perspective, is a remarkable knowledge gap for an engine which had been thought to be better than any human!
Chess engine developers are heavily focused on incrementally improving the strength of engines right now. To my knowledge, nobody is looking into how to uncover the deep underlying patterns of the game.
Human chess is partly about tactics and strategy, but mostly about memory
-
syzygy
- Posts: 5824
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Hi... I'm working on engine research
Well, an admitted newbie coming in to declare that there are "fundamental flaws" does reek of the Dunning-Kruger effect.FireDragon761138 wrote: ↑Tue Dec 30, 2025 1:03 amI tried talking about this on reddit, but folks there just said I was in a bongcloud haze. Nobody likes hearing that their favorite engine might have fundamental flaws, and philosophical critique is immediately dismissed...
Why do you have to start off with such a dismissive tone?
Stockfish represents 80 years of accumulated engineering knowledge with contributions by 1000s of people. There is no credible sense in which Stockfish is "fundamentally flawed", even if you have come up with a new idea that could improve Stockfish by 100 Elo (unlikely).
Show us the code...
-
syzygy
- Posts: 5824
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Hi... I'm working on engine research
You reddit thread has nothing to do with what you are claiming here:
There you complain that Stockfish is not suitable for teaching you the principles of playing chess. Sure, you should get a book and/or a teacher. ChatGPT is probably better at teaching you to play chess at a beginner's level than Stockfish.
There you complain that Stockfish is not suitable for teaching you the principles of playing chess. Sure, you should get a book and/or a teacher. ChatGPT is probably better at teaching you to play chess at a beginner's level than Stockfish.
-
Frank Quisinsky
- Posts: 7205
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: Hi... I'm working on engine research
Hi syzygy,
really a strong answer!!
Shortly, I made the experiment with Stockfish (last HCE version) and a NN version, around start of 2025 with the time control 40 moves in 150 minutes (in a tourney with many other engines). I haven't only around 400 games per engine. But it seems that the different between both Stockfish versions is not more as 160 Elo. I know, 400 games aren't enough but I am really sure that which such high time-controls the different between a current Stockfish NN and the last Stockfish HCE version from end of Juli 2020 is not more as around ~200 Elo (means, with 1 core and such a high time-control).
In all the rating systems are such things not to see because all are using very fast time controls.
If I read hundrets of Elo ...
OK, 200 Elo is hundrets of Elo but one could also interpret it differently.
I now dislike Elo. Everyone does their own calculations. For some, Stockfish is 4000, for others 3500.
There must be another way...
But don't ask me how, I've been racking my brains over it for years.
I wish you a good start in 2026, also the other readers.
Best
Frank
Ah, the test-resuslts I have ...
40 in 150 ... 4.3Ghz on AMD Ryzen 9 5950, without ponder, Windows 10, of course my special "Normaly" FEOBOS opening book with 6-pieces endgame bases.
Syzygy ... you know.

NN is great but often by many others ... in my humble opinion ... grossly overrated!!
really a strong answer!!
Shortly, I made the experiment with Stockfish (last HCE version) and a NN version, around start of 2025 with the time control 40 moves in 150 minutes (in a tourney with many other engines). I haven't only around 400 games per engine. But it seems that the different between both Stockfish versions is not more as 160 Elo. I know, 400 games aren't enough but I am really sure that which such high time-controls the different between a current Stockfish NN and the last Stockfish HCE version from end of Juli 2020 is not more as around ~200 Elo (means, with 1 core and such a high time-control).
In all the rating systems are such things not to see because all are using very fast time controls.
If I read hundrets of Elo ...
OK, 200 Elo is hundrets of Elo but one could also interpret it differently.
I now dislike Elo. Everyone does their own calculations. For some, Stockfish is 4000, for others 3500.
There must be another way...
But don't ask me how, I've been racking my brains over it for years.
I wish you a good start in 2026, also the other readers.
Best
Frank
Ah, the test-resuslts I have ...
40 in 150 ... 4.3Ghz on AMD Ryzen 9 5950, without ponder, Windows 10, of course my special "Normaly" FEOBOS opening book with 6-pieces endgame bases.
Syzygy ... you know.
NN is great but often by many others ... in my humble opinion ... grossly overrated!!
-
hgm
- Posts: 28433
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Hi... I'm working on engine research
Isn't it a fundamental flaw to go just for high Elo (which is an average property) rather than for reliability (which guarantees some minimum performance in every conceivable situation)?syzygy wrote: ↑Wed Dec 31, 2025 12:21 amWell, an admitted newbie coming in to declare that there are "fundamental flaws" does reek of the Dunning-Kruger effect.FireDragon761138 wrote: ↑Tue Dec 30, 2025 1:03 amI tried talking about this on reddit, but folks there just said I was in a bongcloud haze. Nobody likes hearing that their favorite engine might have fundamental flaws, and philosophical critique is immediately dismissed...
Why do you have to start off with such a dismissive tone?
Stockfish represents 80 years of accumulated engineering knowledge with contributions by 1000s of people. There is no credible sense in which Stockfish is "fundamentally flawed", even if you have come up with a new idea that could improve Stockfish by 100 Elo (unlikely).
Show us the code...
"A statician waded through a river that was on average only 1m deep. He drowned."
-
syzygy
- Posts: 5824
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Hi... I'm working on engine research
Chess is a game where you win some and lose some. You have an unlimited number of lives. What counts is the average performance, i.e. maximum Elo. And since this is measured over whole games, a minimum reliability restriction is built in already. (On the other hand, if you are testing on test suites, which is what the OP seems to think is a good idea at least judging from his Reddit posts, you will end up with an engine that can fumble a brilliant game with one blunder move.)hgm wrote: ↑Wed Dec 31, 2025 9:40 amIsn't it a fundamental flaw to go just for high Elo (which is an average property) rather than for reliability (which guarantees some minimum performance in every conceivable situation)?syzygy wrote: ↑Wed Dec 31, 2025 12:21 amWell, an admitted newbie coming in to declare that there are "fundamental flaws" does reek of the Dunning-Kruger effect.FireDragon761138 wrote: ↑Tue Dec 30, 2025 1:03 amI tried talking about this on reddit, but folks there just said I was in a bongcloud haze. Nobody likes hearing that their favorite engine might have fundamental flaws, and philosophical critique is immediately dismissed...
Why do you have to start off with such a dismissive tone?
Stockfish represents 80 years of accumulated engineering knowledge with contributions by 1000s of people. There is no credible sense in which Stockfish is "fundamentally flawed", even if you have come up with a new idea that could improve Stockfish by 100 Elo (unlikely).
Show us the code...
"A statician waded through a river that was on average only 1m deep. He drowned."
-
towforce
- Posts: 12719
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Hi... I'm working on engine research
Frank Quisinsky wrote: ↑Wed Dec 31, 2025 12:45 amShortly, I made the experiment with Stockfish (last HCE version) and a NN version, around start of 2025 with the time control 40 moves in 150 minutes (in a tourney with many other engines)...
The experiment FireDragon wants is an investigation into eval variation from move to move, so maybe measure average eval change, median eval change, and "average/median (maximum eval change between 2 moves per game)" over 100 games
FireDragon said he or she was doing some research: maybe we'll see the results of this - but I'm not going to hold my breath while I wait.
I'm not going to say that this information would be "philosophical" or that it would indicate a "fundamental flaw", but a tendency to large evaluation changes between moves would indicate weakness in the eval code, since it misses things that are then uncovered by search in the next move.
Human chess is partly about tactics and strategy, but mostly about memory
-
syzygy
- Posts: 5824
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Hi... I'm working on engine research
But he then dismisses the idea of training an engine on its own evaluations, whereas that is exactly how one would go about reducing evaluation changes from ply to ply, namely by teaching the neural net to recognize the evaluation at ply=d+1 already at ply d or earlier.towforce wrote: ↑Wed Dec 31, 2025 12:09 pm I'm not going to say that this information would be "philosophical" or that it would indicate a "fundamental flaw", but a tendency to large evaluation changes between moves would indicate weakness in the eval code, since it misses things that are then uncovered by search in the next move.
Of course it is inevitable that some things at ply=d+1 will be missed at ply=d. Otherwise there would be no need to search.
-
FireDragon761138
- Posts: 18
- Joined: Sun Dec 28, 2025 7:25 am
- Full name: Aaron Munn
Re: Hi... I'm working on engine research
The way Stockfish tries to bake in depth 12 analysis at depth 0 into the neural network isn't the only way to design a neural network with rich density of information. In our case, we are working with an Lc0 game dataset. I believe Monte Carlo Tree Search gives more trustworthy positional information than simply having the engine iterate on itself, without a bias towards seeing chess as a series of forcing moves. I want to maintain strategic flexibility at lower depths of search.