Breakthrough in combining MoE into chess engine algothrim

Mark Tang · Post by **Mark Tang** » Thu Apr 30, 2026 9:21 am

Our new engine Aether becomes the first chess engine combining MoE into chess engine algothrim and reaches a huge success. Combining the Hybrid Expert Model (MoE) with the Efficient Reconfigurable Neural Network (NNUE) is an excellent idea to break through the current computational bottleneck of chess engines and achieve a generational leap in algorithms. This combination can perfectly integrate the extreme inference efficiency of NNUE in chess search with the infinite scalability potential of MoE in model capacity.
Specifically, this new engine algorithm can be designed in several dimensions:
1.Core architecture: NNUE as the "super-efficient expert" of MoE
The traditional NNUE (such as Stockfish) usually consists of a large input layer (768 or 1024-dimensional features) followed by two to three small fully connected hidden layers. Its core advantage lies in "incremental update" - when only one piece is moved on the chessboard, the activation values of the input layer can be updated simply through addition and subtraction, compressing the evaluation time to nanosecond level.
In the new MoE architecture, each independent NNUE network can be set as an "expert":
Gate Network: Receiving the macroscopic features of the current situation (such as piece strength difference, opening index, endgame stage, etc.), dynamically calculating weights and selecting the Top-k NNUE experts that are most suitable for the current situation.
Sparse Activation: Suppose we have pre-trained 8 NNUE experts for different game situations, the gate network activates only 2 to 4 of them each time. This means that while maintaining a very large model capacity (with an extremely large total parameter count), the actual computational load (activation parameter count) for each situation evaluation is still very low, perfectly meeting the strict requirements of Alpha-Beta search for speed.
2.Expert Division: Evolution from "generalist" to "specialist"
The traditional single NNUE network needs to simultaneously learn to handle opening, complex tactical situations in the middle game, and precise evaluation in endgames. With the introduction of MoE, a more refined "expert division" can be achieved:
Opening/Middle Game Tactical Experts: Some NNUE experts are specifically trained through massive game databases and are good at evaluating complex piece entanglements and offensive situations.
Endgame Theory Experts: Another set of NNUE experts can be trained using endgame tables (EGTB) data (similar to the reverse learning paradigm of Seer-NNUE), specializing in precise evaluation of situations with few pieces.
Positional Experts: Specifically learning the advantages and disadvantages of long-term strategies and pawn structures.
The gate network will automatically weight and integrate the wisdom of these "specialists" based on the complexity of the current situation, thereby obtaining a more accurate situation score than a single network.
3.Solving the Computational Bottleneck: Breaking through the "memory wall" and improving inference speed
As chess AI develops, simply stacking the hidden layer neurons of a single NNUE will encounter diminishing marginal effects and increased computational latency. The MoE architecture can bring significant engineering benefits:
Breaking Capacity Limitations: MoE allows us to expand the total parameter count of the model by tens or even hundreds of times (e.g., from tens of millions to billions), without linearly increasing the computational cost during inference. This enables the engine to "remember" and understand more profound chess principles.
Ultra-efficient Inference: Thanks to the incremental update feature of NNUE itself, combined with the sparse computation of MoE (running only a few small NNUE each time), this new engine can still maintain a terrifying search speed of tens of millions of nodes (NPS) per second on consumer-level hardware (such as ordinary CPUs or even mobile devices).
4.Innovation in Training Paradigm
This new algorithm's training process will also be highly innovative: Expert pre-training: Train each NNUE expert using different datasets including human master games, self-play data, endgame libraries.
Gate network training: During self-play, use reinforcement learning to train the gate network, allowing it to learn "in which situations it should listen to the opinion of which expert".
End-to-end fine-tuning: Finally, combine the gate network and the selected NNUE experts as a whole, and fine-tune them using a large amount of self-play data to ensure that the evaluation standards of each expert can seamlessly connect.

towforce · Post by **towforce** » Thu Apr 30, 2026 9:57 am

This is a very good idea.

It reminds me of the article I wrote in 1997 for the "Selective Search" computer chess magazine proposing an engine with hundreds of evaluation functions, each suited to a particular type of position, and starting by selecting the most suitable one for the current position.

The article received praise and was well-received, but at that time, everyone was removing knowledge from the eval, because deeper searching was removing the need for it: in that world, depth of search was everything - and would remain so until Google showed us differently 20 years later.

Mark Tang · Post by **Mark Tang** » Thu Apr 30, 2026 3:11 pm

After AlphaZero was released to the world, and NNUE was applied to the chess program. I'm always wondering another big change that would happen in algorithm. As LLM seems to be a closer approach to AGI(Demis Hassabis once said), I think maybe I can get inspired from LLM. In 2024, DeepMind has already had a project called ChessBench to use Transformer in chess playing without any search or evaluation code and almost reaches 2895 elo on Lichess and has a 95.4% accuracy in puzzle solving better than Lc0 (but it needs 2.7 billion game data and couldn't run on a normal computer). So I turned to MoE and luckily it works quite good. Though it has much larger networks (about 1GB in total), MoE enables engine to have a more efficient and faster search progress. In our lastest test, without opening given, Aether can defeat Stockfish with a slight margin of advantage in Bullet and Rapid, and simultaneously has a better puzzle solving ability than Reckless and Stockfish Derivatives that are designed for analysis. Also, by training it with opponents of different elo, the experiment shows a higher winning rate when playing with weaker opponents compared to most top engines.

Ajedrecista · Post by **Ajedrecista** » Thu Apr 30, 2026 7:27 pm

Hello:

towforce wrote: ↑Thu Apr 30, 2026 9:57 am This is a very good idea.

It reminds me of the article I wrote in 1997 for the "Selective Search" computer chess magazine proposing an engine with hundreds of evaluation functions, each suited to a particular type of position, and starting by selecting the most suitable one for the current position.

The article received praise and was well-received, but at that time, everyone was removing knowledge from the eval, because deeper searching was removing the need for it: in that world, depth of search was everything - and would remain so until Google showed us differently 20 years later.

Just for the record, the article was titled 'The FUTURE of Computer Chess' and was published at pages 29 and 30 of Selective Search issue 70 (June-July 1997):

http://www.chesscomputeruk.com/SS_70.pdf

The third-to-last paragraph of the article mentions '50,000 cases' and points to the book Chess Skill In Man and Machine. To add some context, that book was published circa 1983. The 50,000 figure can be read in Chapter 2 'Human chess skill', subchapter 'The road to mastery for man and machine', page 51:

[...]

Perception operates in two main areas: generating plausible moves and statically evaluating terminal positions. Not only will a plausible move be perceived more quickly by the master, but his long experience with many patterns on the board (50,000 or more) translates into the advantage of being able to evaluate the resulting position more readily.

[...]

Practice does not mean staring at and memorizing 50,000 patterns. It means learning to recognize types of positions and the plans or playing methods which go with them. [...]

[...]

Regards from Spain.

Ajedrecista.

towforce · Post by **towforce** » Thu Apr 30, 2026 8:37 pm

Ajedrecista wrote: ↑Thu Apr 30, 2026 7:27 pm Hello:

towforce wrote: ↑Thu Apr 30, 2026 9:57 am This is a very good idea.

It reminds me of the article I wrote in 1997 for the "Selective Search" computer chess magazine proposing an engine with hundreds of evaluation functions, each suited to a particular type of position, and starting by selecting the most suitable one for the current position.

The article received praise and was well-received, but at that time, everyone was removing knowledge from the eval, because deeper searching was removing the need for it: in that world, depth of search was everything - and would remain so until Google showed us differently 20 years later.
Just for the record, the article was titled 'The FUTURE of Computer Chess' and was published at pages 29 and 30 of Selective Search issue 70 (June-July 1997):

http://www.chesscomputeruk.com/SS_70.pdf

The third-to-last paragraph of the article mentions '50,000 cases' and points to the book Chess Skill In Man and Machine. To add some context, that book was published circa 1983. The 50,000 figure can be read in Chapter 2 'Human chess skill', subchapter 'The road to mastery for man and machine', page 51:

[...]

Perception operates in two main areas: generating plausible moves and statically evaluating terminal positions. Not only will a plausible move be perceived more quickly by the master, but his long experience with many patterns on the board (50,000 or more) translates into the advantage of being able to evaluate the resulting position more readily.

[...]

Practice does not mean staring at and memorizing 50,000 patterns. It means learning to recognize types of positions and the plans or playing methods which go with them. [...]

[...]
Regards from Spain.

Ajedrecista.

Wow!

Thank you for the link and the exact "Chess Skill In Man And Machine" book reference.

jorose · Post by **jorose** » Thu Apr 30, 2026 8:48 pm

While I commend research being done into this space, I am currently not the biggest fan of how this is being done. This reads very LLM generated to me and very superficial while omitting key details.

First of all, your engine is not open source and we have no idea what is in it exactly. These results will differ wildly depending on your search code. Did you write it from scratch? Is it based on Stockfish? Leela? Something else? Is there a reason you have not shared your engine? I understand you may not want to share the code if you are working on a paper, but if you are ready to make big public claims I feel you should be at least sharing a binary with promise of a later source release.

Furthermore, could you clarify the similarities and differences between your network architecture and that of prior work. In Stockfish 14 a gating network was already used that could be argued to be a form of MoE, so could you clarify exactly how you differentiate yourself to be declaring a "Breakthrough in combining MoE into chess engine algorithm". What makes it so that yours is fundamentally different and an improvement over Stockfish's network. Could you show experiments with proper statistical tests?

Next, based on your writing, it sounds like you are doing independent incremental updates per individual NNUE (if not, please clarify how your architecture is actually novel compared to, for example, the already 5 year old SF architecture). This means you have linear slowdown for your incremental updates. Could you clarify the exact model sizes we are comparing and where your computational bottleneck resides?

Finally, you say you "During self-play, use reinforcement learning to train the gate network". Could you clarify what you mean by this? What algorithm are you using to optimize the gating network? Is this just your way of saying you are doing gradient decent on the whole system to predict game outcome?

Mark Tang · Post by **Mark Tang** » Fri May 01, 2026 8:10 am

I will make a response to your critique
1. It’s just an experimental chess engine, and will never release to the public. I just want this idea could be a possible way to inspire other engines’ authors for their improvement.No code is derived from Stockfish or Lc0. Evaluation: my own MoE network (some training data is from TCEC games, Lc0’s data, CCRL and SPCC). Different with traditional networks, Aether’s networks are literally divided into 8 files. Each one has differnent size. For example, opening expert is largest, more than 200 MB(but it’s not like a book bin because it actually has an opening peference especially like Ruy Lopez opening when playing white and still maintains diversity). And according to my chess master friend, he says Aether has more interesting performance before move 20 compared to Stcokfish.
2. ***Similarities and Differences to Prior Work (Especially Stockfish 14 Gating)***: Stockfish 14 “Gating” is actually a two‑branch, hard‑coded gate whose routes based only on king position,
no dynamic selection, load balancing or any independent expert training, and both branches are always used,which is actually a dual‑input symmetric network, not a true MoE.
While my MoE Architecture uses 8 independent expert networks, dynamic soft gate+Top‑K routing that based on game phase, material, tactics, king safety—not just king color. And I also add some code for load‑balancing loss to prevent expert collapse considering about initial testing results. Though my networks have a larger number of params, it actually runs faster, stronger in specialized games.
3. For gate training algorithm, I use phasic policy gradient+supervised pretraining. The gate is not trained to predict the best move but to select which expert is best at judging the current position.
Stage 1: Supervised pre‑training
Train on real‑game positions from Lichess data, letting it predict game outcome(win/draw/loss). The gate will initialize to route
Stage 2: Reinforcement Learning for the Gate

Bart Weststrate · Post by **Bart Weststrate** » Fri May 01, 2026 7:03 pm

The OP is very briliant or complete bullocks. It smell's like AI generated rubbish but let's see.

berttuyt · Post by **berttuyt** » Fri May 01, 2026 7:47 pm

It was Carl Sagan who invented the phrase "extraordinary claims need extraordinary evidence". Think this wisdom is also valid in computer chess.

The next text is completely 100% written by ChatGPT, as an example.....

I’ve been experimenting with a different approach to move ordering and pruning in alpha-beta search, and wanted to share some early thoughts and results.

The core idea is to replace traditional history heuristics and Late Move Reductions (LMR) with a learned policy network that directly predicts move quality in a given position. Instead of relying on accumulated statistics (history tables, killer moves, etc.), the network outputs a probability distribution over legal moves, which is then used for:

Move ordering
Selective pruning / reductions
Potentially guiding extensions

Why this direction?
History + LMR work extremely well, but they’re still hand-crafted heuristics. A policy network could, in principle, learn richer positional patterns and long-term effects that aren’t easily captured by counters and decay schemes.

Setup (prototype level):

Classical alpha-beta framework
Policy network trained on self-play + high-depth engine data
Network used to:
Order moves
Modulate reductions instead of fixed LMR tables

Preliminary observations:

Move ordering quality is surprisingly competitive with tuned history heuristics.
The network tends to prioritize structurally meaningful moves (king safety, passed pawns, etc.) even in quieter positions.
Tuning is tricky—over-trusting the policy hurts tactical sharpness.

Results (very early, not conclusive):
In limited testing at short time controls, the approach is roughly in the same ballpark as a baseline engine with standard heuristics, but results vary significantly depending on tuning and position types.

Open questions:

Can a policy fully replace LMR, or is a hybrid approach inherently stronger?
How to best integrate uncertainty from the network into reduction decisions?
Is the added compute cost worth the gain versus highly optimized heuristics?

If anyone here has tried similar ideas (policy-guided alpha-beta without going full MCTS), I’d love to compare notes.

Mark Tang · Post by **Mark Tang** » Sat May 02, 2026 9:35 am

Yes, I have to admit that such writing a long text that needs clear logic is quite tiring for me.So I tell AI my idea and let it help me to generate the whole passage but the experiment and idea behind the passage are all real and original.

Breakthrough in combining MoE into chess engine algothrim

Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algorithm.

Re: Breakthrough in combining MoE into chess engine algorithm.

Re: Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algothrim

Re: Breakthrough in combining MoE into chess engine algothrim