Human playable levels

lucasart · Post by **lucasart** » Mon Mar 31, 2025 10:32 am

I have been experimenting lately on how to weaken engines to make them playable opponents for humans. Hopefully this will inspire more of you to implement it in your engines.

What's wrong with fixed depth?
The most basic setup: level L plays at depth L. This is problematic for a few reasons

Endgames: fewer pieces mean less calculation. Especially pawn races in pawn endgames where humans can calculate almost linearly. So we would need more depth as Effective Branching Factor reduces.
Search explosion. Some positions have explosive EBF, so you need to impose a cap on nodes besides the depth limit, or the engine might get stuck and never reach the target depth. Less common in modern engines, but still a reality to contend with: time to depth, or nodes to depth, can vary widely.

How about fixed nodes?
Level L plays at nodes ~ b^L, for some b>1, controlling how far apart levels should be. This is already better, because it reduces the aforementioned Endgame problem, and eliminates the Search Explosion problem.

However, the minimum search possible is still 1 ply, with quiescence search. Couple that with a strong evaluation, especially NN based, and the lowest level will be stronger than most humans, and completely out of reach for beginners. Even if you win, it's not satisfying: you get asphyxiated by the computer, and out of nowhere, it blunders a simple tactic. So it feels like losing, then engine has a bug, allowing you to win, undeservedly.

So we must dumb down the evaluation dramatically, not only the search.

Fixed nodes with noisy evaluation
Level L setup:

Nodes ~ b^L (b >1): exponential growth
Noise ~ d^L (d<1): exponential decay

eval(node) is modified by adding a random number from a distribution scaled by Noise. Intuitively it seems you want fat tails, so a logistic distribution should be good. Also easy to implement, with an inverse CDF (probability to quantile) function easy to write.

Additionally, make sure to use a random seed, so that when initial conditions are repeated (eg. new game), the computer plays differently. Otherwise humans keep playing and memorising the same game until they win, by iteration. Plus you want your training partner to expose you to diverse play.

Note that there is a side effect of noise on search. The nodes to depth increase with noisy evaluation (messing up alpha beta, causing aspirations failures etc).

Endgame adjustments
Experience shows that you still need to increase nodes and reduce noise for endgames (eg. enough to win KQK without making the lowest level too strong otherwise). For example, say you want to add 1 level worth of nodes for everything material halving:

r = piece material/ initial piece material (counting piece values for both sides).
Nodes *= b^(-log2(r))
Perhaps flooring r to limit the node increase.

Same treatment can be applied for the Noise.

Disable polluting variables
Be sure to force Threads=1 in Level mode. Otherwise, SMP causes distortions: both search widening (more elo per depth) and reduction of search efficiency (more nodes to depth).

Same for Hash. Make it fixed. Like 1MB, perhaps more for the highest levels, if needed.

Is it a zero cost approach?
Almost, but not quite. It ads a branch to the evaluation function, which is the critical execution path of the engine. But it's just a bool check like

Code: Select all

if (noisy evaluation) { ... }

Unlikely to cost a measurable amount of elo.

PS: I put some levels on lichess.
https://lichess.org/@/Demolito_L3 (up to level 6, after that it's too strong for humans).

lucasart · Post by **lucasart** » Mon Mar 31, 2025 3:15 pm

lucasart wrote: ↑Mon Mar 31, 2025 10:32 am What's wrong with fixed depth?
The most basic setup: level L plays at depth L. This is problematic for a few reasons

An additional problem I forgot to mention is the depth parity effect. All else equal, going from depth L to depth L+1 gains more when L is odd.

hgm · Post by **hgm** » Tue Apr 01, 2025 8:58 am

In the days of 8-bit microprocessors, and in the Interactive Diagram, I use selective extension of moves that are a logical continuation of the previous ply. In particular these are moves that were not possible (or obviously different) on the ply before. E.g. because they go to or over, or leave from a square with recently modified occupancy. (So recapture, discovered attacks, pin enforcement, plunder.) Humans are much less likely to overlook such moves than moves that became good for more complex reasons. Even in the root you could prune moves that are not 'tactically connected' to the previous two game half-moves, to get a sub-1-ply strength. This is much more natural than randomization, which would sometimes suppress completely obvious moves.

lucasart · Post by **lucasart** » Wed Apr 02, 2025 1:53 am

hgm wrote: ↑Tue Apr 01, 2025 8:58 am In the days of 8-bit microprocessors, and in the Interactive Diagram, I use selective extension of moves that are a logical continuation of the previous ply. In particular these are moves that were not possible (or obviously different) on the ply before. E.g. because they go to or over, or leave from a square with recently modified occupancy. (So recapture, discovered attacks, pin enforcement, plunder.) Humans are much less likely to overlook such moves than moves that became good for more complex reasons. Even in the root you could prune moves that are not 'tactically connected' to the previous two game half-moves, to get a sub-1-ply strength. This is much more natural than randomization, which would sometimes suppress completely obvious moves.

Thanks. But I think what you are trying to do here is attack a much harder problem: how to play badly in a human way? Ultimately any scalable solution to this (hard) problem would have to involve NN training on human games to predict the human moves: see maia chess project.

Indeed random noise with search limits will not do that. They will dumb down the engine so games are littered with inaccuracies, mistakes and blunders, in a statistically consistent way. But humans don't make random mistakes: they mistake because of time pressure, tactical pressure from the opponent, especially consecutive, or because the logical moves are sometimes wrong for unusual reasons (breaking our pattern recognition), etc.

It's just meant to be simple and implementable easily and almost no cost (performance and code maintenance) to chess engines.

hgm · Post by **hgm** » Wed Apr 02, 2025 9:12 pm

But it is simple, and low cost. Using NN is overkill. Humans make two kinds of errors: tactical and strategical. The strategical part can be implemented by mistuning of the evaluation. Not by randomizing it on a move to move basis, but by consistently altering the weights of the various eval terms a lot for the entire game.

Tactical errors occur by incomplete search failing to predict the best outcome. They tend to overlook moves, but in a consistent way. So not by randomly pruning some moves in each node, but pruning the same move in every node in the tree. And the chances that a move will be overlooked is not the same for every move. It might depend on the type of the piece that moves (e.g some weak players have trouble imagining Knight tours). But it definitely depends on how deep the move is burried in the search tree. Moves that are possible in the root, or an obvious reply to root moves (e.g. recapture) are not likely to be overlooked. A move that occurs for the first time on ply 8 will surprise people much more often.

So you could start with a shallow search and make a butterfky table for the earliest time a move occurs in the search tree (either as a pseudo-legal move, or as a move with a reasonable score). And then decide based on this and some tactical-strength parameter which moves to prune. Finally you then do the search with the chosen moves pruned in every node of it, to decide on the move. Easy enough.

BTW, I think that trying to teach an NN to blunder by training with human blunders is a very poor method. Humans will have different misconceptions and weaknesses. One player will lose because he overvalues passers, the other because he undervalues them. And a third because he had overlooked the recapture that created the passer in the first place. An NN would take the average of this as the globally best fit, which tends to cancel out the misconceptions.

Human playable levels

Human playable levels

Re: Human playable levels

Re: Human playable levels

Re: Human playable levels

Re: Human playable levels