Ways to avoid "Draw Death" in Computer Chess

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Laskos »

AlvaroBegue wrote:
Laskos wrote:
AlvaroBegue wrote:H.G.'s criticism of Bayesian inference is very common among scientists and engineers: If we can't agree on what the prior distribution looks like, we cannot agree on the conclusions.

I tend to favor Bayesian statistics more than most, but in this particular case Kai seems to be using the prior to make it look like we have more information than we really do.

So we have a coin, flipped it 6 times and it came down heads 5 times and tails once. How certain are we that the coin is not fair? Well, for a Bayesian analysis you need to know something about the origin of the coin, so we can have some a priori distribution for the hidden parameter p, the true probability of getting heads. Unfortunately in the real world you don't know where the coin came from. So the best we can do is try to quantify the evidence that we got from the observations we have. For instance, you can compute how often you expect to get a result as lopsided as the one you observed, if the coin were fair. In this case, it's 11% of the time. That's easy to interpret and doesn't depend on a prior that we can't agree on. That's why it's valuable. Go LOS!
I don't know why you are so infatuated with p-value and frequentist approach. Yes, it gives scientifical predictions. But p-value stopping rule is more an art, and is not even sound on theoretical grounds. Type I error is unbounded, and for infinite number of games is 100%. P-value, LOS with uniformed prior, can clumsily and inefficiently be used as stopping rule in our computer chess case, because the divergence is logarithmic, so on some range of data Type I error can be controlled. Did you ever see what Type I error accumulates for p-value of 0.05 stopping rule in 1000 games? I could dig for my older posts here.
I am not proposing to use p-value as a stopping rule. If you were to use the t-value of the posterior distribution in your Bayesian setup as a stopping rule, that wouldn't fare too well either. This is not at all what we are talking about.
For high t-values (in coin and chess cases 3.5 and above) stopping rules can be safely applied, though not very efficiently, in both Bayesian and frequentist approaches. I will re-post something I already posted here in the past, a bit modified:

================================================================================================
Suppose a man came to you with a coin, and said "whenever heads come up I win a dollar, whenever tails come up you win a dollar". You believe the coin is fair, and start the game. Your prior for the coin is the following:

Image

Based on that you estimate the a priori LOS of the coin at 50.0%, the game is fair.

After 5 tosses, the result came unfavorably, 5 heads, 0 tails.
Based on that you estimate the LOS of the coin with this prior at 55.3%.
With "Uninformed Prior" LOS is about 98.4% (t-value about 2.1), and you are not able to stop yet.

But after 5-0 you begin to suspect that something is suspect ("gut feeling"). You don't like how that man behaves, his face expression. You see some anomalies in the density distribution of the coin. You decide to take another prior for the coin, one favoring heads (favoring the dubious by now, after 5-0, man proposing the game):

Image

Based on new prior, you re-interpret the LOS of the coin after the same 5-0 as before (no more tosses) at t-value of 3.7. The stop after 5 tosses is justified, and you come to conclusion that the man is cheating you, the coin is not fair.
With "Uninformed Prior", you would reach the same conclusion after about 10 tosses.
The practical difference between the approaches in this case is that I lose 5 dollars, you lose 10. Bayesian framework allows for "gut feeling" of humans, and humans are often good at it. Humans are "holistic" in a approach, they look not only at 5-0 in binomial, but they know that charlatans exist, the coin has to have uniform shape and density, they know how to interpret dubious face expressions, etc. It doesn't matter too much the precise shape of the prior, or whether t-value is 3 or 5. Bayesian approach seems to me to favor "qualitative" plausibility issues and "reasoning" in real world more than precise quantities in some formalized framework. And that in phenomenology is often better than playing with more rigorous but vague Null hypothesis in formalized and specialized domain.
================================================================================================

Maybe that was not that good example. It shows more arbitrariety of stopping and ad hoc decisions than anything else. Say, In Bayesian approach, if I am a stupid human, and keep the initial prior (first plot), I would need some 150 tosses to reasonable t-value for a stop. But I guess an average human is not that stupid. And I guess that an average human has both a holistic and Bayesian approach on a plethora of issues daily.

There are no formal problems with Cosmological constant in Einstein field equations. The problem appears when physicists try to make "some sense" of it in the real, rich phenomenologically, physical world. The issues like "naturalness" and even "anthropic principle" (derided in the past) come to prominence. And this is very related to Bayesian approach.

Anyway, let's return to chess engines:

Code: Select all

Score of Stockfish 8 vs Stockfish 7: 385 - 82 - 533  [0.651] 1000
ELO difference: 108.68 +/- 14.51
Finished match
At 20'' + 0.2''. The Win/Loss ratio is expected to go in Graham's case to 3-4, and draw rate very high, similar to what is seen in Komodo - Houdini match. For +5 -1 =84 score, would you accept an unbalanced prior for this Stockfish 8 - Stockfish 7 match, even more, an unbalanced prior favoring Stockfish 8 (I was not going so far with my prior)? If yes, why? Because you "know" Stockfish 8 is much stronger than Stockfish 7? Another thing: if I "knew" this match (Komodo - Houdini) is between two consecutive versions of Stockfish development, I would consider a very balanced prior like the first plot, and the result completely insignificant. But all "chess engine statistics phenomenology" I know is that Komodo - Houdini match is closer to Stockfish 8 - Stockfish 7 match in Graham's conditions than to two consecutive Stockfish dev. match.

Sorry for this long, rambling post.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Laskos »

hgm wrote:I have no 'grunt' about Bayesian statistics, but about the prior that you use. There is no limit to the silliness you can get out of a Bayesian analysis by taking the a sufficiently crazy prior. Therefore it always has to be applied with caution.

Obviously the 'improvement' you seem to get here is simply what you put in through the prior, and not what is suggested by the data. Your prior is a self-fulfilling prophecy of your belief that all engines that draw a lot against each other must be very different in strength. You claim to have data to back that up.

Well, I have no data at all, but I don't need it to know when I am being conned. Surely, if I play a super-strong engine at long TC in Blalanced Chess it must have a high draw rate. (Does your data contradict that?) Despite the fact that the strength difference is exactly zero. If the data you have collected supports your belief, you just have been biased in collecting that data. Engines of equal strength do exist at any strength level, because each engine is equally strong as itself. Modest patches of strong engine versions that do produce engines that differ very little in strength; engine developers are creating and testing them all the time. Saying that strong engines cannot possibly be close in strength (which is exactly what your prior does) is just nonsense.
I will not insist with my prior, it was brought by me speculatively, and almost arbitrarily. Let's return to "Unbalanced" versus "Balanced" (with ordinary frequentist approach), which has theoretical background, and was on-topic. Ferdinand seems to not check his PM for now, but would you argue that it would be incorrect using Stockfish as perfect engines for much weaker engines? I don't see other, much different means to have those error distributions. And even on theoretical grounds, a perfect engine will give only the counts of perfect moves and non-perfect moves.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Laskos »

hgm wrote:
Laskos wrote:"Doesn't have to" does not mean "are not".
Indeed. But it does mean that when you assume it is, you are just guessing, and that ever conclusion that is based on that assumption is not better than guessing.
You are not backing that by any data and common sense. At least from my data, I have observed that larger mean error means longer tail. Take an engine (not extremely strong one, to have a reference to "optimal") and a significantly weaker engine, and test that. Those two Gamma distribution curves were for two significantly differing in strength engines.
It seems to make the assumption that the distributions are always Gamma distributions, until very far in the tail, and dependent on only a single 'strength' parameter. That you observe this on not extremely strong engines might also be a weak point. Even the strong engine of your example most-likely gives up 10 cP per move, i.e. 6 Pawns per game (and on average much more!). Which on an absolute scale doesn't sound very strong. It would only need ~10 moves to bungle n equal game against perfect play.
My indications are that "Unbalanced Chess" is related to "Balanced Chess", at least for now. To pass to a win, an engine anyway has to pass trough "Unbalanced Chess" even from "Balanced Chess".
Not necessarily. The score loss can dominated by a single gross blunder (e.g. a strategically wrong choice that in the long run proves fatal). This is why the exact shape of the distribution is so important. I don't doubt you if you say larger modal or median score loss also means longer tails. But it makes a lot of difference if the asymptotic behavior is exponential or 1/x^2.
So, are you saying that most losses in "Balanced Chess" are due to blunders, and most losses in "Unbalanced Chess" are due to accumulation of tiny errors?
At very high level of play, this seems likely. Obviously accumulation of small errors will be able to decide a game in Unbalanced Chess, but it would not in Balanced Chess, once the typical error gets small enough. So the question is whether the large score loss required to decide a game of Balanced Chess typically comes about by a very large number of small score losses being a-typically large, or by most of them being typical or perhaps only slightly above typical, with just a few being very large. This depends on the characteristics of the loss-per-move distribution. E.g. with a Gaussian distribution a 3-sigma event in the sum would be most-likely caused by all terms being shifted by the same amount, but scattering around that in the normal way. If the sum of a number of draws from a Cauchy distribution is a fluke, it is most likely due to one term being much larger than all the others.
That is in fact testable. I haven't observed that, but let's diverge on "opinions", "common sense" and such things. If you come with some data backing you, it would be more interesting. I don't think that most developers have much knowledge about blunder rates vs lesser-than-blunder rates, and prefer something much more than another.
It is not the developers I worry about. They can do whatever they want. When they want to design hammers, why not, if it makes them happy. But if the users need to fasten screws, they won't consider a more hammer-like hammer really an improvement.

I don't feel like spending any CPU power on this. But for testers that do spend huge amount of CPU powers, it would be good if they are spending it to measure something of interest to the public, or whether they just try to measure with very high accuracy a quantity that is of no relevance. Just on some vague hope that it will correlate well enough with the quantity that is of interest to not make all hard-gotten accuracy on the thing they actually measure completely meaningless.
Ferdinand wrote a tool for me to detail a bit the points discussed. This is pretty much solving the question. Databases were:

1000 self-games of Fruit 2.1 (2700 ELO CCRL 40/4) at 10''+ 0.1'' (about 100,000 total moves)
1000 self games of BikJump 2.01 (2100 ELO CCRL 40/4) at 10''+ 0.1'' (about 100,000 total moves)

Arbiter: Stockfish dev (3450 ELO CCRL 40/4) (250ms/position)


Fruit:
Erroneous moves: 45247 (48.4%)
Perfect moves: 48283 (51.6%)
Median of the error: 0.31

BikJump:
Erroneous moves: 53532 (64.1%)
Perfect moves: 29998 (35.9%)
Median of the error: 0.46


The histogram of the frequency (unnormalized) of errors:

Image

The stronger engine has lower median error, and shorter tail. This looks a lot like Gamma distributions with these parameters (normalized frequencies), k shape parameter, theta scale parameter:


Image


And with the special attention to the tails of frequencies:


Image

Which fit again well the same Gamma distributions (basically exponential decay). So the tails don't decay pathologically, and empirical data suggests that your pathological examples don't happen usually. And the "reasonable" behavior seems more likely to expect, giving credence to hypothesis that "unbalanced" openings are very related to "balanced" openings regarding matches outcome.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Ways to avoid "Draw Death" in Computer Chess

Post by hgm »

Well, in the first place I am surprised that Fruit still makes such large errors so frequently. A sizable fraction of the moves give up more than 1.5 Pawn, which in general is enough to be the sole cause for losing the game.

A second question that presents itself is whether therror is in any way correlated with the absolute score. For the outcome of a game only errors committed close to equality are relevant. E.g. if most of the errors > 1 Pawn are caused by unnecessarilt sacrifycing a Pawn to speed up promotion that would otherwise occur beyond the horizon, it will not have any impact.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Laskos »

hgm wrote:Well, in the first place I am surprised that Fruit still makes such large errors so frequently. A sizable fraction of the moves give up more than 1.5 Pawn, which in general is enough to be the sole cause for losing the game.

A second question that presents itself is whether therror is in any way correlated with the absolute score. For the outcome of a game only errors committed close to equality are relevant. E.g. if most of the errors > 1 Pawn are caused by unnecessarilt sacrifycing a Pawn to speed up promotion that would otherwise occur beyond the horizon, it will not have any impact.
The second point is somehow avoided, because I adjudicated as wins more than 2 pawns scores, so there are very few already decided positions where engines give away pawns to win anyway. I did not need mistakes occurring at 6 pawns advantage.

The first issue occurs mainly because the games are ultra-fast (10''+ 0.1''). Give them 60 seconds, and the distributions will narrow. But I am not willing to test this obvious thing, as building databases and arbiter moves will take several days.

Sure, you can come with other pathological possibilities, however unlikely, I will not check all of them.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Laskos »

Laskos wrote:
hgm wrote:Well, in the first place I am surprised that Fruit still makes such large errors so frequently. A sizable fraction of the moves give up more than 1.5 Pawn, which in general is enough to be the sole cause for losing the game.

A second question that presents itself is whether therror is in any way correlated with the absolute score. For the outcome of a game only errors committed close to equality are relevant. E.g. if most of the errors > 1 Pawn are caused by unnecessarilt sacrifycing a Pawn to speed up promotion that would otherwise occur beyond the horizon, it will not have any impact.
The second point is somehow avoided, because I adjudicated as wins more than 2 pawns scores, so there are very few already decided positions where engines give away pawns to win anyway. I did not need mistakes occurring at 6 pawns advantage.

The first issue occurs mainly because the games are ultra-fast (10''+ 0.1''). Give them 60 seconds, and the distributions will narrow. But I am not willing to test this obvious thing, as building databases and arbiter moves will take several days.

Sure, you can come with other pathological possibilities, however unlikely, I will not check all of them.
I took another two engines to check the validity of my conclusions, using Stockfish dev at longer time control as arbiter (because the guinea pig engine Senpai 1.0 is already strong):

PGN databases of 1,000 games each at 10''+ 0.1''

Senpai 1.0 (3000 ELO CCRL):
Erroneous moves: 53404 (49.3%)
Perfect moves: 54931 (50.7%)
Median of the error: 0.25

Zurichess Appenzeller (1800 ELO CCRL):
Erroneous moves: 52942 (58.1%)
Perfect moves: 38257 (41.9%)
Median of the error: 0.49


The histogram of errors (normalized probability distribution function):

Image


Fitted Gamma distributions with k shape parameter, theta scale parameter (normalized probability distribution function):

Image


The tails of frequencies (counts):

Image


To see deep tails and the scaling behavior, it is very helpful to use log frequencies, plotted here up to very large blunders of 12 pawns, where there is still enough data.

Image

We see that deep in the tails, they decay exponentially, as linear fits on logarithmic scale show. The slopes are different for the two engines because scale parameters are different. So, no any pathological behavior for another two widely separated in strength engines.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Ways to avoid "Draw Death" in Computer Chess

Post by hgm »

Interesting as this is, I still don't see what you want to prove with it. The example with teh Cauchy distribution was only an example of a distribution with long tails, where even the average does not converge for an infinite number of draws. A chess game only contains a finite number of moves, though. The problem that a single error might dominate the the sum can easily happen there for exponential tails, or even Gaussian tails. Suppose the score change after a full move is distributed as the sum of two Gaussians, a narrow one, which carries 99.5% of the probability, and a very wide one, which carries 0.5%. So 1 in 200 moves is a gross blunder. After the typical 50 moves of a game the sum of the errors caused by the narrow distribution is sqrt(50) ~ 7 times wider, and carries about 78% of the probability. The probability for one blunder during the game is ~19%, and the corresponding score distribution still has the width of that for the blunder, is it was signifcantly more than 7 times as wide as the narrow one.

If 7 times the width of the narrow one is significantly smaller than the draw margin, games without blunders are only decided when you get in the extreme tail of the blunder-free Gaussian, and get faster-than-exponentially smaller. The decided games are then almost all games that contain a blunder, as the wide Gaussian still protrudes far outside of the draw zone.

It seems that what you want to prove is that there is a universal law that prescribes a relation between the width of the narrow (positional error) part of the distribution, and the total magnitude and width of the blunder part, and that every engine must obey that. I don't see how you could ever prove that such a relation exists. It certainly would not follow from a handfull of engines obeying such a relation. In fact I suspect that it should be easy to create engines with increased blunder rate and the same quality of positional play, by removing some knowledge from the evaluation that is not often needed, but can prove fatal if it is (e.g. about Pawn structure, like outpost passers).

Balanced and Unbalanced Chess just measure different aspects of the error/move distribution. Assuming that distributions must be the same shape for all engines, which is what one must do to relate the two, is very questionable.

The point is this: Balanced Chess at high level of play starts to measure mainly the blunder rate: you need at least one value in the extreme tails of the per-move error distribution to get a game decided. But blunders are easy to recognize from any set of games; they are srt of self-detecting, just look at the change of the scores the engine assigns to its own position. What is difficult to measure is the average small positional score loss per move. Unbalanced Chess is much more sensitive to that, when the ditributions get narrow and the average error small.
Uri Blass
Posts: 10279
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Uri Blass »

hgm wrote:Interesting as this is, I still don't see what you want to prove with it. The example with teh Cauchy distribution was only an example of a distribution with long tails, where even the average does not converge for an infinite number of draws. A chess game only contains a finite number of moves, though. The problem that a single error might dominate the the sum can easily happen there for exponential tails, or even Gaussian tails. Suppose the score change after a full move is distributed as the sum of two Gaussians, a narrow one, which carries 99.5% of the probability, and a very wide one, which carries 0.5%. So 1 in 200 moves is a gross blunder. After the typical 50 moves of a game the sum of the errors caused by the narrow distribution is sqrt(50) ~ 7 times wider, and carries about 78% of the probability. The probability for one blunder during the game is ~19%, and the corresponding score distribution still has the width of that for the blunder, is it was signifcantly more than 7 times as wide as the narrow one.

If 7 times the width of the narrow one is significantly smaller than the draw margin, games without blunders are only decided when you get in the extreme tail of the blunder-free Gaussian, and get faster-than-exponentially smaller. The decided games are then almost all games that contain a blunder, as the wide Gaussian still protrudes far outside of the draw zone.

It seems that what you want to prove is that there is a universal law that prescribes a relation between the width of the narrow (positional error) part of the distribution, and the total magnitude and width of the blunder part, and that every engine must obey that. I don't see how you could ever prove that such a relation exists. It certainly would not follow from a handfull of engines obeying such a relation. In fact I suspect that it should be easy to create engines with increased blunder rate and the same quality of positional play, by removing some knowledge from the evaluation that is not often needed, but can prove fatal if it is (e.g. about Pawn structure, like outpost passers).

Balanced and Unbalanced Chess just measure different aspects of the error/move distribution. Assuming that distributions must be the same shape for all engines, which is what one must do to relate the two, is very questionable.

The point is this: Balanced Chess at high level of play starts to measure mainly the blunder rate: you need at least one value in the extreme tails of the per-move error distribution to get a game decided. But blunders are easy to recognize from any set of games; they are srt of self-detecting, just look at the change of the scores the engine assigns to its own position. What is difficult to measure is the average small positional score loss per move. Unbalanced Chess is much more sensitive to that, when the ditributions get narrow and the average error small.
The question is how to define a blunder.
In theory you can say that a blunder is a move that change the theoretical result of the game but in this case you cannot check if a move is a blunder.

Suppose for the discussion that your definition for a blunder is a move that lose at least 30 centi-pawns relative to the best move after analysis.

1)In many cases a blunder does not change the theoretical result of the game.
2)If an engine play many moves that lose 10 centi-pawns then I expect it to practically lose the game so it is easy to make an engine that does not blunder and lose almost every game against stockfish even at long time control.

I guess that a simple rule to change your evaluation only to have values of 0.3*n pawns for n that is an integer may be enough to lose most games even at long time control because the program will not notice that it make a positional blunder at depth 1 and will not change its mind (and the evaluation of it will slowly go down).
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Ways to avoid "Draw Death" in Computer Chess

Post by hgm »

For the purpose of this discussion a blunder is just a move in the far tail of the distribution. Classifying moves as blunder or not in the example, each with a different distribution, was just a way to get an overall distribution that is not a pure Gaussian, but has longer tails.

It is true that score loss per move is a bit ill defined, as game theoretical the only scores are mate scores. So in this model we are selling ourselves to a heuristic evaluation. Large score losses will be mostly caused by tactical errors; at some point the engine will unwittingly do a move that unavoidably results in the loss of a piece, while another move could have saved it. Deeper search at that position would have solved the problem, and engines that are better because of deeper search thus make these errors less frequently. Because distant unavoidable losses are more rare in the game tree than close ones. OTOH, in positionally poor positions (low mobility, poor center control) unavoidable material losses of any kind get more frequent in the tree.

Perhaps the model of adding independent per-move errors is entirely wrong, and most games are won by slowly outperforming the opponent positionally in small steps, to steer him to a part of the tree where blunder possibilities are more frequent, so that he has a higher probability to fall for one. The density of blunders will not go completely to zero in equal or even better positions, though.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ways to avoid "Draw Death" in Computer Chess

Post by Laskos »

hgm wrote:Interesting as this is, I still don't see what you want to prove with it. The example with teh Cauchy distribution was only an example of a distribution with long tails, where even the average does not converge for an infinite number of draws. A chess game only contains a finite number of moves, though. The problem that a single error might dominate the the sum can easily happen there for exponential tails, or even Gaussian tails. Suppose the score change after a full move is distributed as the sum of two Gaussians, a narrow one, which carries 99.5% of the probability, and a very wide one, which carries 0.5%. So 1 in 200 moves is a gross blunder. After the typical 50 moves of a game the sum of the errors caused by the narrow distribution is sqrt(50) ~ 7 times wider, and carries about 78% of the probability. The probability for one blunder during the game is ~19%, and the corresponding score distribution still has the width of that for the blunder, is it was signifcantly more than 7 times as wide as the narrow one.

If 7 times the width of the narrow one is significantly smaller than the draw margin, games without blunders are only decided when you get in the extreme tail of the blunder-free Gaussian, and get faster-than-exponentially smaller. The decided games are then almost all games that contain a blunder, as the wide Gaussian still protrudes far outside of the draw zone.

It seems that what you want to prove is that there is a universal law that prescribes a relation between the width of the narrow (positional error) part of the distribution, and the total magnitude and width of the blunder part, and that every engine must obey that. I don't see how you could ever prove that such a relation exists. It certainly would not follow from a handfull of engines obeying such a relation. In fact I suspect that it should be easy to create engines with increased blunder rate and the same quality of positional play, by removing some knowledge from the evaluation that is not often needed, but can prove fatal if it is (e.g. about Pawn structure, like outpost passers).

Balanced and Unbalanced Chess just measure different aspects of the error/move distribution. Assuming that distributions must be the same shape for all engines, which is what one must do to relate the two, is very questionable.

The point is this: Balanced Chess at high level of play starts to measure mainly the blunder rate: you need at least one value in the extreme tails of the per-move error distribution to get a game decided. But blunders are easy to recognize from any set of games; they are srt of self-detecting, just look at the change of the scores the engine assigns to its own position. What is difficult to measure is the average small positional score loss per move. Unbalanced Chess is much more sensitive to that, when the ditributions get narrow and the average error small.
Let's again reduce to real world your seemingly plausible example. Say, we have a hard to detect 0.5% weight Gaussian with sigma=5 pawns hovering over over this 99.5% narrow distribution (median 0.25 pawns).

1/ The logarithmic plot of errors does not correspond to the empirical data for already 4 engines (Fruit, BikJump, Senpai, Zurichess). That's why that logarithmic plot in my previous post was important, to see if something on tails is hovering over.

Image

The weight of this wide Gaussian must be probably below 0.2%, as ruled by experiment.

2/ Even with that, In full move by move basis, we can take a narrow Gaussian (sigma = 0.25 according to the former median) with the weight 99.5%, and a wide Gaussian (sigma = 5 as before) with 0.5% weight (although probably not allowed by empirical data). If we take the game of 50 full moves, about 78% are played according solely to narrow Gaussian, 19% have a single contact with wide Gaussian. Say 2 pawns is the exact threshold were the game is decided.

The log plot of move by move deviation of the score, according to combined two Gaussians (and two identical engines):

Image
  • a) Balanced openings: initial score 0.00
    25% of the games played only according to narrow Gaussian will be decided. A total of 0.25*0.78 = 19%
    69% of the games played according to wide Gaussian are decided. A total of 0.69*0.19 = 13%

    b) Unbalanced openings: initial position 1.20 (that's the sort of unbalance necessary to maximize efficiency, even exaggerating a bit the value)
    36% of the games played only according to narrow Gaussian will be decided. A total of 0.36*0.78 = 28%
    69% of the games played according to wide Gaussian are decided. A total of 0.69*0.19 = 13%
Blunder rate remains almost the same. More decisive games are due to accumulation of small errors in the unbalanced case. Saying that Balanced chess grossly exaggerates one aspect or the other compared to Unbalanced chess is stretching the reality with outlandish examples not supported by empirical evidence. I am not even sure which of Balanced or Unbalanced shows more or less relevant things about a chess engine.

Now, if you want to build an engine which blunders a bishop every 30 moves, otherwise playing like Stockfish, then surely all what I wrote and measured is useless. I toyed with 4 regular engines, I am not building an overreaching theory of chess engines. You seem to dismiss data, preferring toy models of gedanken experiments, which are nice, but are not always relevant in an empiric world of chess engines.