Ways to avoid "Draw Death" in Computer Chess

Laskos · Post by **Laskos** » Fri Aug 04, 2017 10:20 am

Pio wrote:I just love your posts HGM. You are of course correct about that the engines will shift their distribution if they are tested very close to the edge. I do not understand why people do not use your time odds idea that is brilliant.

/Pio

Time odds were discussed in this thread earlier, from my opening post, you probably missed that. Check their efficiency too.

Laskos · Post by **Laskos** » Fri Aug 04, 2017 10:32 am

hgm wrote:Indeed, it was an extreme example to illustrate the point, but then the consequences were also extreme, turning a 10-0 loss into a 75% victory.

The point is that if there is any relation between a match starting from deep within the draw zone, and one starting from the edge, it is critically dependent on that distribution, and in particular on the far tail of that distribution. You argue like that distribution is a given certainty, but it is not. I doubt you have enough data to proof this distribution is anything more than an assumption on your part.

And even if you would have a very accurate measurement of that distribution, there is no guarantee that any new engine you want to test would have the same distribution. The entire sport of 'improving' engines is shaping that distribution, by culling away knowledge for speed, and more speculatively following some lines at the expense of pruning others. All expected to shift the peak of the distribution to the left because of better tactics, but lengthening the tail because of unrecognized strategic blunders.

It is the characteristics of that distribution that in the end determines how useful the engine is for a particular purpose. And measuring from within the draw zone is not sensitive to the relevant characteristics. If you start assuming relations between those characteristics without measuring them, you might as well just assume which engine is better. Then you could do the test with zero games...

I had some old data (weak) on distributions. Even if it mostly assumption, it is a plausible assumption. What you say does not have any data and seems to me not that plausible. Now, we can diverge in what is plausible. Then, let's call one "Balanced Chess" and another "Unbalanced Chess". In the "Unbalanced Chess", using pentanomial for side-reversed games, one would need an order of magnitude less games to the same statistical significance for 95%+ draw rates (with balanced openings) than in "Balanced Chess". This is proven theoretically (in Davidson and Rao-Kupper (Bayeselo) draw models) and shown practically. We can then diverge on opinions and plausibility that "Balanced Chess" is correlated with "Unbalanced Chess". Maybe a better solution than having +4 -1 =73 after 2 weeks.

Pio · Post by **Pio** » Fri Aug 04, 2017 10:50 am

Hi Kai!

I have watched every post in this thread even though I cannot remember everything.

Of course you can adjust the time odds so that it will give you the most information (depending on your rating model).

Another gain by using time odds is that you will be able to inspect the games where the player with less time has won. Those games will be really interesting since the player with more time most certainly has made a big strategical mistake. Hopefully you can fix that mistake in evaluation and/or in search.

/Pio

Laskos · Post by **Laskos** » Fri Aug 04, 2017 11:13 am

Pio wrote:Hi Kai!

I have watched every post in this thread even though I cannot remember everything.

Of course you can adjust the time odds so that it will give you the most information (depending on your rating model).

Another gain by using time odds is that you will be able to inspect the games where the player with less time has won. Those games will be really interesting since the player with more time most certainly has made a big strategical mistake. Hopefully you can fix that mistake in evaluation and/or in search.

/Pio

Hi,
I was too quite curious about time odds. In "Endgame Chess" with balanced openings ("Drawish Chess"), I even tried 16x time odds, as 4x seemed too low in this case. Even that had not that large effect and pentanomial for side-reversed White time odds helps only moderately. Maybe real Chess is different, but then to get those draw rates and any statistical significance, one would have to sit and wait not 2-3 weeks, like Graham, but 2-3 months.

hgm · Post by **hgm** » Fri Aug 04, 2017 11:27 am

Laskos wrote:We can then diverge on opinions and plausibility that "Balanced Chess" is correlated with "Unbalanced Chess".

The extreme example I gave shows that Balanced Chess doesn't really have to correlate with Unbalanced Chess. They are different tasks, and the optimal tools for them will be different, just as you need a different tool to fasten a screw than to fasten a nail.

So the important question is really, which is more relevant for the average user. Do they want an engine that almost always does the better moves, but once every 10,000 moves makes a gross blunder, or do they want one that gives away a little bit advantage on every move, out of paranoia for blunders.

My feeling is that the typical use case would want to typically get the very best move, even if that would lead to an occasional gross blunder. As under the current paradigm of engine design such blunders are likely to be strategic, and so probably recognizable as such by the user. By testing the engines in Balanced Chess, which is rather insensitive to tiny differences in the typical quality of the move, and only to gross blunders, we might develop them into being screwdrivers while the user actually wants to hammer in nails...

Laskos · Post by **Laskos** » Fri Aug 04, 2017 12:36 pm

hgm wrote:
Laskos wrote:We can then diverge on opinions and plausibility that "Balanced Chess" is correlated with "Unbalanced Chess".
The extreme example I gave shows that Balanced Chess doesn't really have to correlate with Unbalanced Chess. They are different tasks, and the optimal tools for them will be different, just as you need a different tool to fasten a screw than to fasten a nail.

"Doesn't have to" does not mean "are not". You are not backing that by any data and common sense. At least from my data, I have observed that larger mean error means longer tail. Take an engine (not extremely strong one, to have a reference to "optimal") and a significantly weaker engine, and test that. Those two Gamma distribution curves were for two significantly differing in strength engines. My indications are that "Unbalanced Chess" is related to "Balanced Chess", at least for now. To pass to a win, an engine anyway has to pass trough "Unbalanced Chess" even from "Balanced Chess". This might change as people start to shift to systematic "Unbalanced Chess" (they will probably anyway do that in 10-20 years, this is simply the Checkers paradigm).

So the important question is really, which is more relevant for the average user. Do they want an engine that almost always does the better moves, but once every 10,000 moves makes a gross blunder, or do they want one that gives away a little bit advantage on every move, out of paranoia for blunders.

My feeling is that the typical use case would want to typically get the very best move, even if that would lead to an occasional gross blunder. As under the current paradigm of engine design such blunders are likely to be strategic, and so probably recognizable as such by the user. By testing the engines in Balanced Chess, which is rather insensitive to tiny differences in the typical quality of the move, and only to gross blunders, we might develop them into being screwdrivers while the user actually wants to hammer in nails...

So, are you saying that most losses in "Balanced Chess" are due to blunders, and most losses in "Unbalanced Chess" are due to accumulation of tiny errors? That is in fact testable. I haven't observed that, but let's diverge on "opinions", "common sense" and such things. If you come with some data backing you, it would be more interesting. I don't think that most developers have much knowledge about blunder rates vs lesser-than-blunder rates, and prefer something much more than another.

hgm · Post by **hgm** » Fri Aug 04, 2017 2:01 pm

Laskos wrote:"Doesn't have to" does not mean "are not".

Indeed. But it does mean that when you assume it is, you are just guessing, and that ever conclusion that is based on that assumption is not better than guessing.

You are not backing that by any data and common sense. At least from my data, I have observed that larger mean error means longer tail. Take an engine (not extremely strong one, to have a reference to "optimal") and a significantly weaker engine, and test that. Those two Gamma distribution curves were for two significantly differing in strength engines.

It seems to make the assumption that the distributions are always Gamma distributions, until very far in the tail, and dependent on only a single 'strength' parameter. That you observe this on not extremely strong engines might also be a weak point. Even the strong engine of your example most-likely gives up 10 cP per move, i.e. 6 Pawns per game (and on average much more!). Which on an absolute scale doesn't sound very strong. It would only need ~10 moves to bungle n equal game against perfect play.

My indications are that "Unbalanced Chess" is related to "Balanced Chess", at least for now. To pass to a win, an engine anyway has to pass trough "Unbalanced Chess" even from "Balanced Chess".

Not necessarily. The score loss can dominated by a single gross blunder (e.g. a strategically wrong choice that in the long run proves fatal). This is why the exact shape of the distribution is so important. I don't doubt you if you say larger modal or median score loss also means longer tails. But it makes a lot of difference if the asymptotic behavior is exponential or 1/x^2.

So, are you saying that most losses in "Balanced Chess" are due to blunders, and most losses in "Unbalanced Chess" are due to accumulation of tiny errors?

At very high level of play, this seems likely. Obviously accumulation of small errors will be able to decide a game in Unbalanced Chess, but it would not in Balanced Chess, once the typical error gets small enough. So the question is whether the large score loss required to decide a game of Balanced Chess typically comes about by a very large number of small score losses being a-typically large, or by most of them being typical or perhaps only slightly above typical, with just a few being very large. This depends on the characteristics of the loss-per-move distribution. E.g. with a Gaussian distribution a 3-sigma event in the sum would be most-likely caused by all terms being shifted by the same amount, but scattering around that in the normal way. If the sum of a number of draws from a Cauchy distribution is a fluke, it is most likely due to one term being much larger than all the others.

That is in fact testable. I haven't observed that, but let's diverge on "opinions", "common sense" and such things. If you come with some data backing you, it would be more interesting. I don't think that most developers have much knowledge about blunder rates vs lesser-than-blunder rates, and prefer something much more than another.

It is not the developers I worry about. They can do whatever they want. When they want to design hammers, why not, if it makes them happy. But if the users need to fasten screws, they won't consider a more hammer-like hammer really an improvement.

I don't feel like spending any CPU power on this. But for testers that do spend huge amount of CPU powers, it would be good if they are spending it to measure something of interest to the public, or whether they just try to measure with very high accuracy a quantity that is of no relevance. Just on some vague hope that it will correlate well enough with the quantity that is of interest to not make all hard-gotten accuracy on the thing they actually measure completely meaningless.

Laskos · Post by **Laskos** » Sat Aug 05, 2017 7:46 am

hgm wrote:Just on some vague hope that it will correlate well enough with the quantity that is of interest to not make all hard-gotten accuracy on the thing they actually measure completely meaningless.

You exaggerate all the difficulties. Continuously giving examples of pathological distributions with expected values and variances undefined or weird, is more a paranoia. What is proposed is "Unbalanced Chess", a clearly defined game. Checkers people used "Unbalanced Checkers" for years. To me it is plausible that "Unbalanced Chess" is closely related to "Balanced Chess" and I am backing it with some data. I observed 4-5 years ago (?) that larger mean (and modal) means longer tails. I partly derived that the tails decay exponentially, but this was by no means established (too few data).

I asked Ferdinand if he can write a tool to be used to check this, if he finds time, I will be able to show some data soon. I plan to use Zurichess_00 (a stable UCI engine of 1800 CCRL) as weak engine, Fruit 2.1 (2700 CCRL) as strong engine, and Stockfish dev (3400 CCRL) as perfect engine.

Until then you can see +5 -1 =84 result of Graham from "Balanced Chess", which is completely unrelated to "Unbalanced Chess" because of Cauchy distribution. Come on!

hgm · Post by **hgm** » Sat Aug 05, 2017 9:07 am

It might be hard to check this, for lack of a referee that plays/evaluates perfectly. You might think Stockfish is so much better than other engines that it can be used as such, but in fact it is equally stupid as weaker engines strategically, if not more.

A clear example (yes, again an extreme one) is the 3Q vs 7N position. Stockfish lacks the strategic knowledge about the Scharnagl effect that makes this a lost position for the Queens, and evaluates it as +8 at low depth. An then the score drops in small steps to -16 as the game progresses, or, in any fixed position, if its look-ahead depth allows it to see deeper into the game. So to Stockfish it looks like it is making a relatively small error on every move, compared to its deeper searches. While in fact it didn't really make any error at all, and is just seeing the consequences of starting in a hopeless position. If it would have been started in a position where it had the choice to convert to this hopeless position or not (e.g. where it just can force a Q-for-2N trade before the Knights get organized) it would not take the opportunity to save the game, and the entire error would be made in that one move, sealing its fate. It cannot be solved by analyzing the game backwards, because in any position there will be several moves that apparently look equally good if you don't have the required evaluation knowledge, but are in fact equally bad.

This is rather typical of strategical errors. The unrecognized disadvantage restricts your options, so that your best moves are just a bit worse than those of the opponent. Positional horizon effect will be used to smoothen out material losses, making the score gradually drop already before an actual material loss occurs, as the engine plans to postone the loss by making smaller positional sacrifices to push it over the horizon, before it can see the loss is unavoidable.

BTW, what does sound like paranoia to me is the assumption that playing strength at Balanced Chess would correlate so poorly with that of Unbalanced chess that it would ever be needed to live with the vastly larger statistical error that Balanced Chess gives you for the same number of games. It would almost require anti-correlation, or no correlation at all. And even if that would be the case, it is still a good question which of the two abilities is more relevant to the typical user. It seems to me that Balanced Chess is a dead horse. Either it measures the same as Unbalanced Chess, but in an incredibly less efficient way, or it measures something different, which the average user is not much interested in. It cannot win. So there is little point in flogging it with advaced statistical methods.

Laskos · Post by **Laskos** » Sat Aug 05, 2017 9:13 am

Laskos wrote: Until then you can see +5 -1 =84 result of Graham from "Balanced Chess", which is completely unrelated to "Unbalanced Chess" because of Cauchy distribution. Come on!

And even that +5 -1 =84 situation in "Balanced Chess" can be bit improved a bit with "Unbalanced Prior" for "Balanced Chess". The unbalance in prior comes from highly deviating from 1 Win/Loss ratio. A bit exaggerated prior would be:

W**3 * L**(1/3)

and t-value of 2.15

For uniform prior, t-value is 1.63, so 1.32 factor in t-value improvement. But this is a bit exaggerating, because for that match the prior is more likely W**2 * L**(1/2).

With "Unbalanced Chess", at this draw rate, the expected improvement is a factor 2-3 in t-value, with square of that for the number of games to same significance.

Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess