Ways to avoid "Draw Death" in Computer Chess

Rebel · Post by **Rebel** » Tue Aug 01, 2017 5:35 pm

Laskos wrote:
Rebel wrote:
Laskos wrote: Unnormalized prior: W**2 * L**(1/2)
Yes, something like that. I leave that in the good hands of the mathematicians (like you) among us, I am certainly not.

I am not an expert in ELO calculation but I suspect with the Draw Death Doom Day [DDDD ] in mind the number of games should start to play a prominent role. If we take my previous example (X vs Y) and multiply the number of games by 10 we get:
Code: Select all
Match-1 - 100   games - 50.5-49.5 - &#40;50.5%) - draws 99
Match-2 - 1000  games - 505 - 495 - &#40;50.5%) - draws 990
Match-3 - 10000 games - 5050-4950 - &#40;50.5%) - draws 9900
Match-1 with only 1 win doesn't say much about ELO.
Match-2 with 10 wins already says a whole lot.
Match-3 with 100 wins and no single loss in 10,000 games then X is clearly superior over Y and it should be made visible in ELO.

From a gut feeling I would say Match-1 +10 elo, Match-2 +50 elo and Match-3 +200-300 elo. Are there already formulas for that?
I am not a mathematician, actually a physicist, but some shallow math I should know. And I like to have some empirical grounds. There is a quantity to use for that, t-value (related easily in this to p-value), it has a standard deviation 1, and 1.96 for 95% confidence (always as we discuss).

Here the expression for it is: (score - 1/2) / sigma

So:
Match-1: t-value = 1.01 +/- 1.96 (undecided)
Match-2: t-value = 3.18 +/- 1.96 (significant)
Match-3: t-value = 10.1 +/- 1.96 (highly significant)

But the current ELO models in use the base is always the 50.5 percentage, do I see that correct? If so, the elo gain in match-2 & 3 (3.5 elo) is not correct, wouldn't you say so?

AlvaroBegue · Post by **AlvaroBegue** » Wed Aug 02, 2017 1:16 am

Rebel wrote:
Laskos wrote:
Rebel wrote:
Laskos wrote: Unnormalized prior: W**2 * L**(1/2)
Yes, something like that. I leave that in the good hands of the mathematicians (like you) among us, I am certainly not.

I am not an expert in ELO calculation but I suspect with the Draw Death Doom Day [DDDD ] in mind the number of games should start to play a prominent role. If we take my previous example (X vs Y) and multiply the number of games by 10 we get:
Code: Select all
Match-1 - 100   games - 50.5-49.5 - &#40;50.5%) - draws 99
Match-2 - 1000  games - 505 - 495 - &#40;50.5%) - draws 990
Match-3 - 10000 games - 5050-4950 - &#40;50.5%) - draws 9900
Match-1 with only 1 win doesn't say much about ELO.
Match-2 with 10 wins already says a whole lot.
Match-3 with 100 wins and no single loss in 10,000 games then X is clearly superior over Y and it should be made visible in ELO.

From a gut feeling I would say Match-1 +10 elo, Match-2 +50 elo and Match-3 +200-300 elo. Are there already formulas for that?
I am not a mathematician, actually a physicist, but some shallow math I should know. And I like to have some empirical grounds. There is a quantity to use for that, t-value (related easily in this to p-value), it has a standard deviation 1, and 1.96 for 95% confidence (always as we discuss).

Here the expression for it is: (score - 1/2) / sigma

So:
Match-1: t-value = 1.01 +/- 1.96 (undecided)
Match-2: t-value = 3.18 +/- 1.96 (significant)
Match-3: t-value = 10.1 +/- 1.96 (highly significant)
But the current ELO models in use the base is always the 50.5 percentage, do I see that correct? If so, the elo gain in match-2 & 3 (3.5 elo) is not correct, wouldn't you say so?

I don't understand what you are saying. Elo difference tells you something about the probability that one engine will beat another engine in one game. If we estimate that probability as the sample mean of the games in a match, indeed those three matches have the same sample mean and therefore you'll get the same Elo difference.

I can see two ways to make mathematical sense of your feeling that these Elo differences shouldn't be equal.

1. Instead of using the sample mean to estimate the probability that one engine will beat the other one, be a good Bayesian. Start with some prior probability distribution for what you believe the true probability is likely to be, and then use Bayes' theorem to compute a posterior distribution, the average of which will be your estimate of the probability that one engine beats the other. In practice, and to keep things manageable, stay within the family of beta distributions. This is equivalent to starting with an initial number of wins and losses. For instance, I like to use 60 wins and 60 losses. So just add those to the observed numbers and you'll get a tiny Elo increase for match 1, but something more substantial for match 3.

2. If you are trying to measure statistical significance of the claim that one engine is superior to the other after some games have been observed, you shouldn't look at (W-L)/(W+L), but at (W-L)/sqrt(W+L), which -under the assumption that both engines are equally good- should approximate a normal distribution with mean 0 and standard deviation 1 fairly quickly. If you see a number like 10, you have extremely strong evidence that one engine is better than the other. If you see a number like 1, it is plausible that the engine that won more games is not actually better than the other.

Laskos · Post by **Laskos** » Wed Aug 02, 2017 9:14 pm

AlvaroBegue wrote:
Rebel wrote:
Laskos wrote:
Rebel wrote:
Laskos wrote: Unnormalized prior: W**2 * L**(1/2)
Yes, something like that. I leave that in the good hands of the mathematicians (like you) among us, I am certainly not.

I am not an expert in ELO calculation but I suspect with the Draw Death Doom Day [DDDD ] in mind the number of games should start to play a prominent role. If we take my previous example (X vs Y) and multiply the number of games by 10 we get:
Code: Select all
Match-1 - 100   games - 50.5-49.5 - &#40;50.5%) - draws 99
Match-2 - 1000  games - 505 - 495 - &#40;50.5%) - draws 990
Match-3 - 10000 games - 5050-4950 - &#40;50.5%) - draws 9900
Match-1 with only 1 win doesn't say much about ELO.
Match-2 with 10 wins already says a whole lot.
Match-3 with 100 wins and no single loss in 10,000 games then X is clearly superior over Y and it should be made visible in ELO.

From a gut feeling I would say Match-1 +10 elo, Match-2 +50 elo and Match-3 +200-300 elo. Are there already formulas for that?
I am not a mathematician, actually a physicist, but some shallow math I should know. And I like to have some empirical grounds. There is a quantity to use for that, t-value (related easily in this to p-value), it has a standard deviation 1, and 1.96 for 95% confidence (always as we discuss).

Here the expression for it is: (score - 1/2) / sigma

So:
Match-1: t-value = 1.01 +/- 1.96 (undecided)
Match-2: t-value = 3.18 +/- 1.96 (significant)
Match-3: t-value = 10.1 +/- 1.96 (highly significant)
But the current ELO models in use the base is always the 50.5 percentage, do I see that correct? If so, the elo gain in match-2 & 3 (3.5 elo) is not correct, wouldn't you say so?
I don't understand what you are saying. Elo difference tells you something about the probability that one engine will beat another engine in one game. If we estimate that probability as the sample mean of the games in a match, indeed those three matches have the same sample mean and therefore you'll get the same Elo difference.

I can see two ways to make mathematical sense of your feeling that these Elo differences shouldn't be equal.

1. Instead of using the sample mean to estimate the probability that one engine will beat the other one, be a good Bayesian. Start with some prior probability distribution for what you believe the true probability is likely to be, and then use Bayes' theorem to compute a posterior distribution, the average of which will be your estimate of the probability that one engine beats the other. In practice, and to keep things manageable, stay within the family of beta distributions. This is equivalent to starting with an initial number of wins and losses. For instance, I like to use 60 wins and 60 losses. So just add those to the observed numbers and you'll get a tiny Elo increase for match 1, but something more substantial for match 3.

2. If you are trying to measure statistical significance of the claim that one engine is superior to the other after some games have been observed, you shouldn't look at (W-L)/(W+L), but at (W-L)/sqrt(W+L), which -under the assumption that both engines are equally good- should approximate a normal distribution with mean 0 and standard deviation 1 fairly quickly. If you see a number like 10, you have extremely strong evidence that one engine is better than the other. If you see a number like 1, it is plausible that the engine that won more games is not actually better than the other.

Ed, to explain a bit what Alvaro says:
Point 2 of what he wrote is what I was saying in previous post, (W-D)/sqrt(W+L) is exactly t-value, which I used to exemplify an increasing quantity for your matches, a quantity which shows statistical significance.

Point 1 is maybe confusing to you. If one uses the prior knowledge that engines are close in strength, a posteriori ELO difference increases in your matches.

Use again this "close in strength prior" of this form and shape:

Unnormalized prior: [4*score*(1-score)]**8000
It looks like that:

Now a posteriori ELO difference difference increases for your three matches:

You write:

Match-1 - 100 games - 50.5-49.5 - (50.5%) - draws 99
Match-2 - 1000 games - 505 - 495 - (50.5%) - draws 990
Match-3 - 10000 games - 5050-4950 - (50.5%) - draws 9900

Match-1 with only 1 win doesn't say much about ELO.
Match-2 with 10 wins already says a whole lot.
Match-3 with 100 wins and no single loss in 10,000 games then X is clearly superior over Y and it should be made visible in ELO.

With that pictured "close in strength prior",

Match-1: 0.83 ELO points difference
Match-2: 2.93 ELO points difference
Match-3: 3.42 ELO points difference

But all this is maybe useless, the statistical significance is given not by ELO itself, but by ELO/sigma, t- and p- values. So basically, look at ELO difference and error margins, both.

hgm · Post by **hgm** » Wed Aug 02, 2017 10:53 pm

Would it be a good idea to determine how well engines play Chess by letting them play games from a KRKR start position?

AlvaroBegue · Post by **AlvaroBegue** » Wed Aug 02, 2017 10:55 pm

hgm wrote:Would it be a good idea to determine how well engines play Chess by letting them play games from a KRKR start position?

If that's an honest question, the answer is no. But I suspect you are trying to convey a message, not actually get an answer. I am just missing what the message is...

hgm · Post by **hgm** » Wed Aug 02, 2017 11:22 pm

Indeed, it was a rethorical question. Because this whole thread seems to be about something similar: you let players play a game that is too easy for them, starting deep in the draw zone. And then try to somehow limit the damage this does to the usefulness of the data you collect by fancy calculation. But you cannot recover what is not there.

Much better to collect meaningful data in the first place. Who wins most in a match of 1000 games where 995 games end in a draw does not have to be correlated at all with who on average plays better chess. Consider a player A that plays that once every 10,000 moves blunders away a piece, but otherwise plays optimally, and one (B) that every move gives up a centi-Pawn, and never blunders. Since the average game is shorter than 100 moves, player B will never give up enough to lose (i.e. more than ~100cP), while A will lose about 1% of the games. But start them from the edge of the draw zone, A will still lose the 10 games where he blunders, but all the other games where he started nearly lost he will draw, and those he started nearly won he will win. So A will beat B by 742.5 - 257.5. So when it matters, A plays a lot better than B. But starting deep within the draw zone, it doesn't matter who plays better, but only who blunders more frequently. There is no way you are going to get a meaningful measure from that.

Ozymandias · Post by **Ozymandias** » Wed Aug 02, 2017 11:31 pm

hgm wrote:A plays a lot better than B. But starting deep within the draw zone, it doesn't matter who plays better, but only who blunders more frequently. There is no way you are going to get a meaningful measure from that.

This is a closely related problem, to the one we faced in the latest Infinity Chess Ultimate Challenge. Better players couldn't convert their advantage to wins, while worse players could profit from their opponent's blunders. Deep and wide books, leave you in that draw zone you mention and form there, there's no way the best player can win, but trough seer luck.

Laskos · Post by **Laskos** » Fri Aug 04, 2017 7:33 am

hgm wrote:Indeed, it was a rethorical question. Because this whole thread seems to be about something similar: you let players play a game that is too easy for them, starting deep in the draw zone. And then try to somehow limit the damage this does to the usefulness of the data you collect by fancy calculation. But you cannot recover what is not there.

Much better to collect meaningful data in the first place. Who wins most in a match of 1000 games where 995 games end in a draw does not have to be correlated at all with who on average plays better chess. Consider a player A that plays that once every 10,000 moves blunders away a piece, but otherwise plays optimally, and one (B) that every move gives up a centi-Pawn, and never blunders. Since the average game is shorter than 100 moves, player B will never give up enough to lose (i.e. more than ~100cP), while A will lose about 1% of the games. But start them from the edge of the draw zone, A will still lose the 10 games where he blunders, but all the other games where he started nearly lost he will draw, and those he started nearly won he will win. So A will beat B by 742.5 - 257.5. So when it matters, A plays a lot better than B. But starting deep within the draw zone, it doesn't matter who plays better, but only who blunders more frequently. There is no way you are going to get a meaningful measure from that.

I already mentioned that unbalanced openings, although have significantly higher sensitivity, have a drawback that engines might forget to play balanced ones. But your extreme example reminds me of Uri's often almost absurd arguments, absurd on the grounds of plausibility to happen in the real world. What you exemplify can hardly happen. From my past real-world data, the deviation from optimal move in engines of different strengths obey Gamma distribution, with not much different shape parameter, but with different scale parameter. Here is the plot of plausible probability density functions with the same shape parameter and different scale parameter for pretty different in strength engines:

Under many plausible assumptions, the "strong engine" will beat the "weak engine" both inside the deep drawish zone and on the border zone of decisive advantage. So, they give not that independent results in these two cases. It is likely in real world that stronger in unbalanced is also stronger in balanced openings, and if in balanced we have squeezed the significance by too many draws, we get to have a solution to that problem. Strangely, in Checkers they reached this conclusion in 1990'es without any statistical background.

Look at Graham Banks LTC match as of now:
http://www.talkchess.com/forum/viewtopic.php?t=64666
It runs for already more than two weeks and hasn't achieved yet a significant result. +4 -1 =69 is more than 93% draw rate, and a LOS with uniform prior of 89%, hardly decisive. Two weeks of Xeon octal at continuous max power are a lot of resources. One can hardly continue like that in the next years.

One solution is a bit artistic and based on belief and common sense: from Andreas data, it seems that Win/Loss ratio decays very slowly with deeper drawishness, and stays far away from 1 for engines assumed to have different strengths. So, if we believe by common sense that Win/Loss ratio in Graham's match is far away from 1, then we can take this prior distribution:

prior: W**3 * L**(1/3)

LOS in this case for +4 -1 =69 is 97%, pretty decisive now and independent of draw rate too.

Or, we can adopt a sounder both theoretically and practically "unbalanced openings" strategy, and for 93% draw rate improve sensitivity (t-value or Normalized ELO or Multiplicative Rating) by at least a factor of 2. Meaning a LOS of above 99% (with uniform prior). Graham's match would have been already decided long ago with some +15 -5 =20 result. Well, you can come in this case with your fairly absurd example quoted, but again, this example has few (if any) embodiments in real chess engine life.

hgm · Post by **hgm** » Fri Aug 04, 2017 9:35 am

Indeed, it was an extreme example to illustrate the point, but then the consequences were also extreme, turning a 10-0 loss into a 75% victory.

The point is that if there is any relation between a match starting from deep within the draw zone, and one starting from the edge, it is critically dependent on that distribution, and in particular on the far tail of that distribution. You argue like that distribution is a given certainty, but it is not. I doubt you have enough data to proof this distribution is anything more than an assumption on your part.

And even if you would have a very accurate measurement of that distribution, there is no guarantee that any new engine you want to test would have the same distribution. The entire sport of 'improving' engines is shaping that distribution, by culling away knowledge for speed, and more speculatively following some lines at the expense of pruning others. All expected to shift the peak of the distribution to the left because of better tactics, but lengthening the tail because of unrecognized strategic blunders.

It is the characteristics of that distribution that in the end determines how useful the engine is for a particular purpose. And measuring from within the draw zone is not sensitive to the relevant characteristics. If you start assuming relations between those characteristics without measuring them, you might as well just assume which engine is better. Then you could do the test with zero games...

Pio · Post by **Pio** » Fri Aug 04, 2017 9:57 am

I just love your posts HGM. You are of course correct about that the engines will shift their distribution if they are tested very close to the edge. I do not understand why people do not use your time odds idea that is brilliant.

/Pio

Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess