Towards a standard analysis output format

Don · Post by **Don** » Thu Mar 24, 2011 9:44 pm

hgm wrote:That depends strongly on game phase. (Unless the engine does not report in centi-Pawn, but already scales it with game phase, as Rybka does.) In KPPK micro-Max would think itself at +2, but its winning chances would be close to 100% even when it faces the 1000-Elo stronger Houdini...

The inverse formula should be used for tuning the program - otherwise I think it's evidence that your evaluation is out of balance. I don't think it really should be based on game phase but maybe it is as you say it really is.

Houdini · Post by **Houdini** » Thu Mar 24, 2011 9:48 pm

hgm wrote:That seems too much. A +1 score should correspond to Pawn-odds, and Pawn-odds is supposed to produce around 72% score. That would only be 72% win in complete absense of draws, while it is n fact more likely that most of the points of the sde that is behind comes from draws.

Have you tried self-play of Houdini with Pawn odds?

If by "pawn-odds" you understand removing one pawn in the starting position (which pawn, by the way?), it doesn't create a "+1" advantage. Open lines and easier development create a lot of compensation for the pawn.
For example, if I remove the black d-pawn, Houdini scores the position at around +0.4 to +0.5 at low search depths, the eval rises to about +0.7 when the search goes deeper. It illustrates the difficulty of defining what exactly is a "+1" advantage.

Robert

Don · Post by **Don** » Thu Mar 24, 2011 9:52 pm

Houdini wrote:
hgm wrote:That seems too much. A +1 score should correspond to Pawn-odds, and Pawn-odds is supposed to produce around 72% score. That would only be 72% win in complete absense of draws, while it is n fact more likely that most of the points of the sde that is behind comes from draws.

Have you tried self-play of Houdini with Pawn odds?
If by "pawn-odds" you understand removing one pawn in the starting position (which pawn, by the way?), it doesn't create a "+1" advantage. Open lines and easier development create a lot of compensation for the pawn.
For example, if I remove the black d-pawn, Houdini scores the position at around +0.4 to +0.5 at low search depths, the eval rises to about +0.7 when the search goes deeper. It illustrates the difficulty of defining what exactly is a "+1" advantage.

Robert

A general problem in trying to deduce certain things from games is similar. For example one way to attempt to determine the value of the bishop pair is to see what kind of results are obtained in games where one side first gets the bishop pair (and perhaps keeps it for a couple of moves at least.) HOWEVER, such a method does not consider that the person giving up the bishop pair probably had a good reason for doing so.

Don

hgm · Post by **hgm** » Thu Mar 24, 2011 10:32 pm

Standard pawn-odds removes the f-pawn, which gives as little compensation as possible.

marcelk · Post by **marcelk** » Fri Mar 25, 2011 12:35 am

Houdini wrote:
marcelk wrote:Suggestion 2:

Don't report in pawns but in winning chance using 1 / (1 + 10 ** (-eval/4.0)) from the chess programming wiki.

70.3% is equivalent to white being 1.5 pawns up
42.9% is equivalent to white being half a pawn down.

Such numbers are also something you can put on television, like in is done in reporting poker matches.
Surely the coefficient 4.0 in the formula cannot be correct, it implies that an eval of +4 would only give 90% win chance.

For Houdini, +1 eval corresponds to about 85% win, +3 eval is higher than 99% win.
In other words, the formula for Houdini is probably close to 1 / (1 + 10**(-eval*0.75)) .

Robert

Suggestion 3:

1-0 : Houdini says it will win as white
0-1: Houdini says it will win as black
* : didn't check with Houdini yet

bob · Post by **bob** » Fri Mar 25, 2011 4:44 am

Dann Corbit wrote:
bob wrote:
marcelk wrote:
sje wrote: The most important thing is to standardize for which side you are reporting the score.
Suggestion 1:

"1.5w" means white is 1.5 pawn up
"0.5b" black is half a pawn up
No ambiguity.

Suggestion 2:

Don't report in pawns but in winning chance using 1 / (1 + 10 ** (-eval/4.0)) from the chess programming wiki.

70.3% is equivalent to white being 1.5 pawns up
42.9% is equivalent to white being half a pawn down.

Such numbers are also something you can put on television, like in is done in reporting poker matches.
How accurate is that? I could probably run (say) 100K games on the cluster, and convert Crafty's scores to that and create a large file that showed each value in the range of 0 to 100% and recode the actual result. Then combine it to see if 70% really wins 70% of the games, or something significantly better or worse...

I was thinking about a large array inside Crafty, one entry per move in the game. I record that "percentage" by transforming the eval. Then, at the end of the game, dump each one, but with the real result paired with it. Would not be hard to combine all that data, run it thru a simple program and show for each "percentage of wins based on eval score" what the actual winning percentage was...
For winning percentage, the program could connect to a database of chess games and form the estimate from that (if and only if the position is in the database). I don't believe you can make an accurate winning percentage from a ce value alone unless it is a mate position.

Logically there must be _some_ correlation between CE and winning expectation, else the evaluation is broken, and badly. How strong the correlation is is a question. I think I am going to tackle this when a student load ends on the cluster...

bob · Post by **bob** » Fri Mar 25, 2011 4:46 am

Houdini wrote:
marcelk wrote:Suggestion 2:

Don't report in pawns but in winning chance using 1 / (1 + 10 ** (-eval/4.0)) from the chess programming wiki.

70.3% is equivalent to white being 1.5 pawns up
42.9% is equivalent to white being half a pawn down.

Such numbers are also something you can put on television, like in is done in reporting poker matches.
Surely the coefficient 4.0 in the formula cannot be correct, it implies that an eval of +4 would only give 90% win chance.

For Houdini, +1 eval corresponds to about 85% win, +3 eval is higher than 99% win.
In other words, the formula for Houdini is probably close to 1 / (1 + 10**(-eval*0.75)) .

Robert

How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...

bob · Post by **bob** » Fri Mar 25, 2011 4:50 am

hgm wrote:
bob wrote:I wonder if I ought to actually do this formally as an experiment? I could certainly take Crafty and play it against the gauntlet, normally, then at -1.00, and then -2.00 to see just what effect removing one or two pawns does in terms of Elo...
I think such 'direct' piece-value measurements are quite interesting. I played many ten-thousands of such material-imbalance games, not only with Pawn odds, but also deleting Bishop vs Knight, Bishop vs Knight +Pawn, Queen vs 2 Bishops + 1 Knight, etc. (Mostly on 10x8 boards.) In principle you could measure the score of any advantage that way. E.g. if you want to now how much castling rights are worth, play one side without castling rights, and see by how much it loses, in comparison to the Pawn-odds score.

To make it work you need an engine that randomizes well (as there are no books for such imbalanced positions), or shuffle the intial positions (e.g. in Chess960 fashion). And deleting multiple Pawns sometimes gave inconsistent results (the second Pawn having negative effective value), presumably because you give that side a tremendous advantage in development, which can be very dangerous in 10x8 Chess.

I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.

But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...

The only issue I see is that a +.5 in the opening might be meaningless, while in an endgame it is possibly winning.

Don · Post by **Don** » Fri Mar 25, 2011 5:00 am

bob wrote:
hgm wrote:
bob wrote:I wonder if I ought to actually do this formally as an experiment? I could certainly take Crafty and play it against the gauntlet, normally, then at -1.00, and then -2.00 to see just what effect removing one or two pawns does in terms of Elo...
I think such 'direct' piece-value measurements are quite interesting. I played many ten-thousands of such material-imbalance games, not only with Pawn odds, but also deleting Bishop vs Knight, Bishop vs Knight +Pawn, Queen vs 2 Bishops + 1 Knight, etc. (Mostly on 10x8 boards.) In principle you could measure the score of any advantage that way. E.g. if you want to now how much castling rights are worth, play one side without castling rights, and see by how much it loses, in comparison to the Pawn-odds score.

To make it work you need an engine that randomizes well (as there are no books for such imbalanced positions), or shuffle the intial positions (e.g. in Chess960 fashion). And deleting multiple Pawns sometimes gave inconsistent results (the second Pawn having negative effective value), presumably because you give that side a tremendous advantage in development, which can be very dangerous in 10x8 Chess.
I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.

But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...

The only issue I see is that a +.5 in the opening might be meaningless, while in an endgame it is possibly winning.

In the opening a score close to zero is an accurate prediction. A score of 0.5 is a very good score in Komodo even right out of the opening. 0.5 is not a win - it's just a very good score. I don't believe a score of 0.5 is better in the ending than opening. I think it means you have good chances but nothing like a very certain win.

But if you are going to do the study, you can test this too. Don't just keep discrete buckets by score, but also keep the phase of the game (perhaps by summing all the non-pawn material using the classical values for instance.)

I might do this study too, it sounds like fun.

Don

Dann Corbit · Post by **Dann Corbit** » Fri Mar 25, 2011 8:25 am

bob wrote:
Dann Corbit wrote:
bob wrote:
marcelk wrote:
sje wrote: The most important thing is to standardize for which side you are reporting the score.
Suggestion 1:

"1.5w" means white is 1.5 pawn up
"0.5b" black is half a pawn up
No ambiguity.

Suggestion 2:

Don't report in pawns but in winning chance using 1 / (1 + 10 ** (-eval/4.0)) from the chess programming wiki.

70.3% is equivalent to white being 1.5 pawns up
42.9% is equivalent to white being half a pawn down.

Such numbers are also something you can put on television, like in is done in reporting poker matches.
How accurate is that? I could probably run (say) 100K games on the cluster, and convert Crafty's scores to that and create a large file that showed each value in the range of 0 to 100% and recode the actual result. Then combine it to see if 70% really wins 70% of the games, or something significantly better or worse...

I was thinking about a large array inside Crafty, one entry per move in the game. I record that "percentage" by transforming the eval. Then, at the end of the game, dump each one, but with the real result paired with it. Would not be hard to combine all that data, run it thru a simple program and show for each "percentage of wins based on eval score" what the actual winning percentage was...
For winning percentage, the program could connect to a database of chess games and form the estimate from that (if and only if the position is in the database). I don't believe you can make an accurate winning percentage from a ce value alone unless it is a mate position.
Logically there must be _some_ correlation between CE and winning expectation, else the evaluation is broken, and badly. How strong the correlation is is a question. I think I am going to tackle this when a student load ends on the cluster...

I am sure it is true for an average, but for a single instance of a position, I suspect it is faulty. For instance, I have frequently seen strong programs think that they are a full pawn ahead only to discover that really they are in trouble a few moves later. So if we were to correlate the previous ce to a winning percentage it will give a bad answer.

It could be an interesting study to see what combination of ce and/or actual winning percentage (perhaps as a function of Elo expectation) works best.

Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format