Towards a standard analysis output format

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Towards a standard analysis output format

Post by Laskos »

bob wrote:
However, a pawn is one thing, but +1.00 is something else entirely. It might be a pawn advantage, or it might be a piece advantage where the king is horribly exposed. A speculative eval is probably going to produce a different "conversion factor (to % win probability) than a more cautious evaluation...

But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
+1.00 in the opening is the same as +1.00 in the endgame as winning % ?

Kai
User avatar
marcelk
Posts: 348
Joined: Sat Feb 27, 2010 12:21 am

Re: Towards a standard analysis output format

Post by marcelk »

bob wrote: But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
That would be pretty nice. It sounds like the right way to me, from a higher point of view. Note that if the program has knowledge of the rating difference with its opponent, it can adjust its expectation accordingly. That is just the good old contempt factor, but expressed in the %-domain instead of the pawn-domain.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Towards a standard analysis output format

Post by Don »

Laskos wrote:
bob wrote:
However, a pawn is one thing, but +1.00 is something else entirely. It might be a pawn advantage, or it might be a piece advantage where the king is horribly exposed. A speculative eval is probably going to produce a different "conversion factor (to % win probability) than a more cautious evaluation...

But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
+1.00 in the opening is the same as +1.00 in the endgame as winning % ?
That depends ...

The score as reported by the program is just an arbitrary value. In theory being exactly 1 pawn up with no "other" compensation is exactly 1.0 but that of course depends entirely on the program and how it decides to score positions. In Komodo a pawn in the opening is about 0.8 and in the ending it's more than 1.0. But the value of a pawn is also meaningless without the context of the other pieces and the evaluation feature weights.

So it's not possible to say if +1.0 up means the same thing in the ending (in the general case) without a great deal more context about the specific program.

It might be useful however to tune the evaluation weights based on making this formula more predictive. Some authors fix the value of a pawn to 1.0 but if we instead "fixed" what our expectations are for being 1 pawn up, we might have a nicer evaluation function.

I think temporal different learning of the evaluation weights tries to optimize that.

Kai
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Towards a standard analysis output format

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
marcelk wrote:
sje wrote: The most important thing is to standardize for which side you are reporting the score.
Suggestion 1:

"1.5w" means white is 1.5 pawn up
"0.5b" black is half a pawn up
No ambiguity.

Suggestion 2:

Don't report in pawns but in winning chance using 1 / (1 + 10 ** (-eval/4.0)) from the chess programming wiki.

70.3% is equivalent to white being 1.5 pawns up
42.9% is equivalent to white being half a pawn down.

Such numbers are also something you can put on television, like in is done in reporting poker matches.
How accurate is that? I could probably run (say) 100K games on the cluster, and convert Crafty's scores to that and create a large file that showed each value in the range of 0 to 100% and recode the actual result. Then combine it to see if 70% really wins 70% of the games, or something significantly better or worse...

I was thinking about a large array inside Crafty, one entry per move in the game. I record that "percentage" by transforming the eval. Then, at the end of the game, dump each one, but with the real result paired with it. Would not be hard to combine all that data, run it thru a simple program and show for each "percentage of wins based on eval score" what the actual winning percentage was...
For winning percentage, the program could connect to a database of chess games and form the estimate from that (if and only if the position is in the database). I don't believe you can make an accurate winning percentage from a ce value alone unless it is a mate position.
Logically there must be _some_ correlation between CE and winning expectation, else the evaluation is broken, and badly. How strong the correlation is is a question. I think I am going to tackle this when a student load ends on the cluster...
I am sure it is true for an average, but for a single instance of a position, I suspect it is faulty. For instance, I have frequently seen strong programs think that they are a full pawn ahead only to discover that really they are in trouble a few moves later. So if we were to correlate the previous ce to a winning percentage it will give a bad answer.

It could be an interesting study to see what combination of ce and/or actual winning percentage (perhaps as a function of Elo expectation) works best.
However, +1, if it equates to 70% winning, is still a probability. 30% of the time you lose from +1, so it would be ok...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Towards a standard analysis output format

Post by bob »

Laskos wrote:
bob wrote:
However, a pawn is one thing, but +1.00 is something else entirely. It might be a pawn advantage, or it might be a piece advantage where the king is horribly exposed. A speculative eval is probably going to produce a different "conversion factor (to % win probability) than a more cautious evaluation...

But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
+1.00 in the opening is the same as +1.00 in the endgame as winning % ?

Kai
I don't see how. And perhaps Don's suggestion about some sort of "phase" to make it a two-dimensional process:

win_probability = conversion[total_material][score]

That is my feeling here from watching way more games than I should have over the years...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Towards a standard analysis output format

Post by bob »

marcelk wrote:
bob wrote: But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
That would be pretty nice. It sounds like the right way to me, from a higher point of view. Note that if the program has knowledge of the rating difference with its opponent, it can adjust its expectation accordingly. That is just the good old contempt factor, but expressed in the %-domain instead of the pawn-domain.
It is complicated by the rating issue. If we use the old +200 Elo advantage translates into roughly a 75% winning chance, one could correct for rating difference. I certainly know the ratings of both opponents (in Crafty) when playing on a server. When a user plays crafty directly via xboard or whatever, there is an assumption about ratings that is probably way off (I think I assume they are equal but am not sure).

However, I could probably test this since in my testing the latest stockfish is about +200 above Crafty. Which is a 75% winning probability, roughly the winning percentage in my testing.

It will take a bit of thinking on how to produce this winning percentage matrix in a most usable form, but I think I am going to do it just for fun, since one criticism of computer chess programs over the years has always been that the evaluations are always inconsistent between two programs...
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Towards a standard analysis output format

Post by Don »

Ok, I ran off several games and plotted the results using gnuplot to see if scoring data matches the function HG (or someone) presented in this thread.

Here is what I get:


Image

The blue line is the pure formula that seems to best fit the data, I basically adjusted it visually, trying different values until the 2 lines seemed to match. I payed more attention to the values between -2 and +2 pawns.

I only sampled values between -5 and +5 and I did not bucket them, each score is a data point. I only sampled up to the 40th move.

I think this shows pretty convincingly that the logistic function is a very good statistical predictor of the final result.

What could be done next is to sample only position AFTER the first 40 moves and plot this over the top of the other lines to see if they overlay properly or have some distortion.

Don
bob wrote:
marcelk wrote:
bob wrote: But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
That would be pretty nice. It sounds like the right way to me, from a higher point of view. Note that if the program has knowledge of the rating difference with its opponent, it can adjust its expectation accordingly. That is just the good old contempt factor, but expressed in the %-domain instead of the pawn-domain.
It is complicated by the rating issue. If we use the old +200 Elo advantage translates into roughly a 75% winning chance, one could correct for rating difference. I certainly know the ratings of both opponents (in Crafty) when playing on a server. When a user plays crafty directly via xboard or whatever, there is an assumption about ratings that is probably way off (I think I assume they are equal but am not sure).

However, I could probably test this since in my testing the latest stockfish is about +200 above Crafty. Which is a 75% winning probability, roughly the winning percentage in my testing.

It will take a bit of thinking on how to produce this winning percentage matrix in a most usable form, but I think I am going to do it just for fun, since one criticism of computer chess programs over the years has always been that the evaluations are always inconsistent between two programs...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Towards a standard analysis output format

Post by Laskos »

Don wrote:
The blue line is the pure formula that seems to best fit the data, I basically adjusted it visually, trying different values until the 2 lines seemed to match. I payed more attention to the values between -2 and +2 pawns.

I only sampled values between -5 and +5 and I did not bucket them, each score is a data point. I only sampled up to the 40th move.

I think this shows pretty convincingly that the logistic function is a very good statistical predictor of the final result.

What could be done next is to sample only position AFTER the first 40 moves and plot this over the top of the other lines to see if they overlay properly or have some distortion.

Don
Could you help me to understand better? You took the scores of all the moves from 1 to 40, and then you took some sort of an average on all games to derive that 23% win probability?
As Bob suggested to have a 3D (win%) against (score)_(material) graph, is it hard just to compare results at exactly move 15 to results at exactly move 50? Also, the time control might be a factor, as it is certainly a factor for draw ratio. Some care might be required when generalizing the results.

Kai
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Towards a standard analysis output format

Post by Don »

Laskos wrote:
Don wrote:
The blue line is the pure formula that seems to best fit the data, I basically adjusted it visually, trying different values until the 2 lines seemed to match. I payed more attention to the values between -2 and +2 pawns.

I only sampled values between -5 and +5 and I did not bucket them, each score is a data point. I only sampled up to the 40th move.

I think this shows pretty convincingly that the logistic function is a very good statistical predictor of the final result.

What could be done next is to sample only position AFTER the first 40 moves and plot this over the top of the other lines to see if they overlay properly or have some distortion.

Don
Could you help me to understand better? You took the scores of all the moves from 1 to 40, and then you took some sort of an average on all games to derive that 23% win probability?
As Bob suggested to have a 3D (win%) against (score)_(material) graph, is it hard just to compare results at exactly move 15 to results at exactly move 50? Also, the time control might be a factor, as it is certainly a factor for draw ratio. Some care might be required when generalizing the results.

Kai
No, what I did was take the data from thousands of games. I tracked 1001 different discreet scores and their winning percentages. For example -5.00, -4.99, -4.98, -4.97 .... 4.98, 4.99, 5.00.

As the graph shows, when the score is 0.0 you win about half the games ( 0.5 on the y axis.) When the score is about 1.0 (a pawn up) the graph shows that you will win that game without about probability 0.76

I ran the games to depth 6 in order to get a large sample very quickly. I'm now running 7 ply in order to compare the 2 curves.

Here are some samples of statistics, left column is the score right column the win percentage:


score win perc
------ -----------
-0.88 0.30456
-0.87 0.29916
-0.86 0.30151
-0.85 0.29757
-0.84 0.31164
-0.83 0.28086
-0.82 0.29561
-0.81 0.30065
-0.80 0.30260
-0.79 0.32027
-0.78 0.33167
-0.77 0.31045
-0.76 0.32959
-0.75 0.31176
-0.74 0.32135
-0.73 0.31250


If you look at the BLUE line, it is a plot of the formula: 1 / (1 + 10 ** (-x * 0.50))

The constant 0.50 was chosen to fit the data visually and depends on how aggressive your evaluation function is. Not all programs treat a score of 1.0 the same. Stockfish seems to give higher scores for smaller advantages than Komodo, but it's all relative and is not important, it just affects the constant used in the formula if you want to "standardize."

By the way, I'm using smooth bezier smoothing, otherwise the curve is the same basic shape but it's jagged. Presumably if I had tends of thousands of samples for each point it would be almost smooth. You will notice from the above data that not all win percentages increase with the score, but on average it does. Another way this could be done is to make each point of the average of the point itself and N points on either side of it.

Don
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Towards a standard analysis output format

Post by Don »

I repeated the experiment using 7 ply searches instead of 6 ply search and the curve is practically indistinguishable when I overlay them (plot them on the same canvas.) I did not expect any differences, but I wanted to see if 150+ ELO made any difference and the answer is no. Of course a 7 ply search plays stronger but being a pawn up (against an equal opponent) appears to have the same meaning regardless of strength, at least to the extent I was able to measure.

At high depths, draws become much more likely - so it's possible that being up by 1.0 means something different. But Houdart reports that when Houdini is down by 5.0 it's not likely to win or draw any games so I assume an advantage has the same meaning in any program (once standardized that is.)

I think the issue is that at very high levels you are not as likely to get a 1 pawn advantage or more, but if you do it's just as lethal. You just make less errors.

Don wrote:Ok, I ran off several games and plotted the results using gnuplot to see if scoring data matches the function HG (or someone) presented in this thread.

Here is what I get:


Image

The blue line is the pure formula that seems to best fit the data, I basically adjusted it visually, trying different values until the 2 lines seemed to match. I payed more attention to the values between -2 and +2 pawns.

I only sampled values between -5 and +5 and I did not bucket them, each score is a data point. I only sampled up to the 40th move.

I think this shows pretty convincingly that the logistic function is a very good statistical predictor of the final result.

What could be done next is to sample only position AFTER the first 40 moves and plot this over the top of the other lines to see if they overlay properly or have some distortion.

Don
bob wrote:
marcelk wrote:
bob wrote: But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.
That would be pretty nice. It sounds like the right way to me, from a higher point of view. Note that if the program has knowledge of the rating difference with its opponent, it can adjust its expectation accordingly. That is just the good old contempt factor, but expressed in the %-domain instead of the pawn-domain.
It is complicated by the rating issue. If we use the old +200 Elo advantage translates into roughly a 75% winning chance, one could correct for rating difference. I certainly know the ratings of both opponents (in Crafty) when playing on a server. When a user plays crafty directly via xboard or whatever, there is an assumption about ratings that is probably way off (I think I assume they are equal but am not sure).

However, I could probably test this since in my testing the latest stockfish is about +200 above Crafty. Which is a 75% winning probability, roughly the winning percentage in my testing.

It will take a bit of thinking on how to produce this winning percentage matrix in a most usable form, but I think I am going to do it just for fun, since one criticism of computer chess programs over the years has always been that the evaluations are always inconsistent between two programs...