Towards a standard analysis output format

hgm · Post by **hgm** » Fri Mar 25, 2011 8:36 am

bob wrote:How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...

A single game can never disprove a statistical claim (unless he would have claimed 100%). Btw, I know a positon where Houdini is at +8, and loses.

hgm · Post by **hgm** » Fri Mar 25, 2011 8:44 am

bob wrote:I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.

This can be a bit tricky, because deleting a Pawn in a middle-game position can easily turn a tactically quiet situation into one that is immediately decided by hanging material. Deleting a Pawn from an opening situation is rreasonably safe; there are no pre-exsting attacks, and two walls of Pawns between the pieces, so that even when you punch a hole in one of those, they are well separated.

marcelk · Post by **marcelk** » Fri Mar 25, 2011 8:46 am

bob wrote: But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...

I did that two years ago for my older program, using 14,660,393 evaluations taken from 534,097 games.

http://marcelk.net/2009-09-05/rookie-eval-vs-result/

There is a clear deviation (offset shift) from the wiki formula which is caused by the ELO difference between the players. It would be possible to remove this bias from the data though.

My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.

Giorgio Medeot · Post by **Giorgio Medeot** » Fri Mar 25, 2011 11:56 am

marcelk wrote:My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.

Well, I'm not sure to correctly understand you. For this to work you should need a _different_ formula for each engine. If you apply the very same conversion formula, starting from "a Houdini pawn that is not the same as a Crafty pawn", you will end in a Houdini winning percentage that is not the same as a Crafty winning percentage.
Winning percentage may ease human understandability, not different engines comparison.

Giorgio

Don · Post by **Don** » Fri Mar 25, 2011 12:03 pm

Giorgio Medeot wrote:
marcelk wrote:My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.
Well, I'm not sure to correctly understand you. For this to work you should need a _different_ formula for each engine. If you apply the very same conversion formula, starting from "a Houdini pawn that is not the same as a Crafty pawn", you will end in a Houdini winning percentage that is not the same as a Crafty winning percentage.
Winning percentage may ease human understandability, not different engines comparison.
Giorgio

I doubt there is that much difference but there will be some from program to program. Each program may scale a pawn a little differently, but (for instance) I doubt Houdini would be significantly different from a much weaker program (with similar scaling) in this regard. This function is not about your expectation of beating another program, but is more like your expectation of beating an equal opponent.

Houdini · Post by **Houdini** » Fri Mar 25, 2011 12:38 pm

bob wrote:How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...

I'm not sure at all. I've tentatively injected a 0.75 coefficient in the formula given by Marcel to produce an approximation of Houdini's evaluation system.

For a +5 eval the 1 / (1 + 10**(-eval*0.75)) formula yields 99.98% expectancy. In other words, at +5 eval every 2500 games would produce a single game that ends in a draw, which sounds reasonable for Houdini.

Robert

Don · Post by **Don** » Fri Mar 25, 2011 12:59 pm

Houdini wrote:
bob wrote:How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...
I'm not sure at all. I've tentatively injected a 0.75 coefficient in the formula given by Marcel to produce an approximation of Houdini's evaluation system.

For a +5 eval the 1 / (1 + 10**(-eval*0.75)) formula yields 99.98% expectancy. In other words, at +5 eval every 2500 games would produce a single game that ends in a draw, which sounds reasonable for Houdini.

Robert

Does this assume the use of endgame tables?

I think the stats have little meaning beyond 3 or 4 pawns up, at least it's very difficult to measure and has more to do with how well your program identifies which exact endings are not wins. If the program is not good with bishop of opposite color and all the exceptions for other simple endings that 1 /2500 games might be much less or it could a lot more if it's exceptionally good about those.

marcelk · Post by **marcelk** » Fri Mar 25, 2011 2:04 pm

Giorgio Medeot wrote:Well, I'm not sure to correctly understand you. For this to work you should need a _different_ formula for each engine.

Yes of course.

bob · Post by **bob** » Fri Mar 25, 2011 4:30 pm

Don wrote:
bob wrote:
hgm wrote:
bob wrote:I wonder if I ought to actually do this formally as an experiment? I could certainly take Crafty and play it against the gauntlet, normally, then at -1.00, and then -2.00 to see just what effect removing one or two pawns does in terms of Elo...
I think such 'direct' piece-value measurements are quite interesting. I played many ten-thousands of such material-imbalance games, not only with Pawn odds, but also deleting Bishop vs Knight, Bishop vs Knight +Pawn, Queen vs 2 Bishops + 1 Knight, etc. (Mostly on 10x8 boards.) In principle you could measure the score of any advantage that way. E.g. if you want to now how much castling rights are worth, play one side without castling rights, and see by how much it loses, in comparison to the Pawn-odds score.

To make it work you need an engine that randomizes well (as there are no books for such imbalanced positions), or shuffle the intial positions (e.g. in Chess960 fashion). And deleting multiple Pawns sometimes gave inconsistent results (the second Pawn having negative effective value), presumably because you give that side a tremendous advantage in development, which can be very dangerous in 10x8 Chess.
I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.

But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...

The only issue I see is that a +.5 in the opening might be meaningless, while in an endgame it is possibly winning.
In the opening a score close to zero is an accurate prediction. A score of 0.5 is a very good score in Komodo even right out of the opening. 0.5 is not a win - it's just a very good score. I don't believe a score of 0.5 is better in the ending than opening. I think it means you have good chances but nothing like a very certain win.

But if you are going to do the study, you can test this too. Don't just keep discrete buckets by score, but also keep the phase of the game (perhaps by summing all the non-pawn material using the classical values for instance.)

I might do this study too, it sounds like fun.

Don

In my case, I am pretty certain that EG scores are more accurate. Because they are all about pawns and pawn structure. In the MG, you have king safety, center control, mobility, etc that can be real or fleeting advantages. But in the endgame, weak pawns are _really_ weak, strong pawns are _really_ strong. To the point that a +.5 is a serious advantage, and a -.5 is a serious problem...

That's based on observations of who knows how many online games... But not based on any sort of scientific study/analysis.

bob · Post by **bob** » Fri Mar 25, 2011 4:34 pm

marcelk wrote:
bob wrote: But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...
I did that two years ago for my older program, using 14,660,393 evaluations taken from 534,097 games.

http://marcelk.net/2009-09-05/rookie-eval-vs-result/

There is a clear deviation (offset shift) from the wiki formula which is caused by the ELO difference between the players. It would be possible to remove this bias from the data though.

My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.

However, a pawn is one thing, but +1.00 is something else entirely. It might be a pawn advantage, or it might be a piece advantage where the king is horribly exposed. A speculative eval is probably going to produce a different "conversion factor (to % win probability) than a more cautious evaluation...

But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.

Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format