A single game can never disprove a statistical claim (unless he would have claimed 100%). Btw, I know a positon where Houdini is at +8, and loses.bob wrote:How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...
Towards a standard analysis output format
Moderator: Ras
-
- Posts: 28387
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Towards a standard analysis output format
-
- Posts: 28387
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Towards a standard analysis output format
This can be a bit tricky, because deleting a Pawn in a middle-game position can easily turn a tactically quiet situation into one that is immediately decided by hanging material. Deleting a Pawn from an opening situation is rreasonably safe; there are no pre-exsting attacks, and two walls of Pawns between the pieces, so that even when you punch a hole in one of those, they are well separated.bob wrote:I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.
-
- Posts: 348
- Joined: Sat Feb 27, 2010 12:21 am
Re: Towards a standard analysis output format
I did that two years ago for my older program, using 14,660,393 evaluations taken from 534,097 games.bob wrote: But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...
http://marcelk.net/2009-09-05/rookie-eval-vs-result/

There is a clear deviation (offset shift) from the wiki formula which is caused by the ELO difference between the players. It would be possible to remove this bias from the data though.
My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.
-
- Posts: 52
- Joined: Fri Jan 29, 2010 2:01 pm
- Location: Ivrea, Italy
Re: Towards a standard analysis output format
Well, I'm not sure to correctly understand you. For this to work you should need a _different_ formula for each engine. If you apply the very same conversion formula, starting from "a Houdini pawn that is not the same as a Crafty pawn", you will end in a Houdini winning percentage that is not the same as a Crafty winning percentage.marcelk wrote:My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.
Winning percentage may ease human understandability, not different engines comparison.
- Giorgio
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Towards a standard analysis output format
I doubt there is that much difference but there will be some from program to program. Each program may scale a pawn a little differently, but (for instance) I doubt Houdini would be significantly different from a much weaker program (with similar scaling) in this regard. This function is not about your expectation of beating another program, but is more like your expectation of beating an equal opponent.Giorgio Medeot wrote:Well, I'm not sure to correctly understand you. For this to work you should need a _different_ formula for each engine. If you apply the very same conversion formula, starting from "a Houdini pawn that is not the same as a Crafty pawn", you will end in a Houdini winning percentage that is not the same as a Crafty winning percentage.marcelk wrote:My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.
Winning percentage may ease human understandability, not different engines comparison.
- Giorgio
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Towards a standard analysis output format
I'm not sure at all. I've tentatively injected a 0.75 coefficient in the formula given by Marcel to produce an approximation of Houdini's evaluation system.bob wrote:How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...
For a +5 eval the 1 / (1 + 10**(-eval*0.75)) formula yields 99.98% expectancy. In other words, at +5 eval every 2500 games would produce a single game that ends in a draw, which sounds reasonable for Houdini.
Robert
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Towards a standard analysis output format
Does this assume the use of endgame tables?Houdini wrote:I'm not sure at all. I've tentatively injected a 0.75 coefficient in the formula given by Marcel to produce an approximation of Houdini's evaluation system.bob wrote:How sure are you of that? I watched a houdini vs Crafty game about 2 weeks ago where Houdini was +5 and it ended in a dead draw. I'll try to find it but it is buried in a few thousand games...
For a +5 eval the 1 / (1 + 10**(-eval*0.75)) formula yields 99.98% expectancy. In other words, at +5 eval every 2500 games would produce a single game that ends in a draw, which sounds reasonable for Houdini.
Robert
I think the stats have little meaning beyond 3 or 4 pawns up, at least it's very difficult to measure and has more to do with how well your program identifies which exact endings are not wins. If the program is not good with bishop of opposite color and all the exceptions for other simple endings that 1 /2500 games might be much less or it could a lot more if it's exceptionally good about those.
-
- Posts: 348
- Joined: Sat Feb 27, 2010 12:21 am
Re: Towards a standard analysis output format
Yes of course.Giorgio Medeot wrote:Well, I'm not sure to correctly understand you. For this to work you should need a _different_ formula for each engine.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Towards a standard analysis output format
In my case, I am pretty certain that EG scores are more accurate. Because they are all about pawns and pawn structure. In the MG, you have king safety, center control, mobility, etc that can be real or fleeting advantages. But in the endgame, weak pawns are _really_ weak, strong pawns are _really_ strong. To the point that a +.5 is a serious advantage, and a -.5 is a serious problem...Don wrote:In the opening a score close to zero is an accurate prediction. A score of 0.5 is a very good score in Komodo even right out of the opening. 0.5 is not a win - it's just a very good score. I don't believe a score of 0.5 is better in the ending than opening. I think it means you have good chances but nothing like a very certain win.bob wrote:I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.hgm wrote:I think such 'direct' piece-value measurements are quite interesting. I played many ten-thousands of such material-imbalance games, not only with Pawn odds, but also deleting Bishop vs Knight, Bishop vs Knight +Pawn, Queen vs 2 Bishops + 1 Knight, etc. (Mostly on 10x8 boards.) In principle you could measure the score of any advantage that way. E.g. if you want to now how much castling rights are worth, play one side without castling rights, and see by how much it loses, in comparison to the Pawn-odds score.bob wrote:I wonder if I ought to actually do this formally as an experiment? I could certainly take Crafty and play it against the gauntlet, normally, then at -1.00, and then -2.00 to see just what effect removing one or two pawns does in terms of Elo...
To make it work you need an engine that randomizes well (as there are no books for such imbalanced positions), or shuffle the intial positions (e.g. in Chess960 fashion). And deleting multiple Pawns sometimes gave inconsistent results (the second Pawn having negative effective value), presumably because you give that side a tremendous advantage in development, which can be very dangerous in 10x8 Chess.
But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...
The only issue I see is that a +.5 in the opening might be meaningless, while in an endgame it is possibly winning.
But if you are going to do the study, you can test this too. Don't just keep discrete buckets by score, but also keep the phase of the game (perhaps by summing all the non-pawn material using the classical values for instance.)
I might do this study too, it sounds like fun.
Don
That's based on observations of who knows how many online games... But not based on any sort of scientific study/analysis.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Towards a standard analysis output format
However, a pawn is one thing, but +1.00 is something else entirely. It might be a pawn advantage, or it might be a piece advantage where the king is horribly exposed. A speculative eval is probably going to produce a different "conversion factor (to % win probability) than a more cautious evaluation...marcelk wrote:I did that two years ago for my older program, using 14,660,393 evaluations taken from 534,097 games.bob wrote: But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...
http://marcelk.net/2009-09-05/rookie-eval-vs-result/
There is a clear deviation (offset shift) from the wiki formula which is caused by the ELO difference between the players. It would be possible to remove this bias from the data though.
My main point, going back to the original posting, is that this type of conversion, be it by wiki formula, calibrated formula or by lookup table, will give comparable results between programs in the same unit. A Houdini pawn is not the same as a Crafty pawn. Percentages mean the same thing.
But I can certainly produce a reasonable eval to % formula for Crafty, which I think I will try. And then give the user the option of displaying a normal eval, or a probability of winning eval.