Crafty and Stockfish question

Michael Sherwin · Post by **Michael Sherwin** » Sat Jul 17, 2010 6:18 am

lkaufman wrote:I don't disagree that White is better after 1e4 c5, but to evaluate this position, especially after a 25 ply search or so, as +.60 or +.75 is ridiculous. White's edge should be something like a quarter pawn or so, without requiring really deep analysis to prove this.

Not much is decided in most Sicilians by move 13 (26 ply).

[d]r1b1k2r/2qn1ppp/p2pp3/1p3Pb1/3NP3/2N2Q2/PPP4P/2KR1B1R w kq - 0 14

This position is exactly 26 plies played. And was unclear to humans for many years. Our current human evaluation of this position is based on dozens if not hundreds of games played and studied by humans over those years. Maybe a computer if given this position to compute over could very well have a correct evaluation of it, but how can it be expected for a computer to evaluate this position correctly from plies 2 and 3 and decide to head there. It can't be, that is ridiculous. Further, this position most likely is not ever going to be the principle variation and is never even going to be seen in a 26 ply search in the first place due to all the reductions that are in a 26 ply search. Matter of fact, most of the search tree branches are not going to be searched to anywhere near 26 ply. That is because there is only a very 'thin' path of what is considered best in an alpha-beta search. Even equal positions cause cutoffs that keep a large portion of the tree from ever being looked at. Are equal positions really equal--no they are not--either the eval lacks resolution to differentiate them or even the best of evals are not perfect. The search and eval therefore approach the truth but leave room for plenty of changing of the principle variation, even in self play, as a game is played or while a single move is being computed. Bottom line is, a computer does not know in non forced lines where it is going and only tries to maximize its potential. It has no idea in reality that the initial position of the Sicilian is only +.25 unless it is tuned that way. And if it is tuned that way it might not be tuned correctly. That could lead to mis evaluating positions further along in the game when it really should be +.25. And one more caveat is it does not matter what the initial evaluation of the Sicilian is as that value will be built into all branches of the tree. So creating a table of initial values connected to the ECO codes and having an engine normalize its evaluation to such a table may bring the scores returned into a perspective suitable for human sensitivities.

lkaufman · Post by **lkaufman** » Sat Jul 17, 2010 6:52 am

Here is how I evaluate the Sicilian. I value a tempo in the opening at about 0.4 pawns. White is up half a tempo at the start, so about 0.2 pawns. Now if Black wastes his first move, White should be better by 0.6. But it is obvious the ...c5 is very far from a wasted move. It may not be the best move, but it is clearly much closer to being the best move than to being a wasted move. So the eval should be much closer to 0.2 than to 0.6, so maybe 0.25 or perhaps 0.30.

A line like the one you cite may be relevant to the question of what is best play after 1e4 c5. But there are dozens of ways to play the Sicilian, and surely some are within 10 centipawns or so of whatever is the best one (probably the Najdorf). If all I want to do is to prove that the Sicilian is not worse than 0.4 minus, for example, I just need to pick one that is not obviously bad, quite an easy task.

bob · Post by **bob** » Sat Jul 17, 2010 2:31 pm

lkaufman wrote:
bob wrote:[quote
For crafty, it is pretty easy to grasp the score. For the position from the second post in this thread, you can discover via the "score" command that some of this is from development (knights on the edge, unconnected rooks, uncastled, etc. Not much of that comes from mobility in our case, it is mainly the special case "uncastled development scoring"...

Remember, it is not the score that counts, it is the move. I suppose everyone could just add in a -50 constant to their scores and make them appear more conservative, but it would not change the move at all...
Changing the scores by a constant would solve nothing, because they are interpreted relative to material and to static factors. The issue is about the relative weighting of static vs. dynamic factors (leaving out king safety as it has elements of both). Perhaps I am mistaken about Crafty overweighting dynamics; I have spent far more time with Stockfish which displays similar behavior in the opening. For me (and surely many others) what I want most from an engine is to get an accurate evaluation of an opening line (which may extend all the way to the endgame!). I put the scores in an IDeA tree using Aquarium and research openings this way. If the evals systematically overrate positions where White has more mobility, it will be "recommending" the wrong lines. So for me, a correct eval of the end node is more important than the rating of an engine.

I spent literally months of time on "centralizing" the evaluation. And outside of the development issues, most scores are pretty well centered around zero. This may have changed during testing, since anything is possible there, but we always wanted "equal" positions to be somewhere near zero. But development is different and trickier. If you do that completely symmetrically, then you either develop a piece or prevent your opponent from developing a piece. This can backfire, thanks to tempo's importance.

There are certainly issues between "real" and "imagined" positional advantages that we do not handle very well (nor does any other program I have seen so far.)

bob · Post by **bob** » Sat Jul 17, 2010 2:36 pm

lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?

Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

Michael Sherwin · Post by **Michael Sherwin** » Sat Jul 17, 2010 3:25 pm

lkaufman wrote:Here is how I evaluate the Sicilian. I value a tempo in the opening at about 0.4 pawns. White is up half a tempo at the start, so about 0.2 pawns. Now if Black wastes his first move, White should be better by 0.6. But it is obvious the ...c5 is very far from a wasted move. It may not be the best move, but it is clearly much closer to being the best move than to being a wasted move. So the eval should be much closer to 0.2 than to 0.6, so maybe 0.25 or perhaps 0.30.

A line like the one you cite may be relevant to the question of what is best play after 1e4 c5. But there are dozens of ways to play the Sicilian, and surely some are within 10 centipawns or so of whatever is the best one (probably the Najdorf). If all I want to do is to prove that the Sicilian is not worse than 0.4 minus, for example, I just need to pick one that is not obviously bad, quite an easy task.

A computer only 'exactly' evaluates through search the 'best' move. Even if it is searching at the root the initial Sicilian position it merely tries to maximize the next move. It does not understand the lines that have been developed over a couple of centuries. It has no way to adjust its evaluation to +.25 or +.30 and to determine that c7c5 is not a wasted move. The principal variation will not be best play by human standards and will contain more non optimal moves for black than for white. Therefore, the Sicilian does not get evaluated correctly.

lkaufman · Post by **lkaufman** » Sat Jul 17, 2010 3:43 pm

bob wrote:
lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?
Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

I think we are still miscommunicating. My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine). So if you report a score of 0.75, this implies a winning percentage that can be calculated from the Elo tables based on the winning percentage for plus one pawn. Multiplying the scores by a constant would change nothing. So if you report a score of +.75 for the position after 1e4 c5, something is very wrong if you also report a score of +1.00 for an extra pawn in the opening with no compensation, as I assume is roughly true.

lkaufman · Post by **lkaufman** » Sat Jul 17, 2010 3:46 pm

[quote="Michael Sherwin"

A computer only 'exactly' evaluates through search the 'best' move. Even if it is searching at the root the initial Sicilian position it merely tries to maximize the next move. It does not understand the lines that have been developed over a couple of centuries. It has no way to adjust its evaluation to +.25 or +.30 and to determine that c7c5 is not a wasted move. The principal variation will not be best play by human standards and will contain more non optimal moves for black than for white. Therefore, the Sicilian does not get evaluated correctly.[/quote]

I know that the computer does not evaluate lines the way I do as a human. My point is that a good eval should produce scores in the ballpark of 0.25 or 0.30 for most positions that would be reached by reasonable moves, so that miinimax should return a score in that ballpark. It fails to do so in the engines in question because it overvalues certain transient elements of the positions reached. If only static factors were evaluated, I think the score returned would be near zero.

Michael Sherwin · Post by **Michael Sherwin** » Sat Jul 17, 2010 4:02 pm

lkaufman wrote:
Michael Sherwin wrote:
A computer only 'exactly' evaluates through search the 'best' move. Even if it is searching at the root the initial Sicilian position it merely tries to maximize the next move. It does not understand the lines that have been developed over a couple of centuries. It has no way to adjust its evaluation to +.25 or +.30 and to determine that c7c5 is not a wasted move. The principal variation will not be best play by human standards and will contain more non optimal moves for black than for white. Therefore, the Sicilian does not get evaluated correctly.
I know that the computer does not evaluate lines the way I do as a human. My point is that a good eval should produce scores in the ballpark of 0.25 or 0.30 for most positions that would be reached by reasonable moves, so that miinimax should return a score in that ballpark. It fails to do so in the engines in question because it overvalues certain transient elements of the positions reached. If only static factors were evaluated, I think the score returned would be near zero.

The score comes from the end of the principal variation. So, for the engines in question, set up the final position and see what evaluation it deserves. If the engines eval is wrong for that position then it could be just a scaling problem or a bad eval. If it is correct then the computer simply does not comprehend the initial position very well. If the eval is then tuned to understand the Sicilian well, then we might very well be having this conversation about double pawn openings.

Chan Rasjid · Post by **Chan Rasjid** » Sat Jul 17, 2010 6:55 pm

Evaluation scores should represent probability of winning in order that alpha-beta works - it selects the move that has the highest score or probability of winning. "0" should mean an evenly balanced position and I don't think it is good if this reference point is somehow shifted.

Chess programming aims for an "ideal" evaluator that should consistently return a certain level of score advantage when it is about to win a game, other things being equal.

I cannot see any alternative way to interpret scores or to assign scores to evaluation factors.

Rasjid

lkaufman · Post by **lkaufman** » Sat Jul 17, 2010 7:52 pm

Here is a good example of what I'm talking about. The following opening line is one of the most common positions after 11 moves in GM chess; it is a very typical Sicilian. Both sides have castled so that's not an issue.

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4
Nf6 5. Nc3 a6 6. Be2 e6 7. O-O Be7 8. f4 Qc7 9. Be3 Nc6 10. Kh1 O-O 11. a4 Re8

There have been more than two thousand games at IM/GM level from this position, with White scoring the usual 55%. So a proper eval would be about 1/5 of the value of a pawn in the opening position, so about 0.2 for most programs, about 0.15 for Rybka.

I did a two ply search from here on a lot of programs. Stockfish 1.8 evaluated it at +.56, Rybka 4 as +.57, Crafty 23 as +.60, Deep Shredder 12 at 0.69, Fritz 12 at +.79, Naum 4 at +.89. All ridiculously optimistic for white. The only program that was even close to reasonable, much to my surprise, was Komodo 1.2 at +.31. Does anyone know of any other program that evaluates this at two ply at around 0.3 or less?

I'll now run a randomized playout with Rybka to see how the engine actually performs from here.

Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question