Crafty and Stockfish question

lkaufman · Post by **lkaufman** » Tue Jul 20, 2010 4:29 am

bob wrote: You/Larry want the evaluation to match a human perception about the advantage and who has it. I see nothing at all requiring such a thing to play good chess. We simply need an evaluation that causes the program to play the best move each time. And tuning leads to this. Whether a program thinks white is +1.5 before making a single move or +0.0 has no bearing on how well the engine will play the game. What is important is that better moves raise the score higher than worse moves.

This fetish about wanting the eval to match a human's perception makes no sense. What human examines 100M nodes per second to choose which move to make? What human does alpha/beta, minmax, LMR, null-move and all the other things a computer does. So why, since a computer plays the game so differently, is there a sense of "wrongness" when the computer's evaluation seems out of touch with a human's evaluation?

Now if the goal is to somehow map computer evaluations into something more palatable for humans, that can be done. But there is absolutely no reason to physically change a computer's evaluation to make it match a human's perception better, any more than it would be useful to use a more human-like approach to playing the game...

Bob, I think you misunderstand. Neither I nor Sam is especially concerned about a human's perception of the score here. We are both saying (I think) that current engine evals do not relate well to the probability of the engine winning from the position in question. We both believe that a position with a higher probability of winning than another position should have a higher score, and it is clear that this is not so in the opening. I have found that my subjective opinion of an opening postion (based heavily on GM practical results) is a much better predictor of engine vs. engine playout results than is the score shown by that engine. So I believe that making an eval come closer to GM opinion will also make it a better predictor of playout results, which if we are correct should also make it a stronger engine. However I have some doubt about this last point, in part due to comments by Tord here and to other comments about LMR changing the situation from pure alpha-beta. I am pretty sure that Komodo is a better predictor of what playout results (self-play) from a given opening position would be than is Stockfish, but (so far) Stockfish is stronger, so perhaps this assumption is in error. Of course there are likely other explanations for the relative ratings.

bob · Post by **bob** » Tue Jul 20, 2010 6:54 am

lkaufman wrote:
bob wrote: You/Larry want the evaluation to match a human perception about the advantage and who has it. I see nothing at all requiring such a thing to play good chess. We simply need an evaluation that causes the program to play the best move each time. And tuning leads to this. Whether a program thinks white is +1.5 before making a single move or +0.0 has no bearing on how well the engine will play the game. What is important is that better moves raise the score higher than worse moves.

This fetish about wanting the eval to match a human's perception makes no sense. What human examines 100M nodes per second to choose which move to make? What human does alpha/beta, minmax, LMR, null-move and all the other things a computer does. So why, since a computer plays the game so differently, is there a sense of "wrongness" when the computer's evaluation seems out of touch with a human's evaluation?

Now if the goal is to somehow map computer evaluations into something more palatable for humans, that can be done. But there is absolutely no reason to physically change a computer's evaluation to make it match a human's perception better, any more than it would be useful to use a more human-like approach to playing the game...
Bob, I think you misunderstand. Neither I nor Sam is especially concerned about a human's perception of the score here. We are both saying (I think) that current engine evals do not relate well to the probability of the engine winning from the position in question. We both believe that a position with a higher probability of winning than another position should have a higher score,

OK, there is our primary point of disagreement. I believe that the score should represent a number such that if the score is higher after move A, than after move B, then move A is a better move. And that's the way I have been tuning for 30 years now.

I can think of lots of examples where a score of x.xx means one thing at one point and something else at another point. For example, playing white in the King's Gambit accepted, you are a pawn down. losing? winning? but then you get to a simple endgame, where black has a/b pawns, you have an a pawn, you both have a g-pawn. You are a pawn down. losing? winning? So even a pawn's score varies wildly depending on middlegame or endgame. Then we get to positional scores.

I see nothing to suggest that the score should be proportional to winning or losing. Is a piece enough to win the game? what about KRN vs KR? So if material values can mean such different things at different points, why should positional scores be much more static? All I want is to be sure that my program plays the best move, I don't care whether the score indicates anything at all. I prefer to see +=good, -=bad, but that's only a human preference and hardly necessary. I suspect it would be possible to map scores to winning probability with some degree of accuracy. But I have seen Crafty lose or draw from +6.00 and win from -6.00, so I don't expect a high correlation between score and winning probability. I simply hope for high correlation between score and quality of move, unrelated to quality of position.

and it is clear that this is not so in the opening. I have found that my subjective opinion of an opening postion (based heavily on GM practical results) is a much better predictor of engine vs. engine playout results than is the score shown by that engine. So I believe that making an eval come closer to GM opinion will also make it a better predictor of playout results, which if we are correct should also make it a stronger engine. However I have some doubt about this last point, in part due to comments by Tord here and to other comments about LMR changing the situation from pure alpha-beta. I am pretty sure that Komodo is a better predictor of what playout results (self-play) from a given opening position would be than is Stockfish, but (so far) Stockfish is stronger, so perhaps this assumption is in error. Of course there are likely other explanations for the relative ratings.

I think the basic idea is flawed. As a human, you do much more than just add up positional features. While you might do that to an extent (I'm putting myself in the role of player here) there are a lot of modifications caused by experience, intuition, etc. Is that isolated pawn weak or strong. If I rip this pawn over here, my kingside will get ripped. Will it get ripped enough that I get mated or can I survive and have a pawn + and a majority on the queenside?

Computers do less of that "second-guessing" stuff, which is what leads to inflated (or deflated) scores. But then they search hellishly deep. The trees they now search are essentially incomprehensible, both due to their incredible size, as well as incredibly variable depth.

While someone _could_ write an evaluation that does what you suggest, I am not sure the computational cost would be justified, however. The question is, do we want a program that gives an estimate of win/lose probability, or a program that plays the best move. The fact that for any given position, programs can give significantly different scores illustrates the problem well.

I personally reached this conclusion back in the 80's with Cray Blitz. Back then, when scores got to roughly +.3 to +.4, we were on the edge of winning material and the game. But as we tuned to deal with certain IM/GM issues such as locking pawns up (ala David Levy) we noticed the scores climbing without a necessarily matching increase in win/lose probability, although the program was playing the moves we wanted so that it avoided the locked up pawns that led to problems...

BubbaTough · Post by **BubbaTough** » Tue Jul 20, 2010 7:13 am

I believe that the score should represent a number such that if the score is higher after move A, than after move B, then move A is a better move.

I translate "better move" to mean higher probability to win. I assume you do to. I think the only difference in opinion here is that you think that can/should only be locally defined, while I think that the more globally true that is, the better chess programs will be. And the larger search depths get, the more true this is.

Your examples about the changing value of a pawn or a knight depending on situation are neither here nor there. I completely agree the whole idea of a "centipawn" as a measure is odd and artificial. The only question here is whether considering a position "better" than another position (as in its a higher probability to win from that position) is only relevant in a strict localized area of the search (such as picking between two moves from the same position) or whether programs would be better if the concept was more global.

-Sam

michiguel · Post by **michiguel** » Tue Jul 20, 2010 7:36 am

BubbaTough wrote:
I believe that the score should represent a number such that if the score is higher after move A, than after move B, then move A is a better move.
I translate "better move" to mean higher probability to win. I assume you do to. I think the only difference in opinion here is that you think that can/should only be locally defined, while I think that the more globally true that is, the better chess programs will be. And the larger search depths get, the more true this is.

I meant to write something like this, but I could not have put it more nicely. Of course, maybe in practice this cannot be noticed because searches are not deep enough, but that is another issue.

Miguel

Your examples about the changing value of a pawn or a knight depending on situation are neither here nor there. I completely agree the whole idea of a "centipawn" as a measure is odd and artificial. The only question here is whether considering a position "better" than another position (as in its a higher probability to win from that position) is only relevant in a strict localized area of the search (such as picking between two moves from the same position) or whether programs would be better if the concept was more global.

-Sam

yoshiharu · Post by **yoshiharu** » Tue Jul 20, 2010 1:36 pm

BubbaTough wrote: I translate "better move" to mean higher probability to win. I assume you do to. I think the only difference in opinion here is that you think that can/should only be locally defined, while I think that the more globally true that is, the better chess programs will be. And the larger search depths get, the more true this is.

I think this summarises the whole discussion. But how do you connect all the "local" parts of the tree seems to me to be an almost impossible task.
After all, most games (well, at a sufficiently high level) finish in the endgame, and you must be able to predict the possible endgames back from the middlegame (possibly earlier), if you want the perfect evaluation. To reach that point you have to search at incredible depth, or maybe to find a different way to handle the complexity of the tree.
I think an analogy can be found in the computer Go: alpha-beta and such cannot handle those huge trees, so someone came out with new ideas; as far as I know the principal one was applying Montecarlo methods: what is the Montecarlo-equivalent for Computer Chess?
I mean: the problem of reaching a good endgame cannot be solved by computation (at the moment), and this is quite the same issue they have with Go, that strategy dominates (even deep) tactics (maybe to a different extent). I don't know if in the literature of Computer Chess this has been addressed in some way, but it sounds attractive, doesn't it?

Cheers, Mauro

yoshiharu · Post by **yoshiharu** » Tue Jul 20, 2010 1:45 pm

BubbaTough wrote: People spend so much time interacting with computer programs that it has fundamentally changed how they talk about the size of advantages.

This probably solves the problem

Joking aside, I think that people doesn't usually realise how often it occurs the phenomenon I call "evaporation of search value", i.e. the effect that (using the engine as an analysis mule) as long as you stay in a position the score after some time converges to something, but as soon as you reach some critical point along the PV this score tapers (well, sometimes flips too

).
I think (speaking as a user) the best way to use an engine for this purpose is to employ a lot of dialectics (in the original meaning, as in http://en.wikipedia.org/wiki/Dialectics), between you and the engine, as I think they do in Advanced Chess tournaments.
Just my two cents

Cheers, Mauro

bob · Post by **bob** » Tue Jul 20, 2010 3:36 pm

BubbaTough wrote:
I believe that the score should represent a number such that if the score is higher after move A, than after move B, then move A is a better move.
I translate "better move" to mean higher probability to win. I assume you do to. I think the only difference in opinion here is that you think that can/should only be locally defined, while I think that the more globally true that is, the better chess programs will be. And the larger search depths get, the more true this is.

Your examples about the changing value of a pawn or a knight depending on situation are neither here nor there. I completely agree the whole idea of a "centipawn" as a measure is odd and artificial. The only question here is whether considering a position "better" than another position (as in its a higher probability to win from that position) is only relevant in a strict localized area of the search (such as picking between two moves from the same position) or whether programs would be better if the concept was more global.

-Sam

I am not coming close to saying "this can't be done." What I am saying is that

(1) I believe it might be too expensive computationally

(2) I am not convinced a program would play better.

For example, suppose that we take _every_ starting position and assign it a score of zero, and then do a search to maximize that score. Would that play worse than what we have today? I don't see how.

I believe that too much is being attributed to the "score", taking it way "out of context" with how it is generated. As yet another example, if you only do a 2-ply search and find that move X is best with a score of +.3, or you do a 20 ply search and find the same move X is best with a score of +4.5, how would that translate to winning chances, as opposed to just choosing the best move correctly? Or vice-versa, a 2 ply search says +4, while a 20 ply search says +.1.

I believe that scores are relative anyway, trying to turn them into some probability of winning is going to be interesting, expensive, and very likely worth nothing except to have a different "relative" number to display to satisfy a human's desire to see something he understands better. I'm going to try my "log eater" idea when I have time, because it ought to produce some interesting data.

(a) what sort of "buckets" fit best (what score range gives a probability of +.6, or +.7, etc.

(b) for each "bucket, how often is it wrong as compared to how often is it correct?

I suspect the idea is going to do nothing more than scale the eval numbers into some smaller range, without changing their actual meaning at all.

zamar · Post by **zamar** » Tue Jul 20, 2010 7:43 pm

lkaufman wrote:So I believe that making an eval come closer to GM opinion will also make it a better predictor of playout results, which if we are correct should also make it a stronger engine. However I have some doubt about this last point, in part due to comments by Tord here and to other comments about LMR changing the situation from pure alpha-beta..

I had a dream that tuning evaluation to better match statistical winning percentage of the position would be a way to improve. I was wrong! I Implemented various algorithms which converged beautifully to certain point. Resulting values were always a disaster in practical tests.

Unfortunately the fact seems to be that there is no direct connection between evaluation and winning percentage. It seems that heavily optimized search should not head for a single position which has high winning percentage, but instead head for a strong position with a lot of different possibilities to play on.

jhaglund · Post by **jhaglund** » Tue Jul 20, 2010 8:06 pm

bob wrote:
jhaglund wrote:
bob wrote:
lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?
Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

What I think could be useful here:

Implementing a concurrent search (SFSB)...
http://talkchess.com/forum/viewtopic.ph ... 70&t=35066

Playing out the moves to a complete game or just to x plies + y and generating statistics of each evaluation +/-... and using that to chose your move or for move ordering.

1.e4 ...
... e5 (34%W)(32%L)(34%D)
... c5 (33%W)(30%L)(37%D)
... c6 (31%W)(28%L)(41%D)
...Nc6(29%W)(30%L)(41%D)
...
or...

Average eval at end of x plies...

1.e4 ...
... e5 (+.34)
... c5 (+.15)
... c6 (+.11)
...Nc6(+.09)
...

or...

Order move after Eval pv[x]+pv[ply[y]]

1.e4 ...
... e5 (+.4)
... c5 (+.3)
... c6 (+.2)
...Nc6(+.1)
...

I'm not a fan. The problem I see is that while we are perfectly willing to use an N-1 ply search to order moves for the N ply search, I would not be willing to use an N-15 ply search to order moves for an N-ply. And that is what you would be doing because those "games" would have to be so shallow, in order to complete them in reasonable time, they would represent really minimal searches.

I more prefer the idea I proposed in another thread, that is to simply play a million games or so, then go in and create buckets for each eval "range" (say 0 to .25, .25 to .50, etc) and then look thru each log file and for each search that had an eval in one of the buckets and to that bucket add the result (0, .5 or 1.0). Once going thru a million games, compute the average for each bucket, which would convert an eval of 0 to .25 into a winning probability. Ditto for .25 to .5. Then the eval could be a pure number between 0.0 and 1.0. Or perhaps even better, double the number and subtract 1.0, so that the numbers fall in the range of -1.0 (absolutely lost) to 0.0 (drawish) to 1.0 (absolutely won). Or they could be scaled in whatever way someone wants, perhaps even via an option. I think I might play around with this, just for fun, to see what the winning probability looks like for each possible scoring range.

I agree, with the aspect of finishing out games quickly to calculate statistics could bring bad results.

On the other hand, why wouldn't you want to make your moves (mentally) first that were choosen at, say at Depth 15?
... play them out,

...then evaluate the position concurrently.

Ordering the moves based on looking x plies ahead of what you were going play in the first place.

It would discover any problems with your N-1 before you go down the PV in the actual game.

If a GM got to play out 15 moves ... then got to undo all or some of them because he/she saw after 15 + x (x =12 plies) that it was a bad line to follow. The GM would goto plan B and look at his/her next variation(s). The GM would be looking 27 plies instead of 15.

Generating statistics would be useful if you had it in-house within the search... The move played has more influence on the outcome of the game then what the position evaluates.

Taking the moves from root, then evaluating with search windows and sorting them based on that value would be efficient then searching log files for strings. You wouldn't need to play any games. It would be like analyze(), but sorting each PV line. The longer you search the more PVs you'll have.

This is what I was doing in Crafty, but writting each PV line into a PGN, then creating a book for crafty with it's own evaluations. Having a storage eval window so you don't get bad moves in the book...

Otherwise, database program already do calculate the W/L/D percentages of games & for book moves... Having it with-in, as a sub-search probability generator for ordering moves would be nice out of opening theory... That is where my middle game books idea can replace this...

lkaufman · Post by **lkaufman** » Tue Jul 20, 2010 8:07 pm

Based on this comment by Joona and the similar one by Tord, I must accept that using realistic evals that predict winning percentages and maximizing Elo are incompatible goals, unless we can find a way to "have my cake and eat it too". So I guess Komodo won't catch Stockfish in Elo as long as we insist on realistic evaluation. I suppose we could have two options, does the user want reasonable eval or max Elo?

Crafty and Stockfish question

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question... Larry????