Crafty and Stockfish question

BubbaTough · Post by **BubbaTough** » Tue Jul 20, 2010 8:17 pm

lkaufman wrote:Based on this comment by Joona and the similar one by Tord, I must accept that using realistic evals that predict winning percentages and maximizing Elo are incompatible goals, unless we can find a way to "have my cake and eat it too". So I guess Komodo won't catch Stockfish in Elo as long as we insist on realistic evaluation. I suppose we could have two options, does the user want reasonable eval or max Elo?

I have tried to have my cake and eat it by having 2 functions, an evaluation function, and a "tree-shaper" function, which gives extra weights to certain dynamic factors (such as those that shape opening evaluations so unrealistically) so that while the terminal nodes have realistic evaluation weights, the search tree acts as if a more traditional evaluation function was at play.

I have tried variants of this idea of the years, but with no luck. So while I remain convinced that ultimately it is possible to build a stronger engine by having a realistic evaluation function without losing out to the effect on what is searched (which is what I think Joona and Tord are referring too) I have no evidence of this

.

-Sam

jwes · Post by **jwes** » Tue Jul 20, 2010 8:46 pm

zamar wrote:
lkaufman wrote:So I believe that making an eval come closer to GM opinion will also make it a better predictor of playout results, which if we are correct should also make it a stronger engine. However I have some doubt about this last point, in part due to comments by Tord here and to other comments about LMR changing the situation from pure alpha-beta..
I had a dream that tuning evaluation to better match statistical winning percentage of the position would be a way to improve. I was wrong! I Implemented various algorithms which converged beautifully to certain point. Resulting values were always a disaster in practical tests.

Unfortunately the fact seems to be that there is no direct connection between evaluation and winning percentage. It seems that heavily optimized search should not head for a single position which has high winning percentage, but instead head for a strong position with a lot of different possibilities to play on.

How do you recognize "a strong position with a lot of different possibilities to play on"? Not with alpha-beta (unless you evaluate for this and give such positions a higher score!)

bob · Post by **bob** » Tue Jul 20, 2010 9:51 pm

jwes wrote:
zamar wrote:
lkaufman wrote:So I believe that making an eval come closer to GM opinion will also make it a better predictor of playout results, which if we are correct should also make it a stronger engine. However I have some doubt about this last point, in part due to comments by Tord here and to other comments about LMR changing the situation from pure alpha-beta..
I had a dream that tuning evaluation to better match statistical winning percentage of the position would be a way to improve. I was wrong! I Implemented various algorithms which converged beautifully to certain point. Resulting values were always a disaster in practical tests.

Unfortunately the fact seems to be that there is no direct connection between evaluation and winning percentage. It seems that heavily optimized search should not head for a single position which has high winning percentage, but instead head for a strong position with a lot of different possibilities to play on.
How do you recognize "a strong position with a lot of different possibilities to play on"? Not with alpha-beta (unless you evaluate for this and give such positions a higher score!)

alpha/beta actually takes care of this pretty well on its own. See the Beal experiment with random eval to understand, but the more options you have at any point in the tree (which is a poor-man's mobility estimation) the more likely you can reach an endpoint with a more favourable score, just based on statistical sampling...

bob · Post by **bob** » Tue Jul 20, 2010 9:59 pm

jhaglund wrote:
bob wrote:
jhaglund wrote:
bob wrote:
lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?
Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

What I think could be useful here:

Implementing a concurrent search (SFSB)...
http://talkchess.com/forum/viewtopic.ph ... 70&t=35066

Playing out the moves to a complete game or just to x plies + y and generating statistics of each evaluation +/-... and using that to chose your move or for move ordering.

1.e4 ...
... e5 (34%W)(32%L)(34%D)
... c5 (33%W)(30%L)(37%D)
... c6 (31%W)(28%L)(41%D)
...Nc6(29%W)(30%L)(41%D)
...
or...

Average eval at end of x plies...

1.e4 ...
... e5 (+.34)
... c5 (+.15)
... c6 (+.11)
...Nc6(+.09)
...

or...

Order move after Eval pv[x]+pv[ply[y]]

1.e4 ...
... e5 (+.4)
... c5 (+.3)
... c6 (+.2)
...Nc6(+.1)
...

I'm not a fan. The problem I see is that while we are perfectly willing to use an N-1 ply search to order moves for the N ply search, I would not be willing to use an N-15 ply search to order moves for an N-ply. And that is what you would be doing because those "games" would have to be so shallow, in order to complete them in reasonable time, they would represent really minimal searches.

I more prefer the idea I proposed in another thread, that is to simply play a million games or so, then go in and create buckets for each eval "range" (say 0 to .25, .25 to .50, etc) and then look thru each log file and for each search that had an eval in one of the buckets and to that bucket add the result (0, .5 or 1.0). Once going thru a million games, compute the average for each bucket, which would convert an eval of 0 to .25 into a winning probability. Ditto for .25 to .5. Then the eval could be a pure number between 0.0 and 1.0. Or perhaps even better, double the number and subtract 1.0, so that the numbers fall in the range of -1.0 (absolutely lost) to 0.0 (drawish) to 1.0 (absolutely won). Or they could be scaled in whatever way someone wants, perhaps even via an option. I think I might play around with this, just for fun, to see what the winning probability looks like for each possible scoring range.
I agree, with the aspect of finishing out games quickly to calculate statistics could bring bad results.

On the other hand, why wouldn't you want to make your moves (mentally) first that were choosen at, say at Depth 15?
... play them out,

...then evaluate the position concurrently.

Ordering the moves based on looking x plies ahead of what you were going play in the first place.

It would discover any problems with your N-1 before you go down the PV in the actual game.

If a GM got to play out 15 moves ... then got to undo all or some of them because he/she saw after 15 + x (x =12 plies) that it was a bad line to follow. The GM would goto plan B and look at his/her next variation(s). The GM would be looking 27 plies instead of 15.

Generating statistics would be useful if you had it in-house within the search... The move played has more influence on the outcome of the game then what the position evaluates.

Taking the moves from root, then evaluating with search windows and sorting them based on that value would be efficient then searching log files for strings. You wouldn't need to play any games. It would be like analyze(), but sorting each PV line. The longer you search the more PVs you'll have.

This is what I was doing in Crafty, but writting each PV line into a PGN, then creating a book for crafty with it's own evaluations. Having a storage eval window so you don't get bad moves in the book...

Otherwise, database program already do calculate the W/L/D percentages of games & for book moves... Having it with-in, as a sub-search probability generator for ordering moves would be nice out of opening theory... That is where my middle game books idea can replace this...

This is similar to the old perpetual-motion machine. You can't cheat in physics, nor in tree searching. If you stop to do something unusual, then there is a cost for doing that, in our case the cost can directly be measured in terms of reduced search depth... How can you, at any point in the tree, stop and play out several games? How much could you have searched normally while that "delay" was in progress???

Beware of your book idea as well. We did that in Cray Blitz. We searched from _every_ position in the book and then minimaxed the entire thing back to the root so that when you played a book move you had an idea of where you were heading. .... or not. What about the book moves you _don't_ have, the refutations to a line that used to look good but are now proven to be bad? What about the gambits that you are gleefully going to accept since they appear to win material, yet you end up losing? This is not quite as easy as it seems. We gave up on the search/minimax idea for our book way early in the Crafty development.

We started doing that idea to get rid of blunders and catch the many lines in the various MCO/ECO books that say +/- but where black is actually winning (or vice-versa). It was good for those cases, but we had to look at them manually to do something about them in our book. I eventually decided that book learning was the only real solution to help with the book.

yoshiharu · Post by **yoshiharu** » Tue Jul 20, 2010 11:08 pm

bob wrote: Beware of your book idea as well. We did that in Cray Blitz. We searched from _every_ position in the book and then minimaxed the entire thing back to the root so that when you played a book move you had an idea of where you were heading. .... or not.

I am interested in this since it sounds like an idea I had myself one year ago or so; just to make sure I understood correctly: did you minimax using as "leaf scores" some statistical indicator extracted from the DB (maybe expected result, maybe whatever else)? Sort of searching only along the tree extracted from your DB, with the stats as static eval? I think it is what you did, inferring it from what you wrote below, but I want to be sure.

bob wrote: What about the book moves you _don't_ have, the refutations to a line that used to look good but are now proven to be bad? What about the gambits that you are gleefully going to accept since they appear to win material, yet you end up losing? This is not quite as easy as it seems. We gave up on the search/minimax idea for our book way early in the Crafty development.

(Note: In what follows I assume the answer to my question was 'yes').

But isn't this a problem arising also when you play the moves from your book with no search at all? Actually this is a problem for human chess as well, when a line was the mainline until, say, 2003, when some very smart GM refuted it, but then the stats are poorer since his novelty his too "new", and are not represented enough in your DB (sort of, I think you get the point).

Cheers, Mauro

bob · Post by **bob** » Wed Jul 21, 2010 4:42 am

yoshiharu wrote:
bob wrote: Beware of your book idea as well. We did that in Cray Blitz. We searched from _every_ position in the book and then minimaxed the entire thing back to the root so that when you played a book move you had an idea of where you were heading. .... or not.

I am interested in this since it sounds like an idea I had myself one year ago or so; just to make sure I understood correctly: did you minimax using as "leaf scores" some statistical indicator extracted from the DB (maybe expected result, maybe whatever else)? Sort of searching only along the tree extracted from your DB, with the stats as static eval? I think it is what you did, inferring it from what you wrote below, but I want to be sure.

No. What we did was to first go thru the book and for each position in there, we did a deep search and saved that in the book position. Kind of like the learn value in Crafty today. Then we took the entire book, and minimaxed those scores back to the root so that when we played white, and looked up and found positions after we had played 1. e4, 1. d4, etc, we had scores that were based on deep searches from both book line endpoints _and_ all intervening positions (we did the internal positions,from before the end of a book line, to make sure we were not overlooking some killer move a book editor had overlooked).

bob wrote: What about the book moves you _don't_ have, the refutations to a line that used to look good but are now proven to be bad? What about the gambits that you are gleefully going to accept since they appear to win material, yet you end up losing? This is not quite as easy as it seems. We gave up on the search/minimax idea for our book way early in the Crafty development.
(Note: In what follows I assume the answer to my question was 'yes').

But isn't this a problem arising also when you play the moves from your book with no search at all? Actually this is a problem for human chess as well, when a line was the mainline until, say, 2003, when some very smart GM refuted it, but then the stats are poorer since his novelty his too "new", and are not represented enough in your DB (sort of, I think you get the point).

Cheers, Mauro

At present, we base our move selection mainly on how often the book move was played when the book was built from a GM pgn game collection. Then we use book learning to fine-tune that for the moves that appear to be bad after we actually play them in a game and then do real searches and discover that the move led us to a bad position or result.

Michael Sherwin · Post by **Michael Sherwin** » Wed Jul 21, 2010 6:12 am

lkaufman wrote:Based on this comment by Joona and the similar one by Tord, I must accept that using realistic evals that predict winning percentages and maximizing Elo are incompatible goals, unless we can find a way to "have my cake and eat it too". So I guess Komodo won't catch Stockfish in Elo as long as we insist on realistic evaluation. I suppose we could have two options, does the user want reasonable eval or max Elo?

There may be a way! You could try to 'Normalize' the piece evals in some way that keeps the importance of mobility with out letting it skew the score returned so much. I am trying this now in RomiChess. Except for end-leaf pawn eval all eval terms are summed in piece square tables, so the normalize function in RomiChess looks like this.

Code: Select all

void Normalize() {
  s32 wpt, bpt, wnt, bnt, wbt, bbt, wrt, brt, wqt, bqt, sq;
  wpt = bpt = wnt = bnt = wbt = bbt = wrt = brt = wqt = bqt = -INFINITY;
  for(sq = A1; sq <= H8; sq++) {
    if(wPawnTbl[sq]   > wpt) wpt = wPawnTbl[sq];
    if(bPawnTbl[sq]   > bpt) bpt = bPawnTbl[sq];
    if(wKnightTbl[sq] > wnt) wnt = wKnightTbl[sq];
    if(bKnightTbl[sq] > bnt) bnt = bKnightTbl[sq];
    if(wBishopTbl[sq] > wbt) wbt = wBishopTbl[sq];
    if(bBishopTbl[sq] > bbt) bbt = bBishopTbl[sq];
    if(wRookTbl[sq]   > wrt) wrt = wRookTbl[sq];
    if(bRookTbl[sq]   > brt) brt = bRookTbl[sq];
    if(wQueenTbl[sq]  > wqt) wqt = wQueenTbl[sq];
    if(bQueenTbl[sq]  > bqt) bqt = bQueenTbl[sq];
  }
  for(sq = A1; sq <= H8; sq++) {
    wPawnTbl[sq]   = wPawnTbl[sq]   * 100 / wpt;
    bPawnTbl[sq]   = bPawnTbl[sq]   * 100 / bpt;
    wKnightTbl[sq] = wKnightTbl[sq] *  40 / wnt;
    bKnightTbl[sq] = bKnightTbl[sq] *  40 / bnt;
    wBishopTbl[sq] = wBishopTbl[sq] *  60 / wbt;
    bBishopTbl[sq] = bBishopTbl[sq] *  60 / bbt;
    wRookTbl[sq]   = wRookTbl[sq]   *  30 / wrt;
    bRookTbl[sq]   = bRookTbl[sq]   *  30 / brt;
    wQueenTbl[sq]  = wQueenTbl[sq]  *  20 / wqt;
    bQueenTbl[sq]  = bQueenTbl[sq]  *  20 / bqt;
  }
}

I have no idea what static modifiers I should use, however, the idea is to squeeze the values into a smaller range while not changing the internal relationship of each pieces terms. In theory then you could use dynamic modifiers instead of static, based on the opening, in order to scale the eval for more elo and for human familiarness.

gaard · Post by **gaard** » Wed Jul 21, 2010 6:45 am

What about maximizing Elo but reporting scores using a MC approach? For assigning probabilities/centipawn scores you could take a parameter=Elo and for larger Elo you could increases depth (accuracy?). This way for 1500 Elo play, a pawn up would not be nearly as much of an advantage if both players were 2700+. I imagine this approach would work better in different stages of the game, but I don't think you'd need a very large sample to report something that looked at least as reasonable as a conventional "score" in centipawns.

jhaglund · Post by **jhaglund** » Wed Jul 21, 2010 6:27 pm

How can you, at any point in the tree, stop and play out several games? How much could you have searched normally while that "delay" was in progress???

This is where you would dedicate a thread for a concurrent search at the _end_ leaf so you wouldn't stop anywhere. You would only lose 1 thread or however many you'd dedicate for the concurrent search. If you can get to depth 15 easy with 2 threads, why not dedicate the other 6+ for the other search to test if the line was any good... etc.

It could be even like a que of moves... take the best move... do x games then go to next move(s), repeat. The main search would feed the concurrent search, etc...

I eventually decided that book learning was the only real solution to help with the book.

Why not have a PGN parser to generate book learn output to book.lrn... as well as the old position.lrn

?
You have most of the code already...

jhaglund · Post by **jhaglund** » Wed Jul 21, 2010 6:34 pm

At present, we base our move selection mainly on how often the book move was played when the book was built from a GM pgn game collection. Then we use book learning to fine-tune that for the moves that appear to be bad after we actually play them in a game and then do real searches and discover that the move led us to a bad position or result.

This is where I think having a parser or function that just writes all PV lines in a given eval window size +/-.... to a file in PGN format as it searches() from root... I would prefer Crafty's own evaluation over a GM pgn method, because it can cover millions of moves to analyze, quicker.

Crafty and Stockfish question

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question