Lazy evaluation

bob · Post by **bob** » Thu Aug 07, 2014 2:00 am

This topic comes up from time to time, and I was curious about it, so I ran some tests which showed a significant Elo drop when I removed it. But I was curious about the error rate, and decided to "instrument" Crafty's evaluation.

case 1. Near the top of the eval, I take the material score plus a pretty big margin (knight value in middle game, rook value in endgame) and if the current material score + margin < alpha, or current material score - margin > beta, I simply return the current material score and get out. The intent here is to not do positional evaluation when down major material, which is quite common (more on the how common in a bit).

case 2. Near the end of the eval, after I have done the pawn + passed pawn stuff, I test the current score +/- a margin (that is dynamic between roughly pawn and 2 pawns or so) and if the current score + margin is < alpha, I avoid doing the piece scoring, or if the current score - margin is > beta, ditto.

I decided to test these to see how often each one is tried, how often it succeeds, and how often it succeeds and is wrong. What I chose to do was to eliminate the lazy cutoff at the top, but set a flag that tells me whether the alpha or beta test failed. But I kept right on in the eval. At the point where I might skip evaluating pieces, I did the same and set another flag telling me whether the alpha or beta test suggested this scoring should be skipped.

Finally, at the bottom of evaluate(), just before returning score, I checked each of the flags, and then compared the actual score to the same bound. If the first lazy test succeeded, material + margin < alpha, then I verified that actual score was < alpha. Ditto for material - margin > beta, and then the same two tests for the second lazy eval test as well. After searching a position, I just dumped the counters. Here is a sample:

Code: Select all

lazy cutoffs&#58;
#1&#58;  tried=94,909,941  <alpha_done=21,047,333  <alpha_wrong=117
#1&#58;  tried=94,909,941  >beta_done=17,374,265  >beta_wrong=107
#2&#58;  tried=94,909,941  <alpha_done=36,546,071  <alpha_wrong=33,954
#2&#58;  tried=94,909,941  >beta_done=27,702,218  >beta_wrong=20,568

To read that, #1 is the first lazy eval exit, #2 is the test that skips piece scoring.

tried is obviously the number of times Evaluate() was actually called. There is one way that tried for #2 won't match, that's when mate scoring is done and the rest of the piece evaluation stuff gets skipped completely. But in the above this did not happen. The "beta_done" value means the material - margin > beta was triggered and we would have returned Material (or beta, same thing) here. The "beta_wrong" means that when I compared the final score to beta, it was actually <= beta when the lazy cutoff test had predicted it would be > beta. Ditto for alpha, but using the alpha bound.

Now for some actual data. First one is kopek #22 (Bxe4) searched to depth=28, one cpu to avoid worrying about locking the counters above and such, not to mention the non-determinism.

Second position is win at chess #2, the Rxb2 position. Ends up in an endgame and even gets egtb hits. Lots of checks and such too.

Third position is the initial starting position searched to depth 24. Took a good bit longer than the others (bad depth choice) on my macbook.

The fourth position is the classic fine #70, Kb1 position. Searched for 2 minutes (always forget that Crafty searches this to a forced mate after 1 minute, but I decided to not run it again and just leave these numbers as is.

Code: Select all

lazy cutoffs&#58;
#1&#58;  tried=232,765,181  <alpha_done=50,602,410  <alpha_wrong=485
#1&#58;  tried=232,765,181  >beta_done=44,333,281  >beta_wrong=348
#2&#58;  tried=232,765,181  <alpha_done=88,039,569  <alpha_wrong=86,823
#2&#58;  tried=232,765,181  >beta_done=71,048,470  >beta_wrong=55,624


#1&#58;  tried=216,176,727  <alpha_done=27,849,350  <alpha_wrong=1,429,110
#1&#58;  tried=216,176,727  >beta_done=14,305,040  >beta_wrong=2,048,679
#2&#58;  tried=215,925,197  <alpha_done=88,389,502  <alpha_wrong=328,371
#2&#58;  tried=215,925,197  >beta_done=53,822,321  >beta_wrong=373,325


lazy cutoffs&#58;
#1&#58;  tried=274,505,203  <alpha_done=43,680,268  <alpha_wrong=4,364
#1&#58;  tried=274,505,203  >beta_done=30,890,091  >beta_wrong=2,532
#2&#58;  tried=274,505,203  <alpha_done=82,748,008  <alpha_wrong=160,751
#2&#58;  tried=274,505,203  >beta_done=59,762,057  >beta_wrong=180,485


lazy cutoffs&#58;
#1&#58;  tried=137,919,807  <alpha_done=73,605,381  <alpha_wrong=549,500
#1&#58;  tried=137,919,807  >beta_done=57,146,523  >beta_wrong=587,823
#2&#58;  tried=132,816,452  <alpha_done=73,650,079  <alpha_wrong=14,119
#2&#58;  tried=132,816,452  >beta_done=56,001,456  >beta_wrong=17,247

Observations:

Position number 2 suggests that the "pieceless" margin might be improved a bit, something I am going to test. Otherwise the number of failures is surprisingly low.

For performance, the difference is pretty clear. With lazy eval, 5.5M nodes per sec on my macbook (one cpu). With lazy eval disabled, 4.1M nodes per second. Or about 75% as fast as with lazy eval.

Has anybody else done any similar testing?

edit:

adjusting the margin for no pieces reduced the percentage of wrong cutoffs (#1) to something reasonable. More tuning scheduled.. Didn't seem to affect the NPS on wac 2 at all, which was nice.

lucasart · Post by **lucasart** » Thu Aug 07, 2014 4:15 am

Interesting analysis. But conclusions based on Crafty cannot be considered universal. In my engine lazy eval never worked. In Stockfish it never worked either.

Here's a guess as to why it can work in Crafty and not in DiscoCheck nor in Stockfish:
* Crafty uses fail hard, while DC and SF use fail soft. With fail soft, it's not just a problem of whether lazy eval correctly predicts a fail low or a fail high, but by how much you fail low or fail high. Fail soft is a clear gain in my experience, if you gamble correctly with fail soft scores.
* SF and DC have a relatively light eval, it's not the number one item of profiling, contrary to what's often assumed. Perhaps Crafty has too much knowledge and there's indeed a lot of speedup in lazy eval in Crafty, but relatively little in DC or SF?

cetormenter · Post by **cetormenter** » Thu Aug 07, 2014 5:47 am

Coincidentally I was just about to test out adding a lazy evaluation to Nirvanachess so I have taken the liberty of doing the tests you have outlined. I used similar margins that you did for your cutoffs (Bishop value in the opening and Rook value in the ending for #1 and 2 Pawns for #2).

Searching using my bench command at depth 23 (just to get a similar evaluation call count) I get the results,

Code: Select all

#1&#58; tried=88,242,903   <alpha_done=24,619,032   <alpha_wrong=25,950
#1&#58; tried=88,242,903   >beta_done=6,787,080   >beta_wrong=20,705
#2&#58; tried=88,242,903   <alpha_done=38,940,052   <alpha_wrong=592,506
#2&#58; tried=88,242,903   >beta_done=12,304,782   >beta_wrong=442,884

In the four positions you posted as well as the same depths/times (I took a guess with the second position and went to depth 26):

Code: Select all

#1&#58; tried=170,711,291   <alpha_done=38,206,145   <alpha_wrong=15,576
#1&#58; tried=170,711,291   >beta_done=11,278,016   >beta_wrong=7,838
#2&#58; tried=170,711,291   <alpha_done=68,382,007   <alpha_wrong=153,729
#2&#58; tried=170,711,291   >beta_done=20,231,631   >beta_wrong=89,470

#1&#58; tried=197,620,833   <alpha_done=33,253,047   <alpha_wrong=70,847
#1&#58; tried=197,620,833   >beta_done=23,272,585   >beta_wrong=138,394
#2&#58; tried=197,620,833   <alpha_done=55,856,917   <alpha_wrong=98,617
#2&#58; tried=197,620,833   >beta_done=45,074,660   >beta_wrong=91,849

#1&#58; tried=30,857,926   <alpha_done=3,401,300   <alpha_wrong=611
#1&#58; tried=30,857,926   >beta_done=717,292   >beta_wrong=219
#2&#58; tried=30,857,926   <alpha_done=6,956,249   <alpha_wrong=13,962
#2&#58; tried=30,857,926   >beta_done=1,449,360   >beta_wrong=6,491

#1&#58; tried=70,485,087   <alpha_done=17,544,993   <alpha_wrong=52,297
#1&#58; tried=70,485,087   >beta_done=11,089,800   >beta_wrong=285,304
#2&#58; tried=70,485,087   <alpha_done=24,560,726   <alpha_wrong=31,226
#2&#58; tried=70,485,087   >beta_done=13,694,424   >beta_wrong=54,889

Overall the results seem to be somewhat consistent with yours. I do not get quite the increase in inaccuracy in position 2 that you do and for some reason in fine #70 only the beta cutoff seems to be blow up.

hgm · Post by **hgm** » Thu Aug 07, 2014 10:38 am

bob wrote:case 1. Near the top of the eval, I take the material score plus a pretty big margin (knight value in middle game, rook value in endgame) and if the current material score + margin < alpha, or current material score - margin > beta, I simply return the current material score and get out.

This is conceptually wrong. In fail soft you would have to return score +/- margin, and not score itself, as the latter is not a guaranteed bound. In fail hard you would return alpha or beta. In principle wrong bounds could find their way into the hash table, and be used later after the search window was changed because of an altered root score. (I don't know if that would also be the case for Crafty; when QS nodes are not hashed it might save the day.)

bob · Post by **bob** » Thu Aug 07, 2014 4:49 pm

lucasart wrote:Interesting analysis. But conclusions based on Crafty cannot be considered universal. In my engine lazy eval never worked. In Stockfish it never worked either.

Here's a guess as to why it can work in Crafty and not in DiscoCheck nor in Stockfish:
* Crafty uses fail hard, while DC and SF use fail soft. With fail soft, it's not just a problem of whether lazy eval correctly predicts a fail low or a fail high, but by how much you fail low or fail high. Fail soft is a clear gain in my experience, if you gamble correctly with fail soft scores.
* SF and DC have a relatively light eval, it's not the number one item of profiling, contrary to what's often assumed. Perhaps Crafty has too much knowledge and there's indeed a lot of speedup in lazy eval in Crafty, but relatively little in DC or SF?

Where do you see the benefit from fail-soft? I've played with it multiple times. The ONLY time I saw any actual benefit was when I was experimenting with mtd(f). I've never had any success trying to use the backed-up value to adjust the aspiration window on a fail high or fail low, either.

When you say "clear gain" that implies a measurable Elo gain. I don't see where it comes from, particularly when I notice that stockfish seems to use the traditional "I failed high, increment beta by a small amount, I failed high again, increment beta by 2x the last amount, etc." That's not using the fail-high / fail-low scores effectively.

bob · Post by **bob** » Thu Aug 07, 2014 4:51 pm

hgm wrote:
bob wrote:case 1. Near the top of the eval, I take the material score plus a pretty big margin (knight value in middle game, rook value in endgame) and if the current material score + margin < alpha, or current material score - margin > beta, I simply return the current material score and get out.
This is conceptually wrong. In fail soft you would have to return score +/- margin, and not score itself, as the latter is not a guaranteed bound. In fail hard you would return alpha or beta. In principle wrong bounds could find their way into the hash table, and be used later after the search window was changed because of an altered root score. (I don't know if that would also be the case for Crafty; when QS nodes are not hashed it might save the day.)

Note that I don't use fail-soft. So whether I return beta or beta + infinity, the result is the same, I do not store any value < alpha or > beta in the ttable, for example.

eric.oldre · Post by **eric.oldre** » Thu Aug 07, 2014 5:43 pm

In the implementation I've been playing with, I assume that the majority of eval terms are relatively similar to a move or two ago when I was able to do a full evaluation. I adjust that for changes in material, pawn structure, and pcsq because those are fast. Then compare that with a much smaller margin (based on how many plies it's been since I was able to do the full evaluation)

fuzzyScore = currentMaterial
+ currentPawn
+ incrementallyUpdatedPcSq
+ /*(score from last full eval higher in tree, minus material, pawn, pcsq)*/

margin = 50 * plyCountSinceLastFullEval

if(/*passed pawns have not changed*/)
{
if(fuzzyScore + margin < alpha){ return alpha; }
if(fuzzyScore - margin > beta){ return beta; }
}
/*do full evaluation*/

So basically i'm making the assumption that my non-material / non-pawn / non-pcsq evaluation terms will not change more than 50 centi-pawns per move if there have been no changes to passed pawns.

I'm certain this is not optimal, but I have tried with and without the above and this did have a dramatic effect on elo (don't remember exactly but IIRC with was 50 elo or so). Would be interested to hear others thoughts.

Henk · Post by **Henk** » Thu Aug 07, 2014 6:08 pm

My normal evaluation is very lazy. It's almost impossible to make it more lazy. Maybe that's why Skipper plays bad. But the more terms you add to evaluation the slower alpha beta search (I mean Principal Variation Search) gets or does it generate more cut-offs ?

jdart · Post by **jdart** » Thu Aug 07, 2014 6:21 pm

I use the non-lazy evaluated score for other things besides alpha-beta. I cache the eval so that as I make my way through the search code I don't have to evaluate it more than once.

The problem with lazy eval is, it works if you always pass in a margin, but I need the score evaluated against several different margins during the search. Just for example, I vary futility margin based on move count so that margin changes during the search. You can pass in the right margin(s) and evaluate lazily each time but IMO it is probably at least as good to not lazy eval and do the eval caching. Then if you need the eval a second time you are always executing no code except a memory fetch.

--Jon

mvk · Post by **mvk** » Thu Aug 07, 2014 6:22 pm

lucasart wrote:Interesting analysis. But conclusions based on Crafty cannot be considered universal. In my engine lazy eval never worked. In Stockfish it never worked either.

With so few errors in case 2, it suggests to me that Crafty's piece evaluation doesn't have enough bandwidth. 1p or 2p is a very narrow band. From the description it is not clear if king safety is supposed to fall in it (I suppose not), and the same for mobility/board control (I suppose it is, but then 2p is certainly not enough). The way I see it: NPS is the wrong focus if the evaluation doesn't have room to breathe. It just results in bigger trees.

Lazy evaluation

Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation

Re: Lazy evaluation