Engine testing: search vs eval

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Uri Blass
Posts: 10102
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Engine testing: search vs eval

Post by Uri Blass »

Richard Allbert wrote:Hi Don,

Can you give some pointers to this?

I've finally decided to start taking a disciplined approach to testing, have written a program to set up testing games using Cutechess.

The problem is knowing where to start.

For example, clean out the Eval function and tune just one Paramter

When the next parameter is tuned, it will be tuned vs the first value , and so on.

The values you end up with then depend on the order in which they are introduced? This doesn't seem a god way to do things.

If you get a result where VersionA has 2000elo +- 10 and VersionB 2020 +-10, then that is equal, correct? Unclear, as both fall within the error margin.

Can you give some tips on starting the testing, please :) ?

I've tested a group of oppenents at 20s+0.2s for stabilty, all was ok, and I've used Bob Hyatt's openings.epd as starting positions.

Any help is appreciated. I don't have a huge cluster, unfortunately, rather 4 cores spare :)

Regards

Richard
I agree with Don and I think that it is better to have weights that are too small and not weights that are too big.

I even think that deciding about weights that are too small when you know that the optimal weights are bigger may be productive for later improvement.

For example Initially when you do not have passed pawns evaluation
and tune your piece square table you may need a big bonus for pawns in the 6th or in the 7th .

Later with bonus for passed pawns(dependent on the rank of the pawn
and on the question if the pawn is blocked) you need smaller bonus for pushing passed pawns in the piece square table.

If you start with the "right" piece square table then you may have problems with improvement by adding other terms and you need to reduce your initial bonus so instead of increasing bonus only to reduce it later it seems to me faster to start with a bonus that is too small when hopefully later the bonus is not going to continue to be too small.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engine testing: search vs eval

Post by Don »

Uri Blass wrote:
Richard Allbert wrote:Hi Don,

Can you give some pointers to this?

I've finally decided to start taking a disciplined approach to testing, have written a program to set up testing games using Cutechess.

The problem is knowing where to start.

For example, clean out the Eval function and tune just one Paramter

When the next parameter is tuned, it will be tuned vs the first value , and so on.

The values you end up with then depend on the order in which they are introduced? This doesn't seem a god way to do things.

If you get a result where VersionA has 2000elo +- 10 and VersionB 2020 +-10, then that is equal, correct? Unclear, as both fall within the error margin.

Can you give some tips on starting the testing, please :) ?

I've tested a group of oppenents at 20s+0.2s for stabilty, all was ok, and I've used Bob Hyatt's openings.epd as starting positions.

Any help is appreciated. I don't have a huge cluster, unfortunately, rather 4 cores spare :)

Regards

Richard
I agree with Don and I think that it is better to have weights that are too small and not weights that are too big.

I even think that deciding about weights that are too small when you know that the optimal weights are bigger may be productive for later improvement.

For example Initially when you do not have passed pawns evaluation
and tune your piece square table you may need a big bonus for pawns in the 6th or in the 7th .

Later with bonus for passed pawns(dependent on the rank of the pawn
and on the question if the pawn is blocked) you need smaller bonus for pushing passed pawns in the piece square table.

If you start with the "right" piece square table then you may have problems with improvement by adding other terms and you need to reduce your initial bonus so instead of increasing bonus only to reduce it later it seems to me faster to start with a bonus that is too small when hopefully later the bonus is not going to continue to be too small.
In Komodo development if we add something that add further distinctions we will generally make an offsetting adjustment. So in your example we might start with a high passed pawn bonus but if we add blocked pawns, squares controlling the path, etc then we will lower the passed pawn bonus so that the "average" passed pawn comes out the same, since a single passed pawn term is not carrying as much of the load.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Richard Allbert
Posts: 792
Joined: Wed Jul 19, 2006 9:58 am

Re: Engine testing: search vs eval

Post by Richard Allbert »

Hi Uri

So it's better to start with a reasonable set of features implemented, but all with small weights?

Your example with the passed pawn - does that mean once you had introduced the passed pawn and tuned, you would then go back and retune the piece tables? Once this has been done, the PassedPawn would need tuning again, would it not, vs the "retuned" piece table?

This is the problem I have deciding the best way to go about things - but starting small makes sense.

Thanks

Richard
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engine testing: search vs eval

Post by Don »

Richard Allbert wrote:Hi Uri

So it's better to start with a reasonable set of features implemented, but all with small weights?

Your example with the passed pawn - does that mean once you had introduced the passed pawn and tuned, you would then go back and retune the piece tables? Once this has been done, the PassedPawn would need tuning again, would it not, vs the "retuned" piece table?

This is the problem I have deciding the best way to go about things - but starting small makes sense.

Thanks

Richard
It's impossible to retest every feature every time you make a change. What you have to do is get the big things right and revisit terms (usually in logical groups) periodically as you add more and more features.

Evaluation is a bit of a "black art" in the sense that a lot of it is trial and error and intuition driven. There is no easy way. Along the way we have been surprised a few times with something that worked much better than expected, and some major missing feature that did not work at all. A lot of evaluation features have some redundancy too and they interact with other features. For example a doubled pawn is bad, right? Not necessarily because it may also open a file for a powerful rook.

So focus on the really big things and try to get that right before you get into fine details. Getting the value of the pieces correct (with interactions), pawn structure, and mobility - all are super critical.

These days people just copy the evaluation of some other program and work from there but unfortunately you end up with a program that has the same weaknesses and strengths and local optima of some other program. You just don't advance the state of the art by doing that. We really need diversity so I applaud your efforts to engineer your own evaluation function.
Last edited by Don on Sun Jul 15, 2012 3:11 pm, edited 1 time in total.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Richard Allbert
Posts: 792
Joined: Wed Jul 19, 2006 9:58 am

Re: Engine testing: search vs eval

Post by Richard Allbert »

The other thing I've noticed is the weights have a huge affect on the search tree - so you could end up with a suboptimal weight as far as estimating the position is concerned, but it causes a strength gain, as the search goes deeper.

Thanks again for the replies, I'll start with the three areas you recommend. I might also try the CLOP for this with Cutechess

Ciao

Richard
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engine testing: search vs eval

Post by Don »

Richard Allbert wrote:The other thing I've noticed is the weights have a huge affect on the search tree - so you could end up with a suboptimal weight as far as estimating the position is concerned, but it causes a strength gain, as the search goes deeper.

Thanks again for the replies, I'll start with the three areas you recommend. I might also try the CLOP for this with Cutechess

Ciao

Richard
I personally believe that for LMR and even null move pruning to work really well you need to have a very good evaluation function. And I agree with you that the tree tends to be reduced with better evaluation - it may be because good evaluation tends to improve the move ordering for LMR - not as many researches in LMR and so on.

Imagine not understanding weak pawns, but one is about to fall. Until it does your program may put a weak move near the front of the list, only to disrupt the move order later on. So my theory on this is that weaker evaluation eventually shows up in the form of bad move ordering, where a moves weakness is exposed eventually by search when it could be known much sooner by evaluation.

Sometimes people ask, "should I work on search or evaluation?" The answer is usually going to be evaluation - it's very difficult to recover from a bad evaluation with extra search. Turn off everything except material and you will notice that no depth you can perform in a reasonable amount of time will cause you to find good moves unless the position is purely tactical. Try this from the opening position for example and your program will probably play the first move it generates, i.e. a3, a4, h3, h4 or perhaps Nh3, etc - even on a 30 ply search.

Then if you add mobility so that you have only head count and mobility, the moves will drastically improve in quality, but the program will still look like a patzer. Pawn weaknesses all over the place and so on.

In fact every program in existence, even the ones with really good evaluation functions have characteristic strengths and weakness (personalities) which cannot be hidden by search. If your program doesn't understand some important concept, it will forever be a problem. Komodo has some bad moves that we cannot easily fix without weakening the program - a sure sign there is something "else" a bit out of balance. It's often the case that you cannot solve a problem by addressing it directly because the reason it is doing the bad thing is not the reason you think, it's something else that is out of balance. For a trivial example, you can make a program less willing to move a certain piece to a certain square by decreasing the piece value table score for that square, but the problem might be that you undervalue the square that the piece is on now! The interactions can be a lot more complicated that this, but this illustrates the point.

Unfortunately, there is no magic here, it's just hard work. Ideally it would be nice if you could just code up evaluation features and in some automated way (such as CLOP) produce ideal weights without having to think much about it, but the problem is far more difficult that it appears to be. It seems to require a huge amount of human guidance and intelligence.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
sedicla
Posts: 178
Joined: Sat Jan 08, 2011 12:51 am
Location: USA
Full name: Alcides Schulz

Re: Engine testing: search vs eval

Post by sedicla »

Thanks you all, specially Don. It answers my question and I'll be waiting for your guide. I appreciate it.

I always have an impression that there something wrong with my eval, as you mentioned it has a big influence on the selected moves. I'm happy now with my search, i'll leave it with the basic and then dedicate to eval tuning, eventually I'll rewrite from scratch and methodically add components.

Thanks all again..
Richard Allbert
Posts: 792
Joined: Wed Jul 19, 2006 9:58 am

Re: Engine testing: search vs eval

Post by Richard Allbert »

Hi Lucas,

Normall one needs about 6000 games to get +/- 10 elo - so with 5 paramaters and wide window, you'd expect well over 10000 games as a requirement, wouldn't you?

Without CLOP you'd need say 6000 games for each value change, then for each piece, - just four different values for each peice would quickly rise to 120k games. I understood that CLOP reduces this, but you'd still expect a lot more than 10k. Or have I misunderstood? :)

Did you just try piece values? Or anything else?

Oh, another general question - is using something like the Crafty benchmark a reliable way of altering TC for different hardware?

Ciao

Richard
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engine testing: search vs eval

Post by Don »

Richard Allbert wrote:Hi Lucas,

Normall one needs about 6000 games to get +/- 10 elo - so with 5 paramaters and wide window, you'd expect well over 10000 games as a requirement, wouldn't you?

Without CLOP you'd need say 6000 games for each value change, then for each piece, - just four different values for each peice would quickly rise to 120k games. I understood that CLOP reduces this, but you'd still expect a lot more than 10k. Or have I misunderstood? :)

Did you just try piece values? Or anything else?

Oh, another general question - is using something like the Crafty benchmark a reliable way of altering TC for different hardware?

Ciao

Richard
The number of games you need is a function of how much error you are willing to accept - there is no way around the fact that you will occasionally make wrong decisions so this is about hedging your bets, when you are wrong you don't want to be "too wrong" and you don't want to throw out too many good changes either due to sample noise.

Don't forget that error margins notwithstanding, there is a bell shaped curve which describes your likely error - in other words, regardless of the number of games, you are more likely to be off a little than off a lot. So part of the picture is again, how many small regressions are you willing to accept in order to make more rapid progress? If you can make 7 improvement for every 3 regressions and they are all of equal magnitude, then you win. Some common sense applies here of course as you don't want to keep too many regressions that might interfere with other improvements.

Sometimes Larry and I get hasty with a change that looks good but it's really bad and it shows up later when even minor changes cannot match our previous best results - making it obvious that a recent version introduced an ELO regression. That happens enough that we know that we must also be accepting other minor regressions that are less obvious. There is not much you can do about that other than slowing down development to a crawl, 1 change per week or month for example so that the change can be super-tested right down to 1 or 2 ELO points. You won't make much progress if you are so meticulous that you are crippled by the process.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Uri Blass
Posts: 10102
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Engine testing: search vs eval

Post by Uri Blass »

Don wrote:
Uri Blass wrote:
Richard Allbert wrote:Hi Don,

Can you give some pointers to this?

I've finally decided to start taking a disciplined approach to testing, have written a program to set up testing games using Cutechess.

The problem is knowing where to start.

For example, clean out the Eval function and tune just one Paramter

When the next parameter is tuned, it will be tuned vs the first value , and so on.

The values you end up with then depend on the order in which they are introduced? This doesn't seem a god way to do things.

If you get a result where VersionA has 2000elo +- 10 and VersionB 2020 +-10, then that is equal, correct? Unclear, as both fall within the error margin.

Can you give some tips on starting the testing, please :) ?

I've tested a group of oppenents at 20s+0.2s for stabilty, all was ok, and I've used Bob Hyatt's openings.epd as starting positions.

Any help is appreciated. I don't have a huge cluster, unfortunately, rather 4 cores spare :)

Regards

Richard
I agree with Don and I think that it is better to have weights that are too small and not weights that are too big.

I even think that deciding about weights that are too small when you know that the optimal weights are bigger may be productive for later improvement.

For example Initially when you do not have passed pawns evaluation
and tune your piece square table you may need a big bonus for pawns in the 6th or in the 7th .

Later with bonus for passed pawns(dependent on the rank of the pawn
and on the question if the pawn is blocked) you need smaller bonus for pushing passed pawns in the piece square table.

If you start with the "right" piece square table then you may have problems with improvement by adding other terms and you need to reduce your initial bonus so instead of increasing bonus only to reduce it later it seems to me faster to start with a bonus that is too small when hopefully later the bonus is not going to continue to be too small.
In Komodo development if we add something that add further distinctions we will generally make an offsetting adjustment. So in your example we might start with a high passed pawn bonus but if we add blocked pawns, squares controlling the path, etc then we will lower the passed pawn bonus so that the "average" passed pawn comes out the same, since a single passed pawn term is not carrying as much of the load.
There may be cases when interaction between evaluation terms is less obvious so I still think that it is better to start with evaluation weights that you know that they are too small(when probably the optimal value of them is going to be smaller later).

For example mobility and pawn structure.

I did not do research and I may be wrong here
but it is possible that a better pawn structure help to increase the mobility of your pieces in most cases because the opponent needs to defend the weak pawns.