Eval development: is it better to tune or add new terms?

Don · Post by **Don** » Mon Mar 18, 2013 1:18 pm

sedicla wrote:My engine has an elo of about 2500. I was wondering if I have to focus more on tuning current evaluation terms or add new ideas.
I am going to start a new development cycle, and I'm not sure how much I can improve by tuning current items, and when I should stop this. I have a feeling that I can improve, but maybe I will waste time and not improve that much. On the other hand if I introduce new items, will add more variables to the process. Anyway, I appreciate if anyone can comment on that...

What you want is not just a quantity of terms, but you need to cover any chess principles that are not already covered.

I consider too much knowledge, i.e. the "Diep" approach, a different kind of brute force - not smart. More is not "better" but quality counts more than quantity. A good chess evaluation function does require significant quantity though. When possible try to put your knowledge in a form where more can be added without requiring a big performance degrade. For example pawn structure is almost free and any new single feature is virtually free due to the use of pawn structure hash tables. You can do a lot with material signature hash tables.

Piece square tables are limited because they are not dynamic but you can add knowledge free via those but better to address more than the most general things in a more dynamic way.

You really must cover all important chess principles to have a really strong evaluation function. Some of them encompass a lot (such as king safety) and is a black art, they can be improved forever and you never will feel that you have them right.

The tuning is critical. The conventional wisdom used to be that the weights did not matter than much as long as you were in the general ballpark - but we have found that is just not true. That wisdom came from an era when massive automated testing did not exist or was much more limited for the few that had it. We have gotten tons of ELO improvement over the years from tuning weights. It's at a point now that we cannot change ANY value without noticing a small ELO loss. There is the issue about whether you have found some local optima or not - I won't address that here but try to get the really big terms right. By "big" I do not necessarily mean the heaviest weights, but the terms that define the skeleton of your evaluation function, the weights of the pieces, how they change in different phases of the games and how they interact with other pieces (such as bishop pair and other things) as well as the basic pawn structure terms and mobility. Really get those right and in balance before tuning the secondary terms. By "big" I mean terms that affect every single game - the ubiquitous terms you might say.

Clop is a pretty good tool for getting things in the right ballpark. Clop is no good if you don't have the patience to run a LOT of games - it is subject to the same rules of statistics as playing matches, you need tens of thousands of games to converge on reasonable values. We don't make very heavy use of Clop as we are good at manual tuning but we have found it useful. When we add a new evaluation feature it is a good tool to find good starting values if you don't trust your own guess. We sometimes start from that and then tune manually from there, primarily with massive testing of various weights.

pilgrimdan · Post by **pilgrimdan** » Mon Mar 18, 2013 1:31 pm

Don wrote:
sedicla wrote:My engine has an elo of about 2500. I was wondering if I have to focus more on tuning current evaluation terms or add new ideas.
I am going to start a new development cycle, and I'm not sure how much I can improve by tuning current items, and when I should stop this. I have a feeling that I can improve, but maybe I will waste time and not improve that much. On the other hand if I introduce new items, will add more variables to the process. Anyway, I appreciate if anyone can comment on that...
What you want is not just a quantity of terms, but you need to cover any chess principles that are not already covered.

I consider too much knowledge, i.e. the "Diep" approach, a different kind of brute force - not smart. More is not "better" but quality counts more than quantity. A good chess evaluation function does require significant quantity though. When possible try to put your knowledge in a form where more can be added without requiring a big performance degrade. For example pawn structure is almost free and any new single feature is virtually free due to the use of pawn structure hash tables. You can do a lot with material signature hash tables.

Piece square tables are limited because they are not dynamic but you can add knowledge free via those but better to address more than the most general things in a more dynamic way.

You really must cover all important chess principles to have a really strong evaluation function. Some of them encompass a lot (such as king safety) and is a black art, they can be improved forever and you never will feel that you have them right.

The tuning is critical. The conventional wisdom used to be that the weights did not matter than much as long as you were in the general ballpark - but we have found that is just not true. That wisdom came from an era when massive automated testing did not exist or was much more limited for the few that had it. We have gotten tons of ELO improvement over the years from tuning weights. It's at a point now that we cannot change ANY value without noticing a small ELO loss. There is the issue about whether you have found some local optima or not - I won't address that here but try to get the really big terms right. By "big" I do not necessarily mean the heaviest weights, but the terms that define the skeleton of your evaluation function, the weights of the pieces, how they change in different phases of the games and how they interact with other pieces (such as bishop pair and other things) as well as the basic pawn structure terms and mobility. Really get those right and in balance before tuning the secondary terms. By "big" I mean terms that affect every single game - the ubiquitous terms you might say.

Clop is a pretty good tool for getting things in the right ballpark. Clop is no good if you don't have the patience to run a LOT of games - it is subject to the same rules of statistics as playing matches, you need tens of thousands of games to converge on reasonable values. We don't make very heavy use of Clop as we are good at manual tuning but we have found it useful. When we add a new evaluation feature it is a good tool to find good starting values if you don't trust your own guess. We sometimes start from that and then tune manually from there, primarily with massive testing of various weights.

good stuff Don...

as the universe is finely tuned... so is chess programming...

sedicla · Post by **sedicla** » Mon Mar 18, 2013 3:22 pm

I will release a new version soon...

velmarin wrote:I have not looked enough your code, I will.

One way to put those ideas is to make them indirectly,
to score directly (opening, endgame),
done in an intermediate
score on another variable (good_attack example) then as we reach certain goals, the score had to in sections (opening, endgame),
It's nothing new, but it seems very effective, and settings need not be so drastic.

is clear that the test mass of games with different parameters can be refined, but besides expensive (money and boring) is very professional, I do not like. takes away the charm.

rbarreira · Post by **rbarreira** » Mon Mar 18, 2013 3:40 pm

Adding features to an evaluation function which is very mistuned has a big problem - the new feature might perform badly in testing, even if it's a good one (due to not playing well with the mistuned ones).

Don · Post by **Don** » Mon Mar 18, 2013 4:14 pm

rbarreira wrote:Adding features to an evaluation function which is very mistuned has a big problem - the new feature might perform badly in testing, even if it's a good one (due to not playing well with the mistuned ones).

Yes, evaluation in general is very hard to do well. That is why I strongly recommend getting the ubiquitous features right FIRST. The ubiquitous features are ones that occur on almost every search and have a huge impact on everything, such as the value of the pieces and mobility and common bad pawn features. If you don't have any idea about a weight you should set it to a very conservative value. If a feature is a good, even a value way under what it should be will help.

In most cases you can assume that feature weights do not interact even though that is not really true. If you obsess over the interaction you will spend several orders of magnitude more energy adjusting them that is necessary. Instead, you can use an iterative approach - make a small adjustment that helps and then adjust something else and come back later to refine it.

In some case you KNOW for sure there is a pretty strong interaction. In those case you want to work carefully. Here are 4 terms that need to be tuned together, but still can be tuned one at a time if you stay on the low side:

1. Bishop value
2. Knight value
3. Bishop pair
4. Doubled pawn

Komodo has more than one kind of doubled pawn but you get the picture. A very common theme in chess is BxN creating a doubled pawn and of course you definitely want to make reasonable decisions any time a bishop and knight are traded off.

You also want the pawn value to be right in comparison to the minor pieces so if that is out of balance you want to either adjust ONLY the pawns or else adjust all the pieces in proportion - especially if you went to a lot of trouble getting all that right.

Even though the bishop and knight are the most critical, you always realize that common themes in chess are the exchange, so you want the rook value to be right in relation to the minor pieces - and of course the pawns are always a factor. So it comes down to really paying a lot of attention to the values of ALL the tokens on the board.

When you start adding evaluation you can easily screw this up. For example if you have all the pieces "just right" and then add mobility you are effectively raising the value of the mobile pieces and throwing everything out of whack. So you want to make sure that new terms of any consequence such as mobility are "centered." In Komodo, if a piece has "average" mobility it gets zero points. If it is less than average it is penalized and if it's more then it gets a bonus. You can simply subtract some normalizing offset from the mobility score so that it is not always positive.

bob · Post by **bob** » Mon Mar 18, 2013 4:52 pm

rbarreira wrote:Adding features to an evaluation function which is very mistuned has a big problem - the new feature might perform badly in testing, even if it's a good one (due to not playing well with the mistuned ones).

There are three kinds of terms:

(1) horribly tuned. These CAN be worse than not having such a positional term at all. But not always.

(2) poorly tuned. These are generally better than not having the term at all, how much better is more or less based on luck.

(3) well tuned. These are what you WANT to have. But with so many terms in an eval, this is not easy to reach.

From a ton of experience, I generally discovered that it is better to add a term with a reasonable score (reasonable guess) than to not add it at all. For example, it is better to have ANY type of bonus for a distant passer, as opposed to none. Unless you make it so big it completely breaks your eval (say 2 queens for a distant passer score so that it will never actually promote the pawn since it would be worth more unprompted).

Eval development: is it better to tune or add new terms?

Re: Eval development: is it better to tune or add new terms?

Re: Eval development: is it better to tune or add new terms?

Re: Eval development: is it better to tune or add new terms?

Re: Eval development: is it better to tune or add new terms?

Re: Eval development: is it better to tune or add new terms?

Re: Eval development: is it better to tune or add new terms?