Question for Bob Hyatt

lkaufman · Post by **lkaufman** » Thu Jan 14, 2010 3:42 pm

First, as this is my first post here, let me introduce myself to those who don't know me. I am a chess Grandmaster, World Senior Champion in 2008, and although not a programmer myself I have been a consultant on chess programs (testing, writing opening books, assigning parameter values, proposing evaluation terms, etc.) from 1967 (MacHack) to the present (Rybka 3 and now Doch with Don Dailey).
Now for my question. I understand that you have tuned the evaluation function of Crafty very finely by playing perhaps millions of games at fast time limits. What I would like to know is to what extent do the optimum values depend on the time limit (or depth)? If there is measurable dependence, which values (individual piece values, mobility, king safety, pawn structure, etc.) go up with increased depth and which ones go down, and roughly by how much per ply? Do the values approach an asymptote at some modest level or do they appear to continue to change up to the maximum depth at which you test?
Thanks in advance for your answer.

hgm · Post by **hgm** » Thu Jan 14, 2010 5:42 pm

I would also be curious how much a gross mis-tuning (e.g. Q=850 in stead of 950) would cost you in terms of Elo.

bob · Post by **bob** » Thu Jan 14, 2010 6:29 pm

lkaufman wrote:First, as this is my first post here, let me introduce myself to those who don't know me. I am a chess Grandmaster, World Senior Champion in 2008, and although not a programmer myself I have been a consultant on chess programs (testing, writing opening books, assigning parameter values, proposing evaluation terms, etc.) from 1967 (MacHack) to the present (Rybka 3 and now Doch with Don Dailey).
Now for my question. I understand that you have tuned the evaluation function of Crafty very finely by playing perhaps millions of games at fast time limits. What I would like to know is to what extent do the optimum values depend on the time limit (or depth)? If there is measurable dependence, which values (individual piece values, mobility, king safety, pawn structure, etc.) go up with increased depth and which ones go down, and roughly by how much per ply? Do the values approach an asymptote at some modest level or do they appear to continue to change up to the maximum depth at which you test?
Thanks in advance for your answer.

Good question, Larry.

And unfortunately, there is no exact answer. But here is what I have found after a ton of testing.

(1) most evaluation terms seem to work well regardless of time control. The only thing I have found is that if you are not careful, you can ramp up some too far and they help at very fast games, but can hurt at slower games. One common case is king safety. If you ramp up the scores, then you will try to drive your opponent's king into an exposed position where at fast time controls, mistakes are made and the attack works. But when the attack fails, what did you lose to get it going? Material? wrecked pawns? To protect against this, I try to occasionally run longer games to make sure nothing has been broken.

(2) search can be a different animal. Extensions, reductions, etc. can behave differently at different depths, so any search changes get run thru a suite of time controls. The most common time control I use is 10 seconds + 0.1 seconds increment. On our small cluster, with 256 processors, that will play a 40,000 game match in 60-90 minutes and give us quick feedback. If you double the time control, then that goes up accordingly. I've played some 60+60 matches, but those take on the order of 2-3-4 weeks (memory fails me here for exact numbers) as these are not very common.

(3) All of my "tuning" is of a manual nature. I can automate the cluster runs so that a parameter (or parameters) vary over some set of values, but I get a 40K match for each value or combination of values. I then plot these and am always looking for something like a normal-distribution curve where the Elo peaks at one setting for a value, and then drops off on either side. I pick the value that represents that "peak". Unfortunately, not all values behave that way. Some start off low, rise to a peak, but then the Elo stays there as values go further in that direction. This is a bit trickier, and I usually re-run this test, with values at the highest point and on either side of that point, using a longer time control to see if I have found one of those "depth-sensitive" values.

I guess, in summary, that my answer will not be very precise, in that while most values seem to be depth-insensitive, some are not. And I generally rely on intuition to decide when to test at longer time controls.

If you have any specifics, feel free to ask. It might be easier if you have something specific in mind, for me to just run a test changing that particular kind of evaluation term and reporting the results...

bob · Post by **bob** » Thu Jan 14, 2010 6:30 pm

hgm wrote:I would also be curious how much a gross mis-tuning (e.g. Q=850 in stead of 950) would cost you in terms of Elo.

I am not sure that is so "gross". I'll try to look back thru my test results, as we ran a lot of "material value change" experiments to settle on our current values... I think I still have that buried in a huge file of BayesElo output.

Kempelen · Post by **Kempelen** » Thu Jan 14, 2010 6:45 pm

I think sorting parameters and routines are very high depth sensitives, as most of the variables (killers, history, ....) are based on results and statistics of other nodes at that depth. I.e. at depth 3 the tree is very small, at depth 12 is very width and information collected may not be useful for other nodes at that depth: the positions could have not nothing in common.

In fact, it is very easy to see this, just modify search routine to report move sort numbers hits and you will how they are worse with higher depths

bob · Post by **bob** » Thu Jan 14, 2010 7:17 pm

hgm wrote:I would also be curious how much a gross mis-tuning (e.g. Q=850 in stead of 950) would cost you in terms of Elo.

Here's my data. It is obviously for a non-current version, but it covers a range of queen values. The -nnnn in the program names is the value used for a queen.

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   3 Crafty-23.1R10-1050  2610    4    4 31128   53%  2586   22%
   4 Crafty-23.1R10-1075  2608    5    5 31128   53%  2586   22%
   5 Crafty-23.1R10-1100  2608    4    5 31128   53%  2586   22%
   7 Crafty-23.1R10-1025  2606    5    4 31128   53%  2586   23%
   8 Crafty-23.1R10-1150  2606    5    5 31128   53%  2586   22%
   9 Crafty-23.1R10-1125  2606    5    5 31128   53%  2586   22%
  10 Crafty-23.1R10-975   2605    4    4 31128   52%  2586   22%
  11 Crafty-23.1R10-1000  2604    4    4 31128   52%  2586   22%
  13 Crafty-23.1R10-950   2603    5    5 31128   52%  2586   22%
  17 Crafty-23.1R10-900   2597    5    5 31128   51%  2586   21%

This was an early version of 23.1. It was also an earlier version of the test with just 31128 games per match, so a little less accuracy.

I did rerun this a couple of times to see if the results were consistent, which they were. Note that varying the queen value from 900 to 1100 gives a 10 Elo swing. Changing it by just 100 is not a big deal, you are really talking about Q plus a pawn or two, or just pure queen, that doesn't make much difference in the games. How often do you see KRR vs KQP (with other material). This would affect which side you think is better...

bob · Post by **bob** » Thu Jan 14, 2010 7:20 pm

Kempelen wrote:I think sorting parameters and routines are very high depth sensitives, as most of the variables (killers, history, ....) are based on results and statistics of other nodes at that depth. I.e. at depth 3 the tree is very small, at depth 12 is very width and information collected may not be useful for other nodes at that depth: the positions could have not nothing in common.

In fact, it is very easy to see this, just modify search routine to report move sort numbers hits and you will how they are worse with higher depths

That's why search changes are harder for me to evaluate, since they need to be tested at both fast and long time controls. But evaluation changes are, for the most part, less sensitive to depth, based on the testing I have done. Not 100% are depth-insensitive, but the majority are. And so long as you are not changing values that tend to modify tactics (king safety, passed pawn scores) the depth seems to have little effect.

hgm · Post by **hgm** » Thu Jan 14, 2010 8:34 pm

bob wrote:I did rerun this a couple of times to see if the results were consistent, which they were. Note that varying the queen value from 900 to 1100 gives a 10 Elo swing. Changing it by just 100 is not a big deal, you are really talking about Q plus a pawn or two, or just pure queen, that doesn't make much difference in the games. How often do you see KRR vs KQP (with other material). This would affect which side you think is better...

OK, thanks very much. This is very revealing data. I suspected that the effect would not be big, but I never imagined thet it would be this small. Being a full Pawn off, and still only losing 10 Elo point. If this is a normal parabolic optimum, that would mean that being off by 0.1 Pawn would would cost you 100 times less, i.e. only about 0.1 Elo.

I guess some other piece valuse (e.g. the Knight-Bishop difference) has a much higher impact. If ony because 1 Pawn difference is a much arger fraction of the value there.

diep · Post by **diep** » Thu Jan 14, 2010 8:41 pm

lkaufman wrote:First, as this is my first post here, let me introduce myself to those who don't know me. I am a chess Grandmaster, World Senior Champion in 2008, and although not a programmer myself I have been a consultant on chess programs (testing, writing opening books, assigning parameter values, proposing evaluation terms, etc.) from 1967 (MacHack) to the present (Rybka 3 and now Doch with Don Dailey).
Now for my question. I understand that you have tuned the evaluation function of Crafty very finely by playing perhaps millions of games at fast time limits. What I would like to know is to what extent do the optimum values depend on the time limit (or depth)? If there is measurable dependence, which values (individual piece values, mobility, king safety, pawn structure, etc.) go up with increased depth and which ones go down, and roughly by how much per ply? Do the values approach an asymptote at some modest level or do they appear to continue to change up to the maximum depth at which you test?
Thanks in advance for your answer.

When increasing chessprograms from their 60s / 70s incarnation to todays knowledge, it seems the piece values have gone up.

One of the first to discover this was Chrilly Donninger. He put therefore the piece values back in 1998 already at:

pawn = 100
knight = 420
bishop = 420
rook = 620
queen = 1250 - 1300

Note in those days not a single engine had the Max Euwe values anymore, let alone the small modification Fischer did do later in those values. Those already had been refuted long before that.

A queen in general is stronger than 2 rooks, but definitely not weaker.
Only in a few exceptions 2 rooks are stronger,
by accident chess literature only describes those exceptions.

Similar refutation is there of queen+knight being stronger than queen+bishop. Another bad factorisation of human chessplayers.

I guess it was around 2004 that i noticed that fruit's biggest problem was a total wrong tuning of especially its material and reported that to Fabien.

Fabien then started some sort of massive tuningsproject and produced Fruit 2.1 having the values:

pawn 100
knight 406
bishop 406
rook 625
queen 1250

In todays chess software we still find the Donninger values.

Latest stockfish 1.6.2 for example has the values:

const Value PawnValueMidgame = Value(0x0C6); 198 1.00
const Value KnightValueMidgame = Value(0x331); 817 4.126
const Value BishopValueMidgame = Value(0x344); 836 4.22
const Value RookValueMidgame = Value(0x4F6); 1270 6.41
const Value QueenValueMidgame = Value(0x9D9); 2521 12.73

So that's very close again to the Donninger values.

The interesting thing in the stockfish values is that a bishop gets valued higher than a knight. Probably can be explained by stockfish not having too much chessknowledge on bishops and knights. In general chessplayers overvalue the bishop too much. For me being someone who his entire life plays for owning the bishops, that was a big shock of course to find out that in nowadays chess most programs prefer in case of doubt always a knight;
this can be explained now by objective analysis but by subjective ones.
edit: stockfish difference between bishop and knight is really little, so it definitely picks a centralized knight over a bishop.

Chessplayers are really good in knowing when a bishop is stronger and when a knight is doing fine. Chessprograms are a lot more stupid of course, they have no clue, so it is safer for them to pick the knight as for chessprograms evaluating whether a knight is strong is a lot simpler than evaluating whether a bishop is bad.

Kasparov's games in the 80s already clearly show that Kasparov prefers in a lot of cases a knight over a bishop in positions where the 'average western' chessplayer prefers a bishop. Good example is Kasparov-Anand.

So i would argue the chessknowledge that a chessprogram possesses is having a far bigger impact onto the values than the search depth or other factors.

But in general spoken the Donninger values still hold true for most of todays engines.

In Diep of course material is a lot more complex, so not easy to give the values, they're a lot lower than all this, thanks to all kind of material rules that i introduced at start of the 21th century not long after world champs 2000. Recently those needed a lot of bugfixing and still do.

But grosso modo here is the values:

{ 1000, 3875, 3875, 6175, 12350 }, /* 0 */

Please realize that past years those values only got *higher*.

I started at piece = 3.5 somewhere in the 90s.

That doesn't mean that in 90s the 3.5 was correct, in contradiction, it just means we're learning now better what the values ought to be.

Because of other rules effectively a piece is worth 5.175 pawn as of now.

Thanks,
Vincent

bob · Post by **bob** » Thu Jan 14, 2010 8:59 pm

hgm wrote:
bob wrote:I did rerun this a couple of times to see if the results were consistent, which they were. Note that varying the queen value from 900 to 1100 gives a 10 Elo swing. Changing it by just 100 is not a big deal, you are really talking about Q plus a pawn or two, or just pure queen, that doesn't make much difference in the games. How often do you see KRR vs KQP (with other material). This would affect which side you think is better...
OK, thanks very much. This is very revealing data. I suspected that the effect would not be big, but I never imagined thet it would be this small. Being a full Pawn off, and still only losing 10 Elo point. If this is a normal parabolic optimum, that would mean that being off by 0.1 Pawn would would cost you 100 times less, i.e. only about 0.1 Elo.

I guess some other piece valuse (e.g. the Knight-Bishop difference) has a much higher impact. If ony because 1 Pawn difference is a much arger fraction of the value there.

I can probably dig up some of those as well. Will try to find a knight or bishop change and post that...

BTW, anyone that takes these results at face value is an idiot.

Remember, I have the infamous "bad trade" code in Crafty to avoid dealing with giving up two minors for a rook and pawn, or a minor for 3 pawns. I suspect the optimal value for pieces will end up being unique for each program, and that they will likely change over time as other parts of the program are modified.

Question for Bob Hyatt

Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt

Re: Question for Bob Hyatt