Lyudmil Tsvetkov wrote:for pawns:
- 3rd rank gets minimally higher values than 2nd rank
- 4th rank gets minimally higher values than 3rd rank
- 5th rank gets much higher values than 4th rank
- 6th rank gets significantly higher values than 5th rank

Here is a nice and nasty side effect. PST's scores interact with other eval terms, for instance mobilty, consider the diagram.

[d]rnbqkbnr/pppppppp/8/8/P7/8/1PPPPPPP/RNBQKBNR b KQkq a3[/d]

The move 1.a4 gets a too high score because mobility gains 2 squares (a2 and a3) for the rook. Things get worse if we follow your advice for 2.a5 (a much higher value than a4 you say). I remember my program favoring the a4 move for the wrong reason (mobility) and my fix was to favor the a2/a3 squares above a4 and a5 during the opening phase.

Ed,

One idea I've been toying around with is to use the PSQT values to help score the mobility. Currently I calculate the "safe" mobility for each piece and use that to access a non-linear penalty/bonus array. I define safe mobility as any square that a piece can move to that is not attacked by a pawn or piece of lesser value, (enemy pieces that are attacked of equal or greater value are also considered safe). At the moment I give a large penalty for zero or low mobility and an increasing bonus once the mobility passes a predetermined threshold. The idea I've been considering is to base the bonus on the value of the PSQT (appropriately scaled) for each of the safe destination squares. The idea would be that all safe squares are not equal and we should score them differently. For example, a knight on A1 with a safe mobility of 1 or 2 should probably get a worst score then a knight on D4 with a safe mobility of 1 or 2.

Here's a great rule of thumb for improving yor engines: *never* listen to lyudmil. It's like these "what would Jesus do?" ads on americain TV, but reversed. What would Lyudmil not do? Just by asking myself this question I find regular improvements in DiscoCheck

Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Lyudmil Tsvetkov wrote:for pawns:
- 3rd rank gets minimally higher values than 2nd rank
- 4th rank gets minimally higher values than 3rd rank
- 5th rank gets much higher values than 4th rank
- 6th rank gets significantly higher values than 5th rank

Here is a nice and nasty side effect. PST's scores interact with other eval terms, for instance mobilty, consider the diagram.

[d]rnbqkbnr/pppppppp/8/8/P7/8/1PPPPPPP/RNBQKBNR b KQkq a3[/d]

The move 1.a4 gets a too high score because mobility gains 2 squares (a2 and a3) for the rook. Things get worse if we follow your advice for 2.a5 (a much higher value than a4 you say). I remember my program favoring the a4 move for the wrong reason (mobility) and my fix was to favor the a2/a3 squares above a4 and a5 during the opening phase.

Ed,

One idea I've been toying around with is to use the PSQT values to help score the mobility. Currently I calculate the "safe" mobility for each piece and use that to access a non-linear penalty/bonus array. I define safe mobility as any square that a piece can move to that is not attacked by a pawn or piece of lesser value, (enemy pieces that are attacked of equal or greater value are also considered safe). At the moment I give a large penalty for zero or low mobility and an increasing bonus once the mobility passes a predetermined threshold. The idea I've been considering is to base the bonus on the value of the PSQT (appropriately scaled) for each of the safe destination squares. The idea would be that all safe squares are not equal and we should score them differently. For example, a knight on A1 with a safe mobility of 1 or 2 should probably get a worst score then a knight on D4 with a safe mobility of 1 or 2.

--tom

Yes, I do exactly the same except that I have 2 separate PST's (w/b) for that purpose. Especially the knight is sensitive to safe squares.

lucasart wrote:Here's a great rule of thumb for improving yor engines: *never* listen to lyudmil. It's like these "what would Jesus do?" ads on americain TV, but reversed. What would Lyudmil not do? Just by asking myself this question I find regular improvements in DiscoCheck

I have had relatively poor results from CLOP. I think the idea is sound and I am tuning a few parameters at a time and have run a lot of CLOP test games per test (40k or more). But I always verify any changes suggested by CLOP by running a version of my program with the final result values through my standard test gauntlet (35k games or so currently). And fairly often the changes fail this verification test.

jdart wrote:I have had relatively poor results from CLOP. I think the idea is sound and I am tuning a few parameters at a time and have run a lot of CLOP test games per test (40k or more). But I always verify any changes suggested by CLOP by running a version of my program with the final result values through my standard test gauntlet (35k games or so currently). And fairly often the changes fail this verification test.

--Jon

Those 40k games should not go to waste easily, if the values suggested by CLOP interface does not perform well
after verification in your normal test environment, you may try to extract some information in xxx.dat file
generated from CLOP run. This file contains w/d/l results of every possible combination of parameter values
that are tried by CLOP. You can even verify if CLOP suggested parameter value combination is the best in terms
of score percentage and number of games. For games below 65k I use excel to process this data.

AlvaroBegue wrote:A different approach is to try to make the evaluation function be a good predictor of game result, without playing games. (Of course I would play many games after tuning to verify the new weights are better than the previous ones.)

The basic mechanism would be something like this:
(1) Get a database of [~1 million?] positions with associated results.
(2) From each position, run quiescence search and extract the position that the evaluation ultimately came from; in the process, write a new database of quiescent positions with associated results.
(3) Define the probability of winning as sigmoid(C*evaluation), where sigmoid(x):=1/(1+exp(-x)) and the constant C is chosen so the evaluation has the usual scale in pawns (I got C=0.58 or something like that, but I am quoting from memory).
(4) Use non-linear regression to estimate the parameters of the evaluation function that maximize the [log-]likelihood of the results.

One needs to do something to handle draws, but probably treating them as half a victory and half a loss would be fine.

Notice that if your evaluation function is a linear combination of terms and you are trying to figure out the coefficients, step (4) is logistic regression.

I have only done small-scale tests with this idea, but the Junior team seems to have used it extensively, as described in this paper: http://www.ratio.huji.ac.il/node/2362 . They seem to handle draws in a complicated way, but other than that I think their ideas are similar to mine (I haven't read the paper in a while).

Since about one month ago, I have been using something similar to tune the evaluation function in my chess engine texel. Here is a description of the method.

Method

Take 64000 games played at a fast time control (such as 1s+0.08s/move) between the current and/or previous versions of the engine. Extract all positions from those games, except positions within the opening book and positions where the engine found a mate score during the game. This typically gives about 8.8 million positions which are stored as FEN strings in a text file.

E = 1/N * sum(i=1,N, (result[i] - Sigmoid(qScore(pos[i])))^2)

where:

N is the number of test positions.

result is the result of the game corresponding to position i. 0 for black win, 0.5 for draw and 1 for white win.

qScore(pos) is the value returned by the chess engine quiescence function. The algorithm assumes the qScore function is deterministic. If transposition tables or the history heuristic is used in the qScore function this may not be the case.

Sigmoid(s) = 1 / (1 + 10^(-K*s/400))

K is a scaling constant.

Compute the K that minimizes E. K is never changed again by the algorithm.

If needed, refactor the source code so that the qScore function depends on a set of evaluation function parameters w[j], 1<=j<=M.

The average error E is now a function of the M parameters. Find parameter values such that E is a local minimum in parameter space. The exact optimization method is not that important. It could for example use local search varying one parameter at a time, the Gauss-Newton method, the conjugate gradient method, or a hybrid approach mixing those methods.

If the evaluation function parameters are all of integer type with finite range (which they typically are in a chess engine), the local search method is guaranteed to eventually terminate since there is only a finite number of elements in the parameter space. Also in this case the conventional gradient isn't mathematically defined, but you can use difference quotients instead of the "real" gradient in the GN and CG methods.

Advantages

One big advantage of this method is that it can simultaneously optimize several hundreds of evaluation function parameters in a reasonable amount of time. The first step that collects a set of games does not take any extra time, because those games have already been played when previous changes to the engine were tested. The step that converts from PGN to FEN only takes a few minutes. The time consuming step is to compute the local minimum. In my engine M is currently around 400 and computing the gradient takes around 25 minutes on a 16-core Dell T620 computer. A local minimum can usually be computed within 6 hours, faster if only small changes to the parameter values are needed.

While 6 hours may seem like a lot, consider how long it would take CLOP to simultaneously optimize 400 parameters (assuming you have enough memory to actually do that). I have not worked out the math but I guess it would take at least ten years to get reliable results.

Another advantage is that no external source of knowledge is needed. You don't need to find a large set of games played by engines or humans that are stronger than the engine you want to improve. While this may not be a practical advantage for the current strength of my engine, I still find this property theoretically pleasing, somewhat similar to the concept of self calibration.

A third advantage is that the need for different evaluation terms to be "orthogonal" disappears, since the optimization will automatically deal with dependencies between evaluation terms.

A fourth advantage, compared to the method used by Amir Ban to tune the Junior evaluation function, is that my method does not need the implementation of a "drawishness" function in the chess engine.

Concerns

My biggest concern about this method is what is called Correlation does not imply causation in statistics. I don't have a specific example for chess, but the Wikipedia example about ice cream sales and drowning accidents is quite illuminating. It is a known fact that when ice cream sales increase, so does drowning accidents, but that does not imply that stopping ice cream sales would reduce the number of drowning accidents. It is conceivable that the chess engine would learn to appreciate positional features that are correlated to increased winning chances, even though actively striving to reach such positions does not increase winning chances. Anyway the method does work in my engine, possibly because of the counter argument that "Correlation is not causation but it sure is a hint".

Another concern is that the chess engine will not be able to learn things that it does not understand at all. For example, assume the chess engine did not know how to win KBNK endings, so all games leading to this endgame would end in a draw. Because of this E would tend to be smaller if the evaluation functions scored all KBNK endings as draws, so there is a risk the evaluation actually gets worse after optimization.

On the other hand, the algorithm can be expected to improve the engine's knowledge about things that it partially already knows, or knowledge about things that are good even if you don't know it. For example, assume the engine does not know that the bishop pair deserves a bonus. However, if the engine has general knowledge about mobility it is likely that it will win more than it loses anyway when it has the bishop pair advantage. Therefore the engine will be able to learn that the bishop pair deserves a bonus. If the algorithm is then repeated with a new set of games, the engine might learn that the bishop pair is even better when it knows that it should not unnecessarily trade a bishop.

It may also be argued that since more than one position is picked from each game (in fact about 140 positions per game on average), the result is invalid because there is a dependence between terms in the summation that defines E, but least squares theory in general assumes that different data points are independent. The way I see it, E has contributions from 64000 independent events and the fact that there are 140 times more terms in the summation is similar to what would happen if you solved a weighted least squares problem where the weights are determined by how often a particular type of position is present in typical games.

While it would probably be better (or at least equally good) to take one position each from 8.8 million different games, obtaining that many games would require more computation time than I am willing to spend. I believe 64000 independent events is more than enough to estimate 400 parameters anyway.

Results

When I started using this method, I calculated that K=1.13 best matched the then current evaluation function in my engine. I have not changed the value of K since then and do not intend to change it in the future either, since it is just an arbitrary scaling constant.

Since 2014-01-01 when the first evaluation change based on this algorithm was included in my engine, the elo improvements caused by evaluation weight tuning (measured using 32000 games at 1+0.08) have been:

The 24.6 improvement was when I made most of the evaluation parameters accessible to the optimization algorithm. The next three improvements came when I in three stages exposed more evaluation parameters to the algorithm. The 12.8 improvement came when I re-created the set of test positions based on the most recently played games.

The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial.

The 10.2 improvement came when I added separate queen piece square tables for middle game and end game. Previously texel only had one queen PST, which had horizontal, vertical and diagonal symmetry.

The last improvement was made after the texel 1.03 release. The others are included in the 1.03 release and I believe they are responsible for most of the strength increase in texel 1.03.

I don't know if the large strength increase was made possible because the method is good or because the texel evaluation function was particularly bad before I started using this algorithm. Anyway I don't expect to be able to get equally big improvements in the future as I got in texel 1.03, but I hope the algorithm will nevertheless help finding a fair amount of smaller evaluation improvements.

Future improvements

Currently local search is used to find a local optimum. I believe it would be faster to initially use a few Gauss-Newton iterations and then switch to local search when the remaining corrections are small.

A number of evaluation parameters have values that appear quite unintuitive. For example, the value of a white queen on b7/g7 (and a black queen on b2/g2 because of symmetry) in the middle game has a value of -128cp. I have not investigated the cause yet, but I believe the explanation is that almost always when a white/black queen is on b7/b2 in the early middle game, it is because it has just eaten the "poisoned pawn", and the opponent is about to win back a pawn by playing Rb1 and Rxb7. If placing the queen on b7/g7/b2/g2 for other reasons is much less common, the optimization algorithm will assign a large penalty for a queen on these squares. While this improves the evaluation function in the average case, it will also cause large evaluation errors for the uncommon cases that do not fall under the "poisoned pawn" category. Implementing a more specific rule about poisoned pawns could be profitable.

Other examples of odd parameter values include a white king on b8 which in the middle game gives white a 200cp bonus, a white knight on a8 in the middle game which gives white a 223cp penalty, the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.

Another interesting approach is to apply data mining techniques on the residuals to try to discover missing knowledge in the evaluation function. I don't know how successful this approach will be. It may be more profitable to test different variants of already known chess evaluation principles instead.

AlvaroBegue wrote:A different approach is to try to make the evaluation function be a good predictor of game result, without playing games. (Of course I would play many games after tuning to verify the new weights are better than the previous ones.)

The basic mechanism would be something like this:
(1) Get a database of [~1 million?] positions with associated results.
(2) From each position, run quiescence search and extract the position that the evaluation ultimately came from; in the process, write a new database of quiescent positions with associated results.
(3) Define the probability of winning as sigmoid(C*evaluation), where sigmoid(x):=1/(1+exp(-x)) and the constant C is chosen so the evaluation has the usual scale in pawns (I got C=0.58 or something like that, but I am quoting from memory).
(4) Use non-linear regression to estimate the parameters of the evaluation function that maximize the [log-]likelihood of the results.

One needs to do something to handle draws, but probably treating them as half a victory and half a loss would be fine.

Notice that if your evaluation function is a linear combination of terms and you are trying to figure out the coefficients, step (4) is logistic regression.

I have only done small-scale tests with this idea, but the Junior team seems to have used it extensively, as described in this paper: http://www.ratio.huji.ac.il/node/2362 . They seem to handle draws in a complicated way, but other than that I think their ideas are similar to mine (I haven't read the paper in a while).

Since about one month ago, I have been using something similar to tune the evaluation function in my chess engine texel. Here is a description of the method.

Method

Take 64000 games played at a fast time control (such as 1s+0.08s/move) between the current and/or previous versions of the engine. Extract all positions from those games, except positions within the opening book and positions where the engine found a mate score during the game. This typically gives about 8.8 million positions which are stored as FEN strings in a text file.

E = 1/N * sum(i=1,N, (result[i] - Sigmoid(qScore(pos[i])))^2)

where:

N is the number of test positions.

result is the result of the game corresponding to position i. 0 for black win, 0.5 for draw and 1 for white win.

qScore(pos) is the value returned by the chess engine quiescence function. The algorithm assumes the qScore function is deterministic. If transposition tables or the history heuristic is used in the qScore function this may not be the case.

Sigmoid(s) = 1 / (1 + 10^(-K*s/400))

K is a scaling constant.

Compute the K that minimizes E. K is never changed again by the algorithm.

If needed, refactor the source code so that the qScore function depends on a set of evaluation function parameters w[j], 1<=j<=M.

The average error E is now a function of the M parameters. Find parameter values such that E is a local minimum in parameter space. The exact optimization method is not that important. It could for example use local search varying one parameter at a time, the Gauss-Newton method, the conjugate gradient method, or a hybrid approach mixing those methods.

If the evaluation function parameters are all of integer type with finite range (which they typically are in a chess engine), the local search method is guaranteed to eventually terminate since there is only a finite number of elements in the parameter space. Also in this case the conventional gradient isn't mathematically defined, but you can use difference quotients instead of the "real" gradient in the GN and CG methods.

Advantages

One big advantage of this method is that it can simultaneously optimize several hundreds of evaluation function parameters in a reasonable amount of time. The first step that collects a set of games does not take any extra time, because those games have already been played when previous changes to the engine were tested. The step that converts from PGN to FEN only takes a few minutes. The time consuming step is to compute the local minimum. In my engine M is currently around 400 and computing the gradient takes around 25 minutes on a 16-core Dell T620 computer. A local minimum can usually be computed within 6 hours, faster if only small changes to the parameter values are needed.

While 6 hours may seem like a lot, consider how long it would take CLOP to simultaneously optimize 400 parameters (assuming you have enough memory to actually do that). I have not worked out the math but I guess it would take at least ten years to get reliable results.

Another advantage is that no external source of knowledge is needed. You don't need to find a large set of games played by engines or humans that are stronger than the engine you want to improve. While this may not be a practical advantage for the current strength of my engine, I still find this property theoretically pleasing, somewhat similar to the concept of self calibration.

A third advantage is that the need for different evaluation terms to be "orthogonal" disappears, since the optimization will automatically deal with dependencies between evaluation terms.

A fourth advantage, compared to the method used by Amir Ban to tune the Junior evaluation function, is that my method does not need the implementation of a "drawishness" function in the chess engine.

Concerns

My biggest concern about this method is what is called Correlation does not imply causation in statistics. I don't have a specific example for chess, but the Wikipedia example about ice cream sales and drowning accidents is quite illuminating. It is a known fact that when ice cream sales increase, so does drowning accidents, but that does not imply that stopping ice cream sales would reduce the number of drowning accidents. It is conceivable that the chess engine would learn to appreciate positional features that are correlated to increased winning chances, even though actively striving to reach such positions does not increase winning chances. Anyway the method does work in my engine, possibly because of the counter argument that "Correlation is not causation but it sure is a hint".

Another concern is that the chess engine will not be able to learn things that it does not understand at all. For example, assume the chess engine did not know how to win KBNK endings, so all games leading to this endgame would end in a draw. Because of this E would tend to be smaller if the evaluation functions scored all KBNK endings as draws, so there is a risk the evaluation actually gets worse after optimization.

On the other hand, the algorithm can be expected to improve the engine's knowledge about things that it partially already knows, or knowledge about things that are good even if you don't know it. For example, assume the engine does not know that the bishop pair deserves a bonus. However, if the engine has general knowledge about mobility it is likely that it will win more than it loses anyway when it has the bishop pair advantage. Therefore the engine will be able to learn that the bishop pair deserves a bonus. If the algorithm is then repeated with a new set of games, the engine might learn that the bishop pair is even better when it knows that it should not unnecessarily trade a bishop.

It may also be argued that since more than one position is picked from each game (in fact about 140 positions per game on average), the result is invalid because there is a dependence between terms in the summation that defines E, but least squares theory in general assumes that different data points are independent. The way I see it, E has contributions from 64000 independent events and the fact that there are 140 times more terms in the summation is similar to what would happen if you solved a weighted least squares problem where the weights are determined by how often a particular type of position is present in typical games.

While it would probably be better (or at least equally good) to take one position each from 8.8 million different games, obtaining that many games would require more computation time than I am willing to spend. I believe 64000 independent events is more than enough to estimate 400 parameters anyway.

Results

When I started using this method, I calculated that K=1.13 best matched the then current evaluation function in my engine. I have not changed the value of K since then and do not intend to change it in the future either, since it is just an arbitrary scaling constant.

Since 2014-01-01 when the first evaluation change based on this algorithm was included in my engine, the elo improvements caused by evaluation weight tuning (measured using 32000 games at 1+0.08) have been:

The 24.6 improvement was when I made most of the evaluation parameters accessible to the optimization algorithm. The next three improvements came when I in three stages exposed more evaluation parameters to the algorithm. The 12.8 improvement came when I re-created the set of test positions based on the most recently played games.

The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial.

The 10.2 improvement came when I added separate queen piece square tables for middle game and end game. Previously texel only had one queen PST, which had horizontal, vertical and diagonal symmetry.

The last improvement was made after the texel 1.03 release. The others are included in the 1.03 release and I believe they are responsible for most of the strength increase in texel 1.03.

I don't know if the large strength increase was made possible because the method is good or because the texel evaluation function was particularly bad before I started using this algorithm. Anyway I don't expect to be able to get equally big improvements in the future as I got in texel 1.03, but I hope the algorithm will nevertheless help finding a fair amount of smaller evaluation improvements.

Future improvements

Currently local search is used to find a local optimum. I believe it would be faster to initially use a few Gauss-Newton iterations and then switch to local search when the remaining corrections are small.

A number of evaluation parameters have values that appear quite unintuitive. For example, the value of a white queen on b7/g7 (and a black queen on b2/g2 because of symmetry) in the middle game has a value of -128cp. I have not investigated the cause yet, but I believe the explanation is that almost always when a white/black queen is on b7/b2 in the early middle game, it is because it has just eaten the "poisoned pawn", and the opponent is about to win back a pawn by playing Rb1 and Rxb7. If placing the queen on b7/g7/b2/g2 for other reasons is much less common, the optimization algorithm will assign a large penalty for a queen on these squares. While this improves the evaluation function in the average case, it will also cause large evaluation errors for the uncommon cases that do not fall under the "poisoned pawn" category. Implementing a more specific rule about poisoned pawns could be profitable.

Other examples of odd parameter values include a white king on b8 which in the middle game gives white a 200cp bonus, a white knight on a8 in the middle game which gives white a 223cp penalty, the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.

Another interesting approach is to apply data mining techniques on the residuals to try to discover missing knowledge in the evaluation function. I don't know how successful this approach will be. It may be more profitable to test different variants of already known chess evaluation principles instead.

The main difference is that I do not use quies search values, but the score of eval. For that reason, I use position that are already quiescent. Also, I do not use games played by Gaviota, but games from the computer databases, which allow me to use a handful of millions of positions. I am no sure if those many are needed.

The technique has some other advantages. For instance, you can tell what parameters are really not doing anything and become candidates to be removed.

thanks for sharing this. I will definitely give it a try in combination with my genetic algorithm framework that I already have.

The GA will just do the optimization of the parameters but I will use your method as an alternative fitness function. Probably as Miguel I will use quiet positions as I then only have a fast eval call and not a slower qsearch call. So its a bit more pre processing to save time later.

My current algorithm uses game play in combination with solving a test set of easy positional problems. A parameter set that solves a significant amount less positions than the winner of the last tournament is not allowed to play, it will not win the tournament anyway. This saves a bit of time in the early generations, in the later (where the entropy has already dropped) all parameters sets unfortunately jump over that threshold.

This makes the fitness function very expensive and so the algorithm takes a lot of time (2-3 weeks for a reasonable convergence of about 500 bits in my evaluation weights).

When I first optimized my evaluation between iCE 0.3 and iCE 1.0 I gained more than 100 ELO, the gains by subsequent runs after eval modification now are much smaller, so it is almost not worth to effort. Your approach might give it another boost.

My game play fitness function is still beneficial for tuning the parameters that control the search (futility margins, LMR, null move margins etc.). I can't image an alternative to game play in that area.