Isn't this still a matter of the raw data, as i tried to point out in my last post for you?hgm wrote:What you seem to miss is that the tuning method doesn't use only the score, but also the gradient of the score, i.e. the direction in parameter space in which the score increases fastest. It needs this to know the relative magnitude of the changes it has to apply to all the parameters to get closer to the optimum.Desperado wrote:If i come close with my understanding which would mean a static score from the pv leaf, i bet it provides the same noise as any other static score you compute in any other situation.
To know this, you have to know how the score of the individual test cases will change as a result of a change in each parameter. But you cannot see how much each parameter contributes to the score in the root. E.g. the root could have a Rook for the opponent, and none for you. Then the Root score would get worse for you if the Rook value increases. But the PV might be NxR, PxN, so that the position at the end of the PV does not contain any Rooks at all, and the static eval of that position (and thus the root score )would be insensitive to the Rook value.
Texel tuning method question
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Re: Texel tuning method question
Re: Texel tuning method question
This post http://talkchess.com/forum/viewtopic.php?t=61427 might help. It links to the EPD of 730k quiet positions Alexandru Mosoi, author of Zurichess, used in his Texel tuning.

 Posts: 3956
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: Texel tuning method question
I think several posters here misunderstand the process.
You are trying to minimize the error between estimated game result (from position score) and the actual game results, over the training set.
If you have only quiescent positions, such as might occur at the end of the qsearch, in your training set, then you don't need to do any searching. You can use the static eval directly.
If your position is not quiescent then in general you cannot obtain an even grossly approximate value for the position from the static value. At least, you need to call the qsearch to get a value.
I do not assume the training set has only quiescent positions, so a search is necessary. Once this search is done you basically have a set of derived positions that are quiescent and from then on the static eval and its gradient can be computed for those positions. However, note the game result, which you are actually using in the gradient computation, belongs to the root (original) training position, not the endofPV position. So you are assuming that the quiescent position's eval is a proxy for evaluating the root position, which cannot be directly evaluated statically.
Jon
You are trying to minimize the error between estimated game result (from position score) and the actual game results, over the training set.
If you have only quiescent positions, such as might occur at the end of the qsearch, in your training set, then you don't need to do any searching. You can use the static eval directly.
If your position is not quiescent then in general you cannot obtain an even grossly approximate value for the position from the static value. At least, you need to call the qsearch to get a value.
I do not assume the training set has only quiescent positions, so a search is necessary. Once this search is done you basically have a set of derived positions that are quiescent and from then on the static eval and its gradient can be computed for those positions. However, note the game result, which you are actually using in the gradient computation, belongs to the root (original) training position, not the endofPV position. So you are assuming that the quiescent position's eval is a proxy for evaluating the root position, which cannot be directly evaluated statically.
Jon
 hgm
 Posts: 24657
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: Texel tuning method question
I don't know what you mean by 'raw data'. If it is the position at the end of the PV, then yes, the gradient follows from that. The gradient cannot be calculated from the position at the root. It could be calculated from as many root scores with different parameters as there are parameters (+1).Desperado wrote:Isn't this still a matter of the raw data, as i tried to point out in my last post for you?
Re: Texel tuning method question
The positions linked in the post above seek to address some of these issues by providing quiet positions (according to the readme, "From the set were removed all positions on which quiescence search found a wining capture. The remaining positions were stored in quiet.epd."). There's also a slightly smaller labeled set (725k) where the game result comes from Stockfish 080916 in selfplay starting from the quiet position rather than something earlier.jdart wrote:I think several posters here misunderstand the process.
You are trying to minimize the error between estimated game result (from position score) and the actual game results, over the training set.
If you have only quiescent positions, such as might occur at the end of the qsearch, in your training set, then you don't need to do any searching. You can use the static eval directly.
If your position is not quiescent then in general you cannot obtain an even grossly approximate value for the position from the static value. At least, you need to call the qsearch to get a value.
I do not assume the training set has only quiescent positions, so a search is necessary. Once this search is done you basically have a set of derived positions that are quiescent and from then on the static eval and its gradient can be computed for those positions. However, note the game result, which you are actually using in the gradient computation, belongs to the root (original) training position, not the endofPV position. So you are assuming that the quiescent position's eval is a proxy for evaluating the root position, which cannot be directly evaluated statically.
Jon
Re: Texel tuning method question
I will add some comments here that will hopefully help clear up some of the confusion.AlvaroBegue wrote:The trick is doing the gradient descent. While it would be possible to do it on the search function itself, it would be hard to make that efficient. So instead, you need to recover what position gave the eval that was propagated to the root, and then compute the gradient of the evaluation function at that node.Desperado wrote:Maybe i should think about it twice, but the pv eval should be passed to the root as search result. So at first glance i don't know in what way the "eval at the end of the pv" is different to the search result score. :?: :!:
The basic texel tuning method treats the evaluation function and the qsearch function as black boxes. You put in a position and a set of parameter values, and you get out an evaluation score. How the score is computed is completely irrelevant for the tuning algorithm.
Without any assumptions about how the evaluation function works internally, you are restricted to quite primitive algorithms for finding a minimum in parameter space. The pseudo code on the CPW for example varies one parameter at a time, following the downwards direction. It stops when no smaller value can be found in any direction.
If we assume that the function to minimize is differentiable almost everywhere but still treat the function as a black box, we could use various gradient based optimization methods to speed up the search for a local minimum. Since the function is a black box it would not be possible to directly compute the required partial derivatives, so they would have to be approximated using finite differences instead. Typically something like
Code: Select all
dE/dPi ~= E(pi+1)  E(pi)
Code: Select all
dE/dPi ~= (E(pi+1)  E(pi1))/2
If we further assume that the evaluation function has a certain structure, so that the evaluation score is computed from the position and parameters using only a welldefined set of operations, and assume that the evaluation function is written in a language that supports generic types and operator overloading, it is possible to implement a framework that automatically computes the partial derivatives at the same time as the evaluation score is computed. See for example this article for an explanation of how this can be done.
Álvaro has implemented such a framework, which is called ruy_tune. It is written in C++ and a requirement for it to work is that the evaluation function is converted to a template, where the score type is a template parameter. With such a modified evaluation function, the gradient can be computed much faster than if it were computed using finite differences. (At least I think it will be much faster, I have not actually tested this.)
However, for this to work you would have to find the position at the end of the PV and use that position to compute the evaluation score and the corresponding gradient. If you wanted to apply the automatic gradient computation technique to the qsearch function, the qsearch function would also have to be converted to a template, and the framework would have to be extended to overload also comparison operators in order to make the minimaxing work.

 Posts: 925
 Joined: Tue Mar 09, 2010 2:46 pm
 Location: New York
 Full name: Álvaro Begué (RuyDos)
Re: Texel tuning method question
Correct. Computing the gradient on the QS directly is a colossal waste of time, at least with the method I implemented. It is much faster to run the QS saving the PV and then compute the gradient using the endofPV position.petero2 wrote:[...]
Álvaro has implemented such a framework, which is called ruy_tune. It is written in C++ and a requirement for it to work is that the evaluation function is converted to a template, where the score type is a template parameter. With such a modified evaluation function, the gradient can be computed much faster than if it were computed using finite differences. (At least I think it will be much faster, I have not actually tested this.)
However, for this to work you would have to find the position at the end of the PV and use that position to compute the evaluation score and the corresponding gradient. If you wanted to apply the automatic gradient computation technique to the qsearch function, the qsearch function would also have to be converted to a template, and the framework would have to be extended to overload also comparison operators in order to make the minimaxing work.
What I did with RuyTune is turn the original positions into quiet positions using this endofPV method using my existing evaluation function, and then not worry too much about the fact that tweaking the evaluation function could result in a different position being picked. I could rerun this periodically (as someone else has suggested in this thread), but I think it would make very little difference in practice.

 Posts: 3956
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: Texel tuning method question
Because my eval function is relatively simple I actually do a closedform gradient computation. For me this entails some duplication of logic between the actual eval function and the tuning code. I also have some code that does the finite difference calculation and compares with the gradient computation and verifies that they are giving the same result within some small error margin.
Computing the gradient is fairly straightforward but it is important to take account of material value scaling, if that is used. And I have a few bits, notably king safety computation, that are nonlinear and for which gradient computation is nontrivial, but still doable.
This is quite a bit more dev work than finite differences though.
Jon
Computing the gradient is fairly straightforward but it is important to take account of material value scaling, if that is used. And I have a few bits, notably king safety computation, that are nonlinear and for which gradient computation is nontrivial, but still doable.
This is quite a bit more dev work than finite differences though.
Jon
Re: Texel tuning method question
Hi,
I have a question at this point in regards to speed and method for testing various parameters.
I am just testing the mechanics and speed of the tuning method on a single thread and get about 1M positions tested in just under 20 seconds. I am OK with this for now but as I think about it, and I see this has been discussed in other posts, if I expose a few dozen parameters to tune, this could take weeks or longer, right?
What I envision is this, for example:
 I have 10 parameters that I want to tune.
 I want to use a delta for each parameter to test, let's just say delta +/ 10. For each parameter, this is 21 different values from value10 to value+10.
That is a lot of combinations for ~20 seconds per test.
I do not know if this is called a "local search" or not, but the more parameters I want to expose the more time that is needed.
Am I seeing this wrong? Maybe this is the simplest brute force and there are better ways. Can someone help clear this idea up and give me a nudge in the right direction?
I have a question at this point in regards to speed and method for testing various parameters.
I am just testing the mechanics and speed of the tuning method on a single thread and get about 1M positions tested in just under 20 seconds. I am OK with this for now but as I think about it, and I see this has been discussed in other posts, if I expose a few dozen parameters to tune, this could take weeks or longer, right?
What I envision is this, for example:
 I have 10 parameters that I want to tune.
 I want to use a delta for each parameter to test, let's just say delta +/ 10. For each parameter, this is 21 different values from value10 to value+10.
That is a lot of combinations for ~20 seconds per test.
I do not know if this is called a "local search" or not, but the more parameters I want to expose the more time that is needed.
Am I seeing this wrong? Maybe this is the simplest brute force and there are better ways. Can someone help clear this idea up and give me a nudge in the right direction?

 Posts: 925
 Joined: Tue Mar 09, 2010 2:46 pm
 Location: New York
 Full name: Álvaro Begué (RuyDos)
Re: Texel tuning method question
Let me see if I understand what you are saying. If we only had one parameter to tune, you could imagine computing the derivative of your loss function (the thing you are minimizing) with respect to your parameter by setting the parameter 10 points higher, then 10 points lower, and approximating the derivative like this:Cheney wrote:Hi,
I have a question at this point in regards to speed and method for testing various parameters.
I am just testing the mechanics and speed of the tuning method on a single thread and get about 1M positions tested in just under 20 seconds. I am OK with this for now but as I think about it, and I see this has been discussed in other posts, if I expose a few dozen parameters to tune, this could take weeks or longer, right?
What I envision is this, for example:
 I have 10 parameters that I want to tune.
 I want to use a delta for each parameter to test, let's just say delta +/ 10. For each parameter, this is 21 different values from value10 to value+10.
That is a lot of combinations for ~20 seconds per test.
I do not know if this is called a "local search" or not, but the more parameters I want to expose the more time that is needed.
Am I seeing this wrong? Maybe this is the simplest brute force and there are better ways. Can someone help clear this idea up and give me a nudge in the right direction?
(d Loss) / (d Param) ~= (Loss(Param+10)  Loss(Param10)) / 20
If you want to do this with P parameters, you would need 2P evaluations of the loss function (i.e., 2P passes through all the data), which gets expensive quickly.
Enter automatic differentiation. You can actually compute all those derivatives in 2 or 3 times the cost of computing the loss function once, regardless of P. The method is called "reversemode automatic differentiation". Neural nets people call it "backpropagation". And people that want to point out how obvious it all is in retrospect call it "the chain rule".
Last year I made RuyTune available so people could do this kind of thing on their engines. Unfortunately, I don't think I managed to make it userfriendly enough. But if you are interested, I can try to help you to make use of it, or at least the automaticdifferentiation piece of it. See here: https://bitbucket.org/alonamaloh/ruy_tune