Texel tuning method question

Desperado · Post by **Desperado** » Tue Jun 06, 2017 8:55 pm

AlvaroBegue wrote:
Desperado wrote:
Ferdy wrote:
jdart wrote:Right. 20 seconds is fast. Takes me maybe 10 minutes (on a big 24-core machine) but I am using a 2-ply search. I calculate the PV, and then do gradient descent based on the end-of-PV evals. Then periodically I re-calculate the PV as the parameters are tuned.

--Jon
Interesting, by "end-of-PV evals", did you use your static evaluation function to get the eval at end of the pv position?
Maybe i should think about it twice, but the pv eval should be passed to the root as search result. So at first glance i don't know in what way the "eval at the end of the pv" is different to the search result score.

And isn't any line (including the pv of course) computed by the static evaluation at the final node ?!

So, what do i miss ?
The trick is doing the gradient descent. While it would be possible to do it on the search function itself, it would be hard to make that efficient. So instead, you need to recover what position gave the eval that was propagated to the root, and then compute the gradient of the evaluation function at that node.

It is still confusing me because i see the following when i am interested in position x:

1.e4 x (e5 2.Nf3 Nc6 d4)

case 1: you have a static eval after 1.e4 that gives +20 points.
case 2: you have a static eval after d4 that gives +35 points
case 3: you map the 35 points on the position after e4 (simple search result for e4)

a.
case 1+2 are identical in the sense that you map the static score directly on the position that the score was computed for. I don't see a gain in just changing the position to get another position/score pair. 1.(e4,20) 2.(d4,35)

b.
case 3 is different to 1+2 because the score that is mapped to the position , is a search result.

c.
The only thing that might make a difference in case "a." is the choice of the position/score pair, so it is of some higher quality because it has been choosen as pv (d4,35 compared to e4,20).
Well, a good reason to check if this improves the tuning algorithm, but the meaning would just be to get a better selection of data input.

Sven · Post by **Sven** » Wed Jun 07, 2017 1:30 am

Desperado wrote:
AlvaroBegue wrote:
Desperado wrote:
Ferdy wrote:
jdart wrote:Right. 20 seconds is fast. Takes me maybe 10 minutes (on a big 24-core machine) but I am using a 2-ply search. I calculate the PV, and then do gradient descent based on the end-of-PV evals. Then periodically I re-calculate the PV as the parameters are tuned.

--Jon
Interesting, by "end-of-PV evals", did you use your static evaluation function to get the eval at end of the pv position?
Maybe i should think about it twice, but the pv eval should be passed to the root as search result. So at first glance i don't know in what way the "eval at the end of the pv" is different to the search result score.

And isn't any line (including the pv of course) computed by the static evaluation at the final node ?!

So, what do i miss ?
The trick is doing the gradient descent. While it would be possible to do it on the search function itself, it would be hard to make that efficient. So instead, you need to recover what position gave the eval that was propagated to the root, and then compute the gradient of the evaluation function at that node.
It is still confusing me because i see the following when i am interested in position x:

1.e4 x (e5 2.Nf3 Nc6 d4)

case 1: you have a static eval after 1.e4 that gives +20 points.
case 2: you have a static eval after d4 that gives +35 points
case 3: you map the 35 points on the position after e4 (simple search result for e4)

a.
case 1+2 are identical in the sense that you map the static score directly on the position that the score was computed for. I don't see a gain in just changing the position to get another position/score pair. 1.(e4,20) 2.(d4,35)

b.
case 3 is different to 1+2 because the score that is mapped to the position , is a search result.

c.
The only thing that might make a difference in case "a." is the choice of the position/score pair, so it is of some higher quality because it has been choosen as pv (d4,35 compared to e4,20).
Well, a good reason to check if this improves the tuning algorithm, but the meaning would just be to get a better selection of data input.

Same here, it confuses me as well. I thought the idea were to tune the static evaluation (parameters) such that the error of the qsearch result for a training set of positions shall be minimized, using game results or qsearch results from a strong engine as a reference. And qsearch (it could as well be a 2-ply search or something similar) would be applied to the training positions itself, not to some next-to-leaf positions of their PVs (that's what I understood so far at least). So shouldn't the training positions themselves be subject of error calculation? That would be case 3.

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 07, 2017 2:14 am

I don't understand what you guys are confused about. Yes, you want to make the qsearch value as close to some reference as possible. In order to do that, you need to compute the gradient of the error with respect to the evaluation parameters. Since the score came from a leaf of a little tree, you recover what that leaf was, then compute the gradient of the error function for that position.

So making the qsearch approximate the reference value is the same thing as making the evaluation function at the leaf whose score was propagated to the root approximate the reference value.

mar · Post by **mar** » Wed Jun 07, 2017 9:13 am

sandermvdb wrote:II am using the q-search score but the problem with this is that calculating the error for my testset (4 million positions) takes about 20 seconds.

That's not bad if it's single threaded, one iteration for 17.6M positions (4 threads) takes ~22 seconds for me.

Sven · Post by **Sven** » Wed Jun 07, 2017 11:34 am

AlvaroBegue wrote:I don't understand what you guys are confused about. Yes, you want to make the qsearch value as close to some reference as possible. In order to do that, you need to compute the gradient of the error with respect to the evaluation parameters. Since the score came from a leaf of a little tree, you recover what that leaf was, then compute the gradient of the error function for that position.

So making the qsearch approximate the reference value is the same thing as making the evaluation function at the leaf whose score was propagated to the root approximate the reference value.

But why should the gradient of the error function for the PV leaf Y be different from the gradient of the error function for the qsearch root X, if the score is the same? Why is there anything to "recover"? It seems the plan is to only call static eval during the tuning process, instead of always calling qsearch which may be inefficient. But then, why bothering with qsearch at all?

hgm · Post by **hgm** » Wed Jun 07, 2017 12:06 pm

In general the score in the root is a non-differentiable function of the eval parameters. It will be the static eval of the position at the end of the PV, but if you change the parameters enough the PV will change, and the root score will start to follow the evaluation of a different position. E.g. imagine a position where a white Bishop attacks a protected black Knight. As long as the Knight value is below the Bishop value the position will have an advantage for white equal to the B-N difference. But when you pass the point where the Knight value is equal to the Bishop value, and starts to exceed it, white will make the trade, and the advantage stays at 0. The graph of root score vs Knight value will turn a sharp corner there.

The larger the tree, the more such 'crossover points' there will be in the root score. Which makes the gradient poorly defined. The gradient of the static eval at the end of the PV is continuous and well defined, but it is only valid for the root a small neighborhood of the current parameter setting, upto the nearest cross-over point.

Now in QS the PV is not very sensitive to exact parameter values. There usually is only one way to gain the material you are going to gain, and any other line would score so much less that you need a huge change in eval to make it unattractive. For deeper searches this is an issue, though.

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 07, 2017 12:34 pm

hgm wrote:In general the score in the root is a non-differentiable function of the eval parameters. It will be the static eval of the position at the end of the PV, but if you change the parameters enough the PV will change, and the root score will start to follow the evaluation of a different position. E.g. imagine a position where a white Bishop attacks a protected black Knight. As long as the Knight value is below the Bishop value the position will have an advantage for white equal to the B-N difference. But when you pass the point where the Knight value is equal to the Bishop value, and starts to exceed it, white will make the trade, and the advantage stays at 0. The graph of root score vs Knight value will turn a sharp corner there.

The larger the tree, the more such 'crossover points' there will be in the root score. Which makes the gradient poorly defined. The gradient of the static eval at the end of the PV is continuous and well defined, but it is only valid for the root a small neighborhood of the current parameter setting, upto the nearest cross-over point.

Now in QS the PV is not very sensitive to exact parameter values. There usually is only one way to gain the material you are going to gain, and any other line would score so much less that you need a huge change in eval to make it unattractive. For deeper searches this is an issue, though.

I like your description of the situation, except I don't agree that the corners introduced by the minimax procedure are likely to ever be an issue. Many modern neural networks use rectified linear units (ReLU), which introduce precisely this type of discontinuity in the derivative, and in a deep neural network these corners are everywhere (although it's still a set of measure 0). And yet gradient descent works beautifully if you just ignore the problem. I expect tuning with deeper searches will behave similarly.

Sven · Post by **Sven** » Wed Jun 07, 2017 1:16 pm

hgm wrote:In general the score in the root is a non-differentiable function of the eval parameters. It will be the static eval of the position at the end of the PV, but if you change the parameters enough the PV will change, and the root score will start to follow the evaluation of a different position. E.g. imagine a position where a white Bishop attacks a protected black Knight. As long as the Knight value is below the Bishop value the position will have an advantage for white equal to the B-N difference. But when you pass the point where the Knight value is equal to the Bishop value, and starts to exceed it, white will make the trade, and the advantage stays at 0. The graph of root score vs Knight value will turn a sharp corner there.

The larger the tree, the more such 'crossover points' there will be in the root score. Which makes the gradient poorly defined. The gradient of the static eval at the end of the PV is continuous and well defined, but it is only valid for the root a small neighborhood of the current parameter setting, upto the nearest cross-over point.

Now in QS the PV is not very sensitive to exact parameter values. There usually is only one way to gain the material you are going to gain, and any other line would score so much less that you need a huge change in eval to make it unattractive. For deeper searches this is an issue, though.

I understand that. But what is the advantage of doing a search, then, instead of directly calling static eval for each (root) position and calculating the gradient of the corresponding error function, based on initially gathering reference eval scores?

Desperado · Post by **Desperado** » Wed Jun 07, 2017 1:59 pm

Hello Alvaro, hello HG,

i don't want to be unfriendly now, especially because i simply do not understand what you are talking of, but it reads for me like "blablabla"

I think the Texel method is operating on a set of positions with a corresponding score n x (p,s) , and it does not matter how the scores are retrieved.
Based on that data, the error function is minimized. That is is all about the basic algorithm.

Now, if the error is computed in another way, for example with sth like temporal differences for the pv, that is a completely different topic.

The best thing to avoid misunderstandigs or to enlighten someone like me would be to give a simple example, so thank you in advance.

Best

Desperado · Post by **Desperado** » Wed Jun 07, 2017 2:37 pm

Desperado wrote:Hello Alvaro, hello HG,

i don't want to be unfriendly now, especially because i simply do not understand what you are talking of, but it reads for me like "blablabla"

I think the Texel method is operating on a set of positions with a corresponding score n x (p,s) , and it does not matter how the scores are retrieved.
Based on that data, the error function is minimized. That is is all about the basic algorithm.

Now, if the error is computed in another way, for example with sth like temporal differences for the pv, that is a completely different topic.

The best thing to avoid misunderstandigs or to enlighten someone like me would be to give a simple example, so thank you in advance.

Best

Sorry, i think this time i was a little bit unclear too.
Of course it is relevant how the scores are produced if you are interested in the quality of the data and the performance of the algorithm.
But computing some gradient/error for one data point is not part of the Texel method. We have n (independent) data points which are used to ccompute the error... still confused.

example still appreciated, thx

Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question