TD-leaf(lambda)

Robert Pope · Post by **Robert Pope** » Wed Nov 09, 2016 9:47 pm

I'm going through my second attempt at implementing TD-leaf. I've hopefully got it coded right and am just starting optimization runs.

One thing that seems really odd in my implementation is setting the scaling parameter alpha. The only guidance I found was to scale it down over time. I had assumed this meant start around alpha = 1.0 and decrease over time. But the "learn vector" that it gets multiplied by is indicating parameter changes on the order of 0.000001 centipawn. If I used alpha =1.0, my eval would never change, since every adjustment rounds to zero.

What I've done in the meantime is solve for alpha after each game so that the biggest weight change is between 0.1 cp and 25 cp. But that means alpha must be in the millions. Is this consistent with other peoples experience?

matthewlai · Post by **matthewlai** » Wed Nov 09, 2016 10:57 pm

Robert Pope wrote:I'm going through my second attempt at implementing TD-leaf. I've hopefully got it coded right and am just starting optimization runs.

One thing that seems really odd in my implementation is setting the scaling parameter alpha. The only guidance I found was to scale it down over time. I had assumed this meant start around alpha = 1.0 and decrease over time. But the "learn vector" that it gets multiplied by is indicating parameter changes on the order of 0.000001 centipawn. If I used alpha =1.0, my eval would never change, since every adjustment rounds to zero.

What I've done in the meantime is solve for alpha after each game so that the biggest weight change is between 0.1 cp and 25 cp. But that means alpha must be in the millions. Is this consistent with other peoples experience?

What is your exact update rule? Something doesn't sound right.

alpha is a learning rate. Usual values (assuming you are not using fancy optimization algorithms) are between 0.1 and 0.001.

alpha shouldn't require solving.

Robert Pope · Post by **Robert Pope** » Thu Nov 10, 2016 2:07 am

matthewlai wrote: What is your exact update rule? Something doesn't sound right.

alpha is a learning rate. Usual values (assuming you are not using fancy optimization algorithms) are between 0.1 and 0.001.

alpha shouldn't require solving.

I'm intending to use the exact update rule in the KnightCap paper:

w:=w + alpha*sum[t=1 to N-1](gradientJ(position t)*sum[j=t to N-1](exp(lambda,j-t)*Dt)
I do the v=tanh(Beta*J(t)) substitution in the above, like KnightCap did.

Maybe I missed a conversion somewhere - now that I know something isn't right, I'll trace through an example.

Robert Pope · Post by **Robert Pope** » Sun Nov 13, 2016 12:15 am

Okay, I've worked through an example. I think it shows there is a problem with my indexes, though I don't see where. But it also shows that I still get ridiculously small adjustments:

The game played (with learning by white) is:
1. Na3 e5 2. Nf3 e4 3. Ng1 e3 4. dxe3 h6 5. Nf3 a6 6. Nb1

I stripped out all evaluation terms except material. I'm just going to show the term for "Has8pawns" in the weight vector calculations.

Step 1: Select Beta to convert scores from (-inf,inf) to (-1,1).
We want a 1 pawn advantage to be around 0.25. In my evaluation function, a pawn is worth 10,000, so I selected Beta=0.0000258725, so tanh(Beta*10000)=0.2531.

Step 2: Leaf Scores returned by search, and converted values
We did N=6 searches and found we could capture a pawn on move 4:

Code: Select all

i     score&#40;i&#41;     v&#40;i&#41;
1           0     0.0000000
2           0     0.0000000
3           0     0.0000000
4    10000     0.2531026
5    10000     0.2531026
6    10000     0.2531026

Step 3: Calculate temporal differences d(t)=v(t+1)-v(t)
i d(i)
1 0
2 0
3 0.2531026
4 0
5 0

Step 4: Select lambda
I picked lambda =0.7.

Step 5: Calculate gradients/partial derivatives at each PV-leaf
The position is symmetric at the first 3 nodes, so the gradient is 0.
In the search of move 4, white has 8 pawns and black only 7, so the leaf evaluation is 10000. To get the gradient of the Has8Pawns term, I added 100 to the weight, which changed the evaluation to 10100.

Gradient(4)=(tanh(Beta*10100)-tanh(Beta*10000))/100=0.0000242

Code: Select all

i   gradient&#40;i&#41;
1     0
2     0
3     0
4     0.0000242
5     0.0000242

Step 6: Calculate the Td-Leaf(lambda) update

w = w + alpha * Sum[t=1 to 5](gradient(t))*Sum[j=t to 5](lambda^(j-t)*d(t))
Since gradient(i) is 0 for 1-3, the double sum expands to:
gradient(3)*(d(3)+0.7*d(3)+0.49*d(3))
+gradient(4)*(d(4)+0.7*d(4))
+gradient(5)*d(5)
=0*(0.253+0.7*0.253+0.49*0.253) + 0.0000242*(0+0.7*0) + 0.0000242*(0)
=0
So no update occurred, since at every part of the sum either the gradient is 0 or the score difference is 0.

But even if they did overlap, I would get an extremely small weight adjustment, i.e. 0.0000242*(0.253+0.7*0.253+0.49*0.253) = 0.0000134
And even at alpha=1.0, the weight for Has8Pawns would change from 80000 to 80000.0000134, which rounds back to 80000.

Can anybody tell what I am missing?

lantonov · Post by **lantonov** » Sun Nov 20, 2016 8:57 am

You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).

lantonov · Post by **lantonov** » Sun Nov 20, 2016 7:57 pm

Sorry, not dissertation but M.Sc. thesis:
http://arxiv.org/abs/1509.01549v1

Robert Pope · Post by **Robert Pope** » Mon Nov 21, 2016 5:48 am

lantonov wrote:You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).

Thanks for the comments. I can definitely try without the tanh and see what happens. Unfortunately, neither the Giraffe or KnightCap paper were very clear to me on applying the tanh. I'll have to re-read the KnightCap with this in mind.

Your last sentence makes me wonder if there is something additional I am missing as well -- I'm not calculating an L1 or L2 distance anywhere in my algorithm. Am I missing something?

lantonov · Post by **lantonov** » Mon Nov 21, 2016 8:43 am

When you obtain a gradient, it can be + or - (or 0 rarely). On the other hand, most programs that work subsequently with this gradient as an objective function, such as AdaGrad, AdaDelta, DE etc., try to minimize this objective function (gradient). If the gradient is negative, minimizing it would mean for it to go to lower negative values (increasing absolute value) which is not what we want. We are comparing our estimates with some reference estimates possibly obtained by a stronger engine (or deeper leaves) and want the differences to be as small as possible in absolute value regardless in which direction are these differences (+ or -). In other words, we are trying to obtain a distribution with mean = 0 and variation as small as possible. There are 2 ways for transforming the gradient such that its minimal value is 0 -- (i) taking the absolute value (less sensitive) and (ii) squaring it (more prone to outliers). In cases with high variation and many outliers such as chess engine tuning, one generally prefers (i).

AlvaroBegue · Post by **AlvaroBegue** » Mon Nov 21, 2016 11:05 am

Without looking at all the details of what you did or how KnightCap worked, I have two comments:
(1) Gradient-descent algorithms are generally not scale invariant, so having a pawn value of 10000 might be part of the problem. Try with a more reasonable scale, like 1.
(2) You can use automatic differentiation to compute the gradient with respect to all parameters in one go, and with no numerical issues with the size of the constant (100) you add.

It's hard to understand why so few people know about automatic differentiation.

matthewlai · Post by **matthewlai** » Mon Nov 21, 2016 1:31 pm

Robert Pope wrote:
lantonov wrote:You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
Thanks for the comments. I can definitely try without the tanh and see what happens. Unfortunately, neither the Giraffe or KnightCap paper were very clear to me on applying the tanh. I'll have to re-read the KnightCap with this in mind.

Your last sentence makes me wonder if there is something additional I am missing as well -- I'm not calculating an L1 or L2 distance anywhere in my algorithm. Am I missing something?

That's because tanh isn't really part of TDleaf(lambda). It's part of the evaluation function, to transform a linear advantage into probability of winning.

Intuitively, if you are 4 queens up, you'll have twice the linear advantage as being 2 queens up, even though the probability of winning are almost the same (almost 1 for both).

TDL(lambda) doesn't really care about that. You just need to take it into account when computing the gradient of your evaluation function.

TD-leaf(lambda)

TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)

Re: TD-leaf(lambda)