TD-leaf(lambda)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

TD-leaf(lambda)

Post by Robert Pope »

I'm going through my second attempt at implementing TD-leaf. I've hopefully got it coded right and am just starting optimization runs.

One thing that seems really odd in my implementation is setting the scaling parameter alpha. The only guidance I found was to scale it down over time. I had assumed this meant start around alpha = 1.0 and decrease over time. But the "learn vector" that it gets multiplied by is indicating parameter changes on the order of 0.000001 centipawn. If I used alpha =1.0, my eval would never change, since every adjustment rounds to zero.

What I've done in the meantime is solve for alpha after each game so that the biggest weight change is between 0.1 cp and 25 cp. But that means alpha must be in the millions. Is this consistent with other peoples experience?
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: TD-leaf(lambda)

Post by matthewlai »

Robert Pope wrote:I'm going through my second attempt at implementing TD-leaf. I've hopefully got it coded right and am just starting optimization runs.

One thing that seems really odd in my implementation is setting the scaling parameter alpha. The only guidance I found was to scale it down over time. I had assumed this meant start around alpha = 1.0 and decrease over time. But the "learn vector" that it gets multiplied by is indicating parameter changes on the order of 0.000001 centipawn. If I used alpha =1.0, my eval would never change, since every adjustment rounds to zero.

What I've done in the meantime is solve for alpha after each game so that the biggest weight change is between 0.1 cp and 25 cp. But that means alpha must be in the millions. Is this consistent with other peoples experience?
What is your exact update rule? Something doesn't sound right.

alpha is a learning rate. Usual values (assuming you are not using fancy optimization algorithms) are between 0.1 and 0.001.

alpha shouldn't require solving.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: TD-leaf(lambda)

Post by Robert Pope »

matthewlai wrote: What is your exact update rule? Something doesn't sound right.

alpha is a learning rate. Usual values (assuming you are not using fancy optimization algorithms) are between 0.1 and 0.001.

alpha shouldn't require solving.
I'm intending to use the exact update rule in the KnightCap paper:

w:=w + alpha*sum[t=1 to N-1](gradientJ(position t)*sum[j=t to N-1](exp(lambda,j-t)*Dt)
I do the v=tanh(Beta*J(t)) substitution in the above, like KnightCap did.

Maybe I missed a conversion somewhere - now that I know something isn't right, I'll trace through an example.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: TD-leaf(lambda)

Post by Robert Pope »

Okay, I've worked through an example. I think it shows there is a problem with my indexes, though I don't see where. But it also shows that I still get ridiculously small adjustments:

The game played (with learning by white) is:
1. Na3 e5 2. Nf3 e4 3. Ng1 e3 4. dxe3 h6 5. Nf3 a6 6. Nb1

I stripped out all evaluation terms except material. I'm just going to show the term for "Has8pawns" in the weight vector calculations.

Step 1: Select Beta to convert scores from (-inf,inf) to (-1,1).
We want a 1 pawn advantage to be around 0.25. In my evaluation function, a pawn is worth 10,000, so I selected Beta=0.0000258725, so tanh(Beta*10000)=0.2531.

Step 2: Leaf Scores returned by search, and converted values
We did N=6 searches and found we could capture a pawn on move 4:

Code: Select all

i     score(i)     v(i)
1           0     0.0000000
2           0     0.0000000
3           0     0.0000000
4    10000     0.2531026
5    10000     0.2531026
6    10000     0.2531026
Step 3: Calculate temporal differences d(t)=v(t+1)-v(t)
i d(i)
1 0
2 0
3 0.2531026
4 0
5 0

Step 4: Select lambda
I picked lambda =0.7.

Step 5: Calculate gradients/partial derivatives at each PV-leaf
The position is symmetric at the first 3 nodes, so the gradient is 0.
In the search of move 4, white has 8 pawns and black only 7, so the leaf evaluation is 10000. To get the gradient of the Has8Pawns term, I added 100 to the weight, which changed the evaluation to 10100.

Gradient(4)=(tanh(Beta*10100)-tanh(Beta*10000))/100=0.0000242

Code: Select all

i   gradient(i)
1     0
2     0
3     0
4     0.0000242
5     0.0000242
Step 6: Calculate the Td-Leaf(lambda) update

w = w + alpha * Sum[t=1 to 5](gradient(t))*Sum[j=t to 5](lambda^(j-t)*d(t))
Since gradient(i) is 0 for 1-3, the double sum expands to:
gradient(3)*(d(3)+0.7*d(3)+0.49*d(3))
+gradient(4)*(d(4)+0.7*d(4))
+gradient(5)*d(5)
=0*(0.253+0.7*0.253+0.49*0.253) + 0.0000242*(0+0.7*0) + 0.0000242*(0)
=0
So no update occurred, since at every part of the sum either the gradient is 0 or the score difference is 0.

But even if they did overlap, I would get an extremely small weight adjustment, i.e. 0.0000242*(0.253+0.7*0.253+0.49*0.253) = 0.0000134
And even at alpha=1.0, the weight for Has8Pawns would change from 80000 to 80000.0000134, which rounds back to 80000.

Can anybody tell what I am missing?
User avatar
lantonov
Posts: 216
Joined: Sun Apr 13, 2014 5:19 pm

Re: TD-leaf(lambda)

Post by lantonov »

You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
User avatar
lantonov
Posts: 216
Joined: Sun Apr 13, 2014 5:19 pm

Re: TD-leaf(lambda)

Post by lantonov »

Sorry, not dissertation but M.Sc. thesis:
http://arxiv.org/abs/1509.01549v1
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: TD-leaf(lambda)

Post by Robert Pope »

lantonov wrote:You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
Thanks for the comments. I can definitely try without the tanh and see what happens. Unfortunately, neither the Giraffe or KnightCap paper were very clear to me on applying the tanh. I'll have to re-read the KnightCap with this in mind.

Your last sentence makes me wonder if there is something additional I am missing as well -- I'm not calculating an L1 or L2 distance anywhere in my algorithm. Am I missing something?
User avatar
lantonov
Posts: 216
Joined: Sun Apr 13, 2014 5:19 pm

Re: TD-leaf(lambda)

Post by lantonov »

When you obtain a gradient, it can be + or - (or 0 rarely). On the other hand, most programs that work subsequently with this gradient as an objective function, such as AdaGrad, AdaDelta, DE etc., try to minimize this objective function (gradient). If the gradient is negative, minimizing it would mean for it to go to lower negative values (increasing absolute value) which is not what we want. We are comparing our estimates with some reference estimates possibly obtained by a stronger engine (or deeper leaves) and want the differences to be as small as possible in absolute value regardless in which direction are these differences (+ or -). In other words, we are trying to obtain a distribution with mean = 0 and variation as small as possible. There are 2 ways for transforming the gradient such that its minimal value is 0 -- (i) taking the absolute value (less sensitive) and (ii) squaring it (more prone to outliers). In cases with high variation and many outliers such as chess engine tuning, one generally prefers (i).
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: TD-leaf(lambda)

Post by AlvaroBegue »

Without looking at all the details of what you did or how KnightCap worked, I have two comments:
(1) Gradient-descent algorithms are generally not scale invariant, so having a pawn value of 10000 might be part of the problem. Try with a more reasonable scale, like 1.
(2) You can use automatic differentiation to compute the gradient with respect to all parameters in one go, and with no numerical issues with the size of the constant (100) you add.

It's hard to understand why so few people know about automatic differentiation.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: TD-leaf(lambda)

Post by matthewlai »

Robert Pope wrote:
lantonov wrote:You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
Thanks for the comments. I can definitely try without the tanh and see what happens. Unfortunately, neither the Giraffe or KnightCap paper were very clear to me on applying the tanh. I'll have to re-read the KnightCap with this in mind.

Your last sentence makes me wonder if there is something additional I am missing as well -- I'm not calculating an L1 or L2 distance anywhere in my algorithm. Am I missing something?
That's because tanh isn't really part of TDleaf(lambda). It's part of the evaluation function, to transform a linear advantage into probability of winning.

Intuitively, if you are 4 queens up, you'll have twice the linear advantage as being 2 queens up, even though the probability of winning are almost the same (almost 1 for both).

TDL(lambda) doesn't really care about that. You just need to take it into account when computing the gradient of your evaluation function.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.