I'm going through my second attempt at implementing TD-leaf. I've hopefully got it coded right and am just starting optimization runs.
One thing that seems really odd in my implementation is setting the scaling parameter alpha. The only guidance I found was to scale it down over time. I had assumed this meant start around alpha = 1.0 and decrease over time. But the "learn vector" that it gets multiplied by is indicating parameter changes on the order of 0.000001 centipawn. If I used alpha =1.0, my eval would never change, since every adjustment rounds to zero.
What I've done in the meantime is solve for alpha after each game so that the biggest weight change is between 0.1 cp and 25 cp. But that means alpha must be in the millions. Is this consistent with other peoples experience?
TD-leaf(lambda)
Moderators: hgm, Rebel, chrisw
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: TD-leaf(lambda)
What is your exact update rule? Something doesn't sound right.Robert Pope wrote:I'm going through my second attempt at implementing TD-leaf. I've hopefully got it coded right and am just starting optimization runs.
One thing that seems really odd in my implementation is setting the scaling parameter alpha. The only guidance I found was to scale it down over time. I had assumed this meant start around alpha = 1.0 and decrease over time. But the "learn vector" that it gets multiplied by is indicating parameter changes on the order of 0.000001 centipawn. If I used alpha =1.0, my eval would never change, since every adjustment rounds to zero.
What I've done in the meantime is solve for alpha after each game so that the biggest weight change is between 0.1 cp and 25 cp. But that means alpha must be in the millions. Is this consistent with other peoples experience?
alpha is a learning rate. Usual values (assuming you are not using fancy optimization algorithms) are between 0.1 and 0.001.
alpha shouldn't require solving.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 558
- Joined: Sat Mar 25, 2006 8:27 pm
Re: TD-leaf(lambda)
I'm intending to use the exact update rule in the KnightCap paper:matthewlai wrote: What is your exact update rule? Something doesn't sound right.
alpha is a learning rate. Usual values (assuming you are not using fancy optimization algorithms) are between 0.1 and 0.001.
alpha shouldn't require solving.
w:=w + alpha*sum[t=1 to N-1](gradientJ(position t)*sum[j=t to N-1](exp(lambda,j-t)*Dt)
I do the v=tanh(Beta*J(t)) substitution in the above, like KnightCap did.
Maybe I missed a conversion somewhere - now that I know something isn't right, I'll trace through an example.
-
- Posts: 558
- Joined: Sat Mar 25, 2006 8:27 pm
Re: TD-leaf(lambda)
Okay, I've worked through an example. I think it shows there is a problem with my indexes, though I don't see where. But it also shows that I still get ridiculously small adjustments:
The game played (with learning by white) is:
1. Na3 e5 2. Nf3 e4 3. Ng1 e3 4. dxe3 h6 5. Nf3 a6 6. Nb1
I stripped out all evaluation terms except material. I'm just going to show the term for "Has8pawns" in the weight vector calculations.
Step 1: Select Beta to convert scores from (-inf,inf) to (-1,1).
We want a 1 pawn advantage to be around 0.25. In my evaluation function, a pawn is worth 10,000, so I selected Beta=0.0000258725, so tanh(Beta*10000)=0.2531.
Step 2: Leaf Scores returned by search, and converted values
We did N=6 searches and found we could capture a pawn on move 4:
Step 3: Calculate temporal differences d(t)=v(t+1)-v(t)
i d(i)
1 0
2 0
3 0.2531026
4 0
5 0
Step 4: Select lambda
I picked lambda =0.7.
Step 5: Calculate gradients/partial derivatives at each PV-leaf
The position is symmetric at the first 3 nodes, so the gradient is 0.
In the search of move 4, white has 8 pawns and black only 7, so the leaf evaluation is 10000. To get the gradient of the Has8Pawns term, I added 100 to the weight, which changed the evaluation to 10100.
Gradient(4)=(tanh(Beta*10100)-tanh(Beta*10000))/100=0.0000242
Step 6: Calculate the Td-Leaf(lambda) update
w = w + alpha * Sum[t=1 to 5](gradient(t))*Sum[j=t to 5](lambda^(j-t)*d(t))
Since gradient(i) is 0 for 1-3, the double sum expands to:
gradient(3)*(d(3)+0.7*d(3)+0.49*d(3))
+gradient(4)*(d(4)+0.7*d(4))
+gradient(5)*d(5)
=0*(0.253+0.7*0.253+0.49*0.253) + 0.0000242*(0+0.7*0) + 0.0000242*(0)
=0
So no update occurred, since at every part of the sum either the gradient is 0 or the score difference is 0.
But even if they did overlap, I would get an extremely small weight adjustment, i.e. 0.0000242*(0.253+0.7*0.253+0.49*0.253) = 0.0000134
And even at alpha=1.0, the weight for Has8Pawns would change from 80000 to 80000.0000134, which rounds back to 80000.
Can anybody tell what I am missing?
The game played (with learning by white) is:
1. Na3 e5 2. Nf3 e4 3. Ng1 e3 4. dxe3 h6 5. Nf3 a6 6. Nb1
I stripped out all evaluation terms except material. I'm just going to show the term for "Has8pawns" in the weight vector calculations.
Step 1: Select Beta to convert scores from (-inf,inf) to (-1,1).
We want a 1 pawn advantage to be around 0.25. In my evaluation function, a pawn is worth 10,000, so I selected Beta=0.0000258725, so tanh(Beta*10000)=0.2531.
Step 2: Leaf Scores returned by search, and converted values
We did N=6 searches and found we could capture a pawn on move 4:
Code: Select all
i score(i) v(i)
1 0 0.0000000
2 0 0.0000000
3 0 0.0000000
4 10000 0.2531026
5 10000 0.2531026
6 10000 0.2531026
i d(i)
1 0
2 0
3 0.2531026
4 0
5 0
Step 4: Select lambda
I picked lambda =0.7.
Step 5: Calculate gradients/partial derivatives at each PV-leaf
The position is symmetric at the first 3 nodes, so the gradient is 0.
In the search of move 4, white has 8 pawns and black only 7, so the leaf evaluation is 10000. To get the gradient of the Has8Pawns term, I added 100 to the weight, which changed the evaluation to 10100.
Gradient(4)=(tanh(Beta*10100)-tanh(Beta*10000))/100=0.0000242
Code: Select all
i gradient(i)
1 0
2 0
3 0
4 0.0000242
5 0.0000242
w = w + alpha * Sum[t=1 to 5](gradient(t))*Sum[j=t to 5](lambda^(j-t)*d(t))
Since gradient(i) is 0 for 1-3, the double sum expands to:
gradient(3)*(d(3)+0.7*d(3)+0.49*d(3))
+gradient(4)*(d(4)+0.7*d(4))
+gradient(5)*d(5)
=0*(0.253+0.7*0.253+0.49*0.253) + 0.0000242*(0+0.7*0) + 0.0000242*(0)
=0
So no update occurred, since at every part of the sum either the gradient is 0 or the score difference is 0.
But even if they did overlap, I would get an extremely small weight adjustment, i.e. 0.0000242*(0.253+0.7*0.253+0.49*0.253) = 0.0000134
And even at alpha=1.0, the weight for Has8Pawns would change from 80000 to 80000.0000134, which rounds back to 80000.
Can anybody tell what I am missing?
-
- Posts: 216
- Joined: Sun Apr 13, 2014 5:19 pm
Re: TD-leaf(lambda)
You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
-
- Posts: 216
- Joined: Sun Apr 13, 2014 5:19 pm
Re: TD-leaf(lambda)
Sorry, not dissertation but M.Sc. thesis:
http://arxiv.org/abs/1509.01549v1
http://arxiv.org/abs/1509.01549v1
-
- Posts: 558
- Joined: Sat Mar 25, 2006 8:27 pm
Re: TD-leaf(lambda)
Thanks for the comments. I can definitely try without the tanh and see what happens. Unfortunately, neither the Giraffe or KnightCap paper were very clear to me on applying the tanh. I'll have to re-read the KnightCap with this in mind.lantonov wrote:You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
Your last sentence makes me wonder if there is something additional I am missing as well -- I'm not calculating an L1 or L2 distance anywhere in my algorithm. Am I missing something?
-
- Posts: 216
- Joined: Sun Apr 13, 2014 5:19 pm
Re: TD-leaf(lambda)
When you obtain a gradient, it can be + or - (or 0 rarely). On the other hand, most programs that work subsequently with this gradient as an objective function, such as AdaGrad, AdaDelta, DE etc., try to minimize this objective function (gradient). If the gradient is negative, minimizing it would mean for it to go to lower negative values (increasing absolute value) which is not what we want. We are comparing our estimates with some reference estimates possibly obtained by a stronger engine (or deeper leaves) and want the differences to be as small as possible in absolute value regardless in which direction are these differences (+ or -). In other words, we are trying to obtain a distribution with mean = 0 and variation as small as possible. There are 2 ways for transforming the gradient such that its minimal value is 0 -- (i) taking the absolute value (less sensitive) and (ii) squaring it (more prone to outliers). In cases with high variation and many outliers such as chess engine tuning, one generally prefers (i).
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: TD-leaf(lambda)
Without looking at all the details of what you did or how KnightCap worked, I have two comments:
(1) Gradient-descent algorithms are generally not scale invariant, so having a pawn value of 10000 might be part of the problem. Try with a more reasonable scale, like 1.
(2) You can use automatic differentiation to compute the gradient with respect to all parameters in one go, and with no numerical issues with the size of the constant (100) you add.
It's hard to understand why so few people know about automatic differentiation.
(1) Gradient-descent algorithms are generally not scale invariant, so having a pawn value of 10000 might be part of the problem. Try with a more reasonable scale, like 1.
(2) You can use automatic differentiation to compute the gradient with respect to all parameters in one go, and with no numerical issues with the size of the constant (100) you add.
It's hard to understand why so few people know about automatic differentiation.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: TD-leaf(lambda)
That's because tanh isn't really part of TDleaf(lambda). It's part of the evaluation function, to transform a linear advantage into probability of winning.Robert Pope wrote:Thanks for the comments. I can definitely try without the tanh and see what happens. Unfortunately, neither the Giraffe or KnightCap paper were very clear to me on applying the tanh. I'll have to re-read the KnightCap with this in mind.lantonov wrote:You apply tanh() too early. See Matthew Lai's dissertation on Giraffe (p. 23) for the correct implementation of tdleaf(lambda).
With the usual pawn = 100, your gradient example should be 100. This multiplied by 0.7^3 gives 34.3.
tanh(34.3) = 1.0 and so on.
You are even not obliged to use tanh() especially if you use the absolutes of gradients (L1) instead of their squares (L2).
Your last sentence makes me wonder if there is something additional I am missing as well -- I'm not calculating an L1 or L2 distance anywhere in my algorithm. Am I missing something?
Intuitively, if you are 4 queens up, you'll have twice the linear advantage as being 2 queens up, even though the probability of winning are almost the same (almost 1 for both).
TDL(lambda) doesn't really care about that. You just need to take it into account when computing the gradient of your evaluation function.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.