Deep misery

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Henk
Posts: 5083
Joined: Mon May 27, 2013 8:31 am

Deep misery

Post by Henk » Wed Feb 07, 2018 8:34 pm

Made some changes to my network and I was not sure everything would be working fine. Suddenly in debug mode an error appeared that indicated that gradients were exploding. So I started debugging. After a few hours I gave up for I could not find what was wrong. So I reverted the changes. But ... bug was still there.

Then I saw that I had changed my learning parameter too. So bug was not the code but a wrong value of learning parameter. So now I can start typing again for redo button doesn't redo my changes of today.

AlvaroBegue
Posts: 881
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York

Re: Deep misery

Post by AlvaroBegue » Wed Feb 07, 2018 8:48 pm

Why don't you use something like Mercurial or Git?

Exploding gradients usually means your weights were initialized too large. There are additional techniques that ameliorate the problem, like ResNet, batch normalization and weight normalization.

Henk
Posts: 5083
Joined: Mon May 27, 2013 8:31 am

Re: Deep misery

Post by Henk » Wed Feb 07, 2018 8:55 pm

I looked at batch normalization. But my network does not use mini-batches and computes gradient for loss over one training example.

I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.

AlvaroBegue
Posts: 881
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York

Re: Deep misery

Post by AlvaroBegue » Wed Feb 07, 2018 9:08 pm

Henk wrote:I looked at batch normalization. But my network does not use mini-batches and computes gradient for loss over one training example.

I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.
There is a fairly obvious modification of batch normalization, where you use an average of recent values (like a momentum term) to compute the mean and the standard deviation, instead of doing it over a minibatch.

However, if you use ResNet, you may not need to do anything else.

Henk
Posts: 5083
Joined: Mon May 27, 2013 8:31 am

Re: Deep misery

Post by Henk » Thu Feb 08, 2018 4:42 pm

ResNet still uses batch normalization.

But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.

My network was still shallow so a ResNet solution is a bit too early.

I still have problems with computing gradients. So simple steps first.

AlvaroBegue
Posts: 881
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York

Re: Deep misery

Post by AlvaroBegue » Thu Feb 08, 2018 4:56 pm

Henk wrote:ResNet still uses batch normalization.
People generally use "ResNet" to refer to the use of skip connections every two layers. That's what I meant by it here. Even without BN this is a very good idea that makes learning more robust.
But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.
Get rid of bugs. You should use automatic differentiation or symbolic differentiation, so you never have to worry about this again.
My network was still shallow so a ResNet solution is a bit too early.

I still have problems with computing gradients. So simple steps first.
Of course. I wouldn't even do any learning until your gradients are computed correctly.

You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.

Henk
Posts: 5083
Joined: Mon May 27, 2013 8:31 am

Re: Deep misery

Post by Henk » Thu Feb 08, 2018 5:17 pm

AlvaroBegue wrote:You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.
Yes that's how I found these bugs. But they are still difficult to fix if you are overlooking something completely.

ResNet will give other gradients. So I expect more problems and tough bugs.

Maybe I'm too stupid to apply a chain rule. Should be glad I did not study mathematics.

Yes generic symbolic solution would be fine. But will probably also be slow because of using interpreter.

AlvaroBegue
Posts: 881
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York

Re: Deep misery

Post by AlvaroBegue » Thu Feb 08, 2018 6:03 pm

Henk wrote:
AlvaroBegue wrote:You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.
Yes that's how I found these bugs. But they are still difficult to fix if you are overlooking something completely.

ResNet will give other gradients. So I expect more problems and tough bugs.

Maybe I'm too stupid to apply a chain rule. Should be glad I did not study mathematics.

Yes generic symbolic solution would be fine. But will probably also be slow because of using interpreter.
Can you post what your forward computation looks like? I might be able to suggest non-buggy ways to compute the gradient.

Henk
Posts: 5083
Joined: Mon May 27, 2013 8:31 am

Re: Deep misery

Post by Henk » Fri Feb 09, 2018 11:19 am

Test (F(x+h)-F(x))/h does not work in discontinues part of a function. For instance if f(u) = u > 0 ? u: 0.01 * u and F(x + h) > 0 but F(x ) < 0

AlvaroBegue
Posts: 881
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York

Re: Deep misery

Post by AlvaroBegue » Fri Feb 09, 2018 12:38 pm

Henk wrote:Test (F(x+h)-F(x))/h does not work in discontinues part of a function. For instance if f(u) = u > 0 ? u: 0.01 * u and F(x + h) > 0 but F(x ) < 0
You can't expect equality, but you should get pretty close in most cases. If your h is judiciously chosen, the probability of hitting a corner like that will be very small.

Post Reply