Made some changes to my network and I was not sure everything would be working fine. Suddenly in debug mode an error appeared that indicated that gradients were exploding. So I started debugging. After a few hours I gave up for I could not find what was wrong. So I reverted the changes. But ... bug was still there.
Then I saw that I had changed my learning parameter too. So bug was not the code but a wrong value of learning parameter. So now I can start typing again for redo button doesn't redo my changes of today.
Deep misery
Moderators: hgm, Rebel, chrisw
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Deep misery
Why don't you use something like Mercurial or Git?
Exploding gradients usually means your weights were initialized too large. There are additional techniques that ameliorate the problem, like ResNet, batch normalization and weight normalization.
Exploding gradients usually means your weights were initialized too large. There are additional techniques that ameliorate the problem, like ResNet, batch normalization and weight normalization.
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: Deep misery
I looked at batch normalization. But my network does not use mini-batches and computes gradient for loss over one training example.
I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.
I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Deep misery
There is a fairly obvious modification of batch normalization, where you use an average of recent values (like a momentum term) to compute the mean and the standard deviation, instead of doing it over a minibatch.Henk wrote:I looked at batch normalization. But my network does not use mini-batches and computes gradient for loss over one training example.
I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.
However, if you use ResNet, you may not need to do anything else.
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: Deep misery
ResNet still uses batch normalization.
But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.
My network was still shallow so a ResNet solution is a bit too early.
I still have problems with computing gradients. So simple steps first.
But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.
My network was still shallow so a ResNet solution is a bit too early.
I still have problems with computing gradients. So simple steps first.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Deep misery
People generally use "ResNet" to refer to the use of skip connections every two layers. That's what I meant by it here. Even without BN this is a very good idea that makes learning more robust.Henk wrote:ResNet still uses batch normalization.
Get rid of bugs. You should use automatic differentiation or symbolic differentiation, so you never have to worry about this again.But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.
Of course. I wouldn't even do any learning until your gradients are computed correctly.My network was still shallow so a ResNet solution is a bit too early.
I still have problems with computing gradients. So simple steps first.
You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: Deep misery
Yes that's how I found these bugs. But they are still difficult to fix if you are overlooking something completely.AlvaroBegue wrote:You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.
ResNet will give other gradients. So I expect more problems and tough bugs.
Maybe I'm too stupid to apply a chain rule. Should be glad I did not study mathematics.
Yes generic symbolic solution would be fine. But will probably also be slow because of using interpreter.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Deep misery
Can you post what your forward computation looks like? I might be able to suggest non-buggy ways to compute the gradient.Henk wrote:Yes that's how I found these bugs. But they are still difficult to fix if you are overlooking something completely.AlvaroBegue wrote:You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.
ResNet will give other gradients. So I expect more problems and tough bugs.
Maybe I'm too stupid to apply a chain rule. Should be glad I did not study mathematics.
Yes generic symbolic solution would be fine. But will probably also be slow because of using interpreter.
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: Deep misery
Test (F(x+h)-F(x))/h does not work in discontinues part of a function. For instance if f(u) = u > 0 ? u: 0.01 * u and F(x + h) > 0 but F(x ) < 0
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Deep misery
You can't expect equality, but you should get pretty close in most cases. If your h is judiciously chosen, the probability of hitting a corner like that will be very small.Henk wrote:Test (F(x+h)-F(x))/h does not work in discontinues part of a function. For instance if f(u) = u > 0 ? u: 0.01 * u and F(x + h) > 0 but F(x ) < 0