Deep misery

Henk · Post by **Henk** » Wed Feb 07, 2018 9:34 pm

Made some changes to my network and I was not sure everything would be working fine. Suddenly in debug mode an error appeared that indicated that gradients were exploding. So I started debugging. After a few hours I gave up for I could not find what was wrong. So I reverted the changes. But ... bug was still there.

Then I saw that I had changed my learning parameter too. So bug was not the code but a wrong value of learning parameter. So now I can start typing again for redo button doesn't redo my changes of today.

AlvaroBegue · Post by **AlvaroBegue** » Wed Feb 07, 2018 9:48 pm

Why don't you use something like Mercurial or Git?

Exploding gradients usually means your weights were initialized too large. There are additional techniques that ameliorate the problem, like ResNet, batch normalization and weight normalization.

Henk · Post by **Henk** » Wed Feb 07, 2018 9:55 pm

I looked at batch normalization. But my network does not use mini-batches and computes gradient for loss over one training example.

I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.

AlvaroBegue · Post by **AlvaroBegue** » Wed Feb 07, 2018 10:08 pm

Henk wrote:I looked at batch normalization. But my network does not use mini-batches and computes gradient for loss over one training example.

I read that probably ELU or SELU might help and is faster than using batch normalization. If not then I switch over to mini batches.

There is a fairly obvious modification of batch normalization, where you use an average of recent values (like a momentum term) to compute the mean and the standard deviation, instead of doing it over a minibatch.

However, if you use ResNet, you may not need to do anything else.

Henk · Post by **Henk** » Thu Feb 08, 2018 5:42 pm

ResNet still uses batch normalization.

But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.

My network was still shallow so a ResNet solution is a bit too early.

I still have problems with computing gradients. So simple steps first.

AlvaroBegue · Post by **AlvaroBegue** » Thu Feb 08, 2018 5:56 pm

Henk wrote:ResNet still uses batch normalization.

People generally use "ResNet" to refer to the use of skip connections every two layers. That's what I meant by it here. Even without BN this is a very good idea that makes learning more robust.

But I just found another tough bug in my backpropagation which I modified for implementing parametric ELU's. So maybe these exploding deltas were caused by this bug too. Have to test to be sure.

Get rid of bugs. You should use automatic differentiation or symbolic differentiation, so you never have to worry about this again.

My network was still shallow so a ResNet solution is a bit too early.

I still have problems with computing gradients. So simple steps first.

Of course. I wouldn't even do any learning until your gradients are computed correctly.

You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.

Henk · Post by **Henk** » Thu Feb 08, 2018 6:17 pm

AlvaroBegue wrote:You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.

Yes that's how I found these bugs. But they are still difficult to fix if you are overlooking something completely.

ResNet will give other gradients. So I expect more problems and tough bugs.

Maybe I'm too stupid to apply a chain rule. Should be glad I did not study mathematics.

Yes generic symbolic solution would be fine. But will probably also be slow because of using interpreter.

AlvaroBegue · Post by **AlvaroBegue** » Thu Feb 08, 2018 7:03 pm

Henk wrote:
AlvaroBegue wrote:You can compute the derivative of the loss function with respect to individual weights by using the formula (F(x+h)-F(x))/h, for some small h. Compare that with the gradient you got from backpropagation.
Yes that's how I found these bugs. But they are still difficult to fix if you are overlooking something completely.

ResNet will give other gradients. So I expect more problems and tough bugs.

Maybe I'm too stupid to apply a chain rule. Should be glad I did not study mathematics.

Yes generic symbolic solution would be fine. But will probably also be slow because of using interpreter.

Can you post what your forward computation looks like? I might be able to suggest non-buggy ways to compute the gradient.

Henk · Post by **Henk** » Fri Feb 09, 2018 12:19 pm

Test (F(x+h)-F(x))/h does not work in discontinues part of a function. For instance if f(u) = u > 0 ? u: 0.01 * u and F(x + h) > 0 but F(x ) < 0

AlvaroBegue · Post by **AlvaroBegue** » Fri Feb 09, 2018 1:38 pm

Henk wrote:Test (F(x+h)-F(x))/h does not work in discontinues part of a function. For instance if f(u) = u > 0 ? u: 0.01 * u and F(x + h) > 0 but F(x ) < 0

You can't expect equality, but you should get pretty close in most cases. If your h is judiciously chosen, the probability of hitting a corner like that will be very small.

Deep misery

Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery

Re: Deep misery