How to make a double-sized net as good as SF NNUE in a few easy steps

hgm · Post by **hgm** » Sun Feb 28, 2021 11:25 pm

connor_mcmonigle wrote: ↑Sun Feb 28, 2021 10:29 pmAlso, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.

It does when two successive layers are zero. Any infinitesimal change of a weight in the first layer would then not affect the network output, because the second layer would not transmit it. And any such change in the second layer would have no effect because the cell it comes from gets zero activation. So the gradient is zero. In essence you are seeing the product rule for differentiation here: (f*g)' = f'*g + f*g', where f is the weight of one layer, and g of the other. If f = g = 0, it doesn't matter that f' and g' !=0.

1/3 and 2/3 would also not work. (And beware that you add two layers of weights; if you make the weights in both layers 3 times smaller, the contributibution of that part of the net gets 9 times smaller.) If the nets are just multiples of each other, their weights will affect the result in the same proportion, so they will also be modified in the same proportion. They would keep doing the same thing forever. You have to make sure they do something essentially different from the start.

Randomizing one layer, and zeroing the other would work, though! This would not affect the initial net output.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Feb 28, 2021 11:33 pm

hgm wrote: ↑Sun Feb 28, 2021 11:25 pm
connor_mcmonigle wrote: ↑Sun Feb 28, 2021 10:29 pmAlso, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.
It does when two successive layers are zero. Any infinitesimal change of a weight in the first layer would then not affect the network output, because the second layer would not transmit it. And any such change in the second layer would have no effect because the cell it comes from gets zero activation. So the gradient is zero. In essence you are seeing the product rule for differentiation here: (f*g)' = f'*g + f*g', where f is the weight of one layer, and g of the other. If f = g = 0, it doesn't matter that f' and g' !=0.

1/3 and 2/3 would also not work. (And beware that you add two layers of weights; if you make the weights in both layers 3 times smaller, the contributibution of that part of the net gets 9 times smaller.) If the nets are just multiples of each other, their weights will affect the result in the same proportion, so they will also be modified in the same proportion. They would keep doing the same thing forever. You have to make sure they do something essentially different from the start.

Yes. If two layers are zero initialized in a simple fully connected network, the gradient will be zero for both layers (higher order methods can overcome this). That's correct. However, that's not what chrisw was suggesting. In chrisw's scheme, only part of the first layer would be zero initialized. The successive layers would still have nonzero weights.

hgm · Post by **hgm** » Sun Feb 28, 2021 11:41 pm

Where do you read that? I just see:

Step 1. Initialise our new 1024 neuron network with all weights = 0.0

Step 2. Unload all SF NNUE weights (from topography 512) and use them to fill the left hand side of the 1024 topography net.

The new part of the network has both connections from the 512 new cells to the input and to the 32-cell layer. These had no counter-part in the 512-wide NNUE. These cells will always stay completely dead. Unless you use second-order methods, as you say. But that would still not solve the problem that they all contribute to the second order in exactly the same way.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Feb 28, 2021 11:42 pm

Read step 2

hgm · Post by **hgm** » Sun Feb 28, 2021 11:47 pm

Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells ("the right-hand side") have zero input and output, in this prescription.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Feb 28, 2021 11:52 pm

hgm wrote: ↑Sun Feb 28, 2021 11:47 pm Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells have zero input and output, in this prescription.

No worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.

chrisw · Post by **chrisw** » Mon Mar 01, 2021 12:05 am

connor_mcmonigle wrote: ↑Sun Feb 28, 2021 11:52 pm
hgm wrote: ↑Sun Feb 28, 2021 11:47 pm Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells have zero input and output, in this prescription.
No worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.

One could, instead of zero-ing the right hand side, initialise it with what would have gone into the left hand side (the weights from the factoriser). Then RHS weights to hidden layer are set duplicate the already existing from LHS. Would that work?

hgm · Post by **hgm** » Mon Mar 01, 2021 12:08 am

Not sure anymore what you call a 'layer' here. Does that include the 512 cells initialized as the old network? We are dealing with 2 new layers of weights here, even if there only is one layer of new cells. It doesn't help you that the complete layers are nonzero. Because a nonzero gradient doesn't help you if it has a zero component along an isolated part of the network.

If you don't believe it, try it.

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Mar 01, 2021 12:09 am

chrisw wrote: ↑Mon Mar 01, 2021 12:05 am
connor_mcmonigle wrote: ↑Sun Feb 28, 2021 11:52 pm
hgm wrote: ↑Sun Feb 28, 2021 11:47 pm Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells have zero input and output, in this prescription.
No worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.
One could, instead of zero-ing the right hand side, initialise it with what would have gone into the left hand side (the weights from the factoriser). Then RHS weights to hidden layer are set duplicate the already existing from LHS. Would that work?

Just zero initializing the new weights would work perfectly fine. It might be inferior to just initializing with some normal noise, though. Of course, then you break the behavior that outputs exactly match, but I don't think this is a big deal in practice.

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Mar 01, 2021 12:11 am

hgm wrote: ↑Mon Mar 01, 2021 12:08 am Not sure anymore what you call a 'layer' here. Does that include the 512 cells initialized as the old network? We are dealing with 2 new layers of weights here, even if there only is one layer of new cells. It doesn't help you that the complete layers are nonzero. Because a nonzero gradient doesn't help you if it has a zero component along an isolated part of the network.

If you don't believe it, try it.

I don't actually know quite what you mean by cells. Do you mean matrix entries? If so, there are far more than 512 that get added with chrisw's scheme...

How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps