None-GPL NNUE probing code.

Discussion of chess software programming and technical issues.

Moderator: Ras

Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: None-GPL NNUE probing code.

Post by Daniel Shawul »

hgm wrote: Tue Feb 02, 2021 3:05 pm
Daniel Shawul wrote: Tue Feb 02, 2021 12:54 amNNUE architecture learns king safety and attack very quickly but something simple as piece values maybe hard without
factorizer, or having it as input directly.
Isn't that just a matter of offering it the right training material? It seems that people have the inclination to train on positions from games between very strong players. But these usually contain zero information on the opening values of pieces, as they stay nearly balanced for a long time. So the only thing the net can learn from it is that positions with a lot of material are always balanced. And it would quite happily set all piece values to 0 in order to guarantee that. To make it understand that trading a Queen for two Pawns in the opening causes a certain loss, you would have to show it enough positions where a Queen was traded for two Pawns.
Yes that is one factor i.e. having "good quality data" that has unbalanced positions, Also the NNUE architecture makes it harder to learn piece values because a piece's merit is based on its position relative to a king. The input is simply too sparse and learning piece values becomes 64x harder because of it. I tried zero reinforcement learning for NNUE, where having bad positions is not a problem, but it failed to learn piece values even though you can see it has learned more advanced features. Supervized learning with depth=18 or so data had shown similar problems but this one is for the reason you mentioned.
connor_mcmonigle
Posts: 544
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: None-GPL NNUE probing code.

Post by connor_mcmonigle »

Daniel Shawul wrote: Tue Feb 02, 2021 4:19 pm
hgm wrote: Tue Feb 02, 2021 3:05 pm
Daniel Shawul wrote: Tue Feb 02, 2021 12:54 amNNUE architecture learns king safety and attack very quickly but something simple as piece values maybe hard without
factorizer, or having it as input directly.
Isn't that just a matter of offering it the right training material? It seems that people have the inclination to train on positions from games between very strong players. But these usually contain zero information on the opening values of pieces, as they stay nearly balanced for a long time. So the only thing the net can learn from it is that positions with a lot of material are always balanced. And it would quite happily set all piece values to 0 in order to guarantee that. To make it understand that trading a Queen for two Pawns in the opening causes a certain loss, you would have to show it enough positions where a Queen was traded for two Pawns.
Yes that is one factor i.e. having "good quality data" that has unbalanced positions, Also the NNUE architecture makes it harder to learn piece values because a piece's merit is based on its position relative to a king. The input is simply too sparse and learning piece values becomes 64x harder because of it. I tried zero reinforcement learning for NNUE, where having bad positions is not a problem, but it failed to learn piece values even though you can see it has learned more advanced features. Supervized learning with depth=18 or so data had shown similar problems but this one is for the reason you mentioned.
It's possible you're already aware, but the "factorization trick" massively reduces the amount of data/time required to learn piece values. With factorization, I've found that networks with halfkp/halfka input features learn piece values nearly as easily as networks with 768 piece position input features. In practice, it's still a little more difficult to learn piece values with the sparse halfkp/halfka input features as the random initialization of the first layer will result in some noise persisting for rarely observed input features. (perhaps a different weight initialization scheme could resolve this though)

The factorization trick relies on the fact that the sparse halfka or halfkp features contain complete information about the dense 768 or 640, respectively, piece position features. Therefore, it's possible to construct a non-square, sparse, permutation (sort of) matrix M such that, for the sparse input feature vector v and dense input u, M @ v = u. We can then add a separate dense input layer taking dense features using M to convert sparse input features to dense input features for the dense layer. When we export the model, we can use M to fuse the dense input layer with the sparse input layer to obtain a single, equivalent, sparse input layer (for dense input weight matrix W, W @ M results in an equivalent sparse input weight matrix).

There exist many other possible "factorizations" such as material count, rank, file, diagonal etc. In general, for any function taking some subset of the sparse input feature indices to some other set of feature indices, there exists a corresponding M.