Thoughts on "Zero" training of NNUEs

Discussion of chess software programming and technical issues.

Moderators: hgm, chrisw, Rebel

User avatar
johnhamlen65
Posts: 31
Joined: Fri May 12, 2023 10:15 am
Location: Melton Mowbray, England
Full name: John Hamlen

Re: Thoughts on "Zero" training of NNUEs

Post by johnhamlen65 »

op12no2 wrote: Wed Nov 06, 2024 10:00 am
johnhamlen65 wrote: Tue Nov 05, 2024 5:24 pm How are you getting along with your single-accumulator model?
All the best
John
Hi John,

It's a lot of fun and I'm learning a lot. As a Javascript engine, performance is key and I have been surprised to discover that 128 is viable as a hidden layer; training 256 as I type. Lots more to experiment with like multiple-accumulators, quantisation, output buckets, alternative activators, deferred UE, more training data, different optimisers, different sigmoid scaling etc. I find it far more fun that playing around with a HCE and it's relatively straightforward to get a basic implementation to work from; you don't need UE to get going for example.

https://github.com/op12no2/lozza
That's great news Colin! Thanks for the Github link. Looking forward to checking it out more fully.
Nevermind the NNUE, I'm really impressed with the UI at https://op12no2.github.io/lozza-ui/
What sort of horsepower are you using to do the training? (Said John looking for and excuse to upgrade his PC! :lol: :lol: )
All the best and good luck with Lozza!
John
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

johnhamlen65 wrote: Wed Nov 06, 2024 5:55 pmI get the part about the the attacking and defending players in a position have two seperate perspectives, but don't understand how this leads to the need for two seperate nets or PSQTs. Being a zero-sum game I would ezpect that a good placement of defending pieces will be exactly as bad for the attacker as they are good for the defender, and vice versa for the attacking piece placement.
For a given player on move that is indeed true. So the score for black in a position with white on move is indeed the opposit of the score for white with white on move. But it doesn't have to be the same (or the opposit) of the score of that same piece constellation with black on move. The side not on move can have different priorities, requiring different placement of his pieces to get a good score.
Having said that, I do get that being the side to move is (usually) an advantage. (In the past (with Woodpusher) I added some unscientific bonus for this into the evaluation, but never quatified how much difference it made).
This is a step in the right direction. But the problem is that in some (wild) positions the advantage of having the move can be huge, if not decisive, while in tame positions there hardly is any advantage at all. So a fixed bonus doesn't help very well. Making it dependent on your King Safety (the unsafer, the worse it is if the opponent has the move) would be a primitive way to account for this in a hand-crafterd evaluation.
With NNUE - and I think this is what Chesskobra is suggesting(?) - can we "just" train the network so the evaluation of (E) of every position is from the perspective of the STM (or White), and the evaluation from the perspective of the other side (or Black) is always -E.
That is not so much a matter of training the NNUE as well as designing its topology. You cannot easily train an NN to reach a state that performs worse than the best solution its topology allows. If you allow it to use asymmetric evaluation, and the training set shows that this performs better, it will learn to evaluate asymmetrically. So you simply must not allow it, but enforce the symmetry. In this case you could just use a large number of conventional PST, used by both stm and his opponent's pieces, and feed their (incrementally updated) PST sums into a fully connected NN. You could even combine this with tables that are dependent on the King location, or use the relative location w.r.t. the King.

The original presentation of NNUE was really a combination of three independent ideas: using different PST for each King location, using different PST for the side to move and the side not to move, and using multiple tables and combine their output through a NN. Strictly speaking only the latter idea is NNUE; the other two are refinements that could also be used without NN.
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

hgm wrote: Wed Nov 06, 2024 8:13 pm The original presentation of NNUE was really a combination of three independent ideas: using different PST for each King location, using different PST for the side to move and the side not to move, and using multiple tables and combine their output through a NN. Strictly speaking only the latter idea is NNUE; the other two are refinements that could also be used without NN.
With the King square PST is that what i dont doing, it will be explode the number of input neurons to much, so i made it only with side to move or not side to move, including the kings too (instead of 10x64 ) , thats enough i guess.

So a 2x12x64 = 1536 inputs are enough, first upper 768 neurons stay white on move and the other 768 stay for black to move.

The reason is about the speed later, small neural network are faster work, bigger are more accurate but badly slow down the search to much.
Last edited by Engin on Thu Nov 07, 2024 12:02 am, edited 3 times in total.
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

Engin wrote: Tue Nov 05, 2024 6:11 pm
johnhamlen65 wrote: Tue Nov 05, 2024 9:54 am
Engin wrote: Mon Nov 04, 2024 5:22 pm
Does anybody know how ist formula of winning percentage WDL to centipawn and also back again is ?
Hi Engin,
You can reverse the Lichess Accuracy Metric to get this:
centipawns = log((2 / ((win_probability - 0.5) * 2 + 1)) - 1) / -0.00368208
Hope this helps,
John
Ah, looks like pretty good, thanks john !
hmmm... i found a better solution at this page https://www.chessprogramming.org/Pawn_A ... e,_and_Elo
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

Engin wrote: Wed Nov 06, 2024 11:51 pm The reason is about the speed later, small neural network are faster work, bigger are more accurate but badly slow down the search to much.
The NNUE using KPST are not any slower than when using normal PST, except in case of a King move, where complete recalculation is the fastest option. But since King moves are just a modest fraction of the moves, the slowdown should not be that dramatic.

The slowdown can be ameliorated by 'bucketing' a set of adjacent squares, making them use the same PST, so that no update is needed at all when the King moves between squares in the bucket. This of course loses the precision with which the tables award aiming pieces exactly at the King, but I can imagine that there still is an advantage in diversifying the tables depending on which side the Kings have castled to. (E.g. two buckets: right and left board half. Or four buckets: a1-c2, f1-h2, d1-e2 and a3-h8.) That should really reduce the number of King moves that cross bucket boundaries.

Also note that this is not really an all-or-nothing choice. If you use, say, 256 tables for each colored piece type and side to move, you could make a fraction of those be KPST, also indexed by the King location, and the remaining ordinary PST, independent from King location. On a King move you then only have to recalculate the PST sums for that fraction.
User avatar
johnhamlen65
Posts: 31
Joined: Fri May 12, 2023 10:15 am
Location: Melton Mowbray, England
Full name: John Hamlen

Re: Thoughts on "Zero" training of NNUEs

Post by johnhamlen65 »

JacquesRW wrote: Wed Nov 06, 2024 3:34 am
johnhamlen65 wrote: Tue Nov 05, 2024 5:24 pm Thanks Colin. That diagram and link make it much clearer to me about what is going on.
I'm sure the 2 accumulator system much work better, as it seems to be universally adopted. However, being a bear of very little brain, I still can't see why. It seems to me that, once trained, the weights for the top and bottom halves will be simply (sort of) mirror images of each other, and not bring anything else to the party.
In terms of motivation - if you want STM-relative inputs (which you do, its worth a hell of a lot of elo) you'll find yourself needing to keep track of an accumulator from each perspective anyway, so why not use both available accumulators in eval?
It is equivalent to doubling the hidden layer size with a (rather strong) constraint on the second half of the weights in that first layer - them being a vertical mirror of the weights in the first half. This does not mean that the two accumulators will look remotely alike nor that the weights in the subsequent layers will show the same symmetry.
The question is whether this gains enough accuracy to outweigh the more expensive inference from the two accumulators -> output, rather than just the stm accumulator -> output. The answer to this is a firm yes.
hgm wrote: Tue Nov 05, 2024 10:10 pm A third ingredient is the introduction of (multi-layer) NN. This is really an independent technique; it could also have been done when using the same PST for both the stm and the opponent. Instead of one PST (or pair of KPST) you use a large number of those (typically 256).
256 is on the tiny side by today's standards. For SF an L1 size of 256 was only seen in the very first SFNNv1 architecture, and the current size is 3072.
Thanks Jamie, this is really helpful.
I checked out your Akimbo on Github. I don't "speak" Rust, but even I can tell how neat and compact your code is! I will try to get my head around what is happening on the NNUE side of things.
All the best
John
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

JacquesRW wrote: Wed Nov 06, 2024 3:34 amIn terms of motivation - if you want STM-relative inputs (which you do, its worth a hell of a lot of elo) you'll find yourself needing to keep track of an accumulator from each perspective anyway, so why not use both available accumulators in eval?
It is equivalent to doubling the hidden layer size with a (rather strong) constraint on the second half of the weights in that first layer - them being a vertical mirror of the weights in the first half. This does not mean that the two accumulators will look remotely alike nor that the weights in the subsequent layers will show the same symmetry.
The question is whether this gains enough accuracy to outweigh the more expensive inference from the two accumulators -> output, rather than just the stm accumulator -> output. The answer to this is a firm yes.
There is something I don't understand here. Like you remark, using both accumulators is like using a single accumulator of twice the size, with a symmetry restraint on the weights. And a special case of this would be to make half the weights zero, so that effectively you would not use half of this double-size accumulator, depending on who has the move. Which is the original NNUE design.

So is the Elo increase you get from the dual-accumulator design just a result of doubling the (combined) accumulator size, and would you have had the same (or better) results from sticking to a single-accumulator design with twice the size? It seems strange that putting restrictions on the weights would give you a stronger network; the unrestricted network could have learned it should be symmetric, or that some weights had better be zero.

Is this a consequence of a wrong training method for a single accumulator design?
JacquesRW
Posts: 113
Joined: Sat Jul 30, 2022 12:12 pm
Full name: Jamie Whiting

Re: Thoughts on "Zero" training of NNUEs

Post by JacquesRW »

hgm wrote: Fri Nov 08, 2024 2:37 pm So is the Elo increase you get from the dual-accumulator design just a result of doubling the (combined) accumulator size, and would you have had the same (or better) results from sticking to a single-accumulator design with twice the size?
If speed wasn't a factor, you would expect a 2N size single-accumulator to beat the dual N size accumulators. However, with a single 2N accumulator, you still need to keep track of both black and white accumulators each of size 2N, and then pick which one to use at eval-time.
Whereas in the dual accumulator setup, you only need to keep track of black and white accumulators of size N, and then at eval time you concatenate them - semantically, actual implementation for speed will elide this - with the side-to-move accumulator going first, and the not-side-to-move accumulator going second.

So basically you get a network that in raw terms is somewhere between the single-accumulator design with size N and 2N, but for only a small additional cost over the single-accumulator with size N design (of the 2N -> subsequent layers rather than N -> subsequent layers), relative to the cost of efficiently updating accumulators of size 2N.

EDIT: I'm not sure about what was originally done in Shogi engines, but SF has always used the dual accumulator architecture that I have described. You can see it in SFNNv1 here: https://github.com/official-stockfish/n ... chitecture
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

OK, I see. Setting weights to 0 is not the same as omitting the corresponding links, because multiplying by zero takes time too.
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

i just one not understanding how can be a neural network is working with integer values that converted from floating numbers, yes its faster working but do you get the same result on the outputs then after quantize whole network to integer and would it be the sigmoid activation function in the forward process working with integer too?? because with sigmoid you get allways floating numbers between 0.0 ... 1.0