TalkChess.com

Posted: **Thu Sep 24, 2020 12:16 am**

I've been playing around with neural nets recently, which I was totally new to, and managed the first result that I'd consider a success.
I trained a rook and pawns endgame neural net to be used by the dev version that is +3 elo versus the the non-neural net evaluation in rook endgames. However it should be much stronger once I do some of the possible speed optimizations as it could probably be around 3 times faster than it is now.

Initially I was surprised at how simple the concept of using a neural net to compute a value was once I understood how they are usually structured. First actual implementation of this was pretty quick too. It really is just some weights and biases, and multiplications and sums (and usually densely connected with a weight from each input to output.)

Training a neural net proved a far tougher task. I started with my own training code and after finding some issues (one example : don't initialize with all weights with the same value) it worked to some extent. I could match piece square table values in a test. But training was slow, and with more layers and more complicated values didn't reduce error in the way it should have. So at this point I downloaded tensorflow and exported the training data in an inputs and target file, then imported the weights back. It did turn out easier to have already debugged and optimized training, out of curiosity I may go back and try to make my own work and use it, but it does seem likely that tensorflow would just be better in every way except the import/export workflow step.

Once I made a generalized neural network class, I setup a neural net structure that used the pieces on the board directly as input features, and has 4 layers for rook endgames. As inputs I started with 64 rook squares, 64 king squares, 48 pawn squares, all 0 or 1 depending on if occupied.

I filtered some games to get an epd of about 10000 rook endgame starting positions, and ran some matches of Slow 2.3 versus itself, exported the training data scored based on game result, and trained a net with 176 inputs x 256 x 64 x 32 x 1.
I was surprised at how much error was reduced.

First try was -458 elo versus the base. Obviously massively overfit and too little data. Imported with data from the games played and retrained, got -425 elo, then -401, then -358. As expected error was going up as amount of training data increased.
It was at this point where I realized I had a very bad bug... I was actually setting the total inputs to 176 instead of 352, so it was only seeing the white side of the board this whole time.
I corrected this and was at -170 elo. Then I optimized the first layer to only do the calculation if the inputs were non-zero which for rook endgames was a 2.5x speed increase, and was at -111 elo. More training and horizontal symmetry (32 squares for white king, rest of board flipped depending on side) -88 elo.
more training and side-to-move input -70 elo. Reduced the net size, and faster memory layout for inputs -55 elo. After this it was basically just more training, and increasing positions per game seemed to help too, was better to have more data and worry less it would be too similar.
-43, -38, -28, -20, -14, -7, and finally +3. I'm up to 1.5 million training positions at this point.

Why rook endgame : I wanted to experiment in a smaller setting, since I'm new to neural nets, starting from scratch in all aspects (except use of tensorflow.)
Rook endgames are also the most common endgames, and I figured starting from the endgame and working back might be a reasonable method.

Right now I'm using floating points with some SIMD. I estimate it could be about 3 times faster if I change to 16-bit fixed point with SIMD and add incremental updates for changing inputs on the first layer. After optimization I might play with net size and inputs to see if what works best.

This is just rook and pawns endgames and not enough testing and a workflow that needs improvement, so don't expect any new version of Slow Chess any time soon. Also it might still be overfit and not do as well in with other positions and opponents. But it seems likely that the method could yield significant gains eventually.

Posted: **Thu Sep 24, 2020 10:53 am**

Right now I. 'm using floating points with some SIMD. I estimate it could be about 3 times faster if I change to 16-bit fixed point with SIMD and add incremental updates for changing inputs on the first layer. After optimization I might play with net size and inputs to see if what works best.

currently, everyone is using nets designed for shogi, and that use a 3x3 multiply.
If you could use the full power of the card with GPUs, you could possibly do much better.
With the CPU, someone chose a small net size. Is it best? maybe or maybe not.

Your freelance test idea is what everyone ought to be doing.

Posted: **Thu Sep 24, 2020 1:29 pm**

Dann Corbit wrote: ↑Thu Sep 24, 2020 10:53 am
Right now I. 'm using floating points with some SIMD. I estimate it could be about 3 times faster if I change to 16-bit fixed point with SIMD and add incremental updates for changing inputs on the first layer. After optimization I might play with net size and inputs to see if what works best.
If you could use the full power of the card with GPUs, you could possibly do much better.
With the CPU, someone chose a small net size. Is it best? maybe or maybe not.

Your freelance test idea is what everyone ought to be doing.

currently, everyone is using nets designed for shogi, and that use a 3x3 multiply.

Wrong, Winter, Fizbo, Ethereal and apparently Schooner have their own (hybrid) implementations.

The NNUE architecture first used by Shogi engines is also drastically different from the CNNs used by the GPU engines.
Can't compare those two.

Posted: **Thu Sep 24, 2020 1:53 pm**

jonkr wrote: ↑Thu Sep 24, 2020 12:16 am So at this point I downloaded tensorflow and exported the training data in an inputs and target file, then imported the weights back. It did turn out easier to have already debugged and optimized training, out of curiosity I may go back and try to make my own work and use it, but it does seem likely that tensorflow would just be better in every way except the import/export workflow step.

So how much will tensorflow "do for you", if you want to get away with as little programming as possible?

Posted: **Thu Sep 24, 2020 4:12 pm**

jp wrote: ↑Thu Sep 24, 2020 1:53 pm
jonkr wrote: ↑Thu Sep 24, 2020 12:16 am So at this point I downloaded tensorflow and exported the training data in an inputs and target file, then imported the weights back. It did turn out easier to have already debugged and optimized training, out of curiosity I may go back and try to make my own work and use it, but it does seem likely that tensorflow would just be better in every way except the import/export workflow step.
So how much will tensorflow "do for you", if you want to get away with as little programming as possible?

You still need to write the code to use the network in your evaluation function and incrementally update the input -> first hidden layer connections. Its not a trivial task.

Posted: **Thu Sep 24, 2020 4:38 pm**

Is your network outputting a single centipawn score, or are you outputting both an MG and an EG score?

For some reason, outputting only one score in training, but then only applying it to the MG works best for me. I'm trying desperately to remedy this, but have been failing over and over for the last week. I'm damn close to removing the network and taking the +20 elo loss. I'm not willing to let incorrect mathematics sit in the engine.

Posted: **Thu Sep 24, 2020 9:52 pm**

jp :
Tensorflow provides a library that can be used to train the neural net weights. You still write a small python script to tell it what to do and import/export your data, but the heavy-lifting / lower level stuff all just works. The other nice thing about tensorflow is you can have it show the values for some of the training positions to make sure you aren't screwing something up in implementation, which I did several times.

Implementing the learning of the weights to me seemed clearly the hardest part of making neural nets (though my learning eventually worked for some things sometimes and I also did thread it, so maybe it was almost there, but maybe it is long way off, won't know unless I finish it.)

Dann : Right now not planning on GPU version, possible in future, but I like the ease of use and testing with many concurrent games running with everything on the CPU. I doubt I'm doing anything smart right now, but I do think it's definitely more interesting to have some people explore the space rather than pasting something in.

Andrew : I'm outputting a single score, just using it in rook endgames and starting simple. I'm not sure yet if I'll try to use multiple outputs, depends on how I try to fit neural nets into the rest of the game.

Yesterday I switched to fixed-point. I ended up setting everything 32-bit now since was getting bad values trying to have some 16-bit parts, probably from overflow. Now that it works just weights as 16-bit should be fine at least. Even without SIMD it's faster than floats on cpu.

So next step is to finish the optimizations, then test against SF11 in the rook&pawn endgame suite, I'm curious if after running several training runs against it it will be able to beat it. (SF12 seems too strong a goal.) After that I will have to decide whether or not to make more special purpose nets (or try more specific endgame types see how quickly that goes) or try to make them more general. I want to experiment with some special ones in eval before trying out generalizing, like what if I sent my King Safety or related features to a small neural net instead of just tuning a few weights with texel tuning.

Posted: **Thu Sep 24, 2020 10:13 pm**

jonkr wrote: ↑Thu Sep 24, 2020 9:52 pm Andrew : I'm outputting a single score, just using it in rook endgames and starting simple. I'm not sure yet if I'll try to use multiple outputs, depends on how I try to fit neural nets into the rest of the game.

You are going down the very line I am planning. First Pawn+King, then Material.

Then special nets that get triggered at root for 20 different endgames. I continue to believe -- and want to believe -- that NNUE replacing the entire eval is suboptimal. And that an equal or better result can be achieved with a handful of small, quickly computed and well hashed NNs, without the speedloss.

Piping the existing eval through an NN is an option I'm looking into as well. Safety is the primary place, as it is already done via non-linear methods.

Posted: **Thu Sep 24, 2020 11:55 pm**

I have also been thinking about using these NNUEs as a way to experiment + learn about this exciting field. One question I have is about the length of time to train a net. For example, assuming you already had done the heavy lifting of all the code, how long does it take a fast machine to build a big cpu-size net like Sergio’s? I ask because I expect to make many mistakes and would still like to have time to try a lot of things that ultimately work.

I originally wanted to do this for the gpu nets, but there it is prohibitive for an individual or small team. Only Google or the Leela project can do it. But it seems like normal people might be able to have fun with NNUEs.

Posted: **Fri Sep 25, 2020 12:00 am**

econo wrote: ↑Thu Sep 24, 2020 11:55 pm I have also been thinking about using these NNUEs as a way to experiment + learn about this exciting field. One question I have is about the length of time to train a net. For example, assuming you already had done the heavy lifting of all the code, how long does it take a fast machine to build a big cpu-size net like Sergio’s? I ask because I expect to make many mistakes and would still like to have time to try a lot of things that ultimately work.

I originally wanted to do this for the gpu nets, but there it is prohibitive for an individual or small team. Only Google or the Leela project can do it. But it seems like normal people might be able to have fun with NNUEs.

I cannot answer your question for NNUEs/Sergio, but I can answer it for Ethereal.

Ethereal's net right now is 224x32x1. It is trained on ~35 million samples. The training process takes ~8 hours, using a 16-thread CPU. One could speed it up by using a GPU (code would work just the same, I just don't happen to have a CUDA GPU)

TalkChess.com

First success with neural nets

First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets

Re: First success with neural nets