You've trained a brilliant NN(UE) King-Piece Network. Now what?

AndrewGrant · Post by **AndrewGrant** » Thu Nov 19, 2020 8:57 am

Skip to the bottom if you don't want to humor me....

Monologue

::::

So I wrote an NNTraining tool in C, made it very well threaded, cranked up optimizations, wrote AVX2 code for portions to perform weight updates faster, came up with a novel addition to Adam to limit throughput when dealing with sparse inputs by batching in fixed sizes, and refusing to shuffle intra-batch, only inter-batch.

The result? An NNUE 2x[40960x256] => 512x32x32x1 that beats Ethereal master in Fischer Random Chess by +200 elo in even-nodes gameplay. That was trained with 300M positions, which is far from the "standard", and was also trained on Ethereal HCE. There is no reason to believe that additional games, and a regeneration of the tuning data, using the new NNUE evaluation to label, will net even more elo.

So... what to do with it. That is the question. I tried, and failed, to port it to SF's structure. SF's optimization requires playing games with the values of weights and biases. Something that is not a default for floating point trained Networks. I've not coded the actual "Incremental" part yet, but with some measurements I can show that my net will be performing a 4+ hundred knps worse than "Etherlito" even in the best case assumptions. That is, quite damning. Ethereal was already somewhat "Fast and Dumb", so the reduced speed is painful.

So lets test the theory further about speed. I trained another network, this time 2x[40960x64] => 128x32x32x1, which has a 1/4th the input weights, resulting in about 1/3rd the total FMAs needed (Trust the math). What does that do (in FRC)? Beats Ethereal master by about 14 elo, despite not doing incremental updates; IE recomputing the first layer everytime.

So you might say Andrew, you've shown that you _can_ beat master with an NNUE, why don't you man up and code the incremental update, then come cry on the forums. Well, that is a good question. I'll get around to that, at the expense of other things soon enough.

::::

So floating point NN implementations on the CPU have some downsides and some upsides. A downside is that you will not get deterministic behavior across platforms. This is an issue since I rely on that. As an upside, its trivial. Multiplying and adding floats together takes very few brain-cells to program. Since a 32-bit float times a 32-bit float is, once again, 32 bits, everything works out nice. When you deal with int16_t, you don't get that luxury.

So my real question I suppose. quantizing down to int16_t does not appear, to me, to be worth very much. You can perform some extra ops -- speedup relus and such, but the multiplication is not a perfect 2x gain, I believe. Am I wrong? Can you line up AVX instructions such that this is not a concern?

Assuming not, this means you have to dive down into the world of int8_t. That is a scary world. At that point, it appears to me, you have to take special action in the training process to ensure that weights are within some smallish range. Gary pointed this out in his pytorch thread in regards to "effectively (my quotes)" multiplying the NN output by 600 at the end, to avoid the final set of weights from having to grow very large.

::::

One thing that has made me quite upset is that there appears to be very little literature online discussing the nuances of these things, and how to actually implement such things into C. It looks like looking at CFish is the best crash-course you could get, but I'm not too keen since I'm not a fan of the programming style, and I also don't want to just duplicate that effort. I wrote the trainer with no direct contact with the SF code, so I'de like to see if I could do the implementation on my own as well.

I suppose the end result might be that I don't care to do the implementation. In which case its weird that I have a big function of 20 million weights sitting on my desktop, which would gain 100+ elo if implemented well, but no one will ever see it.

Madeleine Birchfield · Thu Nov 19, 2020 9:24 am

If there is little literature online discussing the nuances of implementing the neural network inference and training code, then I think it would be a good idea to write down your discoveries in a paper and publish it, similar to what you did with your tuning paper, so that there would be more literature online for future chess engine developers.

smatovic · Post by **smatovic** » Thu Nov 19, 2020 9:53 am

Dude, you are funny, "Don't copy SF NNUE, you are lame!!!111", "Where are the papers for NN implementations???", "I got +100 Elo sitting in here but won't publish it!!!111", what shall 'we' make out of this?

Maybe you should go commercial? Seriously, could relax some things?

--
Srdja

AndrewGrant · Post by **AndrewGrant** » Thu Nov 19, 2020 9:55 am

smatovic wrote: ↑Thu Nov 19, 2020 9:53 am Dude, you are funny, "Don't copy SF NNUE, you are lame!!!111", "Where are the papers for NN implementations???", "I got +100 Elo sitting in here but won't publish it!!!111", what shall 'we' make out of this?

Maybe you should go commercial? Seriously, could relax some things?

?

smatovic · Post by **smatovic** » Thu Nov 19, 2020 9:57 am

You've trained a brilliant NN(UE) King-Piece Network. Now what?

Maybe you should go commercial?

--
Srdja

Madeleine Birchfield · Thu Nov 19, 2020 9:59 am

The other thing to remember is that this technology (NN on CPU) is still relatively new, having only been introduced to the wider computer chess community a few months ago from the shogi world, so I did not expect there to be many papers on this subject in the first place not in the Japanese language, especially since most NNs in the real world require GPUs to operate.

AndrewGrant · Post by **AndrewGrant** » Thu Nov 19, 2020 10:03 am

Madeleine Birchfield wrote: ↑Thu Nov 19, 2020 9:59 am The other thing to remember is that this technology (NN on CPU) is still relatively new, having only been introduced to the wider computer chess community a few months ago from the shogi world, so I did not expect there to be many papers on this subject in the first place not in the Japanese language, especially since most NNs in the real world require GPUs to operate.

Well this MLP feedforward deal has been around for decades. I figured at some point there would be some work implementations with small integer types, especially since floats are only recently similar in speed to int32 operations. Maybe I'm wrong, but this NN stuff for chess is not novel -- 30 years old -- but only now resurfaced. But what I want to read about is not even specific to chess, or even NNs. I want to read some stuff about optimized matrix operations on various datatypes. The stuff that sits in the many libs.

AndrewGrant · Post by **AndrewGrant** » Thu Nov 19, 2020 10:23 am

There is no reason to believe that additional games, and a regeneration of the tuning data, using the new NNUE evaluation to label, will net even more elo.

Typo there -- should have said "there is no reason NOT to believe --

mar · Post by **mar** » Thu Nov 19, 2020 10:26 am

for floating point, it should be possible to rely on the compiler

Code: Select all

#pragma float_control(precise, off, push)

float dot_product(int count, const float *a, const float *b)
{
	float result = 0;

	for (int i=0; i<count; i++)
		result += a[i] * b[i];

	return result;
}

#pragma float_control(pop)

clang -O3 -march=core-avx2 actually produces a nice inner loop in GodBolt

for integers, this looks very much like the old fixed point problem, so multiplication would give you 32 bits that you need to renormalize (at some point) to get back into range, >> might work well, except that for signed quantities one has to be careful because -1 >> n is still -1, because arithmetic shift right simply duplicates the most significant bit (talking 2's complement)

relying on SIMD, perhaps it's possible to multiply and keep the most significant 16 bits of the result, but then you'd need to shift the result left by 1 to renormalize, losing 1 bit of precision along the way (assuming your weights are -32k..32k)

Madeleine Birchfield · Thu Nov 19, 2020 10:35 am

AndrewGrant wrote: ↑Thu Nov 19, 2020 10:03 am Well this MLP feedforward deal has been around for decades. I figured at some point there would be some work implementations with small integer types, especially since floats are only recently similar in speed to int32 operations. Maybe I'm wrong, but this NN stuff for chess is not novel -- 30 years old -- but only now resurfaced.

On the other hand, much of the old theory of neural networks mostly relied on real numbers/floating points, and only in the past 5 years after the hardware improved enough to show that neural networks worked in real time did people turn towards studying network weights quantisation for speeding up neural networks. So there is some work in the machine learning community regarding small integer types, but I would expect most of the literature to be from after 2015 or so. Especially in the computer chess community as a whole, which is still somewhat stuck in 2005 with poorly-tuned handcrafted evaluations, so everything has been relatively new to all of us.

You've trained a brilliant NN(UE) King-Piece Network. Now what?

You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?