You've trained a brilliant NN(UE) King-Piece Network. Now what?

gladius · Post by **gladius** » Thu Nov 19, 2020 10:39 am

AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am So I wrote an NNTraining tool in C, made it very well threaded, cranked up optimizations, wrote AVX2 code for portions to perform weight updates faster, came up with a novel addition to Adam to limit throughput when dealing with sparse inputs by batching in fixed sizes, and refusing to shuffle intra-batch, only inter-batch.

Novel addition to Adam is worth a write-up/paper in itself if you are interested in that

.

AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am One thing that has made me quite upset is that there appears to be very little literature online discussing the nuances of these things, and how to actually implement such things into C. It looks like looking at CFish is the best crash-course you could get, but I'm not too keen since I'm not a fan of the programming style, and I also don't want to just duplicate that effort. I wrote the trainer with no direct contact with the SF code, so I'de like to see if I could do the implementation on my own as well.

There is literature on this, it's just pretty specialized. See https://engineering.fb.com/2018/11/07/m ... ns/fbgemm/ (and the open source implementation), as well as this paper https://arxiv.org/abs/1712.05877 for the pytorch methodology.

The SF quantization approach is unusual, agreed. It's not anywhere close to theoretically optimal, so there is a ton of room to improve/experiment here.

Madeleine Birchfield · Thu Nov 19, 2020 10:42 am

gladius wrote: ↑Thu Nov 19, 2020 10:39 am
AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am So I wrote an NNTraining tool in C, made it very well threaded, cranked up optimizations, wrote AVX2 code for portions to perform weight updates faster, came up with a novel addition to Adam to limit throughput when dealing with sparse inputs by batching in fixed sizes, and refusing to shuffle intra-batch, only inter-batch.
Novel addition to Adam is worth a write-up/paper in itself if you are interested in that .

AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am One thing that has made me quite upset is that there appears to be very little literature online discussing the nuances of these things, and how to actually implement such things into C. It looks like looking at CFish is the best crash-course you could get, but I'm not too keen since I'm not a fan of the programming style, and I also don't want to just duplicate that effort. I wrote the trainer with no direct contact with the SF code, so I'de like to see if I could do the implementation on my own as well.
There is literature on this, it's just pretty specialized. See https://engineering.fb.com/2018/11/07/m ... ns/fbgemm/ (and the open source implementation), as well as this paper https://arxiv.org/abs/1712.05877 for the pytorch methodology.

The SF quantization approach is unusual, agreed. It's not anywhere close to theoretically optimal, so there is a ton of room to improve/experiment here.

There is also this paper: https://arxiv.org/pdf/1906.00532v2.pdf and this article with sources at the end: https://www.mathworks.com/company/newsl ... works.html

smatovic · Post by **smatovic** » Thu Nov 19, 2020 10:57 am

AndrewGrant wrote: ↑Thu Nov 19, 2020 9:55 am ...
?

Okay, sorry for jumping on you, got it the wrong way...

AFAIK Google used bfloat16 on TPU gen2 for training A0 and INT8 on TPU gen1 for inference:

http://talkchess.com/forum3/viewtopic.p ... 55#p796255

My gut feeling tells me that bfloat16 for the weights with upcoming support on vector units is the way to go.

--
Srdja

Henk · Post by **Henk** » Thu Nov 19, 2020 1:53 pm

Delete it. Too complicated. Fuzzy logic in a network structure. Nobody understands. Statistical lies.
Turtle will win

Daniel Shawul · Post by **Daniel Shawul** » Thu Nov 19, 2020 4:55 pm

mar wrote: ↑Thu Nov 19, 2020 10:26 am for floating point, it should be possible to rely on the compiler
Code: Select all
#pragma float_control(precise, off, push)

float dot_product(int count, const float *a, const float *b)
{
	float result = 0;

	for (int i=0; i<count; i++)
		result += a[i] * b[i];

	return result;
}

#pragma float_control(pop)
clang -O3 -march=core-avx2 actually produces a nice inner loop in GodBolt

for integers, this looks very much like the old fixed point problem, so multiplication would give you 32 bits that you need to renormalize (at some point) to get back into range, >> might work well, except that for signed quantities one has to be careful because -1 >> n is still -1, because arithmetic shift right simply duplicates the most significant bit (talking 2's complement)

relying on SIMD, perhaps it's possible to multiply and keep the most significant 16 bits of the result, but then you'd need to shift the result left by 1 to renormalize, losing 1 bit of precision along the way (assuming your weights are -32k..32k)

Yeah clang has a pretty good auto-vectorization. GCC and MSVC on the other hand you have to drag their asses for them to recognize even a straightforward add vector with no reduction.
Computing dot-product is one routine that really needed to be written using SIMD intrinscis for me. But I ended up writing SIMD code for add/subtract and clampVectors as well for the sake of MSVC/GCC.

D Sceviour · Post by **D Sceviour** » Thu Nov 19, 2020 5:04 pm

AndrewGrant wrote: ↑Thu Nov 19, 2020 10:23 am
There is no reason to believe that additional games, and a regeneration of the tuning data, using the new NNUE evaluation to label, will net even more elo.
Typo there -- should have said "there is no reason NOT to believe --

Wrong again. Double negatives are bad grammatical form. It should have said "there is reason to believe..."

Madeleine Birchfield · Thu Nov 19, 2020 7:05 pm

Henk wrote: ↑Thu Nov 19, 2020 1:53 pm Delete it. Too complicated. Fuzzy logic in a network structure. Nobody understands. Statistical lies.
Turtle will win

Long after the hare has died, the turtle is still marching slowly towards the goal, but most chess engine developers are not turtles, they behave like very slow hares; they come into the community and work on an engine for a few months to a few years and then leave for a decade before returning to add an update with 100 elo, most of which are search improvements and not improvements to handcrafted evalation, with the true improvement to handcrafted evaluation might be 10 elo or so, or they decide to start over from scratch. The time scale one is talking about for improvements to handcrafted evaluations is on the order of devades, and unless you have a lot of computing power to accelerate development or a dedicated community of developers willing to continue your work after you retire (both like Stockfish) then you would hardly get anywhere before you decide to retire or die. And unlike humans and hares, turtles do not die in 60 years. The fact that no independent developer was able to surpass Houdini 6 until the adoption of neural networks speaks volumes to how much harder it is to make improvements solely with handcrafted evaluations once a certain elo threshold is reached. This attitude is why the computer chess community is primarily stuck in 2005 with their 10 elo gains every update to their poorly tuned handcrafted eval engines.

Furthermore, just because you do not understand how neural networks work and want to press on with your handcrafted evaluation doesn't mean that you should force others to burn all literature regarding neural networks.

jdart · Post by **jdart** » Thu Nov 19, 2020 7:36 pm

AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am I suppose the end result might be that I don't care to do the implementation. In which case its weird that I have a big function of 20 million weights sitting on my desktop, which would gain 100+ elo if implemented well, but no one will ever see it.

I have a somewhat similar issue: my engine is not GPL and so grafting Stockfish/Cfish code into it is not an option.

I'm interested in NNUE but doing a clean room version of the code seems to be a big project.

Madeleine Birchfield · Thu Nov 19, 2020 7:50 pm

jdart wrote: ↑Thu Nov 19, 2020 7:36 pm
AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am I suppose the end result might be that I don't care to do the implementation. In which case its weird that I have a big function of 20 million weights sitting on my desktop, which would gain 100+ elo if implemented well, but no one will ever see it.
I have a somewhat similar issue: my engine is not GPL and so grafting Stockfish/Cfish code into it is not an option.

I'm interested in NNUE but doing a clean room version of the code seems to be a big project.

The NNUE inference code is the one thing that cannot be copied from Stockfish/Cfish, and is probably the easiest thing from all the NNUE things to write up from scratch. Seer has a good and simple independent implementation of the NNUE inference code (though since it is also GPL it cannot be used in Arasen either). On the other hand, the training code looks to be fairly difficult, with even the Stockfish team themselves having trouble with it, but the GPL for the engine code doesn't apply to the training code as only the result of the training (which is in public domain) is embedded in the engine. I think it would be perfectly fine for Arasan to use a net trained by Andrew Grant's net trainer.

David Carteau · Post by **David Carteau** » Thu Nov 19, 2020 8:24 pm

The inference code is quite simple, since you understood what you have to do. It took me only few days (back in August) to have my own C implementation. At the opposite, the training part is highly more difficult to write from scratch... I finally managed to do it a few days ago, with great results (I'm very happy about that : I learned something new and very challenging !).

I'm not familiar at all with licenses (that's one of the reason why Orion's source code is not available, the others being that the engine is too weak to be helpful, and code is maybe not so "elegant"

), but if people are interested in and if it is possible to find the most "public domain" license possible, I would be happy to share my work (both inference code in C, with incremental updates, and training code in Python, using Pytorch).

Two attention points :
* the training code produces networks that are very similar to SF's ones, but not compatible (strength seems very good, but I haven't tested intensively so far) ;
* the networks produce floats, which means 40 MB for the "standard" NNUE architecture. But this leaves a lot of interesting enhancement possibilities : quantization, smaller architectures, etc.

Finally, I wonder what the community would thought about releasing an "official" version of Orion, using my own implementation of inference code, using a net trained with my own (home-made) trainer, but... using SF's eval to train the all thing

I think some people would highly "disapprove", but on the other hand, a lot of work has been done and, as humans, to learn, we need a teacher. I think it's the same for engines : they need to be trained by the best experts

You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?

Re: You've trained a brilliant NN(UE) King-Piece Network. Now what?