How much work is it to train an NNUE?

dkappe · Post by **dkappe** » Thu Feb 11, 2021 3:55 pm

Milos wrote: ↑Thu Feb 11, 2021 3:44 pm You see, the amount of electricity you burn and time you spent tweaking parameters is not proportional to the end result you get, i.e. strength of the net. First, no one has made you pursue that hopeless direction of using human data. That's just a monumental waste of time and resources. Second, poking different parameters randomly using "intuition" is very far from optimal way of doing things. I know ppl training both Leela and NNUE nets are mostly hobbyist and at best master students at the start of their ML career. But there are certainly better ways like AutoAI.

Finally, I agree it is very hard to equal SF nets, but that is exactly my point. Alberto Plata (if you believe his claims) basically wasted a humongous amount of resources just to end up with a result that is obviously subpar to current SFdev even using larger net. And his "contribution" to changing the net architecture is totally trivial.

Angry man, you don’t disappoint.

I assume you haven’t done much machine learning? AutoAI isn’t really that exciting; it’s mostly a service offering by IBM. Algorithmic model discovery is a bit more promising but very resource intensive. Maybe a good place to start, if you’re interested, is the recent academic literature on image classification.

Modern Times · Post by **Modern Times** » Thu Feb 11, 2021 4:42 pm

Milos wrote: ↑Thu Feb 11, 2021 3:44 pm Alberto Plata (if you believe his claims) basically wasted a humongous amount of resources just to end up with a result that is obviously subpar to current SFdev even using larger net. And his "contribution" to changing the net architecture is totally trivial.

Slightly sub-par on ratings lists, yes. But I'd like to see though how it performs in test suites focused on solving positions. People use engines for analysis by and large. Perhaps that double-size net will make it perform better in those more real-world situations. And who knows, maybe that double-size net gives it a far better basis to work from in future compared to the smaller SF net. All conjecture of course, but maybe people are far too quick to dismiss it as a failure.

Milos · Post by **Milos** » Thu Feb 11, 2021 5:01 pm

dkappe wrote: ↑Thu Feb 11, 2021 3:55 pm
Milos wrote: ↑Thu Feb 11, 2021 3:44 pm You see, the amount of electricity you burn and time you spent tweaking parameters is not proportional to the end result you get, i.e. strength of the net. First, no one has made you pursue that hopeless direction of using human data. That's just a monumental waste of time and resources. Second, poking different parameters randomly using "intuition" is very far from optimal way of doing things. I know ppl training both Leela and NNUE nets are mostly hobbyist and at best master students at the start of their ML career. But there are certainly better ways like AutoAI.

Finally, I agree it is very hard to equal SF nets, but that is exactly my point. Alberto Plata (if you believe his claims) basically wasted a humongous amount of resources just to end up with a result that is obviously subpar to current SFdev even using larger net. And his "contribution" to changing the net architecture is totally trivial.
Angry man, you don’t disappoint.

I assume you haven’t done much machine learning? AutoAI isn’t really that exciting; it’s mostly a service offering by IBM. Algorithmic model discovery is a bit more promising but very resource intensive. Maybe a good place to start, if you’re interested, is the recent academic literature on image classification.

Lol, your success in training of different nets in terms of performance speaks for itself, no need to further stress your competences. I was doing ML when you were probably not even in elementary school, but maybe instead of endless bragging about your trivial mapping of google paper to pytorch in a0lite you can point me out to some of your ML/AI papers, doesn't have to be Science or Nature or even NIPS or ICCV.

Ozymandias · Post by **Ozymandias** » Thu Feb 11, 2021 5:24 pm

Modern Times wrote: ↑Thu Feb 11, 2021 4:42 pm
Milos wrote: ↑Thu Feb 11, 2021 3:44 pm Alberto Plata (if you believe his claims) basically wasted a humongous amount of resources just to end up with a result that is obviously subpar to current SFdev even using larger net. And his "contribution" to changing the net architecture is totally trivial.
Slightly sub-par on ratings lists, yes. But I'd like to see though how it performs in test suites focused on solving positions. People use engines for analysis by and large. Perhaps that double-size net will make it perform better in those more real-world situations. And who knows, maybe that double-size net gives it a far better basis to work from in future compared to the smaller SF net. All conjecture of course, but maybe people are far too quick to dismiss it as a failure.

Being close to the top isn't a failure. It all depends on how you portrait your product.

For example, Dragon doesn't advertise as the new #1. Chessbase is doing it at the very top of their website in a hard-to-miss banner:

Now, is it? Doesn't look like it, so they're setting themselves up to fail.

dkappe · Post by **dkappe** » Thu Feb 11, 2021 5:40 pm

Milos wrote: ↑Thu Feb 11, 2021 5:01 pm
dkappe wrote: ↑Thu Feb 11, 2021 3:55 pm
Milos wrote: ↑Thu Feb 11, 2021 3:44 pm You see, the amount of electricity you burn and time you spent tweaking parameters is not proportional to the end result you get, i.e. strength of the net. First, no one has made you pursue that hopeless direction of using human data. That's just a monumental waste of time and resources. Second, poking different parameters randomly using "intuition" is very far from optimal way of doing things. I know ppl training both Leela and NNUE nets are mostly hobbyist and at best master students at the start of their ML career. But there are certainly better ways like AutoAI.

Finally, I agree it is very hard to equal SF nets, but that is exactly my point. Alberto Plata (if you believe his claims) basically wasted a humongous amount of resources just to end up with a result that is obviously subpar to current SFdev even using larger net. And his "contribution" to changing the net architecture is totally trivial.
Angry man, you don’t disappoint.

I assume you haven’t done much machine learning? AutoAI isn’t really that exciting; it’s mostly a service offering by IBM. Algorithmic model discovery is a bit more promising but very resource intensive. Maybe a good place to start, if you’re interested, is the recent academic literature on image classification.
Lol, your success in training of different nets in terms of performance speaks for itself, no need to further stress your competences. I was doing ML when you were probably not even in elementary school, but maybe instead of endless bragging about your trivial mapping of google paper to pytorch in a0lite you can point me out to some of your ML/AI papers, doesn't have to be Science or Nature or even NIPS or ICCV.

Angry man, wow. Your FORTRAN work in expert systems in the early 70’s must truly have been astounding.

I, unfortunately, only worked in academe in the very early part of my career, and then applying early AI techniques to 19th century medical data. You can check with the NBER for some of my publications. I’m sure they’ll disappoint you for a whole host of reason, not the least of which is that they don’t concern NN’s.

So, which is your engine? I’m eager to match a0lite up against it so you can show me how it’s done.

connor_mcmonigle · Post by **connor_mcmonigle** » Thu Feb 11, 2021 6:45 pm

Gabor Szots wrote: ↑Thu Feb 11, 2021 8:58 am To develop a chess engine usually takes several monts, years or even a lifetime. But how much work is it to take an existing engine and replace its NNUE with a different one?
In my naive view, to make an NNUE you collect a huge amount of games, determine which features of positions you want to analyze, then let your computer do the rest while you are having your holidays. When you return, a new NNUE is waiting for you to use.
Which means, at least for me, that FF2 has taken Stockfish's development work of years and put in a couple of days work of its own. Which approximates 99 % Stockfish, 1 % ChessBase.

What is the reality?

I'm mostly just commenting here to second what Vivien has written above. Training networks using tools written by other people is "computer work".

As someone who has written a reasonably strong engine, written tools for training networks for said engine and invested a decent amount of time using said tools to train strong networks for said engine, I'd argue that the 99% Stockfish, 1% ChessBase portrayal is accurate.

The culmination of all your efforts implementing the underlying tools to train a network is the ability to "press go" and then sit back and watch as your training tools produce networks for you. Pressing go is the easiest part in all of this! Sure, you'll end up having to fiddle/experiment with some hyperparameters and likely "press go" several times to produce a strong network. This requires some perseverance, but using the tools is the easy part and that's all Albert has done to the extent that I am aware.

The effort invested by Albert in training a network using tools written by other people is not remotely comparable to the effort invested by those having written the training tools, let alone the effort invested by those developing an engine such as Stockfish.

As such, I'm firmly of the opinion that CCRL should not list FF2 as a separate engine. It should be listed as what it is: a version of Stockfish with the parameters of the evaluation function changed. Listing FF2 as a separate engine misinforms users of CCRL's data, falsely leading them to believe that FF2 is a separate engine.

dkappe · Post by **dkappe** » Thu Feb 11, 2021 7:38 pm

connor_mcmonigle wrote: ↑Thu Feb 11, 2021 6:45 pm
The culmination of all your efforts implementing the underlying tools to train a network is the ability to "press go" and then sit back and watch as your training tools produce networks for you. Pressing go is the easiest part in all of this! Sure, you'll end up having to fiddle/experiment with some hyperparameters and likely "press go" several times to produce a strong network. This requires some perseverance, but using the tools is the easy part and that's all Albert has done to the extent that I am aware.

The effort invested by Albert in training a network using tools written by other people is not remotely comparable to the effort invested by those having written the training tools, let alone the effort invested by those developing an engine such as Stockfish.

As such, I'm firmly of the opinion that CCRL should not list FF2 as a separate engine. It should be listed as what it is: a version of Stockfish with the parameters of the evaluation function changed. Listing FF2 as a separate engine misinforms users of CCRL's data, falsely leading them to believe that FF2 is a separate engine.

I have a somewhat different perspective.

Tools:
I’ve written tools to train networks, generate data, etc., in chess and other domains. My latest experiment for bad gyal using pytorch took me about 30 minutes to write using lightning. Most of the effort was writing a dataset class to transform the raw data into tensors.

The lc0 and sf teams had a harder time of things as they were trying to reimplement something (AlphaZero and nodchip, respectively) rather than create something new from scratch.

Data:
Generating data is expensive, and if you don’t capture all the data you want, you might have to go back and regenerate. That’s why lots of people use existing data, like t60, sfen datasets from the stockfish project, ccrl game data, etc. If you want new training targets and/or inputs, or different types of data (like from an mcts/nn rather than an ab engine) you can’t get around generating your own.

Engines:
It’s got to be fast enough to do reasonable tests, but flexible enough to be able to try new net types. Lc0 supports many backends optimized for speed and changing them to support a new net type has been somewhat expensive. SF seems a little more flexible in this regard, so may have an advantage. I have moved to torchscript for my backend. Reasonably fast and the networks can be a black box aside from the inputs and outputs.

A simple mcts is easier to write than a simple ab, in my experience, given the search tricks and techniques you have to layer on in the ab case. Most of the complexity in the mcts case has to do with the gpu.

Training:
Training NN’s continues to be more of an engineering discipline than a science. Tweaking hyper parameters, trying new RL data generations, are based on experience, guess work and expensive, time consuming experiments.

Conclusion:
I’ve found training — the tweaking and the failures — to be the hardest and most time consuming part. For evidence, just go through the last few years of the leela chess discord.

AndrewGrant · Post by **AndrewGrant** » Thu Feb 11, 2021 7:52 pm

I've trained nets that are the same size as Stockfish's current nets. Data generation is the easiest and least nuanced part so far in my efforts. I was able to generate 24,000,000 games in about two days using depth 9 games on a single machine. This produced similar to results to 24,000,000 games played using depth 12 games which took almost two weeks to finish. Once the data is generated, the difficult part comes from the implementation of the training interface, and perhaps how you filter down some of the data.

These nets tend to be very picky, and without particular trend. Some of my smaller nets (2x128) regressed almost instantly at the initial learning rate, but then carried on for many many epochs at a 10x lower learning rate gaining elo. Other times, a "Stockfish sized net" (2x256) was able to gain elo for many epochs at the initial learning rate, and then upon applying learning rate drops only squeaked out a few more elo within 5 epochs each time before regressing. I'm within 30 elo of Stockfish's nets as of a few months ago. Much of that can simply be chalked up to Ethereal being weaker, and not the process being weaker.

Training an NNUE is not a task that requires a massive team, nor does it require a massive array of hardware, nor does it require thousands of hours of human time. Effort is spent in getting the initial process to work, something which the Stockfish and Shogi teams already did. Very little effort is needed to train a net through Stockfish at this point. The code exists, all the tools exist, and their are documents describing the exact process to produce high quality nets. Likewise, it takes me very little effort to train _additional_ nets using my setup. I've done the work for three months to get a system down that is stable. The rest is cake. If I released my tool, everyone's uncle could build a net in a week that competes.

The trainer I have now matches the speed of SF's pytorch code running on a single V100, and yet runs on a CPU. If you reuse Stockfish's trainer, there is very little work required to produce a reasonable quality Network -- all the work has been done for you, all you hvae to do is set aside a few days to play games.

People who are putting in serious effort to train Networks are few and far between: The Koivisto guys, Seer's author Connor, Halogen author Kieren, and Ethereal author (me), are the people I know of who are going their own way and doing 100% original work. With that comes tackling the problems that the Shogi people tackled, but from a new perspective. Stockfish's trainer is mathematically flawed and its factorizer for the King PSQT have no basis in reality. Its surprising it works at all, and perhaps it would be much better if done correctly.

dkappe · Post by **dkappe** » Thu Feb 11, 2021 7:58 pm

AndrewGrant wrote: ↑Thu Feb 11, 2021 7:52 pm Stockfish's trainer is mathematically flawed and its factorizer for the King PSQT have no basis in reality. Its surprising it works at all, and perhaps it would be much better if done correctly.

Well put. The architecture oddities carried over from shogi (rotated rather than mirrored positions, for example, or useless inputs corresponding to a shogi piece type) are a head scratcher. The fact that it works well is an ongoing surprise.

AndrewGrant · Post by **AndrewGrant** » Thu Feb 11, 2021 8:08 pm

dkappe wrote: ↑Thu Feb 11, 2021 7:58 pm
AndrewGrant wrote: ↑Thu Feb 11, 2021 7:52 pm Stockfish's trainer is mathematically flawed and its factorizer for the King PSQT have no basis in reality. Its surprising it works at all, and perhaps it would be much better if done correctly.
Well put. The architecture oddities carried over from shogi (rotated rather than mirrored positions, for example, or useless inputs corresponding to a shogi piece type) are a head scratcher. The fact that it works well is an ongoing surprise.

Not to dive into a conversation on Stockfish, since I only have surface level info anyway, but:

1. I'm not convinced the rotating / mirroring thing matters. Seems to bake in a sort of queenside-kingside understanding reasonably well. But its unfounded to say the least.
2. The useless inputs are whatever. They should never be activated? So its just wasted compute time in training with no impact on output weights?
3. King PSQT stuff means the the Network is _different_ during training than _after_ training. IE I train something, and then right before I hand it to you, I change a bunch of numbers and smile and say "Its still roughly the same thing I trained

"

How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?

Re: How much work is it to train an NNUE?