The next revolution in computer chess?

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: The next revolution in computer chess?

Post by Milos »

dkappe wrote: Mon Jul 27, 2020 12:45 pm
Milos wrote: Mon Jul 27, 2020 12:27 pm The biggest source of "jumpy" score is fixed depth used to generate it. Once actual TC is used with SFs TM this will also disappear. And I still fail to grasp how would score returned with higher fixed depth search be worse for training the net.
I hope you’ll provide some evidence for this smoothness at higher depth.
That is self-evident if you know basic stuff about A/B search. Smoothness is directly proportional to amount of fail lows/highs. The amount of fail lows/highs reduces with increased depth as can be checked with the most basic empirical test running SF on various (non-quiet) positions. Q.E.D.
Now, what is your "proof" for expecting the opposite?
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The next revolution in computer chess?

Post by dkappe »

Milos wrote: Mon Jul 27, 2020 12:50 pm
dkappe wrote: Mon Jul 27, 2020 12:45 pm
Milos wrote: Mon Jul 27, 2020 12:27 pm The biggest source of "jumpy" score is fixed depth used to generate it. Once actual TC is used with SFs TM this will also disappear. And I still fail to grasp how would score returned with higher fixed depth search be worse for training the net.
I hope you’ll provide some evidence for this smoothness at higher depth.
That is self-evident if you know basic stuff about A/B search. Smoothness is directly proportional to amount of fail lows/highs. The amount of fail lows/highs reduces with increased depth as can be checked with the most basic empirical test running SF on various (non-quiet) positions. Q.E.D.
Now, what is your "proof" for expecting the opposite?
If the domain is chess positions and the range is the score from white’s perspective, and proximity in the domain is defined by the number of changes in the feature vector (that’s the function we’re approximating), then you’re telling me this function is smooth? Ok, I’ll bite. Let’s see the proof.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: The next revolution in computer chess?

Post by Milos »

dkappe wrote: Mon Jul 27, 2020 1:01 pm
Milos wrote: Mon Jul 27, 2020 12:50 pm
dkappe wrote: Mon Jul 27, 2020 12:45 pm
Milos wrote: Mon Jul 27, 2020 12:27 pm The biggest source of "jumpy" score is fixed depth used to generate it. Once actual TC is used with SFs TM this will also disappear. And I still fail to grasp how would score returned with higher fixed depth search be worse for training the net.
I hope you’ll provide some evidence for this smoothness at higher depth.
That is self-evident if you know basic stuff about A/B search. Smoothness is directly proportional to amount of fail lows/highs. The amount of fail lows/highs reduces with increased depth as can be checked with the most basic empirical test running SF on various (non-quiet) positions. Q.E.D.
Now, what is your "proof" for expecting the opposite?
If the domain is chess positions and the range is the score from white’s perspective, and proximity in the domain is defined by the number of changes in the feature vector (that’s the function we’re approximating), then you’re telling me this function is smooth? Ok, I’ll bite. Let’s see the proof.
Nope I am not telling you that, but you somehow fail to grasp what I'm telling you. I'm telling you that average number of changes in the feature vector is smaller if the score is obtained from search at fixed depth N+1 than at depth N, for any practical N. The only exception might be a very late endgame or positions where mate is in sight.
chrisw
Posts: 4636
Joined: Tue Apr 03, 2012 4:28 pm
Location: Midi-Pyrénées
Full name: Christopher Whittington

Re: The next revolution in computer chess?

Post by chrisw »

dkappe wrote: Mon Jul 27, 2020 5:50 am
Ovyron wrote: Mon Jul 27, 2020 5:04 am
corres wrote: Sat Jul 25, 2020 4:59 pm It is pity, but the NN of NNUE grows more faster than the speed of (def)Stockfish.
This can't go on forever, not with the exponential nature of chess. NNUE plateaus, and it needs better eval to improve, it's going to come from someone that translates the concepts of NNUE into stockfish dev's eval (if the former scores a position at 1.50 while the latter says it's 0.00, and NNUE wins, you've gotta figure out the cause of the discrepancy), then a new NNUE based on this eval can arise.

Luckily NNUE has a lot of fuel, as any position can get better eval by using more depth. Perhaps what we need is a method to differentiate positions where the eval is already good from those that are still bad, and increase the depth of the bad ones. Otherwise, more depth will be wasted on positions where it doesn't help to get better eval (if eval is 0.15 at depth 8 and 0.15 at depth 9, you just wasted time getting there).
NNUE is a relatively simple beast whose main innovation is that it’s efficiently computable when there’s a minor change in the inputs. So efficient, in fact, that you don’t need a GPU to run it quickly.

Now anyone that’s worked with shallow, fully connected networks knows that they’re relatively limited. Approximating a real valued function over a domain of bit strings is a great use. In fact it’s been a lively topic of research for over a decade. But how good the approximation is depends very much on the function to be approximated. Eval at depth 8 may be a good candidate, but a higher depth search may become increasingly difficult, especially if it has bigger and more frequent discontinuities. An educated guess is that for the 256 and 384 networks there is an N beyond which it no longer improves.
Let’s disentangle this a bit. You’re actually training on a meld of game result and SF d8. Eg if SF says 0.6, the game result is 1.0, the train target is c. 0.8, or if result 0.5, target C 0.55

But let’s say you train only (no results data) on SF eval from depth N. The NN is slower and it’s an approximation, so if D=1, it’s worse.
As D increases, you’ll gain from the forward knowledge contained in the improved eval, but you still have NN “approximation” error. At a certain D the search knowledge gain overcomes the approximation error and the entity evaluation “improves”.

However, as above, you’re not training on SF eval only, you’re training on eval plus result, so the training targets already have forward knowledge, and in our little example, we’ll just be altering the target from 0.55 / 0.8 to something a little either side, noisy/steppy for low D, smoother/more consistent for high D. Which all points to more D is good. And the paradox that if game results were “perfect”, even infinite D would not be helpful, because your training targets already know everything anyway.

So, what’s really being trained on is a meld of imperfect game result and imperfect evaluation (less imperfect hopefully with increasing D) and you end up with an NN approximation of the two. Then that NN approximation is inserted into another search, and hopefully that search copes with it all.Apparently it does.

Inputs are:
Relatively poor beancount SF eval and SF dubiously pruning search produces noisy non-optimal position eval.

NoIsy non-optimal position eval plus imperfect results data and relatively small and primitive NN produce (slightly less?) non optimal NNUE eval.

Noisy non optimal NNUE eval plus even more dubiously non-tuned imperfect SF search eval produce something supposedly improved on the original.

Fun!
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: The next revolution in computer chess?

Post by Milos »

chrisw wrote: Mon Jul 27, 2020 1:24 pm Inputs are:
Relatively poor beancount SF eval and SF dubiously pruning search produces noisy non-optimal position eval.

NoIsy non-optimal position eval plus imperfect results data and relatively small and primitive NN produce (slightly less?) non optimal NNUE eval.

Noisy non optimal NNUE eval plus even more dubiously non-tuned imperfect SF search eval produce something supposedly improved on the original.

Fun!
All true, but solution is quite straightforward. Train your net at as high quality scored positions as possible (and the more of them the merrier). Once you start to flatten out with network performance, re-tune the search to this new eval. Repeat a couple more times the whole process.
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The next revolution in computer chess?

Post by dkappe »

chrisw wrote: Mon Jul 27, 2020 1:24 pm
dkappe wrote: Mon Jul 27, 2020 5:50 am
Ovyron wrote: Mon Jul 27, 2020 5:04 am
corres wrote: Sat Jul 25, 2020 4:59 pm It is pity, but the NN of NNUE grows more faster than the speed of (def)Stockfish.
This can't go on forever, not with the exponential nature of chess. NNUE plateaus, and it needs better eval to improve, it's going to come from someone that translates the concepts of NNUE into stockfish dev's eval (if the former scores a position at 1.50 while the latter says it's 0.00, and NNUE wins, you've gotta figure out the cause of the discrepancy), then a new NNUE based on this eval can arise.

Luckily NNUE has a lot of fuel, as any position can get better eval by using more depth. Perhaps what we need is a method to differentiate positions where the eval is already good from those that are still bad, and increase the depth of the bad ones. Otherwise, more depth will be wasted on positions where it doesn't help to get better eval (if eval is 0.15 at depth 8 and 0.15 at depth 9, you just wasted time getting there).
NNUE is a relatively simple beast whose main innovation is that it’s efficiently computable when there’s a minor change in the inputs. So efficient, in fact, that you don’t need a GPU to run it quickly.

Now anyone that’s worked with shallow, fully connected networks knows that they’re relatively limited. Approximating a real valued function over a domain of bit strings is a great use. In fact it’s been a lively topic of research for over a decade. But how good the approximation is depends very much on the function to be approximated. Eval at depth 8 may be a good candidate, but a higher depth search may become increasingly difficult, especially if it has bigger and more frequent discontinuities. An educated guess is that for the 256 and 384 networks there is an N beyond which it no longer improves.
Let’s disentangle this a bit. You’re actually training on a meld of game result and SF d8. Eg if SF says 0.6, the game result is 1.0, the train target is c. 0.8, or if result 0.5, target C 0.55

But let’s say you train only (no results data) on SF eval from depth N. The NN is slower and it’s an approximation, so if D=1, it’s worse.
As D increases, you’ll gain from the forward knowledge contained in the improved eval, but you still have NN “approximation” error. At a certain D the search knowledge gain overcomes the approximation error and the entity evaluation “improves”.

However, as above, you’re not training on SF eval only, you’re training on eval plus result, so the training targets already have forward knowledge, and in our little example, we’ll just be altering the target from 0.55 / 0.8 to something a little either side, noisy/steppy for low D, smoother/more consistent for high D. Which all points to more D is good. And the paradox that if game results were “perfect”, even infinite D would not be helpful, because your training targets already know everything anyway.

So, what’s really being trained on is a meld of imperfect game result and imperfect evaluation (less imperfect hopefully with increasing D) and you end up with an NN approximation of the two. Then that NN approximation is inserted into another search, and hopefully that search copes with it all.Apparently it does.

Inputs are:
Relatively poor beancount SF eval and SF dubiously pruning search produces noisy non-optimal position eval.

NoIsy non-optimal position eval plus imperfect results data and relatively small and primitive NN produce (slightly less?) non optimal NNUE eval.

Noisy non optimal NNUE eval plus even more dubiously non-tuned imperfect SF search eval produce something supposedly improved on the original.

Fun!
For initial training on d8, most people, are ignoring result. Only later, as a sort of ‘sharpening’ step are they mixing in result from d12. The shogi cats have recommended 30% result and a much lower learning rate. Like much nn work, there is engineering wisdom and folklore at work here.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The next revolution in computer chess?

Post by dkappe »

Milos wrote: Mon Jul 27, 2020 1:08 pm
dkappe wrote: Mon Jul 27, 2020 1:01 pm
Milos wrote: Mon Jul 27, 2020 12:50 pm
dkappe wrote: Mon Jul 27, 2020 12:45 pm
Milos wrote: Mon Jul 27, 2020 12:27 pm The biggest source of "jumpy" score is fixed depth used to generate it. Once actual TC is used with SFs TM this will also disappear. And I still fail to grasp how would score returned with higher fixed depth search be worse for training the net.
I hope you’ll provide some evidence for this smoothness at higher depth.
That is self-evident if you know basic stuff about A/B search. Smoothness is directly proportional to amount of fail lows/highs. The amount of fail lows/highs reduces with increased depth as can be checked with the most basic empirical test running SF on various (non-quiet) positions. Q.E.D.
Now, what is your "proof" for expecting the opposite?
If the domain is chess positions and the range is the score from white’s perspective, and proximity in the domain is defined by the number of changes in the feature vector (that’s the function we’re approximating), then you’re telling me this function is smooth? Ok, I’ll bite. Let’s see the proof.
Nope I am not telling you that, but you somehow fail to grasp what I'm telling you. I'm telling you that average number of changes in the feature vector is smaller if the score is obtained from search at fixed depth N+1 than at depth N, for any practical N. The only exception might be a very late endgame or positions where mate is in sight.
The changes in the feature vector are exactly the same. It’s the changes in the range we’re interested in. A function is smooth or it isn’t, it doesn’t care what you “feel.”
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: The next revolution in computer chess?

Post by corres »

dkappe wrote: Mon Jul 27, 2020 5:50 am ...
An educated guess is that for the 256 and 384 networks there is an N beyond which it no longer improves.
So you think the 256 and 384 Nets are the most bigger in the viewpoint of usefulness?
This is a practical experience or it has a basement in theory?
Raphexon
Posts: 476
Joined: Sun Mar 17, 2019 12:00 pm
Full name: Henk Drost

Re: The next revolution in computer chess?

Post by Raphexon »

corres wrote: Tue Jul 28, 2020 11:03 am
dkappe wrote: Mon Jul 27, 2020 5:50 am ...
An educated guess is that for the 256 and 384 networks there is an N beyond which it no longer improves.
So you think the 256 and 384 Nets are the most bigger in the viewpoint of usefulness?
This is a practical experience or it has a basement in theory?
Until incremental updates no longer fit inside the cache of the CPU I think.
Shogi uses 40mb nets, but Shogi is a lot more volatile than chess.
So maybe 784 or 1024 will be the upper limit of practicality for chess.

A halfkp512 with castling rights encoded manages 395-400knps from start compared to 690 knps for the regular halfkp256x2x32x32 on my i5-4460 with 1 core.
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The next revolution in computer chess?

Post by dkappe »

corres wrote: Tue Jul 28, 2020 11:03 am
dkappe wrote: Mon Jul 27, 2020 5:50 am ...
An educated guess is that for the 256 and 384 networks there is an N beyond which it no longer improves.
So you think the 256 and 384 Nets are the most bigger in the viewpoint of usefulness?
This is a practical experience or it has a basement in theory?
It’s based on my reading and experience in training fully connected, shallow networks to approximate real valued functions in a variety of domains. Note you can go even bigger at the cost of speed. There’s often a sweet spot that can’t be calculated but is determined empirically. The addition of features seems the most exciting, though getting it to be EU instead of just an NN, seems a bit tricky.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".