Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

hgm · Post by **hgm** » Mon Oct 21, 2019 12:06 pm

Stockish does use a raw board representation as input for the evaluation, doesn't it?

towforce · Post by **towforce** » Mon Oct 21, 2019 12:35 pm

hgm wrote: ↑Mon Oct 21, 2019 12:06 pm Stockish does use a raw board representation as input for the evaluation, doesn't it?

I think what Fabian was saying was that, in order to get a NN to evaluate a chess position reasonably well with just a few hundred weights you're proposing that, in your words, "The weights are piece (or piece-square) values, passer bonuses etc", which Fabien is pointing out are pre-processed features, and not something the NN has learned for itself by training against a provided set of position/evaluation pairs.

However, you have given us an idea that a "reasonable" level of play can be achieved with a smaller number of weights than one would expect, which is valuable information.

dkappe · Post by **dkappe** » Mon Oct 21, 2019 1:11 pm

towforce wrote: ↑Mon Oct 21, 2019 11:06 am
hgm wrote: ↑Mon Oct 21, 2019 9:37 am It depends on how you define "very well". If you are happy with Stockfish, the answer is "a couple of hundred". Conventional evaluations can be implemented as Neural networks with a one or two layers. The weights are piece (or piece-square) values, passer bonuses etc.

If you are happy with micro-Max, you could do with a single layer and a few dozen weights.

Thank you - that's very interesting!

If this is right, it supports my view that a NN to evaluate a single chess position with good accuracy (and play a game at ply 1 search) would probably require fewer neurons and weights than most people think.

The 11258 distilled networks run all the way from 16x2, 24x3, 32x4, 48x5, etc., and will run reasonably well on CPU. You can find them here: https://github.com/dkappe/leela-chess-w ... d-Networks

Try out the various sizes on lc0 and judge for yourself.

You can also try this BOT https://github.com/dkappe/leela-chess-w ... -style-net

It’s a 32x4 looking at ~25 moves on a raspberry pi 3. Because of its source material, it plays objectively weaker moves than SF or leela, but is very effective against humans.

fabianVDW · Post by **fabianVDW** » Mon Oct 21, 2019 1:35 pm

towforce wrote: ↑Mon Oct 21, 2019 12:35 pm
hgm wrote: ↑Mon Oct 21, 2019 12:06 pm Stockish does use a raw board representation as input for the evaluation, doesn't it?

I think what Fabian was saying was that, in order to get a NN to evaluate a chess position reasonably well with just a few hundred weights you're proposing that, in your words, "The weights are piece (or piece-square) values, passer bonuses etc", which Fabien is pointing out are pre-processed features, and not something the NN has learned for itself by training against a provided set of position/evaluation pairs.

However, you have given us an idea that a "reasonable" level of play can be achieved with a smaller number of weights than one would expect, which is valuable information.

Yes, this is what I was getting at. Let's say a NN gets all the 12 bitboards fed in (this would disregard history and special flags, but okay), then we have 768 inputs. Having the bitboards fed in is I think usual practice, because those represent the one-hot-encoding of the pieces. This NN will need a lot more weights(atleast I would think so, but apparently as DKappe pointed out there are fairly small Lc0 nets) to do feature extraction on the 768 inputs than lets say a NN which has the direct features of Stockfish fed in, like amount of pawns, amount of knights, amount of bishops and so on...

hgm · Post by **hgm** » Mon Oct 21, 2019 3:08 pm

Ah, now you don't want just a NN, but one that learns.

Well, the weights ('eval parameters') in a conventional evaluation are usually tuned, e.g. by Texel tuning. In fact the back-propagation used in a NN is nothing but Texel tuning of the NN. So my answer still stands.

Of course no one would claim that a conventional eval would "play Chess very well" just on its own (i.e. at 1 ply). Stockfish and micro-Max play well because they search very deep. The requirement to play very well just by eval alone was not in your original question.

Chess is mostly tactics, i.e. doing the correct move in non-quiet positions. Conventional evals need to work only on quiet positions, because they are used in combination with Quiescence Search. As soon as you do away with QS, the overwhelmingly important thing becomes to statically evaluate tactical exchanges, and almost everything conventional evaluations calculate becomes insignificant noise compared to that. Getting the tactics reasonably right will probably require thousands of neurons and tens of thousands of weights. Just calculating a SEE for every square would already be pretty complex, but totally insufficient. You would get play similar to that of N.E.G., which loses to almost any searching engine. You will have to be able to recognize 'multi-square tactics'.

Btw, counting Knights and such is not really complex at all. It requires a single neuron and 64 connections all with the same weight. You connect those to the Knights bitboard. I am not sure what your defenition of 'weight' is, but I would count that as a single weight. (A convolutional NN also applies a single weight to many, translationally equivalent connections.)

towforce · Post by **towforce** » Mon Oct 21, 2019 5:49 pm

hgm wrote: ↑Mon Oct 21, 2019 3:08 pm Ah, now you don't want just a NN, but one that learns.

Well, the weights ('eval parameters') in a conventional evaluation are usually tuned, e.g. by Texel tuning. In fact the back-propagation used in a NN is nothing but Texel tuning of the NN. So my answer still stands.

Of course no one would claim that a conventional eval would "play Chess very well" just on its own (i.e. at 1 ply). Stockfish and micro-Max play well because they search very deep. The requirement to play very well just by eval alone was not in your original question.

Chess is mostly tactics, i.e. doing the correct move in non-quiet positions. Conventional evals need to work only on quiet positions, because they are used in combination with Quiescence Search. As soon as you do away with QS, the overwhelmingly important thing becomes to statically evaluate tactical exchanges, and almost everything conventional evaluations calculate becomes insignificant noise compared to that. Getting the tactics reasonably right will probably require thousands of neurons and tens of thousands of weights. Just calculating a SEE for every square would already be pretty complex, but totally insufficient. You would get play similar to that of N.E.G., which loses to almost any searching engine. You will have to be able to recognize 'multi-square tactics'.

Btw, counting Knights and such is not really complex at all. It requires a single neuron and 64 connections all with the same weight. You connect those to the Knights bitboard. I am not sure what your defenition of 'weight' is, but I would count that as a single weight. (A convolutional NN also applies a single weight to many, translationally equivalent connections.)

Thank you for that valuable insight. This is a very good description of how today's top engines work, but, the way you've described it throws up lots of problems. Here's an obvious one:

The position is quiet, but 10 ply from now, some sharp tactics arise. Your choice is:

1. spend a lot of time calculating through all the tactical lines to quiescence

2. keep of selecting moves on the basis of good position

Choice 1 is a profligate usage of time, and there's no guarantee that the sharp sequences will actually happen.

Choice 2 means that you're relying just on your positional knowledge, when you've stated that tactics trump positional evaluation.

It's an obvious truth that if you are genuinely in a good position ahead of a passage of sharp tactical play then, unless you make a mistake in the tactical phase, you'll still be in a good position when the tactics have quietened down.

The bigger point is that human players have found large numbers of "rules" for how to evaluate relatively simple positions (or patterns within positions), so it would be surprising if a system optimised for finding such rules in more complex positions, and rules which are more generally applicable than the ones we all know and love, was unable to find any. These more sophisticated rules may make the simpler rules redundant - just as chess programmers using search find that they can drop a lot of knowledge from evaluation as their engine search gets deeper.

So what we seem to be saying here is that a non-searching chess engine s going to need "quite a lot" of weights. AlphaZero has 50 million weights. IIRC somebody said that it can play at around 2400 without search.

IMO, the reason why Deepmind cannot easily make it even stronger without search is that training NNs to solve complex problems is difficult - chess has an awfully big "shape" to fit to: this is one of the two reasons why I would prefer a set of linear expressions - there's every reason to suppose that the optimisation of the whole system would be just massively better. The other reason is that I also think that it would be easier to optimise it (by which I mean maximising the number of zero weights while getting an equally good result, and expressions with zero weights could be removed from the set).

hgm · Post by **hgm** » Mon Oct 21, 2019 9:47 pm

I don't see what point you want to make with your example. You can play 10 positionally good moves based on perfect positional knowlege, so you start the tactical phase with a guarantee that you can orce a path through it that makes you come out ahead. Unortunately you cannot find that path because you suck at tactics, and in 9 out of 10 cases will come out a Pawn or piece down.

Also note that purely linear response is no good in NNs. Because it means that no matter how many layers you have, the eventual outputs will always be a linear combination of its input. It is like having only a single cell (per output), connected to all inputs. Any weights outnumbering the inputs are simply redundant. And even the simple human rules for Chess evaluation, e.g. to decide if a Pawn is a passer or not, are not linear. So the on-linear response of the cells on the sum of their weighted inputs is essential. Otherwise you never get beyond the level of piece-square tables.

fabianVDW · Post by **fabianVDW** » Mon Oct 21, 2019 11:16 pm

hgm wrote: ↑Mon Oct 21, 2019 3:08 pm Well, the weights ('eval parameters') in a conventional evaluation are usually tuned, e.g. by Texel tuning. In fact the back-propagation used in a NN is nothing but Texel tuning of the NN. So my answer still stands.

I like that you bring this up, because most are not aware of the similarity of the approaches. IMO the main thing seperating an engine like Lc0 from "normal" alpha-beta engines is the type of search and raw evaluation speed.

hgm wrote: ↑Mon Oct 21, 2019 3:08 pm Btw, counting Knights and such is not really complex at all. It requires a single neuron and 64 connections all with the same weight. You connect those to the Knights bitboard. I am not sure what your defenition of 'weight' is, but I would count that as a single weight. (A convolutional NN also applies a single weight to many, translationally equivalent connections.)

You are right for counting, but I would not immediately see such neurons for determining for instance passers. It would only make sense that less and less weights are needed the higher level your features are. The higher level the feature, the less perfect it possibly is though (because of missing complexity). Giving a neural network our predefined features(such as SF's) possibly limits the playing strength of a perfectly trained network to a local optima, and this is why I thought the board has to be fed into the NN as `raw as possible`(not to limit the strength to sub-optimal beforehand). Although one can argue that the information needed for Piece-Square-Table evaluation also fully contains a raw board representation as stated above, if it does not abuse symmetry.

About the definition of a weight, I think that for a DNN this would be counted as 64 different connections, but correct me If I am wrong. Atleast I think this is what Tensorflow interprets as different weights and also prints out as `amount of trainable parameters`.

towforce · Post by **towforce** » Mon Oct 21, 2019 11:54 pm

hgm wrote: ↑Mon Oct 21, 2019 9:47 pm I don't see what point you want to make with your example. You can play 10 positionally good moves based on perfect positional knowlege, so you start the tactical phase with a guarantee that you can orce a path through it that makes you come out ahead. Unortunately you cannot find that path because you suck at tactics, and in 9 out of 10 cases will come out a Pawn or piece down.

Human GMs only rarely make tactical mistakes- they have mainly LEARNED them, and 'automatically' know them without having to count.

Also note that purely linear response is no good in NNs. Because it means that no matter how many layers you have, the eventual outputs will always be a linear combination of its input. It is like having only a single cell (per output), connected to all inputs. Any weights outnumbering the inputs are simply redundant. And even the simple human rules for Chess evaluation, e.g. to decide if a Pawn is a passer or not, are not linear. So the on-linear response of the cells on the sum of their weighted inputs is essential. Otherwise you never get beyond the level of piece-square tables.

The method of divided differences allows you to have polynomial expressions built from linear expressions (this would absolutely require more than one column of expressions).

Uri Blass · Post by **Uri Blass** » Tue Oct 22, 2019 4:21 am

towforce wrote: ↑Mon Oct 21, 2019 11:54 pm Human GMs only rarely make tactical mistakes- they have mainly LEARNED them, and 'automatically' know them without having to count.

Humans GMs rarely make tactical mistakes because they calculate.

Even at 1 minute when GM's clearly make tactical mistakes not only rarely
humans calculate.

They can think that the first move that they think about is bad because of some opponent reply and play a different move when this process does not take more than 1 second.

I doubt if positional evaluation of top GM's is worth more than level of 1600 fide elo at long time control.
They maybe play better than it at bullet of 1 minute per game but only because they calculate even at bullet and not always play the first move that they consider.
Calculation at bullet may be something like
"if I play move A the opponent play move B and I do not like the position so A is not good so I play move C that seems better"
but it is not one node per move and it is even not only comparison of statical evaluations at depth 1 because the comparison is not directly between move A and move C but between some line that include the opponent reply.

Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

How many weights needed to play known chess very well using definitions in first post?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?

Re: Poll: How Many "Weights" Needed To Play "Known" Chess Very Well?