An AlphaZero inspired project

trulses · Post by **trulses** » Sat Dec 23, 2017 7:28 am

AlvaroBegue wrote:I don't see the interest in training tablebases for endgames that are already covered by EGTBs. Doesn't it make more sense to start with 6-men EGTBs and get the neural network to learn everything else?

To answer your question directly, yes it would. Just so that's clear, I have no intention of using this technique to create tablebases or anything like that, they would just be much weaker and much less efficient.

However, to get the technique to work on the full of game of chess while running on "regular hardware", I might have to resort to all sorts of tricks to speed up the convergence. These endgames are just a test-bed for me to try out different sets of hyper-parameters and ideas just for this purpose. Different types of endgames have a nice range of complexity and they all come with a perfect expert (EGTBs), it's like the perfect lab for chess mini-games.

So under these conditions, if I can't get it to work on endgames then I certainly won't be able to get it to work on the full game of chess. I only have one machine with a pretty old 4 core CPU and fairly old GPU, so efficiency is extremely important, that's the goal here.

Evert · Post by **Evert** » Sat Dec 23, 2017 9:12 am

AlvaroBegue wrote:I don't see the interest in training tablebases for endgames that are already covered by EGTBs. Doesn't it make more sense to start with 6-men EGTBs and get the neural network to learn everything else?

Depends on your goal. In this case, very clearly not the ability to play the end game, but validation of the method in a situation that is/could be difficult for it.

brianr · Post by **brianr** » Sat Dec 23, 2017 1:28 pm

trulses wrote:
Thanks for sharing, indeed the iteration parameter helps, it's difficult to balance the nodes per search and having enough games played per training epoch. Have you tried varying the exploration constant in the node selection? In KQvK how far away was the mate spotted by the 10k simulation search?

8/8/8/4k3/8/8/8/4KQ2 w - - 0 1
Shortest mate in 8 moves

CheckersGuy · Post by **CheckersGuy** » Sat Dec 23, 2017 5:07 pm

Evert wrote:
AlvaroBegue wrote:I don't see the interest in training tablebases for endgames that are already covered by EGTBs. Doesn't it make more sense to start with 6-men EGTBs and get the neural network to learn everything else?
Depends on your goal. In this case, very clearly not the ability to play the end game, but validation of the method in a situation that is/could be difficult for it.

If the method has a very high sucess rate it does come with a benefit. The NN uses much less memory compared to , let`say, 7piece endgame database. For that reason alone I would say it`s pretty intresting as it's basically a "compression".

hgm · Post by **hgm** » Sat Dec 23, 2017 7:30 pm

Evert wrote:I think a more interesting test would be KBNK, because that requires some fairly specific knowledge (it would be interesting to know whether A0 can handle it, though I suspect not).

Training just on KBNK seems a bad idea; it will be very hard to get any mate by accident, which could start off the learning. To get a good chance for a mate, it should already know it has to drive the King to a corner. And it is much easier to learn that with a more powerful piece, where you have a reasonable chance for accidental mates.

This shows that training for a very narrow task is a bad idea. If you train for KQK, KRK, KBBK and KBNK mates all at the same time, it will learn KBNK much faster, as KQK and KRK will teach it to involve the King, and to drive the bare King to edge or corner. With that knowledge KBBK should already be easy, and KBNK only requires learning about the good corner.

Pio · Post by **Pio** » Sat Dec 23, 2017 8:01 pm

Hi!

I agree with H.G.M.

One thing that should also improve the learning is to score the solutions with the smallest proof trees higher.

If you also make the network to understand symmetries it will also speed it up. For example when there are no pawns you could fix the side to move king to one of the eight triangles (including the diagonal border squares)

BR
Pio

trulses · Post by **trulses** » Thu Dec 28, 2017 7:41 pm

hgm wrote:
Evert wrote:I think a more interesting test would be KBNK, because that requires some fairly specific knowledge (it would be interesting to know whether A0 can handle it, though I suspect not).
Training just on KBNK seems a bad idea; it will be very hard to get any mate by accident, which could start off the learning. To get a good chance for a mate, it should already know it has to drive the King to a corner. And it is much easier to learn that with a more powerful piece, where you have a reasonable chance for accidental mates.

This shows that training for a very narrow task is a bad idea. If you train for KQK, KRK, KBBK and KBNK mates all at the same time, it will learn KBNK much faster, as KQK and KRK will teach it to involve the King, and to drive the bare King to edge or corner. With that knowledge KBBK should already be easy, and KBNK only requires learning about the good corner.

I'm a bit interested in the solo KBNK problem simply because of how difficult it is in terms of sparsity. I would like to know just how much or little information it takes, how it can be influenced by an expert etc.

You bring up some good points, going to be interesting to see the difference in convergence wrt. agents trained on many EG problems vs just one, also if an agent trained on one problem can transfer its knowledge to another.

----------

Pio wrote:One thing that should also improve the learning is to score the solutions with the smallest proof trees higher.

This typically happens automatically, when we visit a node that hasn't had all its children expanded yet we always expand those unvisited nodes first. This happens since the value of an child with no statistics is presumed infinite so the PUCT algorithm always select those nodes first. Since the game theoretic values we are interested in (mate) are typically much stronger than the "heuristic" values from our neural net and we average those values over the entire subtree this leads to the algorithm preferring not just the shortest path to mate but also the one with least branching factor. Quite similar to M. Lai's probability based AB search come to think of it.

That said, propagating proven values up the tree is definitely something I'm looking into, it seems pretty mad to me that nodes with proven values have to be visited over and over (like a mate in 1) when we know the signal to be true. Of course it makes a lot of sense in the bandit framework where we want to constantly be maximizing our rewards but it doesn't make a lot of sense as a search tree. Proven winning/losing values should also influence the policy label to an infinite extent. Something similar could probably be stated for actions with little to no variance in the reward distribution.

Apparently propagating game theoretic values has been tried here http://ieeexplore.ieee.org/document/5523941/

Pio wrote:If you also make the network to understand symmetries it will also speed it up. For example when there are no pawns you could fix the side to move king to one of the eight triangles (including the diagonal border squares)

Excellent point. I'm looking to stack these types of optimizations on when time and effort allows. Taking advantage of these symmetries can also be advantageous directly in the tree like they did in the AG0 paper. Can also be applied in training to get more mileage out of the data.

----------

A short update on KQK. I tried a linear schedule for the number of simulations per search used in self-play games. Starting at just 16 and increasing by 16 every 10 minutes. The rationale behind this is that with a randomly initialized network the search trees are pretty weak anyway, so before I start training I'd like to just find mate in ones and play those. This allows me to generate a large number of games with mating positions. Since vague comparisons to the way humans approach things are popular I guess this is similar to how an amateur at first looks just one or two moves ahead while a more accomplished player starts looking further ahead.

I also now generate random end-game scenarios by randomly placing pieces in unoccupied squares (re-do if the result is an invalid). This means that the data becomes a lot more diverse and the network has to generalize better to work well.

I also wanted my self-play games to a greater extent mirror what an evaluation game would look like so I picked a different policy than before. The new policy I use in self-play is argmax_a(N(s, a) + (1+Q(s, a))/2). Since most of my starting games will just be the root node children expanded once it's important to differentiate between moves with the same number of visits but different action value estimates. Since the range of the Q-values are in [-1, 1] I just shift and scale to [0, 1]. A caveat here is that I still use the old policy label N(s, a)/N(s), since most of the trees are tiny these labels are mostly flat. This all seems pretty patchy to me and with proper propagation of GT values this could probably be solved in a more elegant manner.

Using this new setup I got this result,

Trying out all this stuff has left the code in an unreadable unworkable mess. So my main priority right now is re-writing the project so that these types of experiments have a proper framework. I'll then work on a guide on how to re-create the results so far.

An AlphaZero inspired project

Re: An AlphaZero inspired project

Re: An AlphaZero inspired project

Re: An AlphaZero inspired project

Re: An AlphaZero inspired project

Re: An AlphaZero inspired project

Re: An AlphaZero inspired project

Re: An AlphaZero inspired project