It uses attack tables as well. My initial attempt used 12 planes for the pieces, 12 for the squares attacked by each piece and additional planes for side to move, castling and en passant. This worked up to a point. It was reaching around 2300-2400 and won some games against 2500 engines but then I noticed it started falling for some very easy tactics, and falling for them even more as training progressed. In particular tactics involving pinned pieces. For example if a piece is being protected, and the protecting piece is pinned then the first piece is actually hanging but the (policy) net wasn't able to 'see' this (since the attack planes were 'saying' it was protected).Daniel Shawul wrote: ↑Fri Mar 15, 2019 3:24 pmI have had a similar frustrating experience with reinforcement learning before i deemed it too big a project for one man.
What input planes do you use besides piece placement? I use attack tables and piece counts as additional inputs and they
seem to help accelerate learning.
Do you use replay buffer in your training and how much does it help to stabilize ? I suspect this was one of the key things
I missed when i was doing RL. I was also using value net only (no policy) at the time that makes matters worse. The 6x64 value net
was not improving after 300k games or so.
You might think that with more training it would learn that when a piece is pinned its attack squares are nullified but on the contrary as training progresses it simply learns to trust whatever custom inputs you give it more and more.
I don't want say exactly what changes I made but definitely you will need to think carefully what additional data you give the net and in particular if there are any exceptional situations where the data becomes misleading.
I'm curious what you meant by 'piece counts' as one of your inputs. Does the main convolution body receive it as a plane or do you enter it after as a single input?
I do use a replay buffer. Each generation consists of 2048 positions (approx 19-20 games). For training I select random positions from the previous 64 generations. In other words I select from the most recent 1200-1300 games (this contrasts with Leela / AlphaZero where they select from the previous million or so games).
I've only ever used a replay buffer. If you didn't, what were you using? What instability issues were you facing?
I would recommend that you try to get something working with only the raw board as input. Once you've got the framework and main techniques sorted then you can experiment with extra inputs.
Incidentally on one of my training runs I accidentally deleted the call to fill the attack planes and only realised it much later. And it had no issues learning. So even if it's a one man job, using the board input only should still work.