SL vs RL

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
chrisw
Posts: 1415
Joined: Tue Apr 03, 2012 2:28 pm

SL vs RL

Post by chrisw » Sun Apr 28, 2019 1:57 pm

I found my little test MCTS SL program screwing up in game play because this sort of situation...,

Black Q gives check. Check piece is en prise. King has some moves.

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.

Result in search is that little program gives Q en prise idiotic check and is very happy because the PUCT puts the recapture so far down its list, or even not at all, that Q en prise never gets captured.

AZ is supposed to generate “weak” games to give experience in stupid situations, is that the solution? And that weak, containing outright blunder positions?

This is an extreme example, but I’m seeing variants of it all the time, based on, SL only sees sensible examples.
Another fail is SEE favored captures. Sensible games have many positions where side just moved is material ahead, but this isn’t maintained because it was just first move in a SEE sequence. Problem is that value head interprets material down but on move as equal because, well, it only got to see examples where this is true.
I think I found a way round this, but hack by hack usually leaves holes. Is RL just better? Maybe.

Rémi Coulom
Posts: 426
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: SL vs RL

Post by Rémi Coulom » Sun Apr 28, 2019 2:19 pm

RL has the same problem. Weak moves of early self-play games are rapidly forgotten.

In Go, the alphazero method has a very severe problem with ladders. It is similar to what you describe.

A fundamental flaw of the Alpha Zero approach is that it learns only from games between strong players. When the neural network is used inside the search tree, it often has to find refutations to bad moves that would have never been played between strong players.

I have not yet found a good way to overcome this problem. But I will try some ideas to include good refutations to bad moves into the training set.

User avatar
hgm
Posts: 23150
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: SL vs RL

Post by hgm » Mon Apr 29, 2019 10:41 am

The solution seems obvious: make sure the training set contains sufficiently many games where a strong player crushes a patzer in the most efficient way. Or games between strong players where you make one of the moves of one player a random one.

chrisw
Posts: 1415
Joined: Tue Apr 03, 2012 2:28 pm

Re: SL vs RL

Post by chrisw » Mon Apr 29, 2019 11:30 am

hgm wrote:
Mon Apr 29, 2019 10:41 am
The solution seems obvious: make sure the training set contains sufficiently many games where a strong player crushes a patzer in the most efficient way. Or games between strong players where you make one of the moves of one player a random one.
Then what do you use as game result for this diversionary sequence? And is one random move in say 80, enough to create the necessary bad game segment?

chrisw
Posts: 1415
Joined: Tue Apr 03, 2012 2:28 pm

Re: SL vs RL

Post by chrisw » Mon Apr 29, 2019 11:49 am

Rémi Coulom wrote:
Sun Apr 28, 2019 2:19 pm
RL has the same problem. Weak moves of early self-play games are rapidly forgotten.

In Go, the alphazero method has a very severe problem with ladders. It is similar to what you describe.

A fundamental flaw of the Alpha Zero approach is that it learns only from games between strong players. When the neural network is used inside the search tree, it often has to find refutations to bad moves that would have never been played between strong players.

I have not yet found a good way to overcome this problem. But I will try some ideas to include good refutations to bad moves into the training set.
LC0 nodes=0 or any low value versus itself, sf10 with limited depth, or one’s own simple net shows up many many blundering cases. Like watching games from 1980s. Lc0 policy is good but in many cases not. Of significance it can have no idea when either strongly ahead, or strongly behind, I guess these sorts of games just don’t appear in training.
If one had a net trained also on many bad positions, thenHomer Simpson brain capacity problem. Every time I hear something new a bit of old drops out the other side.

trulses
Posts: 39
Joined: Wed Dec 06, 2017 4:34 pm

Re: SL vs RL

Post by trulses » Thu May 09, 2019 2:36 pm

chrisw wrote:
Sun Apr 28, 2019 1:57 pm
...

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.
Seems like the entropy of your policy is too low try training with entropy regularization. If you have the test position an immediate band-aid fix is to find which softmax temperature helps puct try this move earlier in your search. Your value net should enjoy the resulting position when up a queen so this move should eventually prove very convincing to puct independent of your policy net.

trulses
Posts: 39
Joined: Wed Dec 06, 2017 4:34 pm

Re: SL vs RL

Post by trulses » Thu May 09, 2019 2:43 pm

Rémi Coulom wrote:
Sun Apr 28, 2019 2:19 pm
RL has the same problem. Weak moves of early self-play games are rapidly forgotten.

In Go, the alphazero method has a very severe problem with ladders. It is similar to what you describe.

A fundamental flaw of the Alpha Zero approach is that it learns only from games between strong players. When the neural network is used inside the search tree, it often has to find refutations to bad moves that would have never been played between strong players.

I have not yet found a good way to overcome this problem. But I will try some ideas to include good refutations to bad moves into the training set.
What have you tried so far to address this? Do you use a large replay buffer? Have you tried playing against older agents? RL should give you a nice spectrum from weak to strong agents to play against.

chrisw
Posts: 1415
Joined: Tue Apr 03, 2012 2:28 pm

Re: SL vs RL

Post by chrisw » Thu May 09, 2019 3:16 pm

trulses wrote:
Thu May 09, 2019 2:36 pm
chrisw wrote:
Sun Apr 28, 2019 1:57 pm
...

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.
Seems like the entropy of your policy is too low try training with entropy regularization. If you have the test position an immediate band-aid fix is to find which softmax temperature helps puct try this move earlier in your search. Your value net should enjoy the resulting position when up a queen so this move should eventually prove very convincing to puct independent of your policy net.
I have so many things to fix that I applied a quick kludge to this and moved on. Kludge base is that policy score is inaccurate anyway, but inaccuracies where p is close to zero are potentially catastrophic since zero is not a good multiplier. Kludge was add absolute 0.005 to p before using it in (my version of) PUCT, that way stuff doesn’t get completely overlooked.
I also have ideas to handcraft move selection when waiting for a policy batch and/or have a handcraft term to add to p, decreasing with visits. All on todo list.

Post Reply