Learning to favor legal moves

463 views
Skip to first unread message

Hanan

unread,
May 16, 2018, 9:55:53 AM5/16/18
to LCZero
If I understand correctly, the neural net outputs probabilities for all possible piece moves on the board. Then the application zeros the non-legal moves and (optionally) normalizes the remaining legal moves to choose from.

My question is, does the network get better at choosing legal moves, i.e. learning specifically to play chess, or is it the case that from the remaining possible legal moves the good moves are ranked higher.
In other words, as the learning progresses are the legal move's probabilities getting higher relative to all moves, or are we just seeing an improved ordering of the legal moves

Alexander Lyashuk

unread,
May 16, 2018, 10:02:36 AM5/16/18
to Hanan Rosemarin, LCZero
During training, we train legal moves with probabilities computed in MCTS from training games, and train illegal moves as having zero probability.
And indeed as a result NN returns very low probability for illegal moves, so they would probably most of the time not considered for play even if we wouldn't nullify those probabilities during play.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/fdd2fcd7-0615-46ad-a252-ef5363018838%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Harry M

unread,
May 16, 2018, 8:52:01 PM5/16/18
to LCZero
Isn't that very, very wasteful?

The neural network should only have to consider legal moves, not waste time and resources thinking about illegal moves (during training or play).

From reading the alpha zero paper, and from talking to one of the people at deep mind (whose team is now working on a Starcraft engine, using the same hyperparameters and techniques as alpha zero), I believe that the neural network should NOT be considering illegal moves AT ALL.

Am I missing something?

Trevor G

unread,
May 16, 2018, 10:09:20 PM5/16/18
to Harry M, LCZero
NN doesn’t waste time thinking about illegal moves.

You have to understand a little bit about how the operations in the network work, specifically vector/tensor arithmetic. I’ll show some of the math here:

Policy = softmax(PolicyActivation)
PolicyLoss = -log(Policy)*PolicyTarget

You don’t really need to understand what this means. The point is, this is what currently determines how to train the network from a policy. This is 100% what A0 does too. Search in the paper for a “log” term and you’ll find it.

If we wanted to train it to completely ignore illegal moves during training (ie *not* push them down), then the equations might look something like this:

PolicyMask = Vector of numbers where illegal moves are big.
MaskedPolicyActivation = PolicyActivation - PolicyMask
Policy = softmax(MaskedPolicy)
PolicyLoss = -log(Policy)*PolicyTarget

(I think there may be other ways to mask the policy so training won’t affect illegal move weights, but this is the easiest way I could think to do it a numerically stable way.)

Long story short... It would take *more* math to ignore illegal moves during training than it does to train them to down to zero.

Hanamuke

unread,
May 17, 2018, 3:41:05 AM5/17/18
to LCZero
I think that the idea was that instead settin policy value to 0 for illegal moves (training them down), to set the training policy gradient to zero (do not care about their predicted values).

Hanamuke

unread,
May 17, 2018, 3:42:06 AM5/17/18
to LCZero
What is wasted is not computation time, it is learning capacity of the network.

Harry M

unread,
May 17, 2018, 10:23:12 AM5/17/18
to LCZero
@Trevor, @Hanamuke, Thanks for the explanation. What is the problem with doing a bit more math to ignore illegal moves?

Trevor G

unread,
May 17, 2018, 11:40:11 AM5/17/18
to Harry M, LCZero
Nothing inherently wrong with it... The extra calculations are pretty negligible really. Though another thing that would need to be changed is the client would need to send back training data that includes information about legal/illegal moves (changing training data is not a small change).

Anyway, I was mostly showing that when Alexander mentioned illegal moves being trained to zero that this wasn't actually extra computation he was referring to.

Anyway, there's no saying for sure what would happen if the mechanism I mentioned to mask illegal moves were implemented. One potential problem is that if a move is rarely legal, then all the training steps that happen while it's illegal might make that activation noisy when it actually is legal. Also, there's regularization - if you're blocking illegal moves from passing gradients, then it probably makes sense to remove the regularization for that activation when it's illegal. This part's not as trivial, but certainly doable.

It could be that training illegal moves down to zero helps the network understand the board position better. Or it could be that masking illegal moves helps to remove an "illegal move bias.".. I did some experimentation a while ago with Othello and Connect4 in a much simpler framework, and found some evidence of both. But what I found there could be completely meaningless for Leela (with MCTS training games generation, which I was not doing).



Reply all
Reply to author
Forward
0 new messages