NN doesn’t waste time thinking about illegal moves.
You have to understand a little bit about how the operations in the network work, specifically vector/tensor arithmetic. I’ll show some of the math here:
Policy = softmax(PolicyActivation)
PolicyLoss = -log(Policy)*PolicyTarget
You don’t really need to understand what this means. The point is, this is what currently determines how to train the network from a policy. This is 100% what A0 does too. Search in the paper for a “log” term and you’ll find it.
If we wanted to train it to completely ignore illegal moves during training (ie *not* push them down), then the equations might look something like this:
PolicyMask = Vector of numbers where illegal moves are big.
MaskedPolicyActivation = PolicyActivation - PolicyMask
Policy = softmax(MaskedPolicy)
PolicyLoss = -log(Policy)*PolicyTarget
(I think there may be other ways to mask the policy so training won’t affect illegal move weights, but this is the easiest way I could think to do it a numerically stable way.)
Long story short... It would take *more* math to ignore illegal moves during training than it does to train them to down to zero.