While this is self-evidently fine when it comes to choosing the best move, I don't think the same can be said with regards to the effectiveness of training. Training is done with back-propagation: the larger the discrepancy between Leela's value and the actual outcome, the larger the adjustment will be to her network weights. The basic principle is, "the worse job the network did in understanding the position, the more the network weights should be nudged". Following that line of reasoning, it seems there might be a powerful optimization that could be made: instead of having the value head return a single number; have it return three numbers: the probability of winning (Pw), probability of draw (Pd), and the probability of losing (Pl). Or, alternatively (keeping with architecture of 1 value head = 1 output value), have 3 independent value heads: one to assess probability of win, one to assess probability of draw, and one to assess probability of loss (with wrapper function to normalize the 3 outputs such that Pw+Pd+Pl = 1.00).
The three values would only be used only during training; during play (when all that matters is tree search to find best move), they would be combined as a single value V between 0..1 using formula:
V=(Pw * 1) + (Pd * 0.5) + (Pl * 0)
So from a "choose best move" perspective, a node evaluated as [win=0.2, draw = 0.5, loss = 0.3] would have the same value as [win=0.45, draw = 0.0, loss = 0.55] (both would be valued at 0.45)
However, from a training perspective, there would be a huge difference in the accuracy of [win=0.2, draw = 0.5, loss = 0.3] vs [win=0.45, draw = 0.0, loss = 0.55] for each outcome (where overall MSE is computed as the average MSE of the 3 separate w/d/l predictions)
If Win occurs, result treated as [win=1,draw=0,loss=0]:
MSE of [win=0.2, draw = 0.5, loss = 0.3] = ((1-0.2)^2 + (0-0.5)^2 + (0-0.3)^2) / 3 = 0.33
MSE of [win=0.45, draw = 0.0, loss = 0.55] = ((1-0.45)^2 + (0-0.0)^2 + (0-0.55)^2) / 3 = 0.20
If Draw occurs, result treated as [win=0,draw=1,loss=0]:
MSE of [win=0.2, draw = 0.5, loss = 0.3] = ((0-0.2)^2 + (1-0.5)^2 + (0-0.3)^2) / 3 = 0.13
MSE of [win=0.45, draw = 0.0, loss = 0.55] = ((0-0.45)^2 + (1-0.0)^2 + (0-0.55)^2) / 3 = 0.5 (horrible prediction!)
If Loss occurs, result treated as [win=0,draw=0,loss=1]:
MSE of [win=0.2, draw = 0.5, loss = 0.3] = ((0-0.2)^2 + (0-0.5)^2 + (1-0.3)^2) / 3 = 0.26
MSE of [win=0.45, draw = 0.0, loss = 0.55] = ((0-0.45)^2 + (0-0.0)^2 + (1-0.55)^2) / 3 = 0.14
With the single-value perspective, either prediction is equally accurate whether outcome is win, draw, or lose.
Conceptually, this ties to the fact that high-skill players recognize a difference between a "roughly equal" game and a "drawn game". By being limited to only to pick a number from 0 to 1, Leela's neural network cannot be assessed for its ability to make such a distinction; she's only trained to recognize that neither white nor black have any real advantage. I'd argue that, all things being equal, a NN that can predict win/loss but can't discern a 50/50 probability of the starting position from a 50/50 probability of K+pawn vs K+pawn, simply doesn't have the same depth of understanding of the game as a NN that can do both; and the deeper the overall understanding of the game, the better the NN will be at finding the best moves.
I also think it's interesting to consider how such values could be incorporated into play. For example, Leela could be fed a parameter that tells her to play aggressively (choosing moves with slightly lower value but higher chance of win vs draw) or to play for a draw (treating win & draw as equal and just minimizing loss-probability). She could also have a parameter that lets her make (and accept or decline) draw offers intelligently. She could even provide more interesting assessments of human-human games by returning not just "who is winning" but also evaluating the overall sharpness of the position (e.g. equal likelihood of black or white winning but with low probability of draw).