Policy head / Value head

Huragan

unread,

Mar 24, 2018, 8:38:29 AM3/24/18

to LCZero

Hello, I still can not understand the reason for using both Policy and Value head for position evaluation in A0 / LC0 projects. I am sure it has good reason. But I still can not understand why policy head is necessary.
If you can evaluate position P by its Value and know which moves are legal, you can simply select the best move only based on their Value. So couldn't be Policy head replaced just by choosing such next position that has relatively good Value?

jkiliani

unread,

Mar 24, 2018, 11:14:34 AM3/24/18

to LCZero

You need both, because with only the value head you would have a node evaluation but no idea which nodes to expand next. Within knowing which nodes to search (i.e. some degree of node pruning) you won't get any reasonable search depth. That's why the dual head is important, policy head directs the search while value head gives the evaluations.

luis....@gmail.com

unread,

Mar 24, 2018, 11:50:55 AM3/24/18

to LCZero

You need both, because with only the value head you would have a node evaluation but no idea which nodes to expand next.

Well, as he said, you can just expand the nodes that the value network evaluates as best, right?

jkiliani

unread,

Mar 24, 2018, 11:54:49 AM3/24/18

to LCZero

You misunderstand me. The output of the value head is not a distribution of nodes, but a winrate. I.e. the value head will give you an output of 0.7, which could mean roughly 50% chance of winning, 40% of drawing and 10% of losing from that position. The value head does NOT say which candidate moves are a good idea to play next. That is the job of the policy head.

Huragan

unread,

Mar 24, 2018, 1:38:14 PM3/24/18

to LCZero

On Saturday, March 24, 2018 at 4:14:34 PM UTC+1, jkiliani wrote:

You need both, because with only the value head you would have a node evaluation but no idea which nodes to expand next. Within knowing which nodes to search (i.e. some degree of node pruning) you won't get any reasonable search depth. That's why the dual head is important, policy head directs the search while value head gives the evaluations.

Still don't understand why we need both. Policy head tells which nodes should be expanded. But aren't they the same as next moves with relatively high winrate (Value head)? Given we have position with winrate 0.5 and there are 5 legal moves from this position that have winrates 0.51, 0.55, 0.65, 0.4, 0.49. Is not just selecting node with winrate 0.65 good pruning policy (without need of policy head)?

Alexander Lyashuk

unread,

Mar 24, 2018, 3:53:26 PM3/24/18

to zdene...@gmail.com, LCZero

That's correct that just having one value network and no policy network is kind of enough. (policy network could be "emulated" by expanding all possible moves).

But policy network with many outputs is used instead for performance reasons (then it's much less networks to evaluate).

Also in original AlphaGo, policy network was a separate net (unlike now), and it was much smaller (for performance).

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/f2d16110-7e2e-498d-9d22-4b5836a0b9ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Huragan

unread,

Mar 24, 2018, 4:52:03 PM3/24/18

to LCZero

Thanks for answer, makes sense.

Huragan

unread,

Mar 24, 2018, 7:34:20 PM3/24/18

to LCZero

Hm, now this came to my mind - do we really need Value head if we have Policy head? Let the Policy head of given position is a vector where for each move the resulting position is evaluated. Then the Value head would not be needed as it is equivalent to the Maximal number of the Policy head vector. It is little bit weird for me that the net is trained both to evaluate the current position and to evaluate subsequent moves / subsequent positions (= moves applied to the current position).

On Saturday, March 24, 2018 at 8:53:26 PM UTC+1, Alexander Lyashuk wrote:

Andy Olsen

unread,

Mar 24, 2018, 10:42:42 PM3/24/18

to LCZero

Policy: From this position I think moves A, B, and C look promising.

Value: I think this position as a whole is +0.34 for White.

Policy can focus on every legal move, and tell you A, B, and C are promising. But it cannot tell you how much the position eval will change after making the move.

Value can focus on a single position and tell you with (hopefully) precision what the eval is.

These are very different things.

2018년 3월 24일 토요일 오후 6시 34분 20초 UTC-5, Huragan 님의 말:

Folkert Huizinga

unread,

Mar 25, 2018, 2:01:22 AM3/25/18

to LCZero

Adding to that: Combining them as two heads into a single network provides a strong regularizer during training.

Huragan

unread,

Mar 25, 2018, 4:56:22 AM3/25/18

to LCZero

While a Value head is one number, how exactly Policy head is represented in LC0? I assume it is a n-tuple of numbers evaluating each move. Very naive representation would be 64x63-touple (defining start and end position of move, covering all legal moves, but the vast majority of moves defined in this way would be illegal). I expect you use something more sophisticated.

Alexander Lyashuk

unread,

Mar 25, 2018, 5:25:56 AM3/25/18

to Zdenek Herman, LCZero

It outputs probability for all possible from-to pairs, i.e. for every of 64 squares there is a probability of every valid rook-like move, bishop-like move and knight-like move (the same from-to pairs are reused for king/queen/pawn moves and castles).

Plus for every pawn underpromotion there is an additional probability ([undepromotion to knight/bishop/rook] × [8 pawn moves without capture + 7 captures to the left + 7 captures to the right])

That's 4672 possible moves in total.

Policy network outputs weights for all those moves, and then values for all invalid moves are zeroed outside of the neural net.

That's what alphazero paper says, I'm not sure that lc0 has exactly the same representation, but probably it does.

On Sun, Mar 25, 2018 at 10:56 AM Huragan <zdene...@gmail.com> wrote:

While a Value head is one number, how exactly Policy head is represented in LC0? I assume it is a n-tuple of numbers evaluating each move. Very naive representation would be 64x63-touple (defining start and end position of move, covering all legal moves, but the vast majority of moves defined in this way would be illegal). I expect you use something more sophisticated.

--

You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/621595b3-2484-4971-80cc-e4a7da58070e%40googlegroups.com.

Chris Whittington

unread,

Mar 25, 2018, 9:18:05 AM3/25/18

to LCZero

An important advantage of having a Policy Head and an Eval Head is time.

At an expansion node with Policy and Eval, you basically expend two computational time units, one for the Eval and the other for the Policy.

You could, I guess, get a Policy by asking the NN for an evaluation for each child, and then expand on maxeval. But, since move width in chess is around 30 to 40, this would require 30 to 40 time units to get a Policy, and your node rate would fall from, say 80K nodes per second to 2K nodes per second (using Google TPUs). Using a PC plus GPU, well, you get the idea ....

Gian-Carlo Pascutto

unread,

Mar 26, 2018, 7:37:59 AM3/26/18

to lcz...@googlegroups.com

On 24/03/2018 20:52, Alexander Lyashuk wrote:

> Still don't understand why we need both. Policy head tells which
> nodes should be expanded. But aren't they the same as next moves with
> relatively high winrate (Value head)? Given we have position with
> winrate 0.5 and there are 5 legal moves from this position that have
> winrates 0.51, 0.55, 0.65, 0.4, 0.49. Is not just selecting node with
> winrate 0.65 good pruning policy (without need of policy head)?

Yes.

Your idea is correct, i.e. that instead of search probabilities, the
policy head could output estimated winrates for each move.

It's easy to understand if you consider the probabilities are currently
a normalized mapping from 0 to 1.0 - but there's no need for the network
output to be this. Nothing prevents you from multiplying that with the
E(v) and scaling it.

As to why this isn't done:

a) The 2 "heads" are essentially free since computing them is a tiny
fraction of the cost of evaluating the network itself.
b) The formulation of the search algorithm requires a probability (not
an evaluation) in its current form.

Now, that all being said, one big problem of the current algorithm is
what to use as the FPU. And what you proposed is exactly to calculate
and train this.

Thus, one could use this design:

a) policy head producing search probabilities for every move
b) value head producing move evaluations for every move

Now E(v) is indeed max(children), and we can immediately initialize the
FPU of every move, as that is exactly what the value head produces.

Also consider that FPU reductions are exactly there to solve the problem
that we have only one E(v) for the root, and a bunch of probabilities
for moves, that we now have to turn into E(v) for every move!

--
GCP

luis....@gmail.com

unread,

Mar 28, 2018, 4:12:20 AM3/28/18

to LCZero

This sounds promising! Any plans to implement that on LCZero?

Message has been deleted

Huragan

unread,

Mar 28, 2018, 2:39:22 PM3/28/18

to LCZero

I doubt that it will be implemented in current version of LC0, it is designed to mimic A0 implementation as much as possible. Nevertheless I would be delighted if it was implemented in order that I could contribute to the project at least by that idea as I have no GPU and can not contribute to the net training.
BTW it would be interesting to compute if most probable moves have also the highest winrates in current algorithm implementation. This could be measured for various net generations. And hopefully the difference (inner inconsistency) will be decreasing during the time.

Reply all

Reply to author

Forward