[Computer-go] AlphaGo Zero self-play temperature

127 views
Skip to first unread message

Imran Hendley

unread,
Nov 7, 2017, 1:26:56 PM11/7/17
to computer-go
Hi, I might be having trouble understanding the self-play policy for AlphaGo Zero. Can someone let me know if I'm on the right track here?

The paper states:

In each position s, an MCTS search is executed, guided by the neural network f_θ . The
MCTS search outputs probabilities π of playing each move.

This wasn't clear at first since MCTS outputs wins and visits, but later the paper explains further:

MCTS may be viewed as a self-play algorithm that, given neural
network parameters θ and a root position s, computes a vector of search
probabilities recommending moves to play, π =​  α_θ(s), proportional to
the exponentiated visit count for each move, π_a ∝​  N(s, a)^(1/τ) , where τ is
a temperature parameter.

So this makes sense, but when I looked for the schedule for decaying the temperature all I found was the following in the Self-play section of Methods:

For the first 30 moves of each game, the temperature is set to τ = ​1; this
selects moves proportionally to their visit count in MCTS, and ensures a diverse
set of positions are encountered. For the remainder of the game, an infinitesimal
temperature is used, τ→​0.

This sounds like they are sampling proportional to visits for the first 30 moves since τ = ​1 makes the exponent go away, and after that they are playing the move with the most visits, since the probability of the move with the most visits goes to 1 and the probability of all other moves goes to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 , b)^(1/τ) as τ goes to 0 from the right.

Am I understanding this correctly? I am confused because it seems a little convoluted to define this simple policy in terms of a temperature. When they mentioned temperature I was expecting something that slowly decays over time rather than only taking two trivial values. 

Thanks!

Álvaro Begué

unread,
Nov 7, 2017, 2:19:02 PM11/7/17
to computer-go
Your understanding matches mine. My guess is that they had a temperature parameter in the code that would allow for things like slowly transitioning from random sampling to deterministically picking the maximum, but they ended up using only those particular values.

Álvaro.




_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

uurtamo .

unread,
Nov 7, 2017, 2:39:46 PM11/7/17
to computer-go
If I understand your question correctly, "goes to 1" can happen as quickly or slowly as you'd like. Yes?

Gian-Carlo Pascutto

unread,
Nov 7, 2017, 3:14:15 PM11/7/17
to compu...@computer-go.org
On 7/11/2017 19:07, Imran Hendley wrote:
> Am I understanding this correctly?

Yes.

It's possible they had in-betweens or experimented with variations at
some point, then settled on the simplest case. You can vary the
randomness if you define it as a softmax with varying temperature,
that's harder if you only define the policy as select best or select
proportionally.

--
GCP

Imran Hendley

unread,
Nov 7, 2017, 4:54:11 PM11/7/17
to computer-go
Great, thanks guys!

uurtamo .

unread,
Nov 7, 2017, 5:49:39 PM11/7/17
to computer-go
It's interesting to leave unused parameters or unnecessary parameterizations in the paper. It telegraphs what was being tried as opposed to simply writing something more concise and leaving the reader to wonder why and how those decisions were made.

s.
Reply all
Reply to author
Forward
0 new messages