In each position s, an MCTS search is executed, guided by the neural network f_θ . TheMCTS search outputs probabilities π of playing each move.
MCTS may be viewed as a self-play algorithm that, given neuralnetwork parameters θ and a root position s, computes a vector of searchprobabilities recommending moves to play, π = α_θ(s), proportional tothe exponentiated visit count for each move, π_a ∝ N(s, a)^(1/τ) , where τ isa temperature parameter.
This sounds like they are sampling proportional to visits for the first 30 moves since τ = 1 makes the exponent go away, and after that they are playing the move with the most visits, since the probability of the move with the most visits goes to 1 and the probability of all other moves goes to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 , b)^(1/τ) as τ goes to 0 from the right.For the first 30 moves of each game, the temperature is set to τ = 1; thisselects moves proportionally to their visit count in MCTS, and ensures a diverseset of positions are encountered. For the remainder of the game, an infinitesimaltemperature is used, τ→0.
_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
Yes.
It's possible they had in-betweens or experimented with variations at
some point, then settled on the simplest case. You can vary the
randomness if you define it as a softmax with varying temperature,
that's harder if you only define the policy as select best or select
proportionally.
--
GCP