[Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

14 views
Skip to first unread message

Ray Tayek

unread,
Jul 16, 2020, 3:49:00 AM7/16/20
to compu...@computer-go.org
https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/


--
Honesty is a very expensive gift. So, don't expect it from cheap people - Warren Buffett
http://tayek.com/

_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Rémi Coulom

unread,
Jul 16, 2020, 1:47:21 PM7/16/20
to computer-go
This looks very interesting.

From a quick glance, it seems the improvement is mainly when the number of playouts is small. Also they don't test on the game of Go. Has anybody tried it?

I will take a deeper look later.

Kensuke Matsuzaki

unread,
Jul 19, 2020, 5:39:21 AM7/19/20
to compu...@computer-go.org
Hi,

I couldn't improve leela zero's strength by implementing SEARCH and ACT.
https://github.com/zakki/leela-zero/commits/regularized_policy

2020年7月17日(金) 2:47 Rémi Coulom <remi....@gmail.com>:

--
Kensuke Matsuzaki

David Wu

unread,
Jul 19, 2020, 10:10:44 AM7/19/20
to compu...@computer-go.org
I imagine that at low visits at least, "ACT" behaves similarly to Leela Zero's "LCB" move selection, which also has the effect of sometimes selecting a move that is not the max-visits move, if its value estimate has recently been found to be sufficiently larger to balance the fact that it is lower prior and lower visits (at least, typically, this is why the move wouldn't have been the max visits move in the first place). It also scales in an interesting way with empirical observed playout-by-playout variance of moves, but I think by far the important part is that it can use sufficiently confident high value to override max-visits.

The gain from "LCB" in match play I recall is on the very very rough order of 100 Elo, although it could be less or more depending on match conditions and what neural net is used and other things. So for LZ at least, "ACT"-like behavior at low visits is not new.

Daniel

unread,
Jul 19, 2020, 1:51:35 PM7/19/20
to compu...@computer-go.org
@Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are meant to be used during training, no?
I think I understand ACT and LEARN but I am not sure about SEARCH for which they say this:

> During search, we propose to stochastically sample actions according to π¯ instead of the deterministic action selection rule of Eq. 1.

This sounds much like the random selection done at the root with temperature, but this time applied at internal nodes.
Does it mean the pUCT formula is not used? Why does the selection have to be stochastic now?
On selection, you compute π_bar every time from (q, π_theta, n_visits) so I suppose π_bar has everything it needs to balance exploration and exploitation.

Kensuke Matsuzaki

unread,
Jul 19, 2020, 8:44:30 PM7/19/20
to compu...@computer-go.org
I used stochastic sampling at internal nodes, because of this.
> During the forward simulation phase of SEARCH, the action at each node x is selected by sampling a ∼ π¯(·|x).
> As a result, the full imaginary trajectory is generated consistently according to policy π¯.

> In this section, we establish our main claim namely that AlphaZero’s action selection criteria can be interpreted as approximating the solution to a regularized policy-optimization objective.

I think they say UCT and PUCT is approximation of direct π¯ sampling,
but I haven't understood section 3 well.

2020年7月20日(月) 2:51 Daniel <dsh...@gmail.com>:

Reply all
Reply to author
Forward
0 new messages