HiraBot author reported mini-max search with Policy and Value network.
It does not use monte-carlo.
Only top 8 moves from Policy is searched in root node. In other depth,
top 4 moves is searched.
Game result against Policy network best move (without search)
Win Loss winrate
MaxDepth=1, (558-442) 0.558 +40 Elo
MaxDepth=2, (351-150) 0.701 +148 Elo
MaxDepth=3, (406-116) 0.778 +218 Elo
MaxDepth=4, (670- 78) 0.896 +374 Elo
MaxDepth=5, (490- 57) 0.896 +374 Elo
MaxDepth=6, (520- 20) 0.963 +556 Elo
Search is simple alpha-beta.
There is a modification Policy network high probability moves tend to be selected.
MaxDepth=6 takes one second/move on i7-4790k + GTX1060.
His nega-max code
http://kiyoshifk.dip.jp/kiyoshifk/apk/negamax.zip
CGOS result, MaxDepth=6
http://www.yss-aya.com/cgos/19x19/cross/minimax-depth6.html
His Policy network(without search) is maybe
http://www.yss-aya.com/cgos/19x19/cross/DCNN-No336-tygem.html
His Policy and Value network(MCTS) is maybe
http://www.yss-aya.com/cgos/19x19/cross/Hiratuka10_38B100.html
Thanks,
Hiroshi Yamashita
_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
The question I always ask is: what's the real difference between MCTS
with a small UCT constant and an alpha-beta search with heavy Late Move
Reductions? Are the explored trees really so different?
In any case, in my experiments Monte Carlo still gives a strong benefit,
even with a not so strong Monte Carlo part. IIRC it was the case for
AlphaGo too, and they used more training data for the value network than
is publicly available, and Zen reported the same: Monte Carlo is important.
The main problem is the "only top x moves part". Late Move Reductions
are very nice because there is never a full pruning. This heavy pruning
by the policy network OTOH seems to be an issue for me. My program has
big tactical holes.
--
GCP
... This heavy pruning
by the policy network OTOH seems to be an issue for me. My program has
big tactical holes.
Yes, LMR in Go has is a big difference compared to LMR in chess: Go tactics take many moves to play out, whereas chess tactics are often pretty immediate. So LMR could hurt Go tactics much more than it hurts chess tactics. Compare the benefit of forcing the playout to the end of the game.
Best,
Brian
With infinite simulations everything is easy :-)
In practice moves with, say, a prior below 0.1% aren't going to get
searched, and I still regularly see positions where they're the winning
move, especially with tactics on the board.
Enforcing the search to be wider without losing playing strength appears
to be hard.
There's 2 aspects to my answer:
1) Unless you've made a breakthrough with value nets, there appears to
be a benefit to keeping the Monte Carlo simulations.
2) I am not sure the practical implementations of both algorithms end up
searching in a different manner.
(1) Is an argument against using alpha-beta. If we want to get rid of
the MC simulations - for whatever reason - it disappears. (2) isn't an
argument against. Stating the algorithm in a different manner may make
some heuristics or optimizations more obvious.
> Yes, LMR in Go has is a big difference compared to LMR in chess: Go
> tactics take many moves to play out, whereas chess tactics are often
> pretty immediate.
Not sure I agree with the basic premise here.
> So LMR could hurt Go tactics much more than it hurts chess tactics.
> Compare the benefit of forcing the playout to the end of the game.
LMR doesn't prune anything, it just reduces the remaining search depth
for non-highly rated moves. So it's certainly not going to make
something tactically weaker than hard pruning? If you're talking about
not pruning or reducing at all, you get the issue of the branching
factor again.
In chess you have quiescent search to filter out the simpler tactics. I
guess Monte Carlo simulations may act similar in that they're going to
raise/lower the score if in some simulations tactical shenanigans happen.
What was generating your probabilities, though? A strong policy DCNN or
something weaker?
ERPS (LMR with fractional reductions based on move probabilities) with
alpha-beta seems very similar to having MCTS with the policy prior being
a factor in the UCT formula. This is what AlphaGo did according to their
2015 paper, so it can't be terrible, but it does mean that you are 100%
blind to something the policy network doesn't see, which seems
worrisome. I think I asked Aja once about what they do with first play
urgency given that the paper doesn't address it - he politely ignored
the question :-)
The obvious defense (when looking at it in alpha-beta formulation) would
be to cap the depth reduction, and (in MCTS/UCT formulation) to cap the
minimum probability. I had no success with this in Go so far.
On 22-05-17 11:27, Erik van der Werf wrote:
> On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto <g...@sjeng.org
> <mailto:g...@sjeng.org>> wrote:
>
> ... This heavy pruning
> by the policy network OTOH seems to be an issue for me. My program has
> big tactical holes.
>
>
> Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
> move predictor (a.k.a. policy net), but I never saw the need to do hard
> pruning. Steenvreter uses the predictions to set priors, and it is very
> selective, but with infinite simulations eventually all potentially
> relevant moves will get sampled.
With infinite simulations everything is easy :-)
In practice moves with, say, a prior below 0.1% aren't going to get
searched, and I still regularly see positions where they're the winning
move, especially with tactics on the board.
Enforcing the search to be wider without losing playing strength appears
to be hard.
(1) To solve L&D, some search is necessary in practice. So, the
value net cannot solve some of them.
(2) The number of possible positions (input of the value net) in
real games is at least 10^30 (10^170 in theory). If the value
net can recognize all? L&Ds depend on very small difference of
the placement of stones or liberties. Can we provide necessary
amount of training data? Have the network enough capacity?
The answer is almost obvious by the theory of function
approximation. (ANN is just a non-linear function
approximator.)
(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic). CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.
Hideki
David Wu: <CAGEydYud7otGtD6u-VPv3O1u...@mail.gmail.com>:
>---- inline file
>_______________________________________________
>Computer-go mailing list
>Compu...@computer-go.org
>http://computer-go.org/mailman/listinfo/computer-go
--
Hideki Kato <mailto:hideki...@ybb.ne.jp>
Leela's Monte Carlo playouts were designed and implemented in 2007,
before most of the current literature around them was public. Back then,
they were very "thick" and good enough to make the program one of the
strongest around. Needless to say, in the ~9 years or so when I was
absent from go programming, others made substantial progress in that
area, especially as before value nets this was clearly one of the most
important components of strength. Leela's Monte Carlo playouts for sure
are weaker than those of Crazy Stone and Zen, and even pachi. I have
done work on this in the last year, but a more complete overhaul isn't
in 0.10.0 yet.
Nevertheless (as you also observe below) they still contribute a benefit
to the strength of the engine. That's why I've been consistently saying
dropping them doesn't seem to be good, and why I like the orthogonality
they provide with the value net (and am generally wary of methods that
tune the playouts with or towards the value net).
> So clearly what's going on is that the playouts allow suicide,
I'll need to reconstruct the position you set up, but this is something
that shouldn't happen. Thank you for pointing it out, I'll try to
confirm on my side.
> Now I'm just speculating. My guess is that somehow 3% of the time, the
> game is scored without black having captured white's group. As in -
> black passes, white passes, white's dead group is still on the board, so
> white wins. The guess would be that liberties and putting it in atari
> increases the likelihood that the playouts kill the group before having
> both players pass and score. But that's just a guess, maybe there's also
> more black magic involving adjusting the "value" of a win depending on
> unknown factors beyond just having a "big win". Would need Gian-Carlo to
> actually confirm or refute this guess though.
Leela allows passes with a very low probability, so your analysis is
probably right.
> given that they're a significant weight in the evaluation
> alongside the value net, they're probably one of the major things
> holding Leela back at this point.
I assume that as well, which is why I've been doing some work on them,
but I'm also prepared to be disappointed. Note that I didn't put the
significant weighting arbitrarily: it's set to what gave the maximum
playing strength.
I suspect that when there are multiple options that seem objectively
equally good (from the value net), the playouts also help play towards
the option where it is harder to mess up. In this case, a larger amount
of stochasticity is not a bad thing.
--
GCP
On 22-05-17 15:46, Erik van der Werf wrote:
> Anyway, LMR seems like a good idea, but last time I tried it (in Migos)
> it did not help. In Magog I had some good results with fractional depth
> reductions (like in Realization Probability Search), but it's a long
> time ago and the engines were much weaker then...
What was generating your probabilities, though? A strong policy DCNN or
something weaker?
ERPS (LMR with fractional reductions based on move probabilities) with
alpha-beta seems very similar to having MCTS with the policy prior being
a factor in the UCT formula.
This is what AlphaGo did according to their
2015 paper, so it can't be terrible, but it does mean that you are 100%
blind to something the policy network doesn't see, which seems
worrisome. I think I asked Aja once about what they do with first play
urgency given that the paper doesn't address it - he politely ignored
the question :-)
Note that this is completely irrelevant for the discussion about
tactical holes and the position I posted. You could literally plug any
evaluation into it (save for a static oracle, in which case why search
at all...) and it would still have the tactical blindness being discussed.
It's an issue of limitations of the policy network, combined with the
way one uses the UCT formula. I'll use the one from the original AlphaGo
paper here, because it's public and should behave even worse:
u(s, a) = c_puct * P(s, a) * sqrt(total_visits / (1 + child_visits))
Note that P(s, a) is a direct factor here, which means that for a move
ignored by the policy network, the UCT term will almost vanish. In other
words, unless the win is immediately visible (and for tactics it won't),
you're not going to find it. Also note that this is a deviation from
regular UCT or PUCT, which do not have such a direct term and hence only
have a disappearing prior, making the search eventually more exploratory.
Now, even the original AlphaGo played moves that surprised human pros
and were contrary to established sequences. So where did those come
from? Enough computation power to overcome the low probability?
Synthesized by inference from the (much larger than mine) policy network?
--
GCP
DCNN clearly have some ability to generalize from learned data and
perform OK even with unseen examples. So I don't find this a very
compelling argument. It's not like Monte Carlo playouts are going to
handle all sequences correctly either.
Evaluations are heuristic guidance for the search, and a help when the
search terminates in an unresolved position. Having multiple independent
ones improves the accuracy of the heuristic - a basic ensemble.
> (3) CNN cannot learn exclusive-or function due to the ReLU
> activation function, instead of traditional sigmoid (tangent
> hyperbolic). CNN is good at approximating continuous (analog)
> functions but Boolean (digital) ones.
Are you sure this is correct? Especially if we allow leaky ReLU?
--
GCP
Agree.
(1) To solve L&D, some search is necessary in practice. So, the
value net cannot solve some of them.
(2) The number of possible positions (input of the value net) in
real games is at least 10^30 (10^170 in theory). If the value
net can recognize all? L&Ds depend on very small difference of
the placement of stones or liberties. Can we provide necessary
amount of training data? Have the network enough capacity?
The answer is almost obvious by the theory of function
approximation. (ANN is just a non-linear function
approximator.)
(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic). CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.
My argument is for "stand-alone" DCNN. Adding some (top-down?)
control to DCNNs could solve this (like human's brain). #I'm not
sure about recurrency but maybe necessary.
>(3) CNN cannot learn exclusive-or function due to the ReLU
>> activation function, instead of traditional sigmoid (tangent
>> hyperbolic). CNN is good at approximating continuous (analog)
>> functions but Boolean (digital) ones.
>>
>
>
>Are you sure about that? I can imagine using two ReLU units to construct a
>sigmoid-like step function, so I'd think a multi-layer net should be fine
>(just like with ordinary perceptrons).
Even if using many layers, it's hard to represent sharp edges by
combining ReLUs. (Not impossible but chances are few probably
due to so many local traps.)
Best, Hideki
>Best,
>Erik
>Now, even the original AlphaGo played moves that surprised human pros
>and were contrary to established sequences. So where did those come
>from? Enough computation power to overcome the low probability?
>Synthesized by inference from the (much larger than mine) policy network?
Demis Hassabis said in a talk:
After the game with Sedol, the team used "adversarial learning" in
order to fill the holes in policy net (such as the Sedol's winning
move in the game 4).
Hideki
--
Hideki Kato <mailto:hideki...@ybb.ne.jp>
CNN can generalize if global shapes can be built from smaller
local shapes. L&D of a large group is an exception because it's
too sensitive for the detail of the position (ie, can be very
global). We can't have much expects on such generalization in
L&D.
By our experiments, value net thinks a group is living if it has
a large enough space. That's all.
#Actually, this is an opposit. Value net thinks a group is dead
if and only if it has short liberties. Some nakade shapes can be
solved if outer libeties are almost filled.
Additionally, value net frequently thinks false eyes as true,
especially on the first lines. (This problem can also be very
global and very hard to be solved with no search.)
Value net itself cannot manage L&D correctly but allows so deeper
search that this problem is hidden (ie, hard to be known).
>Evaluations are heuristic guidance for the search, and a help when the
>search terminates in an unresolved position. Having multiple independent
>ones improves the accuracy of the heuristic - a basic ensemble.
Value net approximates "true" value function of Go very
coarsely. Rollouts (MC simulations) fill the detail. This could
be a best ensemble.
>>(3) CNN cannot learn exclusive-or function due to the ReLU
>>activation function, instead of traditional sigmoid (tangent
>> hyperbolic). CNN is good at approximating continuous (analog)
>> functions but Boolean (digital) ones.
>
>Are you sure this is correct? Especially if we allow leaky ReLU?
Do you know the success of "DEEP" CNN comes from the use of
ReLU? Sigmoid easily vanishes gradient while ReLU not. However,
ReLU cannot represent sharp edges while sigmoid can. DCNN (with
ReLU) approximates functions in a piece-wise-linear style.
Hideki
ReLU) approximates functions in a piece-wise-linear style.
Hideki
--
Hideki Kato <mailto:hideki...@ybb.ne.jp>
(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic). CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.
No, this is incorrect. A perceptron (a single layer neural network)
cannot do XOR.
The whole point of 2+ layer networks was to overcome this basic
weakness. A two layer network with infinite number of neurons in the
layers can approximate any function.
But early on it turned out that learning was unstable and-or extremely
slow for multilayer networks so the theoretical capacity was not
practical.
Now with deep learning we know that with correct training, a lot of data
and hardware (or patience) neural networks can learn almost anything.
It is probably correct that smooth functions are easier to approximate
with a neural network, than high dimensional non-continuous functions.
I am training my networks on a single CPU thread so I have the benefit
of following the learning process of NNOdin slowly. I have seen a lot of
problems with the network but after some weeks of training they go away.
It is interesting to see how its playing style changes. For a while it
would rigididly play very local shapes but now it seems to start to take
lie and death of large groups into account. Or maybe it lets the MC
playout have more impact on the decisions made, by searching more
effectively. Some weeks ago it would barely win against gnugo, and it
won by just playing standard shapes until it got lucky. In the last
couple of days it seems to surround and cut off gnugo's groups and kill
them big as a strong player would.
So what do I want to say. So far i learned that the policy network will
blindly play whatever shapes it finds good and ignore most alternative
moves. So there is indeed a huge problem of "holes" in the policy
function. But for Odin at least I do not know which holes will be a
problem as the network matures with more learning. My plan is then to
fix holes by making the MC evaluation strong.
Best
Magnus
That NN has no "sharp" edges. Using sigmoid (hyperbolic tangent)
activation function, changing weights can change the sharpness
of the edges of the approximated function. For ReLU, changing
weights only changes the slope.
Hideki
Hideki
Gian-Carlo Pascutto: <df55c9d4-2f0a-d902...@sjeng.org>:
"Gian-Carlo Pascutto" <g...@sjeng.org> wrote:
> In the attached SGF, AlphaGo played P10, which was considered a very
> surprising move by all commentators...
> I can sort-of confirm this:
>
> 0.295057654 (E13)
> ...(60 more moves follow)...
> 0.000011952 (P10)
>
> So, 0.001% probability. Demis commented that Lee Sedol's winning move in
> game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
In Summer 2016 I checked the games of AlphaGo vs Lee Sedol
with repeated runs of CrazyStone DL:
In 3 of 20 runs the program selected P10. It
turned out that a rather early "switch" in the search was
necessary to arrive at P10. But if CS did that it
remained with this candidate.
Ingo.
I guess it's possible this move is selected by a policy other than the
neural network. Or perhaps the probability can be much higher with a
differently trained policy net.
--
GCP