[Computer-go] mini-max with Policy and Value network

Hiroshi Yamashita

unread,

May 20, 2017, 4:02:52 PM5/20/17

to compu...@computer-go.org

Hi,

HiraBot author reported mini-max search with Policy and Value network.
It does not use monte-carlo.
Only top 8 moves from Policy is searched in root node. In other depth,
top 4 moves is searched.

Game result against Policy network best move (without search)

Win Loss winrate
MaxDepth=1, (558-442) 0.558 +40 Elo
MaxDepth=2, (351-150) 0.701 +148 Elo
MaxDepth=3, (406-116) 0.778 +218 Elo
MaxDepth=4, (670- 78) 0.896 +374 Elo
MaxDepth=5, (490- 57) 0.896 +374 Elo
MaxDepth=6, (520- 20) 0.963 +556 Elo

Search is simple alpha-beta.
There is a modification Policy network high probability moves tend to be selected.
MaxDepth=6 takes one second/move on i7-4790k + GTX1060.

His nega-max code
http://kiyoshifk.dip.jp/kiyoshifk/apk/negamax.zip
CGOS result, MaxDepth=6
http://www.yss-aya.com/cgos/19x19/cross/minimax-depth6.html
His Policy network(without search) is maybe
http://www.yss-aya.com/cgos/19x19/cross/DCNN-No336-tygem.html
His Policy and Value network(MCTS) is maybe
http://www.yss-aya.com/cgos/19x19/cross/Hiratuka10_38B100.html

Thanks,
Hiroshi Yamashita

_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Brian Sheppard via Computer-go

unread,

May 20, 2017, 4:48:15 PM5/20/17

to compu...@computer-go.org

Could use late-move reductions to eliminate the hard pruning. Given the accuracy rate of the policy network, I would guess that even move 2 should be reduced.

Gian-Carlo Pascutto

unread,

May 22, 2017, 4:38:53 AM5/22/17

to compu...@computer-go.org

On 20/05/2017 22:26, Brian Sheppard via Computer-go wrote:
> Could use late-move reductions to eliminate the hard pruning. Given
> the accuracy rate of the policy network, I would guess that even move
> 2 should be reduced.
>

The question I always ask is: what's the real difference between MCTS
with a small UCT constant and an alpha-beta search with heavy Late Move
Reductions? Are the explored trees really so different?

In any case, in my experiments Monte Carlo still gives a strong benefit,
even with a not so strong Monte Carlo part. IIRC it was the case for
AlphaGo too, and they used more training data for the value network than
is publicly available, and Zen reported the same: Monte Carlo is important.

The main problem is the "only top x moves part". Late Move Reductions
are very nice because there is never a full pruning. This heavy pruning
by the policy network OTOH seems to be an issue for me. My program has
big tactical holes.

--
GCP

Erik van der Werf

unread,

May 22, 2017, 6:15:31 AM5/22/17

to computer-go

On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto <g...@sjeng.org> wrote:

... This heavy pruning

by the policy network OTOH seems to be an issue for me. My program has
big tactical holes.

Do you do any hard pruning? My engines (Steenvreter,Magog) always had a move predictor (a.k.a. policy net), but I never saw the need to do hard pruning. Steenvreter uses the predictions to set priors, and it is very selective, but with infinite simulations eventually all potentially relevant moves will get sampled.

Erik

Brian Sheppard via Computer-go

unread,

May 22, 2017, 9:32:03 AM5/22/17

to compu...@computer-go.org

My reaction was "well, if you are using alpha-beta, then at least use LMR rather than hard pruning." Your reaction is "don't use alpha-beta", and you would know better than anyone!

Yes, LMR in Go has is a big difference compared to LMR in chess: Go tactics take many moves to play out, whereas chess tactics are often pretty immediate. So LMR could hurt Go tactics much more than it hurts chess tactics. Compare the benefit of forcing the playout to the end of the game.

Best,
Brian

Erik van der Werf

unread,

May 22, 2017, 10:44:00 AM5/22/17

to computer-go

Oh, haha, after reading Brian's post I guess I misunderstood :-)

Anyway, LMR seems like a good idea, but last time I tried it (in Migos) it did not help. In Magog I had some good results with fractional depth reductions (like in Realization Probability Search), but it's a long time ago and the engines were much weaker then...

Gian-Carlo Pascutto

unread,

May 22, 2017, 11:02:48 AM5/22/17

to compu...@computer-go.org

With infinite simulations everything is easy :-)

In practice moves with, say, a prior below 0.1% aren't going to get
searched, and I still regularly see positions where they're the winning
move, especially with tactics on the board.

Enforcing the search to be wider without losing playing strength appears
to be hard.

Gian-Carlo Pascutto

unread,

May 22, 2017, 11:42:07 AM5/22/17

to compu...@computer-go.org

On 22-05-17 14:48, Brian Sheppard via Computer-go wrote:
> My reaction was "well, if you are using alpha-beta, then at least use
> LMR rather than hard pruning." Your reaction is "don't use
> alpha-beta", and you would know better than anyone!

There's 2 aspects to my answer:

1) Unless you've made a breakthrough with value nets, there appears to
be a benefit to keeping the Monte Carlo simulations.

2) I am not sure the practical implementations of both algorithms end up
searching in a different manner.

(1) Is an argument against using alpha-beta. If we want to get rid of
the MC simulations - for whatever reason - it disappears. (2) isn't an
argument against. Stating the algorithm in a different manner may make
some heuristics or optimizations more obvious.

> Yes, LMR in Go has is a big difference compared to LMR in chess: Go
> tactics take many moves to play out, whereas chess tactics are often
> pretty immediate.

Not sure I agree with the basic premise here.

> So LMR could hurt Go tactics much more than it hurts chess tactics.
> Compare the benefit of forcing the playout to the end of the game.

LMR doesn't prune anything, it just reduces the remaining search depth
for non-highly rated moves. So it's certainly not going to make
something tactically weaker than hard pruning? If you're talking about
not pruning or reducing at all, you get the issue of the branching
factor again.

In chess you have quiescent search to filter out the simpler tactics. I
guess Monte Carlo simulations may act similar in that they're going to
raise/lower the score if in some simulations tactical shenanigans happen.

Gian-Carlo Pascutto

unread,

May 22, 2017, 12:28:05 PM5/22/17

to compu...@computer-go.org

On 22-05-17 15:46, Erik van der Werf wrote:
> Oh, haha, after reading Brian's post I guess I misunderstood :-)
>
> Anyway, LMR seems like a good idea, but last time I tried it (in Migos)
> it did not help. In Magog I had some good results with fractional depth
> reductions (like in Realization Probability Search), but it's a long
> time ago and the engines were much weaker then...

What was generating your probabilities, though? A strong policy DCNN or
something weaker?

ERPS (LMR with fractional reductions based on move probabilities) with
alpha-beta seems very similar to having MCTS with the policy prior being
a factor in the UCT formula. This is what AlphaGo did according to their
2015 paper, so it can't be terrible, but it does mean that you are 100%
blind to something the policy network doesn't see, which seems
worrisome. I think I asked Aja once about what they do with first play
urgency given that the paper doesn't address it - he politely ignored
the question :-)

The obvious defense (when looking at it in alpha-beta formulation) would
be to cap the depth reduction, and (in MCTS/UCT formulation) to cap the
minimum probability. I had no success with this in Go so far.

Erik van der Werf

unread,

May 22, 2017, 1:16:57 PM5/22/17

to computer-go

On Mon, May 22, 2017 at 3:56 PM, Gian-Carlo Pascutto <g...@sjeng.org> wrote:

On 22-05-17 11:27, Erik van der Werf wrote:
> On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto <g...@sjeng.org
> <mailto:g...@sjeng.org>> wrote:
>
> ... This heavy pruning
> by the policy network OTOH seems to be an issue for me. My program has
> big tactical holes.
>
>
> Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
> move predictor (a.k.a. policy net), but I never saw the need to do hard
> pruning. Steenvreter uses the predictions to set priors, and it is very
> selective, but with infinite simulations eventually all potentially
> relevant moves will get sampled.

With infinite simulations everything is easy :-)

In practice moves with, say, a prior below 0.1% aren't going to get
searched, and I still regularly see positions where they're the winning
move, especially with tactics on the board.

Enforcing the search to be wider without losing playing strength appears
to be hard.

Well, I think that's fundamental; you can't be wide and deep at the same time, but at least you can chose an algorithm that (eventually) explores all directions.

BTW I'm a bit surprised that you are still able to find 'big tactical holes' with Leela now playing as 8d KGS

Best,

Erik

Gian-Carlo Pascutto

unread,

May 22, 2017, 3:35:44 PM5/22/17

to compu...@computer-go.org

Right. But I'm uncomfortable with the current setup, because many
options won't get explored at all in practical situations. It would seem
logical that some minimum amount of (more spread) search effort would
plug enough holes to stop bad blunders, but finding a way to do that and
preserve strength seems elusive so far.

> BTW I'm a bit surprised that you are still able to find 'big tactical
> holes' with Leela now playing as 8d KGS

I've attached an example. It's not the prettiest one (one side has a 0.5
pt advantage in the critical variation so exact komi is an issue), but
it's a recent one from my mailbox.

This is with Leela 0.10.0 so you can follow along:

Leela: loadsgf tactics2.sgf 261
=

Passes: 0 Black (X) Prisoners: 7
White (O) to move White (O) Prisoners: 19

a b c d e f g h j k l m n o p q r s t
19 . X O O . . . . . . . O O X X X O X . 19
18 . X X O O . . . . O O O X X X O O . O 18
17 . . X X O . . . O O X O O X . X O O . 17
16 . . X O O . . O O X X X X . . X X X . 16
15 . . X O . . O X X X X . . . . . X . X 15
14 . . . X O . O X O O O X . X X O O X(X)14
13 . . . X O O O O O X X X X O O X X X O 13
12 . . . . X X . O X X O O X O . O O X O 12
11 . . . . . . . O O X O X X O . . O O O 11
10 . . . X . X X O O O O O X O O O O . . 10
9 . . . . X . X X O . O X O X X X O . . 9
8 . . . . X . X O . . O X O O X O . . . 8
7 X X X X O X X O . O O X O X X O O . . 7
6 O O O O O O O . . . O X O . X X O . . 6
5 X O O . X O O O O O X X X X X O . . . 5
4 X X O O O X X O . O X X . X O O . . . 4
3 X . X X X X O . O . O X . . X O . . . 3
2 X O X . . O O . O . O O X X X O O . . 2
1 . X . O O . . O O . O X X . X X O . . 1
a b c d e f g h j k l m n o p q r s t

Hash: 106A3898CEC94132 Ko-Hash: 67E390C41BF2577

Black time: 00:30:00
White time: 00:30:00

Leela: heatmap
=

94.46% G11
4.20% C1
1.31% E2
0.03% all other moves together

Note that O16 P17 N15 wins immediately. It's not that Leela is
completely blind to it, because that sequence is in some variations. But
in here, O16 won't get searched for a long time (it is actually the 4th
rated move) due to the skewed probabilities.

Leela: play w g11
=

Leela: heatmap
=

99.9% F11
0.1% C1
0% all other moves together

Leela: genmove b
....

Score looks bad. Resigning.
= resign

https://timkr.home.xs4all.nl/chess2/resigntxt.htm

If black plays O16 instead here he wins by 0.5 points.

--
GCP

tactics2.sgf

Marc Landgraf

unread,

May 22, 2017, 4:00:32 PM5/22/17

to compu...@computer-go.org

Leela has surprisingly large tactical holes. Right now it is throwing a good number of games against me in completely won endgames by fumbling away entirely alive groups.

As an example I attached one game of myself (3d), even vs Leela10 @7d. But this really isn't a onetime occurence.

If you look around move 150, the game is completely over by human standards as well as Leelas evaluation (Leela will give itself >80% here)

But then Leela starts doing weird things.

186 is a minor mistake, but itself does not yet throw the game. But it is the start of series of bad turns.

236 then is a non-threat in a Ko fight, and checking Leelas evaluation, Leela doesn't even consider the possibility of it being ignored. This is btw a common topic with Leela in ko fights - it does not look at all at what happens if the Ko threat is ignored.

238 follows up the "ko threat", but this move isn't doing anything either! So Leela passed twice now.

Suddenly there is some Ko appearing at the top right.

Leela plays this Ko fight in some suboptimal way, not fully utilizing local ko threats, but this is a concept rather difficult to grasp for AIs afaik.

I can not 100% judge whether ignoring the black threat of 253 is correct for Leela, I have some doubts on this one too.

With 253 ignored, the game is now heavily swinging, but to my judgement, playing the hane instead of 256 would still keep it rather close and I'm not 100% sure who would win it now. But Leela decides to completely bury itself here with 256, while giving itself still 70% to win.

As slowly realization of the real game state kicks in, the rest of the game is then the usual MC-throw away style we have known for years.

Still... in this game you can see how a series of massive tactical blunders leads to throwing a completely won game. And this is just one of many examples. And it can not be all pinned on Ko's. I have seen a fair number of games where Leela does similar mistakes without Ko involved, even though Ko's drastically increase Leelas fumble chance.

At the same time, Leela is completely and utterly outplaying me on a strategical level and whenever it manages to not make screwups like the ones shown I stand no chance at all. Even 3 stones is a serious challenge for me then. But those mistakes are common enough to keep me around even.

leela10screwup.sgf

Marc Landgraf

unread,

May 22, 2017, 4:54:20 PM5/22/17

to compu...@computer-go.org

And talkig about tactical mistakes:

Another game, where a trick joseki early in the game (top right) completely fools Leela. Leela here play this like it would be done in similar shapes, but then gets completely blindsided. But to make things worse, it finds the one way to make the loss the biggest. (note: this is not reliable when trying this trick joseki, Leela will often lose the 4/5 stones on the left, but will at least take the one stone on top in sente instead of screwing up like it did here) Generally this "trick" is not that deep reading wise, but given its similarity to more common shapes I can understand how the bot falls for it.

Anyway, Leela manages to fully stabilize the game (given our general difference in strength, this should come as no surprise), just to throw away the center group.

But what you should really look at here is Leelas evaluation of the game.

Even very late in the game, the MC part of Leela considers Leela well ahead, completely misreading the L+D here. Usually in most games Leela loses to me, the issue comes the other way around. Leela NN strongly believes in the game to be won, while the MC-part notices the real trouble. But not here. Now of course this kind of misjudgement also could serve as explaination how this group could die in first place.

But having had my own MC-Bot I really wonder how it could misevaluate so badly here. To really lose this game as Black it either requires substantial self ataris by Black, or large unanswered self atari by White. Does Leela have such light playouts that those groups can really flip status in 60%+ of the MC-Evaluations?

leela10.sgf

David Wu

unread,

May 22, 2017, 10:37:29 PM5/22/17

to compu...@computer-go.org

Leela playouts are definitely extremely bad compared to competitors like Crazystone. The deep-learning version of Crazystone has no value net as far as I know, only a policy net, which means it's going on MC playouts alone to produce its evaluations. Nonetheless, its playouts often have noticeable and usually correct opinions about early midgame game positions (as confirmed by the combination of own judgment as a dan player and Leela's value net). Which I find amazing - that it can even approximately get these right.

On to the game:

Analyzing with Leela 0.10.0 in that second game, I think I can infer pretty exactly what the playouts are getting wrong. Indeed the upper left is being disastrously misplayed by them, but that's not all. - I'm finding that multiple different things are all being played out wrong. All of the following numbers are on white's turn, to give white the maximal chance to distract black from resolving the tactics in the tree search and forcing the playouts to do the work - the numbers are sometimes better if it's black's turn.

* At move 186, playouts as it stands show about on the order of 60% for white, despite black absolutely having won the game. I partly played out the rest of the board in a very slow and solid way just in case it was confusing things, but not entirely, so that the tree search would still have plenty of endgame moves to be distracted by. Playouts stayed at 60% for white.

* Add bA15, putting white down to 2 liberties: the PV shows an exchange of wA17 and bA14, keeping white at 2 liberties, and it drops to 50%.

* Add bA16, putting white in atari: it drops to 40%.

So clearly there's some funny business with black in the playouts self-atariing or something, and the chance that black does this lessens as white has fewer liberties and therefore is more likely to die first. Going further:

* Add bE14, it drops to 30%.

So the potential damezumari is another problem - I guess like 10% of the playouts are letting white kill the big black lump at E13. Now, 30% is still way too high. Where does the rest come from?

* Add bA17, having black actually capture. it drops to 20%. So looks like even having white in atari doesn't stop the playouts from misplaying. Black in the misplays that corner despite being able to capture white any time!

Now where are the remaining 20% losses coming from?

* Have black physically capture the 5 white stones at K18: still 20%

* Solidly connect everything for black around the board: still 20%.

* Add bK18: drops to 10%.

Oh! So 10% more more came from black dying at the top. Even after black has physically captured the 5 white stones there and having a living 5-space eye, the playouts have black die there maybe 10% of the time.

Okay, now how is black still losing 10% of the time? White literally cannot make 2 eyes even if he gets infinitely many moves in a row. I've already solidly connected everything so that even if black passes forever, white can kill literally nothing of black's and black will have enough points.

* Add wK14 or wM14: 3%

* Instead of that, add bK14: increases to 20% (!!!). Then have white capture those 2 stones with wM14: 3%.

* Instead of all that, add both K16 and M16: 3%.

So clearly what's going on is that the playouts allow suicide, despite the Leela gui not allowing it. That's the only way that white could possibly live - black suicides 3 stones then white makes 2 eyes. And this neatly explains these observations - adding bK14 puts black one step closer to suicide 3 stones, so the MC winrate for white rises to 20%. Having white play removes the possibility of black handing white 2 eyes via suicide. And similarly filling K16 and M16 makes it so that when black plays 3 stones, it's capture and not a suicide. So the playouts allowing suicide here is leading to some incredibly bad MC evaluations.

Okay, so now that we removed that possibility, how is black still losing 3% of the time?

* Remove liberties from white: once most of them are gone, drops to 2%. Then drops to 1%, then once white is in atari, drops to -1%.

(the -1% presumably comes from the expected thing documented in the Leela FAQ where Leela counts big wins as a bit more than 100%).

Now I'm just speculating. My guess is that somehow 3% of the time, the game is scored without black having captured white's group. As in - black passes, white passes, white's dead group is still on the board, so white wins. The guess would be that liberties and putting it in atari increases the likelihood that the playouts kill the group before having both players pass and score. But that's just a guess, maybe there's also more black magic involving adjusting the "value" of a win depending on unknown factors beyond just having a "big win". Would need Gian-Carlo to actually confirm or refute this guess though.

----

Phew! That was some long and fun exploration of how light Leela's playouts are.

Hopefully some of this helps improve the playouts for future versions of Leela - given that they're a significant weight in the evaluation alongside the value net, they're probably one of the major things holding Leela back at this point.

David Wu

unread,

May 23, 2017, 12:31:20 AM5/23/17

to compu...@computer-go.org

Addendum:

Some additional playing around with the same position can flip the roles of the playouts and value net - so now the value net is very wrong and the playouts are mostly right. I think this gives good insight into what the value net is doing and why as a general matter playouts are still useful.

Here's how:

Play black moves as detailed in the previous email in the "leela10.sgf" game that Marc posted to resolve all the misunderstandings of the playouts and get it into the "3-10% white win" phase, but otherwise leave white's dead group on-board with tons of liberties. Let white have an absolute feast on the rest of the board while black simply connects his stones solidly. White gets to cut through Q6 and get pretty much every point available.

Black is still winning even though he loses almost the entire rest of the board, as long as the middle white group dies. But with some fiddling around, you can arrive at a position where the value net is reporting 90% white win (wrong), while the playouts are rightly reporting only 3-10% white win.

Intuitively, the value net only fuzzily evaluates white's group as probably dead, but isn't sure that it's dead, so intuitively it counts some value for white's group "in expectation" for the small chance it lives. And the score is otherwise not too far off on the rest of the board - the way I played it out, black wins by only ~5 points if white dies. So the small uncertainty that the huge white group is might actually be alive produces enough "expected value" for white to overwhelm the 5 point loss margin, such that the value net is 90% sure that white wins.

What the value net has failed to "understand" here is that white's group surviving is a binary event. I.e. a 20% chance of the group being alive and white winning by 80 points along with a 80% chance that it's dead and white losing by 5 points does not average out to white being (0.2 * 80) - (0.8 * 5) = 16 points ahead overall (although probably the value net doesn't exactly "think" in terms of points but rather something fuzzier). The playouts provide the much-needed "understanding" that win/loss is binary and that the expectation operator should be applied after mapping to win/loss outcomes, rather than before.

It seems intuitive to me that a neural net would compute things in too much of a fuzzy and averaged way and thereby be vulnerable to this mistake. I wonder if it's possible to train a value net to get these things more correct without weakening it otherwise, with the right training. As it is, I suspect this is a systematic flaw in the value net's ability to produce good probabilities of winning in games where the game hinges on the life and death chances of a single large dragon, and where the expected score could be wildly uncorrelated with the probability of winning.

Hideki Kato

unread,

May 23, 2017, 6:21:23 AM5/23/17

to compu...@computer-go.org

Agree.

(1) To solve L&D, some search is necessary in practice. So, the
value net cannot solve some of them.
(2) The number of possible positions (input of the value net) in
real games is at least 10^30 (10^170 in theory). If the value
net can recognize all? L&Ds depend on very small difference of
the placement of stones or liberties. Can we provide necessary
amount of training data? Have the network enough capacity?
The answer is almost obvious by the theory of function
approximation. (ANN is just a non-linear function
approximator.)
(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic). CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.

Hideki

David Wu: <CAGEydYud7otGtD6u-VPv3O1u...@mail.gmail.com>:

>---- inline file

>_______________________________________________
>Computer-go mailing list
>Compu...@computer-go.org
>http://computer-go.org/mailman/listinfo/computer-go

--
Hideki Kato <mailto:hideki...@ybb.ne.jp>

Gian-Carlo Pascutto

unread,

May 23, 2017, 7:53:23 AM5/23/17

to compu...@computer-go.org

On 23-05-17 03:39, David Wu wrote:
> Leela playouts are definitely extremely bad compared to competitors like
> Crazystone. The deep-learning version of Crazystone has no value net as
> far as I know, only a policy net, which means it's going on MC playouts
> alone to produce its evaluations. Nonetheless, its playouts often have
> noticeable and usually correct opinions about early midgame game
> positions (as confirmed by the combination of own judgment as a dan
> player and Leela's value net). Which I find amazing - that it can even
> approximately get these right.

Leela's Monte Carlo playouts were designed and implemented in 2007,
before most of the current literature around them was public. Back then,
they were very "thick" and good enough to make the program one of the
strongest around. Needless to say, in the ~9 years or so when I was
absent from go programming, others made substantial progress in that
area, especially as before value nets this was clearly one of the most
important components of strength. Leela's Monte Carlo playouts for sure
are weaker than those of Crazy Stone and Zen, and even pachi. I have
done work on this in the last year, but a more complete overhaul isn't
in 0.10.0 yet.

Nevertheless (as you also observe below) they still contribute a benefit
to the strength of the engine. That's why I've been consistently saying
dropping them doesn't seem to be good, and why I like the orthogonality
they provide with the value net (and am generally wary of methods that
tune the playouts with or towards the value net).

> So clearly what's going on is that the playouts allow suicide,

I'll need to reconstruct the position you set up, but this is something
that shouldn't happen. Thank you for pointing it out, I'll try to
confirm on my side.

> Now I'm just speculating. My guess is that somehow 3% of the time, the
> game is scored without black having captured white's group. As in -
> black passes, white passes, white's dead group is still on the board, so
> white wins. The guess would be that liberties and putting it in atari
> increases the likelihood that the playouts kill the group before having
> both players pass and score. But that's just a guess, maybe there's also
> more black magic involving adjusting the "value" of a win depending on
> unknown factors beyond just having a "big win". Would need Gian-Carlo to
> actually confirm or refute this guess though.

Leela allows passes with a very low probability, so your analysis is
probably right.

> given that they're a significant weight in the evaluation
> alongside the value net, they're probably one of the major things
> holding Leela back at this point.

I assume that as well, which is why I've been doing some work on them,
but I'm also prepared to be disappointed. Note that I didn't put the
significant weighting arbitrarily: it's set to what gave the maximum
playing strength.

I suspect that when there are multiple options that seem objectively
equally good (from the value net), the playouts also help play towards
the option where it is harder to mess up. In this case, a larger amount
of stochasticity is not a bad thing.

--
GCP

Erik van der Werf

unread,

May 23, 2017, 8:15:09 AM5/23/17

to computer-go

On Mon, May 22, 2017 at 4:54 PM, Gian-Carlo Pascutto <g...@sjeng.org> wrote:

On 22-05-17 15:46, Erik van der Werf wrote:
> Anyway, LMR seems like a good idea, but last time I tried it (in Migos)
> it did not help. In Magog I had some good results with fractional depth
> reductions (like in Realization Probability Search), but it's a long
> time ago and the engines were much weaker then...

What was generating your probabilities, though? A strong policy DCNN or
something weaker?

Nothing deep. Back then for the move predictor I don't think I ever tried more than two hidden layers (and it was only used near the root of the search tree). RPS was even simpler (so it could be used with fast deep searches). In hindsight I traded way too much accuracy for speed, but coming from standard a AlphaBeta framework it still was a big improvement.

ERPS (LMR with fractional reductions based on move probabilities) with
alpha-beta seems very similar to having MCTS with the policy prior being
a factor in the UCT formula.

In the sense of the shape of the tree, possibly yes, but I have the impression that AlphaBeta and similar search algorithms are more brittle when working with high-variance (noisy) evaluations. In chess-like games it may be less of an issue due to the implicit mobility feature that it adds, but for Go mobility seems to be mostly irrelevant. The MCTS backup (averaging evaluations) seems to reduce the variance much better than a minimax backup.

Using a value net instead of raw Monte Carlo evaluation also reduces variance (a lot), so trying out AlphaBeta with DCNN evaluations definitely seems like an interesting experiment.

This is what AlphaGo did according to their
2015 paper, so it can't be terrible, but it does mean that you are 100%
blind to something the policy network doesn't see, which seems
worrisome. I think I asked Aja once about what they do with first play
urgency given that the paper doesn't address it - he politely ignored
the question :-)

I don't think anyone has had good results with high FPU; it seems in Go we simply cannot afford a very wide search (except perhaps near the root or on the PV). I'm not sure if they still use an UCB term (which would ensure some exploration of unlikely candidates). I think at some point David and others argued against it, but in my own experiments it was always helpful, and I think Aja may have found the same in Erica. Nevertheless, even without it I think an argument can be made that the minimax result can eventually be found.

I have an idea on what's causing the problems in Leela (and how you could fix it), but I'll hold of on further commenting until I have some more time to look at the examples.

Best,

Erik

Gian-Carlo Pascutto

unread,

May 23, 2017, 9:03:22 AM5/23/17

to compu...@computer-go.org

On 22-05-17 21:01, Marc Landgraf wrote:
> But what you should really look at here is Leelas evaluation of the game.

Note that this is completely irrelevant for the discussion about
tactical holes and the position I posted. You could literally plug any
evaluation into it (save for a static oracle, in which case why search
at all...) and it would still have the tactical blindness being discussed.

It's an issue of limitations of the policy network, combined with the
way one uses the UCT formula. I'll use the one from the original AlphaGo
paper here, because it's public and should behave even worse:

u(s, a) = c_puct * P(s, a) * sqrt(total_visits / (1 + child_visits))

Note that P(s, a) is a direct factor here, which means that for a move
ignored by the policy network, the UCT term will almost vanish. In other
words, unless the win is immediately visible (and for tactics it won't),
you're not going to find it. Also note that this is a deviation from
regular UCT or PUCT, which do not have such a direct term and hence only
have a disappearing prior, making the search eventually more exploratory.

Now, even the original AlphaGo played moves that surprised human pros
and were contrary to established sequences. So where did those come
from? Enough computation power to overcome the low probability?
Synthesized by inference from the (much larger than mine) policy network?

--
GCP

Gian-Carlo Pascutto

unread,

May 23, 2017, 9:21:37 AM5/23/17

to compu...@computer-go.org

On 23-05-17 10:51, Hideki Kato wrote:
> (2) The number of possible positions (input of the value net) in
> real games is at least 10^30 (10^170 in theory). If the value
> net can recognize all? L&Ds depend on very small difference of
> the placement of stones or liberties. Can we provide necessary
> amount of training data? Have the network enough capacity?
> The answer is almost obvious by the theory of function
> approximation. (ANN is just a non-linear function
> approximator.)

DCNN clearly have some ability to generalize from learned data and
perform OK even with unseen examples. So I don't find this a very
compelling argument. It's not like Monte Carlo playouts are going to
handle all sequences correctly either.

Evaluations are heuristic guidance for the search, and a help when the
search terminates in an unresolved position. Having multiple independent
ones improves the accuracy of the heuristic - a basic ensemble.

> (3) CNN cannot learn exclusive-or function due to the ReLU
> activation function, instead of traditional sigmoid (tangent
> hyperbolic). CNN is good at approximating continuous (analog)
> functions but Boolean (digital) ones.

Are you sure this is correct? Especially if we allow leaky ReLU?

--
GCP

Erik van der Werf

unread,

May 23, 2017, 9:44:42 AM5/23/17

to computer-go

On Tue, May 23, 2017 at 10:51 AM, Hideki Kato <hideki...@ybb.ne.jp> wrote:

Agree.

(1) To solve L&D, some search is necessary in practice. So, the
value net cannot solve some of them.
(2) The number of possible positions (input of the value net) in
real games is at least 10^30 (10^170 in theory). If the value
net can recognize all? L&Ds depend on very small difference of
the placement of stones or liberties. Can we provide necessary
amount of training data? Have the network enough capacity?
The answer is almost obvious by the theory of function
approximation. (ANN is just a non-linear function
approximator.)

A similar argument can be made for natural neural nets, but we know humans are able to come up with reasonable solutions. I suppose a pure neural net approach would require some form of recursion, but when combined with a search, and rolling out the decision process to some sufficiently high number of max steps, apparently it's not that important.. Also, I suspect that nearly all positions can only be reached in real games by inferior moves from both sides. All that may be needed is some crude means to steer away from chaos (and even if one would start in chaos, humans probably wouldn't do well either).

(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic). CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.

Are you sure about that? I can imagine using two ReLU units to construct a sigmoid-like step function, so I'd think a multi-layer net should be fine (just like with ordinary perceptrons).

Best,

Erik

Hideki Kato

unread,

May 23, 2017, 12:49:26 PM5/23/17

to compu...@computer-go.org

Erik van der Werf: <CAKkgGrOxqLdtKk3VgRERNeJ9...@mail.gmail.com>:

>On Tue, May 23, 2017 at 10:51 AM, Hideki Kato <hideki...@ybb.ne.jp>
>wrote:
>
>> Agree.
>>
>> (1) To solve L&D, some search is necessary in practice. So, the
>> value net cannot solve some of them.
>> (2) The number of possible positions (input of the value net) in
>> real games is at least 10^30 (10^170 in theory). If the value
>> net can recognize all? L&Ds depend on very small difference of
>> the placement of stones or liberties. Can we provide necessary
>> amount of training data? Have the network enough capacity?
>> The answer is almost obvious by the theory of function
>> approximation. (ANN is just a non-linear function
>> approximator.)
>>
>
>A similar argument can be made for natural neural nets, but we know humans
>are able to come up with reasonable solutions. I suppose a pure neural net
>approach would require some form of recursion, but when combined with a
>search, and rolling out the decision process to some sufficiently high
>number of max steps, apparently it's not that important.. Also, I suspect
>that nearly all positions can only be reached in real games by inferior
>moves from both sides. All that may be needed is some crude means to steer
>away from chaos (and even if one would start in chaos, humans probably
>wouldn't do well either).

My argument is for "stand-alone" DCNN. Adding some (top-down?)
control to DCNNs could solve this (like human's brain). #I'm not
sure about recurrency but maybe necessary.

>(3) CNN cannot learn exclusive-or function due to the ReLU
>> activation function, instead of traditional sigmoid (tangent
>> hyperbolic). CNN is good at approximating continuous (analog)
>> functions but Boolean (digital) ones.
>>
>
>
>Are you sure about that? I can imagine using two ReLU units to construct a
>sigmoid-like step function, so I'd think a multi-layer net should be fine
>(just like with ordinary perceptrons).

Even if using many layers, it's hard to represent sharp edges by
combining ReLUs. (Not impossible but chances are few probably
due to so many local traps.)

Best, Hideki

>Best,
>Erik

Hideki Kato

unread,

May 23, 2017, 1:09:22 PM5/23/17

to compu...@computer-go.org

Gian-Carlo Pascutto: <0357614a-98b8-6949...@sjeng.org>:

>Now, even the original AlphaGo played moves that surprised human pros
>and were contrary to established sequences. So where did those come
>from? Enough computation power to overcome the low probability?
>Synthesized by inference from the (much larger than mine) policy network?

Demis Hassabis said in a talk:
After the game with Sedol, the team used "adversarial learning" in
order to fill the holes in policy net (such as the Sedol's winning
move in the game 4).

Hideki

--
Hideki Kato <mailto:hideki...@ybb.ne.jp>

Hideki Kato

unread,

May 23, 2017, 1:30:31 PM5/23/17

to compu...@computer-go.org

Gian-Carlo Pascutto: <a0b16b16-6591-a195...@sjeng.org>:

>On 23-05-17 10:51, Hideki Kato wrote:
>> (2) The number of possible positions (input of the value net) in
>> real games is at least 10^30 (10^170 in theory). If the value
>> net can recognize all? L&Ds depend on very small difference of
>> the placement of stones or liberties. Can we provide necessary
>> amount of training data? Have the network enough capacity?
>> The answer is almost obvious by the theory of function
>> approximation. (ANN is just a non-linear function
>> approximator.)
>
>DCNN clearly have some ability to generalize from learned data and
>perform OK even with unseen examples. So I don't find this a very
>compelling argument. It's not like Monte Carlo playouts are going to
>handle all sequences correctly either.

CNN can generalize if global shapes can be built from smaller
local shapes. L&D of a large group is an exception because it's
too sensitive for the detail of the position (ie, can be very
global). We can't have much expects on such generalization in
L&D.

By our experiments, value net thinks a group is living if it has
a large enough space. That's all.
#Actually, this is an opposit. Value net thinks a group is dead
if and only if it has short liberties. Some nakade shapes can be
solved if outer libeties are almost filled.

Additionally, value net frequently thinks false eyes as true,
especially on the first lines. (This problem can also be very
global and very hard to be solved with no search.)

Value net itself cannot manage L&D correctly but allows so deeper
search that this problem is hidden (ie, hard to be known).

>Evaluations are heuristic guidance for the search, and a help when the
>search terminates in an unresolved position. Having multiple independent
>ones improves the accuracy of the heuristic - a basic ensemble.

Value net approximates "true" value function of Go very
coarsely. Rollouts (MC simulations) fill the detail. This could
be a best ensemble.

>>(3) CNN cannot learn exclusive-or function due to the ReLU
>>activation function, instead of traditional sigmoid (tangent
>> hyperbolic). CNN is good at approximating continuous (analog)
>> functions but Boolean (digital) ones.
>
>Are you sure this is correct? Especially if we allow leaky ReLU?

Do you know the success of "DEEP" CNN comes from the use of
ReLU? Sigmoid easily vanishes gradient while ReLU not. However,
ReLU cannot represent sharp edges while sigmoid can. DCNN (with
ReLU) approximates functions in a piece-wise-linear style.

Hideki
ReLU) approximates functions in a piece-wise-linear style.

Hideki

--
Hideki Kato <mailto:hideki...@ybb.ne.jp>

Álvaro Begué

unread,

May 23, 2017, 2:48:33 PM5/23/17

to computer-go

On Tue, May 23, 2017 at 4:51 AM, Hideki Kato <hideki...@ybb.ne.jp> wrote:

(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic). CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.

Oh, not this nonsense with the XOR function again.

You can see a neural network with ReLU activation function learning XOR right here: http://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.01&regularizationRate=0&noise=0&networkShape=4,4&seed=0.96791&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Enjoy,

Álvaro.

valk...@phmp.se

unread,

May 23, 2017, 3:12:56 PM5/23/17

to compu...@computer-go.org

>> (3) CNN cannot learn exclusive-or function due to the ReLU
>> activation function, instead of traditional sigmoid (tangent
>> hyperbolic). CNN is good at approximating continuous (analog)
>> functions but Boolean (digital) ones.
>
> Are you sure about that? I can imagine using two ReLU units to
> construct a sigmoid-like step function, so I'd think a multi-layer net
> should be fine (just like with ordinary perceptrons).

No, this is incorrect. A perceptron (a single layer neural network)
cannot do XOR.
The whole point of 2+ layer networks was to overcome this basic
weakness. A two layer network with infinite number of neurons in the
layers can approximate any function.

But early on it turned out that learning was unstable and-or extremely
slow for multilayer networks so the theoretical capacity was not
practical.

Now with deep learning we know that with correct training, a lot of data
and hardware (or patience) neural networks can learn almost anything.

It is probably correct that smooth functions are easier to approximate
with a neural network, than high dimensional non-continuous functions.

I am training my networks on a single CPU thread so I have the benefit
of following the learning process of NNOdin slowly. I have seen a lot of
problems with the network but after some weeks of training they go away.
It is interesting to see how its playing style changes. For a while it
would rigididly play very local shapes but now it seems to start to take
lie and death of large groups into account. Or maybe it lets the MC
playout have more impact on the decisions made, by searching more
effectively. Some weeks ago it would barely win against gnugo, and it
won by just playing standard shapes until it got lucky. In the last
couple of days it seems to surround and cut off gnugo's groups and kill
them big as a strong player would.

So what do I want to say. So far i learned that the policy network will
blindly play whatever shapes it finds good and ignore most alternative
moves. So there is indeed a huge problem of "holes" in the policy
function. But for Odin at least I do not know which holes will be a
problem as the network matures with more learning. My plan is then to
fix holes by making the MC evaluation strong.

Best
Magnus

Gian-Carlo Pascutto

unread,

May 23, 2017, 4:29:46 PM5/23/17

to compu...@computer-go.org

On 23-05-17 17:19, Hideki Kato wrote:
> Gian-Carlo Pascutto: <0357614a-98b8-6949...@sjeng.org>:
>
>> Now, even the original AlphaGo played moves that surprised human pros
>> and were contrary to established sequences. So where did those come
>> from? Enough computation power to overcome the low probability?
>> Synthesized by inference from the (much larger than mine) policy network?
>
> Demis Hassabis said in a talk:
> After the game with Sedol, the team used "adversarial learning" in
> order to fill the holes in policy net (such as the Sedol's winning
> move in the game 4).

I said, the "original AlphaGo", i.e. the one used in the match against
Lee Sedol. According to the Nature paper, the policy net was trained
with supervised learning only [1]. And yet...

In the attached SGF, AlphaGo played P10, which was considered a very
surprising move by all commentators. Presumably, this means it's not
seen in high level human play, and would not get a high rating in the
policy net. I can sort-of confirm this:

0.295057654 (E13)
...(60 more moves follow)...
0.000011952 (P10)

So, 0.001% probability. Demis commented that Lee Sedol's winning move in
game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
Differently trained policy nets might rate it a bit higher or lower, but
simply due to the fact that was considered very un-human to do, it seems
unlikely to ever be rated highly by a policy net based on supervised
learning.

So in AlphaGo's formula, you're dealing with a reduction of the UCT term
by a factor 100 000 plus or minus some order of magnitude.

D6 -> 1359934 (W: 53.21%) (U: 49.34%) (V: 55.15%: 38918) (N: 6.3%)
PV: D6 F6 E7 F7 C8 B8 D7 B7 E9 C9 F8 H7 H
9 K7 H3 K9
...many moves...
P10 -> 421 (W: 52.68%) (U: 50.09%) (V: 53.98%: 8) (N: 0.0%)
PV: P10 Q10 P8 Q9

Now, of course AlphaGo had a few orders of magnitude more hardware, but
you can see from the above that it's, eh, not easy for P10 to overtake
the top moves here in playout count.

And yet, that's the move that was played.

[1] I'm assuming that what played the match corresponds to what they
published there - maybe that is my mistake. I'm not sure I remember the
relevant timeline correctly.

--
GCP

sedol.sgf

Hideki Kato

unread,

Jun 7, 2017, 11:55:01 AM6/7/17

to compu...@computer-go.org

Alvaro Begue: <CAF8dVMVMwi65m9jMTsvOa=qZorTQz-DEdh54...@mail.gmail.com>:

That NN has no "sharp" edges. Using sigmoid (hyperbolic tangent)
activation function, changing weights can change the sharpness
of the edges of the approximated function. For ReLU, changing
weights only changes the slope.

Hideki

Hideki Kato

unread,

Jun 7, 2017, 12:17:40 PM6/7/17

to compu...@computer-go.org

Generalizing shoulder-hit moves on lower lines may prefer
the move in question.

Hideki

Gian-Carlo Pascutto: <df55c9d4-2f0a-d902...@sjeng.org>:

"Ingo Althöfer"

unread,

Jun 7, 2017, 12:54:08 PM6/7/17

to compu...@computer-go.org

Hi, just my 2 Cent.

"Gian-Carlo Pascutto" <g...@sjeng.org> wrote:

> In the attached SGF, AlphaGo played P10, which was considered a very

> surprising move by all commentators...

> I can sort-of confirm this:
>
> 0.295057654 (E13)
> ...(60 more moves follow)...
> 0.000011952 (P10)
>
> So, 0.001% probability. Demis commented that Lee Sedol's winning move in
> game 4 was a one in 10 000 move. This is a 1 in 100 000 move.

In Summer 2016 I checked the games of AlphaGo vs Lee Sedol
with repeated runs of CrazyStone DL:
In 3 of 20 runs the program selected P10. It
turned out that a rather early "switch" in the search was
necessary to arrive at P10. But if CS did that it
remained with this candidate.

Ingo.

Michael Alford

unread,

Jun 7, 2017, 1:49:49 PM6/7/17

to compu...@computer-go.org

I haven't seen a post to the group in three days. That can't be
possible, with the event in China.

Gian-Carlo Pascutto

unread,

Jun 7, 2017, 5:14:28 PM6/7/17

to compu...@computer-go.org

On 24-05-17 05:33, "Ingo Althöfer" wrote:
>> So, 0.001% probability. Demis commented that Lee Sedol's winning move in
>> game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
>
> In Summer 2016 I checked the games of AlphaGo vs Lee Sedol
> with repeated runs of CrazyStone DL:
> In 3 of 20 runs the program selected P10. It
> turned out that a rather early "switch" in the search was
> necessary to arrive at P10. But if CS did that it
> remained with this candidate.

I guess it's possible this move is selected by a policy other than the
neural network. Or perhaps the probability can be much higher with a
differently trained policy net.

--
GCP

kevin long

unread,

Jun 7, 2017, 6:55:30 PM6/7/17

to compu...@computer-go.org

Maybe we are all depressed about it. 😀

Reply all

Reply to author

Forward