[Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks

Aja Huang

unread,

Dec 19, 2014, 6:17:30 PM12/19/14

to compu...@computer-go.org

Hi all,

We've just submitted our paper to ICLR. We made the draft available at

http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf

I hope you enjoy our work. Comments and questions are welcome.

Regards,

Aja

Kahn Jonas

unread,

Dec 19, 2014, 6:34:45 PM12/19/14

to compu...@computer-go.org

Hi Aja

> We've just submitted our paper to ICLR. We made the draft available at
> http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf
>
> I hope you enjoy our work. Comments and questions are welcome.

I did not look at the go content, on which I'm no expert.
But for the network training, you might be interested in these articles:
«Riemaniann metrics for neural networks I and II» by Yann Ollivier:
http://www.yann-ollivier.org/rech/publs/gradnn.pdf
http://www.yann-ollivier.org/rech/publs/pcnn.pdf

He defines invariant metrics on the parameters, much lighter to compute
than the natural gradient, and then gets usually (very very) much faster
convergence, and in a much more robust way since it does not depend on
parametrisation or the activation function.

Jonas
_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Erik van der Werf

unread,

Dec 19, 2014, 9:06:37 PM12/19/14

to computer-go

On Sat, Dec 20, 2014 at 12:17 AM, Aja Huang <ajah...@google.com> wrote:
> We've just submitted our paper to ICLR. We made the draft available at
> http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf

Hi Aja,

Wow, very impressive. In fact so impressive, it seems a bit
suspicious(*)... If this is real then one might wonder what it means
about the game. Can it really be that simple? I always thought that
even with perfect knowledge there should usually still be a fairly
broad set of equal-valued moves. Are we perhaps seeing that most of
the time players just reproduce the same patterns over and over again?

Do I understand correctly that your representation encodes the
location of the last 5 moves? If so, do you have an idea how much
extra performance that provides compared to only the last or not using
it at all?

Thanks for sharing the paper!

Best,
Erik

* I'd really like to see some test results on pro games that are newer
than any of your training data.

Hiroshi Yamashita

unread,

Dec 19, 2014, 10:33:26 PM12/19/14

to compu...@computer-go.org

Hi Aja,

> We've just submitted our paper to ICLR. We made the draft available at
> http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf

97.2% against GNU Go?! Accuracy is 55%?! Incredible!
Thanks for the paper!

But it looks playing strength is similar to Clark's CNN.
MCTS with CNN is interesting. But CNN (0 playout) vs 10000 playout is 67%
seems small. Maybe playout is weak? I'm curious if playout uses CNN.

In page 6, "approximately 5,000 rollouts per move"
Christopher Clark's CNN used Fuego with 10 seconds a move, 2 threads on
an Intel i7-4702MQ. So maybe it is about 40,000 rollouts per move.

Regards,
Hiroshi Yamashita

Hugh Perkins

unread,

Dec 20, 2014, 3:37:27 AM12/20/14

to compu...@computer-go.org

On Fri Dec 19 23:17:23 UTC 2014, Aja Huang wrote:
> We've just submitted our paper to ICLR. We made the draft available at
> http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf

Cool... just out of curiosity, did a back-of-an-envelope estimation of the
cost of training your and Clark and Storkey's network, if renting time
on AWS GPU instances and came up with:
- Clark and Storkey: 125 usd (4 days * 2 instances * 0.65usd/hour)
- Yours: 2025usd(cost of Clark and Storkey * 25/7 epochs *
29.4/14.7 action-pairs * 12/8 layers)

Probably a bit high for me personally to just spend one weekend for fun,
but not outrageous at all in fact, if the same technique was being used by
an organization.

Stefan Kaitschick

unread,

Dec 20, 2014, 3:44:51 AM12/20/14

to compu...@computer-go.org

Great work. Looks like the age of nn is here.

How does this compare in computation time to a heavy MC move generator?

One very minor quibble, I feel like a nag for even mentioning it: You write
"The most frequently cited reason for the difficulty of Go, compared to games such as Chess, Scrabble
or Shogi, is the difficulty of constructing an evaluation function that can differentiate good moves
from bad in a given position."

If MC has shown anything, it's that computationally, it's much easier to suggest a good move, than to evaluate the position.

This is still true with your paper, it's just that the move suggestion has become even better.

Stefan

Robert Jasiek

unread,

Dec 20, 2014, 4:23:15 AM12/20/14

to compu...@computer-go.org

On 20.12.2014 09:43, Stefan Kaitschick wrote:
> If MC has shown anything, it's that computationally, it's much easier to
> suggest a good move, than to evaluate the position.

Such can only mean an improper understanding of positional judgement.
Positional judgement depends on reading (or MC simulation of reading)
but the reading has a much smaller computational complexity because
localisation and quiescience apply.

The major aspects of positional judgement are territory and influence.
Evaluating influence is much easier than evaluating territory if one
uses a partial influence concept: influence stone difference. Its major
difficulty is the knowledge of which stones are alive or not, however,
MC simulations applied to outside stones should be able to assess such
with reasonable certainty fairly quickly. Hence, the major work of
positional judgement is assessment of territory. See my book Positional
Judgement 1 - Territory for that. By designing (heuristically or using a
low level expert system) MC for its methods, territorial positional
judgement by MC should be much faster than ordinary MC because much
fewer simulations should do. However, it is not as elegant as ordinary
MC because some expert knowledge is necessary or must be approximated
heuristically. Needless to say, keep the computational complexity of
this expert knowledge low.

--
robert jasiek

Detlef Schmicker

unread,

Dec 20, 2014, 5:21:55 AM12/20/14

to compu...@computer-go.org

It is, but I do not think, that this is necessarily a feature of NN.
NNs might be a good evaluators, but it is much easier to train them for
a move predictor, as it is not easy to get training data sets for an
evaluation function?!

Detlef

P.S.: As we all might be trying to start incorporating NN into our
engines, we might bundle our resources, at least for the first start?!
Maybe exchanging open source software links for NN. I personally would
have started trying NN some time ago, if iOS had OpenCL support, as my
aim is to get a strong iPad go program....

>
>
> Stefan

Hiroshi Yamashita

unread,

Dec 20, 2014, 6:34:40 AM12/20/14

to compu...@computer-go.org

Hi Aja,

> I hope you enjoy our work. Comments and questions are welcome.

I have three questions.

I don't understand minibatch.
Does CNN need 0.15sec for a positon, or 0.15sec for 128 positions?

ABCDEFGHJ
9......... White(O) to move.
8...OO.... Previous Black move is H5(X)
7..XXXOO..
6.....XXO.
5.......X.
4.........
3....XO...
2....OX...
1.........
ABCDEFGHJ

"Liberties after move" means
H7(O) is 5, F8(O) is 6.
"Liberties" means
H5(X) is 3, H6(O) is 2.
"Ladder move" means
G2(O), not E6(O).

Is this correct?

Is "KGS rank" set 9 dan when it plays against Fuego?

Regards,
Hiroshi Yamashita

Robert Jasiek

unread,

Dec 20, 2014, 7:59:55 AM12/20/14

to compu...@computer-go.org

On 20.12.2014 11:21, Detlef Schmicker wrote:
> it is not easy to get training data sets for an evaluation function?!

You seem be asking for abundant data sets, e.g., with triples Position,
Territory, Influence. Indeed, only dozens are available in the
literature and need a bit of extra work. Hundreds of available local
joseki positions do not fit your purpose, e.g., because also the Stone
Difference matters there. However, I suggest a different approach:

1) One strong player (strong enough to be accurate +-1 point of
territory when using his known judgement methods) creates a few
examples, e.g., by taking the existing examples for territory and adding
the influence stone difference. It should be only one player so that the
values are created consistently. (If several players are involved, they
should discuss and agree on their application of known methods.)

2) Code is implemented and produces sample data sets.

3) The same player judges how far off the sample data are from his own
judgement.

Thereby, training does not require many thousands of data sets. Instead
it requires much of a strong player's time to accurately judge dozens of
data sets. In theory, the player could be replaced by program judgement,
but I wish happy development of the then necessary additional theory and
algorithms! ;)

As you see, I suggest human/program collaboration to accelerate program
playing strength. Maybe 9p programs can be created without strong
players' help, but then we will not understand much in terms of go
theory why the programs will excel. For getting much understanding of go
theory from programs, human/program collaboration will be necessary anyway.

--
robert jasiek

Aja Huang

unread,

Dec 20, 2014, 10:53:52 AM12/20/14

to compu...@computer-go.org

2014-12-20 11:33 GMT+00:00 Hiroshi Yamashita <y...@bd.mbn.or.jp>:

I don't understand minibatch.
Does CNN need 0.15sec for a positon, or 0.15sec for 128 positions?

0.15 sec for 128 positions.

ABCDEFGHJ
9......... White(O) to move.
8...OO.... Previous Black move is H5(X)
7..XXXOO..
6.....XXO.
5.......X.
4.........
3....XO...
2....OX...
1.........
ABCDEFGHJ

"Liberties after move" means H7(O) is 5, F8(O) is 6.
"Liberties" means
H5(X) is 3, H6(O) is 2.
"Ladder move" means
G2(O), not E6(O).

Is this correct?

Yes, all correct.

Is "KGS rank" set 9 dan when it plays against Fuego?

Yes.

Aja

Aja Huang

unread,

Dec 20, 2014, 11:03:51 AM12/20/14

to compu...@computer-go.org

Hi Hiroshi,

2014-12-20 3:31 GMT+00:00 Hiroshi Yamashita <y...@bd.mbn.or.jp>:

But it looks playing strength is similar to Clark's CNN.

Against GnuGo our 12-layer CNN is about 300 Elo stronger (97% winning rate against 86% based one the same KGS games). Against Fuego using their time setting (10 sec per move on 2 threads) our CNN scored about 30%. To precisely compare with their results, we also ran 6 sec per move (since our CPU is faster) and got 20-25% (against their 12%). So, our network is clearly much stronger.

MCTS with CNN is interesting. But CNN (0 playout) vs 10000 playout is 67%
seems small. Maybe playout is weak? I'm curious if playout uses CNN.

MCTS + CNN (10k playouts) scored 67% against CNN alone. Yes the playout is still very simple.

In page 6, "approximately 5,000 rollouts per move"
Christopher Clark's CNN used Fuego with 10 seconds a move, 2 threads on
an Intel i7-4702MQ. So maybe it is about 40,000 rollouts per move.

Thanks for the information. I'll verify that again.

Aja

Detlef Schmicker

unread,

Dec 20, 2014, 12:01:41 PM12/20/14

to compu...@computer-go.org

Hi,

I am still fighting with the NN slang, but why do you zero-padd the
output (page 3: 4 Architecture & Training)?

From all I read up to now, most are zero-padding the input to make the
output fit 19x19?!

Thanks for the great work

Detlef

Álvaro Begué

unread,

Dec 20, 2014, 1:44:11 PM12/20/14

to computer-go

If you start with a 19x19 grid and you take convolutional filters of size 5x5 (as an example), you'll end up with a board of size 15x15, because a 5x5 box can be placed inside a 19x19 board in 15x15 different locations. We can get 19x19 outputs if we allow the 5x5 box to be centered on any point, but then you need to do multiply by values outside of the original 19x19 board. Zero-padding just means you'll use 0 as the value coming from outside the board. You can either prepare a 23x23 matrix with two rows of zeros along the edges, or you can just keep the 19x19 input and do your math carefully so terms outside the board are ignored.

Mark Wagner

unread,

Dec 20, 2014, 2:18:27 PM12/20/14

to compu...@computer-go.org

Thanks for sharing. I'm intrigued by your strategy for integrating
with MCTS. It's clear that latency is a challenge for integration. Do
you have any statistics on how many searches new nodes had been
through by the time the predictor comes back with an estimation? Did
you try any prefetching techniques? Because the CNN will guide much of
the search at the frontier of the tree, prefetching should be
tractable.

Did you do any comparisons between your MCTS with and w/o CNN? That's
the direction that many of us will be attempting over the next few
months it seems :)

- Mark

David Fotland

unread,

Dec 20, 2014, 3:19:25 PM12/20/14

to compu...@computer-go.org

This would be very similar to the integration I do in Many Faces of Go. The old engine provides a bias to move selection in the tree, but the old engine is single threaded and only does a few hundred evaluations per second. I typically get between 40 and 200 playouts through a node before Old Many Faces adjusts the biases.

David

> -----Original Message-----
> From: Computer-go [mailto:computer-...@computer-go.org] On Behalf
> Of Mark Wagner
> Sent: Saturday, December 20, 2014 11:18 AM
> To: compu...@computer-go.org
> Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional
> Neural Networks
>
> Thanks for sharing. I'm intrigued by your strategy for integrating with
> MCTS. It's clear that latency is a challenge for integration. Do you have
> any statistics on how many searches new nodes had been through by the time
> the predictor comes back with an estimation? Did you try any prefetching
> techniques? Because the CNN will guide much of the search at the frontier
> of the tree, prefetching should be tractable.
>
> Did you do any comparisons between your MCTS with and w/o CNN? That's the
> direction that many of us will be attempting over the next few months it
> seems :)
>
> - Mark
>

> On Sat, Dec 20, 2014 at 10:43 AM, lvaro Begu <alvaro...@gmail.com>

Erik van der Werf

unread,

Dec 20, 2014, 5:59:45 PM12/20/14

to computer-go

On Sat, Dec 20, 2014 at 2:57 PM, Aja Huang <ajah...@gmail.com> wrote:
>> If so, do you have an idea how much
>> extra performance that provides compared to only the last or not using
>> it at all?
>
>

> We haven't measured that but I think "move history" is an important feature
> since Go is very much about answering the opponent's last move locally
> (that's also why in Go we have the term "tenuki" for not answering the last
> move).

That's pretty much how I looked at it as well. For getting a high
prediction rate it is indeed a very useful feature, but it is unclear
to me how important that really is. Some increased tenuki power may
also have its merits. Perhaps I'll just run some experiments with
Steenvreter to see what happens.

I wonder how much worse a 6d human predictor would do without move history ;-)

Erik

Martin Mueller

unread,

Dec 20, 2014, 6:06:21 PM12/20/14

to compu...@computer-go.org

I think many of the programs have a mechanism for dealing with “slow” knowledge. For example in Fuego, you can call a knowledge function for each node that reaches some threshold T of playouts. The new technical challenge is dealing with the GPU. I know nothing about it myself, but from what I read it seems to work best in batch mode - you don’t want to send single positions for GPU evaluation back and forth.

My impression is that we will see a combination of both in the future - “normal”, fast knowledge which can be called as initialization in every node, and can be learned by Remi Coulom’s method (e.g. Crazy Stone, Aya, Erica) or by Wistuba’s (e.g. Fuego). And then, on top of that a mechanism to improve the bias using the slower deep networks running on the GPU.

It would be wonderful if some of us could work on an open source network evaluator to integrate with Fuego (or pachi or oakfoam). I know that Clark and Storkey are planning to open source theirs, but not in the very near future. I do not know about the plans of the Google DeepMind group, but they do mention something about a strong Go program in their paper :)

Martin

Aja Huang

unread,

Dec 20, 2014, 6:16:48 PM12/20/14

to compu...@computer-go.org

Hi Hiroshi,

On Sat, Dec 20, 2014 at 3:31 AM, Hiroshi Yamashita <y...@bd.mbn.or.jp> wrote:

In page 6, "approximately 5,000 rollouts per move"
Christopher Clark's CNN used Fuego with 10 seconds a move, 2 threads on
an Intel i7-4702MQ. So maybe it is about 40,000 rollouts per move.

I ran Fuego (latest svn version) on our machine (Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz) for 10 secs with the following config in the empty position (I actually had to manually play 4 moves at the corners to make Fuego out of book, but the speed should be similar anyway):

boardsize 19

uct_param_search number_threads 2

uct_param_search lock_free 1

uct_max_memory 1000000000

uct_param_player reuse_subtree 0

uct_param_player ponder 0

uct_param_player resign_threshold 0.1

go_param_rules capture_dead 1

go_rules kgs

Count 16648

GamesPlayed 16648

Nodes 4989633

Time 8

GameLen 435.4 dev=24.7 min=361.0 max=543.0

InTree 6.9 dev=1.9 min=0.0 max=18.0

Aborted 0%

Games/s 2089.3

It's more than 5000 playouts but less than 20k. Which version of Fuego did you run? Did I set anything wrong? I would appreciate if you could help me run Fuego with the strongest and fastest settings.

Regards,

Aja

hughperkins2

unread,

Dec 20, 2014, 6:25:13 PM12/20/14

to compu...@computer-go.org

Aja wrote:

> We haven't measured that but I think "move history" is an important feature since Go is very much about answering the opponent's last move locally (that's also why in Go we have the term "tenuki" for not answering the last move).

I guess you could get some measure of the importance by looking at the weights?

Aja Huang

unread,

Dec 20, 2014, 7:11:56 PM12/20/14

to compu...@computer-go.org

Hi Mark,

2014-12-20 19:17 GMT+00:00 Mark Wagner <wagner...@gmail.com>:

Thanks for sharing. I'm intrigued by your strategy for integrating
with MCTS. It's clear that latency is a challenge for integration. Do
you have any statistics on how many searches new nodes had been
through by the time the predictor comes back with an estimation? Did
you try any prefetching techniques? Because the CNN will guide much of
the search at the frontier of the tree, prefetching should be
tractable.

Did you do any comparisons between your MCTS with and w/o CNN? That's
the direction that many of us will be attempting over the next few
months it seems :)

I'm glad you like the paper and would consider to attempt. :)

Thanks for the interesting suggestions.

Regards,

Aja

Hiroshi Yamashita

unread,

Dec 20, 2014, 10:04:01 PM12/20/14

to compu...@computer-go.org

Hi Aja,

> It's more than 5000 playouts but less than 20k. Which version

I tried Fuego 1.1(2011, Windows version) on Intel Core i3 540,
2 cores 4 thread. 3.07GHz.
I played first 4 moves randomly, and next 4 moves are

GamesPlayed 28952, Time 6.8, Games/s 4249.8
GamesPlayed 28750, Time 6.8, Games/s 4249.4
GamesPlayed 42853, Time 9.7, Games/s 4416.4
GamesPlayed 32541, Time 7.5, Games/s 4357.0

Average is 33000 playout/move.

config.txt
-----------------------------------

uct_param_search number_threads 2
uct_param_search lock_free 1

uct_param_player reuse_subtree 1
uct_param_player ponder 0
-----------------------------------
"C:\Program Files\Fuego\fuego.exe" --config config.txt

I did not add "go_param timelimit 10", because 10sec is default.

i7-4702MQ is 4 cores, 8 threads. 2.2 GHz, 3.2 GHz(Turbo Boost)
I'm not sure its speed and whether it used 3.2GHz, but I think
turbo boost is on when 2 threads.

I summed Fuego's cpu time from their first 10 sgfs.
Total is 2861 moves, cpu time 21443 sec.
21443 / (2861/2) = 15.0 sec / move
It is over 10sec. a bit strange.

Regards,
Hiroshi Yamashita

----- Original Message -----
From: "Aja Huang" <ajah...@google.com>
To: <compu...@computer-go.org>
Cc: <y...@bd.mbn.or.jp>
Sent: Sunday, December 21, 2014 8:16 AM
Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional NeuralNetworks

Aja Huang

unread,

Dec 21, 2014, 3:26:47 AM12/21/14

to compu...@computer-go.org

2014-12-21 3:02 GMT+00:00 Hiroshi Yamashita <y...@bd.mbn.or.jp>:

I tried Fuego 1.1(2011, Windows version) on Intel Core i3 540,
2 cores 4 thread. 3.07GHz.

Thanks. You remind me we should write Fuego's version as "1.1.SVN" rather than "1.1". In Clark's paper they tested against Fuego 1.1. So the reason why our 12-layer CNN is about 300 Elo stronger than their best CNN when playing against GnuGo but only 100+ Elo stronger when playing against Fuego, is that we tested against different versions of Fuego.

I'm going to test our CNN against Fuego 1.1 for more precise comparison.

Aja

David Silver

unread,

Dec 22, 2014, 7:38:39 AM12/22/14

to compu...@computer-go.org

Hi Martin

- Would you be willing to share some of the sgf game records played by your network with the community? I tried to replay the game record in your paper, but got stuck since it does not show any of the moves that got captured.

Sorry about that, we will correct the figure and repost. In the meanwhile Aja will post the .sgf for that game. Also, thanks for noticing that we tested against a stronger version of Fuego than Clark and Storkey, we'll evaluate against Fuego 1.1 and post the results. Unfortunately, we only have approval to release the material in the paper, so we can't really give any further data :-(

One more thing, Aja said he was tired when he measured his own performance on KGS predictions (working too hard on this paper!) So it would be great to get better statistics on how humans really do at predicting the next move. Does anyone want to measure their own performance, say on 200 randomly chosen positions from the KGS data?

- Do you know how large is the effect from using the extra features that are not in the paper by Clarke and Storkey, i.e. the last move info and the extra tactics? As a related question, would you get an OK result if you just zeroed out some inputs in the existing net, or would you need to re-train a new network from fewer inputs.

We trained our networks before we knew about Clark and Storkey's results, so we haven't had a chance to evaluate the differences between the approaches. But it's well known that last move info makes a big difference to predictive performance, so I'd guess they would already be close to 50% predictive accuracy if they included those features.

- Is there a way to simplify the final network so that it is faster to compute and/or easier to understand? Is there something computed, maybe on an intermediate layer, that would be usable as a new feature in itself?

This is an interesting idea, but so far we only focused on building a large and deep enough network to represent Go knowledge at all.

Cheers

Dave

Stefan Kaitschick

unread,

Dec 22, 2014, 9:46:26 AM12/22/14

to compu...@computer-go.org

Last move info is a strange beast, isn't it? I mean, except for ko captures, it doesn't really add information to the position. The correct prediction rate is such an obvious metric, but maybe prediction shouldn't be improved at any price. To a certain degree, last move info is a kind of self-delusion. A predictor that does well without it should be a lot more robust, even if the percentages are poorer.

Stefan

Thomas Wolf

unread,

Dec 22, 2014, 10:16:04 AM12/22/14

to compu...@computer-go.org

Last move info is a cheap hint for an instable area (unless it is a defense
move).

Thomas

Brian Sheppard

unread,

Dec 22, 2014, 10:23:56 AM12/22/14

to compu...@computer-go.org

I wondered that too, because the search tree frequently reaches positions by transpositions.

Only testing would say for sure. And even then, YMMV.

From: Computer-go [mailto:computer-...@computer-go.org] On Behalf Of Stefan Kaitschick
Sent: Monday, December 22, 2014 9:46 AM
To: compu...@computer-go.org
Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks

Last move info is a strange beast, isn't it? I mean, except for ko captures, it doesn't really add information to the position. The correct prediction rate is such an obvious metric, but maybe prediction shouldn't be improved at any price. To a certain degree, last move info is a kind of self-delusion. A predictor that does well without it should be a lot more robust, even if the percentages are poorer.

Stefan

Petr Baudis

unread,

Dec 22, 2014, 3:04:17 PM12/22/14

to compu...@computer-go.org

Let's be pragmatic - humans heavily use the information about the last
move too. If they take a while, they don't need to know the last move
of the opponent when reviewing a position, but when reading a tactical
sequence the previous move in the sequence is essential piece of
information.

--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Aja Huang

unread,

Dec 23, 2014, 9:47:16 AM12/23/14

to compu...@computer-go.org

On Mon, Dec 22, 2014 at 12:38 PM, David Silver <davidst...@gmail.com> wrote:

we'll evaluate against Fuego 1.1 and post the results.

I quickly tested our 12-layer CNN against Fuego 1.1 with 5 secs and 10 secs per move, 2 threads. The hardware is Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz.

5 secs per move 12-layer CNN scored 55.8% ±5.4

10 secs per move 12-layer CNN scored 32.9% ±3.8

Fuego1.1 is clearly much weaker than the latest svn release. And interestingly, the network is actually as strong as Fuego 1.1 with 5 secs per move.

Since Clark and Storkey's CNN scored 12% against Fuego 1.1 running on a weaker hardware, our best CNN is about 220 to 310 Elo stronger which is consistent to the results against GnuGo.

Regards,

Aja

Hiroshi Yamashita

unread,

Dec 23, 2014, 10:24:48 AM12/23/14

to compu...@computer-go.org

Hi Aja,

Thanks for a game and report.
I saw sgf, CNN can play ko fight. great.

> our best CNN is about 220 to 310 Elo stronger which is consistent

Deeper network and rich info makes +300 Elo? impressive.
Aja, if your CNN+MCTS use Erica's playout, how strong will it be?
I think it will be contender for strongest program.

I also wonder Fuego could release latest version with 1.2, and use
odd number 1.3.x for development.

Regards,
Hiroshi Yamashita

Hideki Kato

unread,

Dec 23, 2014, 7:10:51 PM12/23/14

to compu...@computer-go.org

Hiroshi Yamashita: <37E4294EAD9142EA84D1031F3E1E9C7C@x60>:

>Hi Aja,
>
>Thanks for a game and report.
>I saw sgf, CNN can play ko fight. great.
>
>> our best CNN is about 220 to 310 Elo stronger which is consistent
>
>Deeper network and rich info makes +300 Elo? impressive.
>Aja, if your CNN+MCTS use Erica's playout, how strong will it be?
>I think it will be contender for strongest program.

The playing strength of an MCTS program is dominated by the
correctness of the simulations, especially of L&D. Prior knowledge
helps a little. David pointed out after the first Densei-sen (almost
three years ago):
>All mcts programs have trouble with the positions near the end. The group
>in the center has miai for two eyes. Same for the group at the top. The
>upper left side group has one big eye shape. For all three groups the
>playouts sometimes kill them. The black stones are pretty solid, so the
>playouts let them survivie. SO even at the end, zen has 50% win rate, MFGO
>has 60%, and pache has 70% win rate for blasck.

Without improving the correctness of the simulations, MCTS programs
can't go up to next stage.

Hideki
Hideki
--
Hideki Kato <mailto:hideki...@ybb.ne.jp>

uurtamo .

unread,

Dec 23, 2014, 7:33:54 PM12/23/14

to computer-go

I thought that any layers beyond 3 were irrelevant. Probably I'm subsuming your nn into what I learned about nn's and didn't read anything carefully enough.

Can you help correct me?

s.

Brian Sheppard

unread,

Dec 23, 2014, 8:50:26 PM12/23/14

to compu...@computer-go.org

A 3-layer network (input, hidden, output) is sufficient to be a universal function approximator, so from a theoretical perspective only 3 layers are necessary. But the gap between theoretical and practical is quite large.

The CNN architecture builds in translation invariance and sensitivity to local phenomena. That gives it a big advantage (on a per distinct weight basis) over the flat architecture.

Additionally, the input layers of these CNN designs are very important. Compared to a stone-by-stone representation, the use of high level concepts in the input layer allows the network to devote its capacity to advanced concepts rather than synthesizing basic concepts.

From: Computer-go [mailto:computer-...@computer-go.org] On Behalf Of uurtamo .
Sent: Tuesday, December 23, 2014 7:34 PM
To: computer-go
Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks

I thought that any layers beyond 3 were irrelevant. Probably I'm subsuming your nn into what I learned about nn's and didn't read anything carefully enough.

hughperkins2

unread,

Dec 23, 2014, 11:14:18 PM12/23/14

to compu...@computer-go.org

Whilst its technically true that you can use an nn with one hidden layer to learn the same function as a deeper net, you might need a combinatorally large number of nodes :-)

"scaling learning algorithms towards ai", by bengio and lecunn, 2007, makes a convincing case along these lines.

Detlef Schmicker

unread,

Dec 25, 2014, 5:00:22 AM12/25/14

to compu...@computer-go.org

Hi,

as I want to by graphic card for CNN: do I need double precision
performance? I give caffe (http://caffe.berkeleyvision.org/) a try, and
as far as I understood most is done in single precision?!

You get comparable single precision performance NVIDA (as caffe uses
CUDA I look for NVIDA) for about 340$ but the double precision
performance is 10x smaller than the 1000$ cards

thanks a lot

Detlef

Álvaro Begué

unread,

Dec 25, 2014, 5:17:00 AM12/25/14

to computer-go

No, you don't need double precision at all.

Álvaro.

hughperkins2

unread,

Dec 25, 2014, 5:45:23 AM12/25/14

to compu...@computer-go.org

> as I want to by graphic card for CNN: do I need double precision
performance?

Personally, i was thinking of experimenting with ints, bytes, and shorts, even less precise than singles :-)

Álvaro Begué

unread,

Dec 25, 2014, 7:41:57 AM12/25/14

to computer-go

You are going to be computing gradients of functions, and most people find it easier to think about these things using a type that roughly corresponds to the notion of real number. You can use a fixed-point representation of reals, which uses ints in the end, but then you have to worry about what scale to use, so you get enough precision but you don't run the risk of overflowing.

The only reason I might consider a fixed-point representation is to achieve reproducibility of results.

David Fotland

unread,

Dec 25, 2014, 5:48:20 PM12/25/14

to compu...@computer-go.org

You can do some GPU experiments on Amazon AWS before you buy. 65 cents per hour

David

http://aws.amazon.com/ec2/instance-types/

G2
This family includes G2 instances intended for graphics and general purpose GPU compute applications.
Features:

High Frequency Intel Xeon E5-2670 (Sandy Bridge) Processors
High-performance NVIDIA GPU with 1,536 CUDA cores and 4GB of video memory

GPU Instances - Current Generation
g2.2xlarge $0.650 per Hour

> -----Original Message-----
> From: Computer-go [mailto:computer-...@computer-go.org] On Behalf
> Of Detlef Schmicker
> Sent: Thursday, December 25, 2014 2:00 AM
> To: compu...@computer-go.org
> Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional
> Neural Networks
>

Hugh Perkins

unread,

Dec 25, 2014, 8:50:08 PM12/25/14

to compu...@computer-go.org

Hi Aja,

Couple of questions:

1. connectivity, number of parameters

Just to check, each filter connects to all the feature maps below it,
is that right? I tried to check that by ball-park estimating number
of parameters in that case, and comparing to the section paragraph in
your section 4. And that seems to support that hypothesis. But
actually my estimate is for some reason under-estimating the number of
parameters, by about 20%:

Estimated total number of parameters
approx = 12 layers * 128 filters * 128 previous featuremaps * 3 * 3 filtersize
= 1.8 million

But you say 2.3 million. It's similar, so seems feature maps are
fully connected to lower level feature maps, but I'm not sure where
the extra 500,000 parameters should come from?

2. Symmetry

Aja, you say in section 5.1 that adding symmetry does not modify the
accuracy, neither higher or lower. Since adding symmetry presumably
reduces the number of weights, and therefore increases learning speed,
why did you thus decide not to implement symmetry?

Hugh

Álvaro Begué

unread,

Dec 25, 2014, 9:41:48 PM12/25/14

to computer-go

This is my guess as to what the number of parameters actually is:

First layer: 128 * (5*5*36 + 19*19) (128 filters of size 5x5 on 36 layers of input, position-dependent biases)

11 hidden layers: 11 * 128 * (3*3*128 + 19*19) (128 filters of size 3x3 on 128 layers of input, position-dependent biases)

Final layer: 2 *(3*3*128 + 19*19) (2 filters of size 3x3 on 128 layers of input, position-dependent biases)

Total number of parameters: 2294738

Did I get that right?

I have the same question about the use of symmetry as Hugh.

Álvaro.

Aja Huang

unread,

Dec 26, 2014, 7:40:09 PM12/26/14

to compu...@computer-go.org

Hi Hugh,

On Fri, Dec 26, 2014 at 9:49 AM, Hugh Perkins <hughp...@gmail.com> wrote:

Estimated total number of parameters
approx = 12 layers * 128 filters * 128 previous featuremaps * 3 * 3 filtersize
= 1.8 million

But you say 2.3 million. It's similar, so seems feature maps are
fully connected to lower level feature maps, but I'm not sure where
the extra 500,000 parameters should come from?

You may have forgotten to include the position dependent biases. This is how I computed the number of parameters

1st layer + 11*middle layers + final layer + 12*middle layer bias + output bias

5*5*36*128 + 3*3*128*128*11 + 3*3*128*2 + 128*19*19*12 + 2*19*19 = 2,294,738

2. Symmetry

Aja, you say in section 5.1 that adding symmetry does not modify the
accuracy, neither higher or lower. Since adding symmetry presumably
reduces the number of weights, and therefore increases learning speed,
why did you thus decide not to implement symmetry?

We were doing exploratory work that optimized performance not training time, so we don't know how symmetry affects training time. In terms of performance it seems not have an effect.

Aja

Kahn Jonas

unread,

Jan 8, 2015, 6:18:32 PM1/8/15

to compu...@computer-go.org

The discussion on move evaluation via CNNs got me wondering: has anyone
tried to make an evaluation function with CNNs ?

I mean, it's hard to really combine CNNs move estimator with a tree
search: you still need something to tell what the best leaf is. Given
the state of the art, the reflex is to use it for move ordering in the
tree for MCTS.
But given how strong the no-look ahead player is, it might be
interesting to have a CNN generate an evaluation instead of a move, and
then use alpha-beta and refinements.

We probably don't want to train the final score, even if the full
probability distribution is interesting; in particular, since many games
end with resignation, we have missing data, and it's certainly not
independant on the resignation itself.

Rather take a leaf from MCTS and just predict one or zero, the
evaluation function being the probability assigned to the result.

Maybe a system should be found to guarantee that the move predicted by
the move predictor (on 9d setting in Aja's technique) gets the highest
probability of winning. (Training the boards with all alternative moves
maybe ?).

OK, food for thought.

Jonas

Darren Cook

unread,

Jan 9, 2015, 6:00:53 PM1/9/15

to compu...@computer-go.org

> The discussion on move evaluation via CNNs got me wondering: has anyone
> tried to make an evaluation function with CNNs ?

My first thought was a human can find good moves with a glance at a
board position, but even the best pros need to both count and use search
to work out the score. So NNs good for move candidate generation, MCTS
good for scoring?

Darren

--
Darren Cook, Software Researcher/Developer
My new book: Data Push Apps with HTML5 SSE
Published by O'Reilly: (ask me for a discount code!)
http://shop.oreilly.com/product/0636920030928.do
Also on Amazon and at all good booksellers!

Darren Cook

unread,

Jan 9, 2015, 6:05:13 PM1/9/15

to compu...@computer-go.org

Aja wrote:
>> I hope you enjoy our work. Comments and questions are welcome.

I've just been catching up on the last few weeks, and its papers. Very
interesting :-)

I think Hiroshi's questions got missed?

Hiroshi Yamashita asked on 2014-12-20:
> I have three questions.
>
> I don't understand minibatch. Does CNN need 0.15sec for a positon, or
> 0.15sec for 128 positions?

I also wasn't sure what "minibatch" meant. Why not just say "batch"?

> Is "KGS rank" set 9 dan when it plays against Fuego?

For me, the improvement from just using a subset of the training data
was one of the most surprising results.

Kahn Jonas

unread,

Jan 9, 2015, 6:12:00 PM1/9/15

to compu...@computer-go.org

>> Is "KGS rank" set 9 dan when it plays against Fuego?
>
> For me, the improvement from just using a subset of the training data
> was one of the most surprising results.

As far as I can tell, they use ALL the training data. That's the point.
They filter by dan, and the CNN must then have less confidence in a 1dan
game than in a 9dan game when predicting a 9dan game, but the
information is used in a way.
The correlation will be nonzero. And depend on the situation, too. The
CNN sees that.

Jonas

Álvaro Begué

unread,

Jan 9, 2015, 10:33:32 PM1/9/15

to computer-go

Yes, it's 0.15 seconds for 128 positions.

A minibatch is a small set of samples that is used to compute an approximation to the gradient before you take a step of gradient descent. I think it's not simply called a "batch" because "batch training" refers to computing the full gradient with all the samples before you take a step of gradient descent. "Minibatch" is standard terminology in the NN community.

Álvaro.

Stefan Kaitschick

unread,

Jan 10, 2015, 4:08:17 AM1/10/15

to compu...@computer-go.org

To me, that's the core lesson of MCTS - take your hands off that evaluation button.

Stefan Kaitschick

unread,

Jan 10, 2015, 4:23:25 AM1/10/15

to compu...@computer-go.org

Let's be pragmatic - humans heavily use the information about the last
move too. If they take a while, they don't need to know the last move
of the opponent when reviewing a position, but when reading a tactical
sequence the previous move in the sequence is essential piece of
information.

--
Petr Baudis

So maybe there is a greater justification for using last move info in the playout than in the tree?

Stefan

Hugh Perkins

unread,

Jan 10, 2015, 10:00:41 PM1/10/15

to compu...@computer-go.org

On 12/27/14, Aja Huang <ajah...@google.com> wrote:
> We were doing exploratory work that optimized performance not training
> time, so we don't know how symmetry affects training time. In terms of
> performance it seems not have an effect.

You are using 3x3. Clarke and Storkey are using 5x5 (section 4.2,
first sentence). So, each of your 3x3 filters contains about the same
amount of information (9 weights) as Clarke and Storkey's symmetrical
5x5 filters (triangle 3+2+1 = 6 weights). If you made your 3x3
filters symmetrical, they each only have 3 weights, which is a bit
small perhaps?

I think an interesting question could be: better to have symmetrical
5x5 (6 weights) or no-symmetries 3x3 (9 weights)?

Aja Huang

unread,

Jan 11, 2015, 4:53:47 PM1/11/15

to compu...@computer-go.org

2015-01-09 23:04 GMT+00:00 Darren Cook <dar...@dcook.org>:

Aja wrote:
>> I hope you enjoy our work. Comments and questions are welcome.

I've just been catching up on the last few weeks, and its papers. Very
interesting :-)

I think Hiroshi's questions got missed?

I did answer Hiroshi's questions.

http://computer-go.org/pipermail/computer-go/2014-December/007063.html

Aja

Darren Cook

unread,

Jan 11, 2015, 5:33:40 PM1/11/15

to compu...@computer-go.org

> Is "KGS rank" set 9 dan when it plays against Fuego?

Aja replied:
> Yes.

I'm wondering if I've misunderstood, but does this mean it is the same
as just training your CNN on the 9-dan games, and ignoring all the 8-dan
and weaker games? (Surely the benefit of seeing more positions outweighs
the relatively minor difference in pro player strength??)

Darren

P.S.

> I did answer Hiroshi's questions.
>
> http://computer-go.org/pipermail/computer-go/2014-December/007063.html

Thanks Aja! It seems you wrote three in a row, and I only got the first
one. I did a side-by-side check from Dec 15 to Dec 31, and I got every
other message. So perhaps it was just a problem on my side, for those
two messages.

Hugh Perkins

unread,

Jan 11, 2015, 6:46:23 PM1/11/15

to compu...@computer-go.org

Darren wrote:
> I'm wondering if I've misunderstood, but does this mean it is the same
as just training your CNN on the 9-dan games, and ignoring all the 8-dan
and weaker games? (Surely the benefit of seeing more positions outweighs
the relatively minor difference in pro player strength??)

It's just an additional data fed into the neural net (via 9 full
layers in fact :-O), so the net can decide to what extent the data it
saw for 2 dan or 1 dan games are useful for predicting the next move
in 9 dan games.

Petr Baudis

unread,

Jan 11, 2015, 8:41:39 PM1/11/15

to compu...@computer-go.org

Hi!

It turns out that due to mail server misconfiguration, three of Aja
Huang's emails on Dec 20 were not delivered to most or all subscribers:

http://computer-go.org/pipermail/computer-go/2014-December/007061.html
http://computer-go.org/pipermail/computer-go/2014-December/007062.html
http://computer-go.org/pipermail/computer-go/2014-December/007063.html

Please read them via the web archive, and my sincere apologies.

Thanks to Darren Cook + Aja Huang for noticing:

On Sun, Jan 11, 2015 at 10:32:53PM +0000, Darren Cook wrote:
> P.S.
>
> > I did answer Hiroshi's questions.
> >
> > http://computer-go.org/pipermail/computer-go/2014-December/007063.html
>
> Thanks Aja! It seems you wrote three in a row, and I only got the first
> one. I did a side-by-side check from Dec 15 to Dec 31, and I got every
> other message. So perhaps it was just a problem on my side, for those
> two messages.

P.S.: What happenned? My home server pasky.or.cz was offline on Dec 20
between 13:57 and ~15:30 UTC for some hardware upgrades - related to my
other project https://github.com/brmson/yodaqa ;-). Unfortunately, the
computer-go.org mail server did not have a proper reverse DNS record
for its IP address configured early on so to enable reliable delivery,
I had to configure relaying all email via my server pasky.or.cz;
I used the `relayhost = pasky.or.cz` postfix directive.
Unfortunately, that turns out not to configure relaying via pasky.or.cz,
but via pasky.or.cz's MX - which is typically pasky.or.cz again so it
would appear to work, except when pasky.or.cz was down at that time.
The backup MX engine.or.cz didn't know anything about the relay
arrangement and so obviously refused to relay any of those mailing list
emails and they were discarded with a permanent delivery error (except
the first one for at least some people, since pasky.or.cz was actually
in the middle of shutdown when this one was being relayed).
I have now fixed the error, the lesson is to use `relayhost
= [pasky.or.cz]` to really relay to a host instead of its MX records.
No other emails were lost due to this problem, as far as I can grep.

P.P.S.: It seems that computer-go.org's reverse DNS record actually
did get fixed by now, so I should be able to remove the relay hack when
time permits.

--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Reply all

Reply to author

Forward