> We've just submitted our paper to ICLR. We made the draft available at
> http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf
>
> I hope you enjoy our work. Comments and questions are welcome.
I did not look at the go content, on which I'm no expert.
But for the network training, you might be interested in these articles:
«Riemaniann metrics for neural networks I and II» by Yann Ollivier:
http://www.yann-ollivier.org/rech/publs/gradnn.pdf
http://www.yann-ollivier.org/rech/publs/pcnn.pdf
He defines invariant metrics on the parameters, much lighter to compute
than the natural gradient, and then gets usually (very very) much faster
convergence, and in a much more robust way since it does not depend on
parametrisation or the activation function.
Jonas
_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
Hi Aja,
Wow, very impressive. In fact so impressive, it seems a bit
suspicious(*)... If this is real then one might wonder what it means
about the game. Can it really be that simple? I always thought that
even with perfect knowledge there should usually still be a fairly
broad set of equal-valued moves. Are we perhaps seeing that most of
the time players just reproduce the same patterns over and over again?
Do I understand correctly that your representation encodes the
location of the last 5 moves? If so, do you have an idea how much
extra performance that provides compared to only the last or not using
it at all?
Thanks for sharing the paper!
Best,
Erik
* I'd really like to see some test results on pro games that are newer
than any of your training data.
> We've just submitted our paper to ICLR. We made the draft available at
> http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf
97.2% against GNU Go?! Accuracy is 55%?! Incredible!
Thanks for the paper!
But it looks playing strength is similar to Clark's CNN.
MCTS with CNN is interesting. But CNN (0 playout) vs 10000 playout is 67%
seems small. Maybe playout is weak? I'm curious if playout uses CNN.
In page 6, "approximately 5,000 rollouts per move"
Christopher Clark's CNN used Fuego with 10 seconds a move, 2 threads on
an Intel i7-4702MQ. So maybe it is about 40,000 rollouts per move.
Regards,
Hiroshi Yamashita
Cool... just out of curiosity, did a back-of-an-envelope estimation of the
cost of training your and Clark and Storkey's network, if renting time
on AWS GPU instances and came up with:
- Clark and Storkey: 125 usd (4 days * 2 instances * 0.65usd/hour)
- Yours: 2025usd(cost of Clark and Storkey * 25/7 epochs *
29.4/14.7 action-pairs * 12/8 layers)
Probably a bit high for me personally to just spend one weekend for fun,
but not outrageous at all in fact, if the same technique was being used by
an organization.
Such can only mean an improper understanding of positional judgement.
Positional judgement depends on reading (or MC simulation of reading)
but the reading has a much smaller computational complexity because
localisation and quiescience apply.
The major aspects of positional judgement are territory and influence.
Evaluating influence is much easier than evaluating territory if one
uses a partial influence concept: influence stone difference. Its major
difficulty is the knowledge of which stones are alive or not, however,
MC simulations applied to outside stones should be able to assess such
with reasonable certainty fairly quickly. Hence, the major work of
positional judgement is assessment of territory. See my book Positional
Judgement 1 - Territory for that. By designing (heuristically or using a
low level expert system) MC for its methods, territorial positional
judgement by MC should be much faster than ordinary MC because much
fewer simulations should do. However, it is not as elegant as ordinary
MC because some expert knowledge is necessary or must be approximated
heuristically. Needless to say, keep the computational complexity of
this expert knowledge low.
--
robert jasiek
It is, but I do not think, that this is necessarily a feature of NN.
NNs might be a good evaluators, but it is much easier to train them for
a move predictor, as it is not easy to get training data sets for an
evaluation function?!
Detlef
P.S.: As we all might be trying to start incorporating NN into our
engines, we might bundle our resources, at least for the first start?!
Maybe exchanging open source software links for NN. I personally would
have started trying NN some time ago, if iOS had OpenCL support, as my
aim is to get a strong iPad go program....
>
>
> Stefan
> I hope you enjoy our work. Comments and questions are welcome.
I have three questions.
I don't understand minibatch.
Does CNN need 0.15sec for a positon, or 0.15sec for 128 positions?
ABCDEFGHJ
9......... White(O) to move.
8...OO.... Previous Black move is H5(X)
7..XXXOO..
6.....XXO.
5.......X.
4.........
3....XO...
2....OX...
1.........
ABCDEFGHJ
"Liberties after move" means
H7(O) is 5, F8(O) is 6.
"Liberties" means
H5(X) is 3, H6(O) is 2.
"Ladder move" means
G2(O), not E6(O).
Is this correct?
Is "KGS rank" set 9 dan when it plays against Fuego?
Regards,
Hiroshi Yamashita
You seem be asking for abundant data sets, e.g., with triples Position,
Territory, Influence. Indeed, only dozens are available in the
literature and need a bit of extra work. Hundreds of available local
joseki positions do not fit your purpose, e.g., because also the Stone
Difference matters there. However, I suggest a different approach:
1) One strong player (strong enough to be accurate +-1 point of
territory when using his known judgement methods) creates a few
examples, e.g., by taking the existing examples for territory and adding
the influence stone difference. It should be only one player so that the
values are created consistently. (If several players are involved, they
should discuss and agree on their application of known methods.)
2) Code is implemented and produces sample data sets.
3) The same player judges how far off the sample data are from his own
judgement.
Thereby, training does not require many thousands of data sets. Instead
it requires much of a strong player's time to accurately judge dozens of
data sets. In theory, the player could be replaced by program judgement,
but I wish happy development of the then necessary additional theory and
algorithms! ;)
As you see, I suggest human/program collaboration to accelerate program
playing strength. Maybe 9p programs can be created without strong
players' help, but then we will not understand much in terms of go
theory why the programs will excel. For getting much understanding of go
theory from programs, human/program collaboration will be necessary anyway.
--
robert jasiek
I don't understand minibatch.
Does CNN need 0.15sec for a positon, or 0.15sec for 128 positions?
ABCDEFGHJ
9......... White(O) to move.
8...OO.... Previous Black move is H5(X)
7..XXXOO..
6.....XXO.
5.......X.
4.........
3....XO...
2....OX...
1.........
ABCDEFGHJ
"Liberties after move" means H7(O) is 5, F8(O) is 6.
"Liberties" means
H5(X) is 3, H6(O) is 2.
"Ladder move" means
G2(O), not E6(O).
Is this correct?
Is "KGS rank" set 9 dan when it plays against Fuego?
But it looks playing strength is similar to Clark's CNN.
MCTS with CNN is interesting. But CNN (0 playout) vs 10000 playout is 67%
seems small. Maybe playout is weak? I'm curious if playout uses CNN.
In page 6, "approximately 5,000 rollouts per move"
Christopher Clark's CNN used Fuego with 10 seconds a move, 2 threads on
an Intel i7-4702MQ. So maybe it is about 40,000 rollouts per move.
I am still fighting with the NN slang, but why do you zero-padd the
output (page 3: 4 Architecture & Training)?
From all I read up to now, most are zero-padding the input to make the
output fit 19x19?!
Thanks for the great work
Detlef
Did you do any comparisons between your MCTS with and w/o CNN? That's
the direction that many of us will be attempting over the next few
months it seems :)
- Mark
David
> -----Original Message-----
> From: Computer-go [mailto:computer-...@computer-go.org] On Behalf
> Of Mark Wagner
> Sent: Saturday, December 20, 2014 11:18 AM
> To: compu...@computer-go.org
> Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional
> Neural Networks
>
> Thanks for sharing. I'm intrigued by your strategy for integrating with
> MCTS. It's clear that latency is a challenge for integration. Do you have
> any statistics on how many searches new nodes had been through by the time
> the predictor comes back with an estimation? Did you try any prefetching
> techniques? Because the CNN will guide much of the search at the frontier
> of the tree, prefetching should be tractable.
>
> Did you do any comparisons between your MCTS with and w/o CNN? That's the
> direction that many of us will be attempting over the next few months it
> seems :)
>
> - Mark
>
> On Sat, Dec 20, 2014 at 10:43 AM, lvaro Begu <alvaro...@gmail.com>
That's pretty much how I looked at it as well. For getting a high
prediction rate it is indeed a very useful feature, but it is unclear
to me how important that really is. Some increased tenuki power may
also have its merits. Perhaps I'll just run some experiments with
Steenvreter to see what happens.
I wonder how much worse a 6d human predictor would do without move history ;-)
Erik
In page 6, "approximately 5,000 rollouts per move"
Christopher Clark's CNN used Fuego with 10 seconds a move, 2 threads on
an Intel i7-4702MQ. So maybe it is about 40,000 rollouts per move.
Thanks for sharing. I'm intrigued by your strategy for integrating
with MCTS. It's clear that latency is a challenge for integration. Do
you have any statistics on how many searches new nodes had been
through by the time the predictor comes back with an estimation? Did
you try any prefetching techniques? Because the CNN will guide much of
the search at the frontier of the tree, prefetching should be
tractable.
Did you do any comparisons between your MCTS with and w/o CNN? That's
the direction that many of us will be attempting over the next few
months it seems :)
> It's more than 5000 playouts but less than 20k. Which version
I tried Fuego 1.1(2011, Windows version) on Intel Core i3 540,
2 cores 4 thread. 3.07GHz.
I played first 4 moves randomly, and next 4 moves are
GamesPlayed 28952, Time 6.8, Games/s 4249.8
GamesPlayed 28750, Time 6.8, Games/s 4249.4
GamesPlayed 42853, Time 9.7, Games/s 4416.4
GamesPlayed 32541, Time 7.5, Games/s 4357.0
Average is 33000 playout/move.
config.txt
-----------------------------------
uct_param_search number_threads 2
uct_param_search lock_free 1
uct_param_player reuse_subtree 1
uct_param_player ponder 0
-----------------------------------
"C:\Program Files\Fuego\fuego.exe" --config config.txt
I did not add "go_param timelimit 10", because 10sec is default.
i7-4702MQ is 4 cores, 8 threads. 2.2 GHz, 3.2 GHz(Turbo Boost)
I'm not sure its speed and whether it used 3.2GHz, but I think
turbo boost is on when 2 threads.
I summed Fuego's cpu time from their first 10 sgfs.
Total is 2861 moves, cpu time 21443 sec.
21443 / (2861/2) = 15.0 sec / move
It is over 10sec. a bit strange.
Regards,
Hiroshi Yamashita
----- Original Message -----
From: "Aja Huang" <ajah...@google.com>
To: <compu...@computer-go.org>
Cc: <y...@bd.mbn.or.jp>
Sent: Sunday, December 21, 2014 8:16 AM
Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional NeuralNetworks
I tried Fuego 1.1(2011, Windows version) on Intel Core i3 540,
2 cores 4 thread. 3.07GHz.
- Would you be willing to share some of the sgf game records played by your network with the community? I tried to replay the game record in your paper, but got stuck since it does not show any of the moves that got captured.
- Do you know how large is the effect from using the extra features that are not in the paper by Clarke and Storkey, i.e. the last move info and the extra tactics? As a related question, would you get an OK result if you just zeroed out some inputs in the existing net, or would you need to re-train a new network from fewer inputs.
- Is there a way to simplify the final network so that it is faster to compute and/or easier to understand? Is there something computed, maybe on an intermediate layer, that would be usable as a new feature in itself?
Thomas
I wondered that too, because the search tree frequently reaches positions by transpositions.
Only testing would say for sure. And even then, YMMV.
From: Computer-go [mailto:computer-...@computer-go.org] On Behalf Of Stefan Kaitschick
Sent: Monday, December 22, 2014 9:46 AM
To: compu...@computer-go.org
Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks
Last move info is a strange beast, isn't it? I mean, except for ko captures, it doesn't really add information to the position. The correct prediction rate is such an obvious metric, but maybe prediction shouldn't be improved at any price. To a certain degree, last move info is a kind of self-delusion. A predictor that does well without it should be a lot more robust, even if the percentages are poorer.
Stefan
Let's be pragmatic - humans heavily use the information about the last
move too. If they take a while, they don't need to know the last move
of the opponent when reviewing a position, but when reading a tactical
sequence the previous move in the sequence is essential piece of
information.
--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html
we'll evaluate against Fuego 1.1 and post the results.
Thanks for a game and report.
I saw sgf, CNN can play ko fight. great.
> our best CNN is about 220 to 310 Elo stronger which is consistent
Deeper network and rich info makes +300 Elo? impressive.
Aja, if your CNN+MCTS use Erica's playout, how strong will it be?
I think it will be contender for strongest program.
I also wonder Fuego could release latest version with 1.2, and use
odd number 1.3.x for development.
Regards,
Hiroshi Yamashita
The playing strength of an MCTS program is dominated by the
correctness of the simulations, especially of L&D. Prior knowledge
helps a little. David pointed out after the first Densei-sen (almost
three years ago):
>All mcts programs have trouble with the positions near the end. The group
>in the center has miai for two eyes. Same for the group at the top. The
>upper left side group has one big eye shape. For all three groups the
>playouts sometimes kill them. The black stones are pretty solid, so the
>playouts let them survivie. SO even at the end, zen has 50% win rate, MFGO
>has 60%, and pache has 70% win rate for blasck.
Without improving the correctness of the simulations, MCTS programs
can't go up to next stage.
Hideki
Hideki
--
Hideki Kato <mailto:hideki...@ybb.ne.jp>
I thought that any layers beyond 3 were irrelevant. Probably I'm subsuming your nn into what I learned about nn's and didn't read anything carefully enough.
Can you help correct me?
s.
A 3-layer network (input, hidden, output) is sufficient to be a universal function approximator, so from a theoretical perspective only 3 layers are necessary. But the gap between theoretical and practical is quite large.
The CNN architecture builds in translation invariance and sensitivity to local phenomena. That gives it a big advantage (on a per distinct weight basis) over the flat architecture.
Additionally, the input layers of these CNN designs are very important. Compared to a stone-by-stone representation, the use of high level concepts in the input layer allows the network to devote its capacity to advanced concepts rather than synthesizing basic concepts.
From: Computer-go [mailto:computer-...@computer-go.org] On Behalf Of uurtamo .
Sent: Tuesday, December 23, 2014 7:34 PM
To: computer-go
Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks
I thought that any layers beyond 3 were irrelevant. Probably I'm subsuming your nn into what I learned about nn's and didn't read anything carefully enough.
as I want to by graphic card for CNN: do I need double precision
performance? I give caffe (http://caffe.berkeleyvision.org/) a try, and
as far as I understood most is done in single precision?!
You get comparable single precision performance NVIDA (as caffe uses
CUDA I look for NVIDA) for about 340$ but the double precision
performance is 10x smaller than the 1000$ cards
thanks a lot
Detlef
David
http://aws.amazon.com/ec2/instance-types/
G2
This family includes G2 instances intended for graphics and general purpose GPU compute applications.
Features:
High Frequency Intel Xeon E5-2670 (Sandy Bridge) Processors
High-performance NVIDIA GPU with 1,536 CUDA cores and 4GB of video memory
GPU Instances - Current Generation
g2.2xlarge $0.650 per Hour
> -----Original Message-----
> From: Computer-go [mailto:computer-...@computer-go.org] On Behalf
> Of Detlef Schmicker
> Sent: Thursday, December 25, 2014 2:00 AM
> To: compu...@computer-go.org
> Subject: Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional
> Neural Networks
>
Couple of questions:
1. connectivity, number of parameters
Just to check, each filter connects to all the feature maps below it,
is that right? I tried to check that by ball-park estimating number
of parameters in that case, and comparing to the section paragraph in
your section 4. And that seems to support that hypothesis. But
actually my estimate is for some reason under-estimating the number of
parameters, by about 20%:
Estimated total number of parameters
approx = 12 layers * 128 filters * 128 previous featuremaps * 3 * 3 filtersize
= 1.8 million
But you say 2.3 million. It's similar, so seems feature maps are
fully connected to lower level feature maps, but I'm not sure where
the extra 500,000 parameters should come from?
2. Symmetry
Aja, you say in section 5.1 that adding symmetry does not modify the
accuracy, neither higher or lower. Since adding symmetry presumably
reduces the number of weights, and therefore increases learning speed,
why did you thus decide not to implement symmetry?
Hugh
Estimated total number of parameters
approx = 12 layers * 128 filters * 128 previous featuremaps * 3 * 3 filtersize
= 1.8 million
But you say 2.3 million. It's similar, so seems feature maps are
fully connected to lower level feature maps, but I'm not sure where
the extra 500,000 parameters should come from?
2. Symmetry
Aja, you say in section 5.1 that adding symmetry does not modify the
accuracy, neither higher or lower. Since adding symmetry presumably
reduces the number of weights, and therefore increases learning speed,
why did you thus decide not to implement symmetry?
I mean, it's hard to really combine CNNs move estimator with a tree
search: you still need something to tell what the best leaf is. Given
the state of the art, the reflex is to use it for move ordering in the
tree for MCTS.
But given how strong the no-look ahead player is, it might be
interesting to have a CNN generate an evaluation instead of a move, and
then use alpha-beta and refinements.
We probably don't want to train the final score, even if the full
probability distribution is interesting; in particular, since many games
end with resignation, we have missing data, and it's certainly not
independant on the resignation itself.
Rather take a leaf from MCTS and just predict one or zero, the
evaluation function being the probability assigned to the result.
Maybe a system should be found to guarantee that the move predicted by
the move predictor (on 9d setting in Aja's technique) gets the highest
probability of winning. (Training the boards with all alternative moves
maybe ?).
OK, food for thought.
Jonas
My first thought was a human can find good moves with a glance at a
board position, but even the best pros need to both count and use search
to work out the score. So NNs good for move candidate generation, MCTS
good for scoring?
Darren
--
Darren Cook, Software Researcher/Developer
My new book: Data Push Apps with HTML5 SSE
Published by O'Reilly: (ask me for a discount code!)
http://shop.oreilly.com/product/0636920030928.do
Also on Amazon and at all good booksellers!
I've just been catching up on the last few weeks, and its papers. Very
interesting :-)
I think Hiroshi's questions got missed?
Hiroshi Yamashita asked on 2014-12-20:
> I have three questions.
>
> I don't understand minibatch. Does CNN need 0.15sec for a positon, or
> 0.15sec for 128 positions?
I also wasn't sure what "minibatch" meant. Why not just say "batch"?
> Is "KGS rank" set 9 dan when it plays against Fuego?
For me, the improvement from just using a subset of the training data
was one of the most surprising results.
As far as I can tell, they use ALL the training data. That's the point.
They filter by dan, and the CNN must then have less confidence in a 1dan
game than in a 9dan game when predicting a 9dan game, but the
information is used in a way.
The correlation will be nonzero. And depend on the situation, too. The
CNN sees that.
Jonas
Let's be pragmatic - humans heavily use the information about the last
move too. If they take a while, they don't need to know the last move
of the opponent when reviewing a position, but when reading a tactical
sequence the previous move in the sequence is essential piece of
information.
--
Petr Baudis
You are using 3x3. Clarke and Storkey are using 5x5 (section 4.2,
first sentence). So, each of your 3x3 filters contains about the same
amount of information (9 weights) as Clarke and Storkey's symmetrical
5x5 filters (triangle 3+2+1 = 6 weights). If you made your 3x3
filters symmetrical, they each only have 3 weights, which is a bit
small perhaps?
I think an interesting question could be: better to have symmetrical
5x5 (6 weights) or no-symmetries 3x3 (9 weights)?
Aja wrote:
>> I hope you enjoy our work. Comments and questions are welcome.
I've just been catching up on the last few weeks, and its papers. Very
interesting :-)
I think Hiroshi's questions got missed?
Aja replied:
> Yes.
I'm wondering if I've misunderstood, but does this mean it is the same
as just training your CNN on the 9-dan games, and ignoring all the 8-dan
and weaker games? (Surely the benefit of seeing more positions outweighs
the relatively minor difference in pro player strength??)
Darren
P.S.
> I did answer Hiroshi's questions.
>
> http://computer-go.org/pipermail/computer-go/2014-December/007063.html
Thanks Aja! It seems you wrote three in a row, and I only got the first
one. I did a side-by-side check from Dec 15 to Dec 31, and I got every
other message. So perhaps it was just a problem on my side, for those
two messages.
It's just an additional data fed into the neural net (via 9 full
layers in fact :-O), so the net can decide to what extent the data it
saw for 2 dan or 1 dan games are useful for predicting the next move
in 9 dan games.
It turns out that due to mail server misconfiguration, three of Aja
Huang's emails on Dec 20 were not delivered to most or all subscribers:
http://computer-go.org/pipermail/computer-go/2014-December/007061.html
http://computer-go.org/pipermail/computer-go/2014-December/007062.html
http://computer-go.org/pipermail/computer-go/2014-December/007063.html
Please read them via the web archive, and my sincere apologies.
Thanks to Darren Cook + Aja Huang for noticing:
On Sun, Jan 11, 2015 at 10:32:53PM +0000, Darren Cook wrote:
> P.S.
>
> > I did answer Hiroshi's questions.
> >
> > http://computer-go.org/pipermail/computer-go/2014-December/007063.html
>
> Thanks Aja! It seems you wrote three in a row, and I only got the first
> one. I did a side-by-side check from Dec 15 to Dec 31, and I got every
> other message. So perhaps it was just a problem on my side, for those
> two messages.
P.S.: What happenned? My home server pasky.or.cz was offline on Dec 20
between 13:57 and ~15:30 UTC for some hardware upgrades - related to my
other project https://github.com/brmson/yodaqa ;-). Unfortunately, the
computer-go.org mail server did not have a proper reverse DNS record
for its IP address configured early on so to enable reliable delivery,
I had to configure relaying all email via my server pasky.or.cz;
I used the `relayhost = pasky.or.cz` postfix directive.
Unfortunately, that turns out not to configure relaying via pasky.or.cz,
but via pasky.or.cz's MX - which is typically pasky.or.cz again so it
would appear to work, except when pasky.or.cz was down at that time.
The backup MX engine.or.cz didn't know anything about the relay
arrangement and so obviously refused to relay any of those mailing list
emails and they were discarded with a permanent delivery error (except
the first one for at least some people, since pasky.or.cz was actually
in the middle of shutdown when this one was being relayed).
I have now fixed the error, the lesson is to use `relayhost
= [pasky.or.cz]` to really relay to a host instead of its MX records.
No other emails were lost due to this problem, as far as I can grep.
P.P.S.: It seems that computer-go.org's reverse DNS record actually
did get fixed by now, so I should be able to remove the relay hack when
time permits.
--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html