[Computer-go] Zero performance

Gian-Carlo Pascutto

unread,

Oct 20, 2017, 3:36:12 PM10/20/17

to compu...@computer-go.org

I reconstructed the full AlphaGo Zero network in Caffe:
https://sjeng.org/dl/zero.prototxt

I did some performance measurements, with what should be
state-of-the-art on consumer hardware:

GTX 1080 Ti
NVIDIA-Caffe + CUDA 9 + cuDNN 7
batch size = 8

Memory use is about ~2G. (It's much more for learning, the original
minibatch size of 32 wouldn't fit on this card!)

Running 2000 iterations takes 93 seconds.

In the AlphaGo paper, they claim 0.4 seconds to do 1600 MCTS
simulations, and they expand 1 node per visit (if I got it right) so
that would be 1600 network evaluations as well, or 200 of my iterations.

So it would take me ~9.3s to produce a self-play move, compared to 0.4s
for them.

I would like to extrapolate how long it will take to reproduce the
research, but I think I'm missing how many GPUs are in each self-play
worker (4 TPU or 64 GPU or ?), or perhaps the average length of the games.

Let's say the latter is around 200 moves. They generated 29 million
games for the final result, which means it's going to take me about 1700
years to replicate this. I initially estimated 7 years based on the
reported 64 GPU vs 1 GPU, but this seems far worse. Did I miss anything
in the calculations above, or was it really a *pile* of those 64 GPU
machines?

Because the performance on playing seems reasonable (you would be able
to actually run the MCTS on a consumer machine, and hence end up with a
strong program), I would be interested in setting up a distributed
effort for this. But realistically there will be maybe 10 people
joining, 80 if we're very lucky (looking at Stockfish numbers). That
means it'd still take 20 to 170 years.

Someone please tell me I missed a factor of 100 or more somewhere. I'd
love to be wrong here.

--
GCP
_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Gian-Carlo Pascutto

unread,

Oct 20, 2017, 4:59:47 PM10/20/17

to compu...@computer-go.org

On 20-10-17 19:44, Gian-Carlo Pascutto wrote:
> Memory use is about ~2G. (It's much more for learning, the original
> minibatch size of 32 wouldn't fit on this card!)

Whoops, this is not true.

It fits! Barely: 10307MiB / 11171MiB

Álvaro Begué

unread,

Oct 20, 2017, 5:41:01 PM10/20/17

to computer-go

I suggest scaling down the problem until some experience is gained.

You don't need the full-fledge 40-block network to get started. You can probably get away with using only 20 blocks and maybe 128 features (from 256). That should save you about a factor of 8, plus you can use larger mini-batches.

You can also start with 9x9 go. That way games are shorter, and you probably don't need 1600 network evaluations per move to do well.

Álvaro.

Sorin Gherman

unread,

Oct 20, 2017, 7:08:20 PM10/20/17

to compu...@computer-go.org

Training of AlphaGo Zero has been done on thousands of TPUs, according to this source:

https://www.reddit.com/r/baduk/comments/777ym4/alphago_zero_learning_from_scratch_deepmind/dokj1uz/?context=3

Maybe that should explain the difference in orders of magnitude that you noticed?

fot...@smart-games.com

unread,

Oct 20, 2017, 7:28:51 PM10/20/17

to compu...@computer-go.org

The paper describes 20 and 40 block networks, but the section on comparison says AlphaGo Zero uses 20 blocks. I think your protobuf describes a 40 block network. That's a factor of two 😊

If you only want pro strength rather than superhuman, you can train for half their time.

Your time looks reasonable when calculating the time to generate the 29M games at about 10 seconds per move. This is only the time to generate the input data. Do you have an estimate of the additional time it takes to do the training? It's probably small in comparison, but it might not be.

My plan is to start out with a little supervised learning, since I'm not trying to prove a breakthrough. I experimented last year for a few months with res-nets for a policy network and there are some things I discovered there that probably apply to this network. They should get perhaps a factor of 5 to 10 speedup. For a commercial program I'll be happy with 7-dan amateur with about 6 months of training using my two GPUs and sixteen i7 cores.

David

John Tromp

unread,

Oct 20, 2017, 7:49:17 PM10/20/17

to computer-go

> You can also start with 9x9 go. That way games are shorter, and you probably
> don't need 1600 network evaluations per move to do well.

Bonus points if you can have it play on goquest where many
of us can enjoy watching its progress, or even challenge it...

regards,
-John

Gian-Carlo Pascutto

unread,

Oct 20, 2017, 8:09:44 PM10/20/17

to compu...@computer-go.org

I agree. Even on 19x19 you can use smaller searches. 400 iterations MCTS is probably already a lot stronger than the raw network, especially if you are expanding every node (very different from a normal program at 400 playouts!). Some tuning of these mini searches is important. Surely you don't want to explore every child node for the first play urgency... I remember this little algorithmic detail was missing from the first paper as well.

So that's a factor 32 gain. Because the network is smaller, it should learn much faster too. Someone on reddit posted a comparison of 20 blocks vs 40 blocks.

With 10 people you can probably get some results in a few months. The question is, how much Elo have we lost on the way...

Another advantage would be that, as long as you keep all the SGF, you can bootstrap a bigger network from the data! So, nothing is lost from starting small. You can "upgrade" if the improvements start to plateau.

--

GCP

Gian-Carlo Pascutto

unread,

Oct 21, 2017, 3:36:29 AM10/21/17

to compu...@computer-go.org

On 20/10/2017 22:41, Sorin Gherman wrote:
> Training of AlphaGo Zero has been done on thousands of TPUs,
> according to this source:
> https://www.reddit.com/r/baduk/comments/777ym4/alphago_zero_learning_from_scratch_deepmind/dokj1uz/?context=3
>
> Maybe that should explain the difference in orders of magnitude that
> you noticed?

That would make a lot more sense, for sure. It would also explain the
25M USD number from Hassabis. That would be a lot of money to spend on
"only" 64 GPUs, or 4 TPU (which are supposed to be ~1 GPU).

There's no explanation where the number came from, but it seems that he
did similar math as in the original post here.

Gian-Carlo Pascutto

unread,

Oct 21, 2017, 5:08:40 AM10/21/17

to compu...@computer-go.org

On 20/10/2017 22:48, fot...@smart-games.com wrote:
> The paper describes 20 and 40 block networks, but the section on
> comparison says AlphaGo Zero uses 20 blocks. I think your protobuf
> describes a 40 block network. That's a factor of two 😊

They compared with both, the final 5180 Elo number is for the 40 block
one. For the 20 block one, the numbers stop around 4300 Elo.
See for example:

https://www.reddit.com/r/baduk/comments/77hr3b/elo_table_of_alphago_zero_selfplay_games/

A factor of 2 isn't much, but sure, it seems sensible to start with the
smaller one, given how intractable the problem looks right now.

> Your time looks reasonable when calculating the time to generate the
> 29M games at about 10 seconds per move. This is only the time to
> generate the input data. Do you have an estimate of the additional
> time it takes to do the training? It's probably small in comparison,
> but it might not be.

So far I've assumed that it's zero, because it can happen in parallel
and the time to generate the self-play games dominates. From the revised
hardware estimates, we can also see that the training machines used 64
GPUs, which is a lot smaller than the 1500+ TPU estimate for the
self-play machines.

Training on the GTX 1080 Ti does 4 batches of 32 positions per second.
They use 2048 position batches, and train for 1000 batches before
checkpointing. So the GTX can produce a checkpoint every 4.5 hours [1].
Testing that over 400 games takes 8.5 days (400 x 200 x 9.3s).

So again, it totally bottlenecks on playing games, not on training. At
least, if the improvement is big, one needn't play the 400 games out,
but SPRT termination can be used.

[1] To be honest, this seems very fast - even starting from 0 such a big
network barely advances in 1000 iterations (or I misinterpreted a
training parameter). But I guess it's important to have a very fast -
learn knowledge - use new knowledge - feedback cycle.

Reply all

Reply to author

Forward