[computer-go] Parallel algorithms

Zach Wegner

unread,

May 24, 2008, 4:41:47 PM5/24/08

to computer-go

On Sat, May 24, 2008 at 6:48 AM, Olivier Teytaud <Olivier...@lri.fr> wrote:
>
> By the way, parallelizations (both multi-core and MPI) are *very* efficient.
> The use of huge clusters could probably give much better
> results than the current limit of mogo (between 1k and 1d on KGS with
> 64 quad-cores and myrinet switch).

This is rather off-topic, but I am very interested in this field of
research. I had read previously that Mogo used MPI, but I didn't know
if could work on clusters without sharing the whole tree. I have been
formulating such an algorithm for the past week or so, so I would like
to make sure that it hasn't already been done. Has anything been
written about Mogo's parallel algorithms? Or could you give a brief
run-down? How does it scale on such large numbers of processors?
_______________________________________________
computer-go mailing list
compu...@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Olivier Teytaud

unread,

May 25, 2008, 10:01:54 AM5/25/08

to computer-go

> I had read previously that Mogo used MPI, but I didn't know
> if could work on clusters without sharing the whole tree. I have been
> formulating such an algorithm for the past week or so, so I would like
> to make sure that it hasn't already been done. Has anything been
> written about Mogo's parallel algorithms? Or could you give a brief
> run-down? How does it scale on such large numbers of processors?

MoGo's parallelization has been published, but with
moderate results because at that time we were only using ethernet. The
same algorithm with good networks provide good results in 9x9 and very
good results in 19x19. The trick is just that sharing only nodes of the
tree with at least 5% of the total number of simulations, say 20 times
per second, is enough for a nice speed-up. This is not possible with
ethernet, but it is possible with myrinet, infiniband, or things like
taht. With
32 nodes, you get
95% winning rate against the one-node baseline, and this can be
implemented using MPI within just a few lines (recursive sharing of the
nodes in the tree). The reference (conference icinco2008)
is on my web page (http://www.lri.fr/~teytaud/cv/cv.html), but the results
in the paper are disappointing
as the results on this paper were with ethernet only. We'll publish
the new speed-up curves soon (preliminary results below). In 9x9, it's far
less impressive, unfortunately; but perhaps improvements are possible.
Note that these results are with very good networks, but better hardware
exists... if you have plenty of money :-)

Also I think that just combining the improvements in various current best
implementations could provide something much stronger. For example, mogo
is still below the state of the art in terms of patterns in the tree (but,
this is increasing :-) ). Some other implementations have no MPI
parallelization, or no RAVE values - combining all that in one code could
be quite nice. It's just a huge work :-)
Olivier
======================== MPI success rate ===============
19x19 with 32 machines
against 1 machines:.95102040816326530612+-.01378857726307400759\\
against 2 machines:.82383419689119170984+-.02742218504725679587\\
against 4 machines:.73493975903614457831+-.03425658934275012162\\
against 8 machines:.63076923076923076923+-.04232651544690520753\\
against 16 machines:.63157894736842105263+-.05533236663556282245\\
against 32 machines:.48000000000000000000+-.09991996797437437129\\

9x9
against 1 machines:.72542372881355932203+- .02598462012440167557
against 2 machines:.63082437275985663082+- .02889140360869044365
against 4 machines:.61231884057971014492+- .02932726868615559501
against 8 machines:.53526220614828209764+- .02120922113498994710
against 16 machines:.50541516245487364620+- .02124171853985291158

Zach Wegner

unread,

May 26, 2008, 12:19:45 AM5/26/08

to computer-go

> MoGo's parallelization has been published, but with moderate results because
> at that time we were only using ethernet. The
> same algorithm with good networks provide good results in 9x9 and very
> good results in 19x19. The trick is just that sharing only nodes of the tree
> with at least 5% of the total number of simulations, say 20 times
> per second, is enough for a nice speed-up. This is not possible with
> ethernet, but it is possible with myrinet, infiniband, or things like taht.

Interesting. It's fairly similar to what I was envisioning, but relies
much more on the network. Mine was oriented more towards a distributed
cluster, connected over the internet. Given this information, my
approach is most likely useless, at least on a setup like yours. I'll
tell you sort of what I was thinking about though:

Each node in the network communicates with just a few others in the
network. These relationships are arranged hierarchically, but they can
adjust dynamically. Each node has a parent and some siblings. Each
node sets its own root node somewhere in the tree below the root of
its parent (in most cases one move after). The parent and the children
communicate in the following way: every X nodes, the parent sends the
move list of its root with the number of playouts, the number of wins,
and the number of children searching below each. The children send
back which move they choose to search (based both on UCT or whatever
tree search algorithm you use, and what the other children are doing)
and the results along with a PV. Sometimes children will dissociate
with a parent and try to find a better place in the tree to search.
The strength of this approach is a self-regulating tree of nodes that
will automatically find the most lucrative places to search without
needing to communicate with a master server. The tree of nodes will
end up looking somewhat like a UCT tree.

Add in the possibility of the network regulating its connections as
well based on lag time, and it gets pretty complicated. It might be
the only viable approach when running on a cluster though... What do
you think?

> With 32 nodes, you get
> 95% winning rate against the one-node baseline, and this can be implemented
> using MPI within just a few lines (recursive sharing of the nodes in the
> tree). The reference (conference icinco2008)
> is on my web page (http://www.lri.fr/~teytaud/cv/cv.html), but the results
> in the paper are disappointing
> as the results on this paper were with ethernet only. We'll publish
> the new speed-up curves soon (preliminary results below). In 9x9, it's far
> less impressive, unfortunately; but perhaps improvements are possible.
> Note that these results are with very good networks, but better hardware
> exists... if you have plenty of money :-)

And 64 quad cores are cheap? ;)

Seriously though, thanks for the information. I guess the paper is not
available on the web anywhere? I think I understand your basic idea
pretty well though.

One interesting trend that I see is that the uncertainty increases for
19x19 when the matches get closer, while that doesn't happen in 9x9.
Maybe 9x9 is just shallow enough that the matches are more decisive.
Or is it just based on how many games were used to get this result?

Reply all

Reply to author

Forward