Integrating PyMC with Apache Spark

mertte...@gmail.com

unread,

May 20, 2014, 4:34:37 PM5/20/14

to py...@googlegroups.com

Hi,

As Cloudera Data Science team, we were thinking about starting a project which integrates PyMC and Apache Spark to enable to run PyMC in a distributed fashion on Spark clusters. I wanted to ask you whether you have an effort to make PyMC parallel, or not. Secondly, since PyMC 3 is still in development phase, would you recommend us using PyMC 2.3 or PyMC 3 throughout this project?

Thanks!

Best,

Mert

Thomas Wiecki

unread,

May 20, 2014, 8:33:49 PM5/20/14

to py...@googlegroups.com

Hi,

I think that's a great idea. Note, however, that the MCMC algorithms currently implemented can not be parallelized (except trivially running multiple chains). There have been a slew of papers coming out that allow MCMC to run in parallel on subsets of the data (see e.g. http://arxiv.org/abs/1311.4780 or http://arxiv.org/abs/1402.4102 or Max Wellings papers) which seems more appropriate for Spark.

I'm not sure what the best way to interface with PyMC3 would be. Would you want to replace all the likelihoods? Perhaps there is even a way to interface at the Theano level which might make PyMC run more seamlessly.

If anything, I would imagine that you would want to go with PyMC 3 rather than 2 as it's a clean slate rewrite with a more thought out architecture. I think at this point the core is pretty solid but documentation is missing.

Thomas

--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymc+uns...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pymc.
For more options, visit https://groups.google.com/d/optout.

--

Thomas Wiecki

PhD candidate, Brown University

Quantitative Researcher, Quantopian Inc, Boston

John Salvatier

unread,

May 21, 2014, 2:48:12 AM5/21/14

to py...@googlegroups.com

I echo Thomas' comments; it's a very cool idea. I've even done a bit of experimentation with doing HMC in Spark. Unfortunately as Thomas' mentions, MCMC algorithms are rather inherently sequential. Likelihoods are often pretty parallelizable, but that varies from problem to problem.

For parallelizing likelihood calculations, I have sort of always imagined integrating Spark into Theano. I agree that your best bet is probably PyMC 3. As I work with both PyMC and Spark, I'd be interested in seeing your work in progress and I'm more than happy to answer questions.

John

Frédéric Bastien

unread,

May 21, 2014, 9:20:01 AM5/21/14

to pymc

Hi,

There is a thread on theano-dev mailing list: [theano-dev] Using Theano with PySpark

A user had a problem caused by sparc, but someone just posted an URL to a work around.

I don't have the time to look into making Theano work with Sparc, but we can discuss this if you are interested. Maybe continuing on theano-dev would be better, as this is Theano specific I think.

Frédéric

mertte...@gmail.com

unread,

May 23, 2014, 4:10:08 PM5/23/14

to py...@googlegroups.com, no...@nouiz.org

Thanks for the great pointers. Especially the papers "Asymptotically Exact, Embarrassingly Parallel MCMC" and "Bayes and Big Data: The Consensus Monte Carlo Algorithm" contain some insight to parallelize MCMC methods based on MapReduce paradigm.

As far as I've seen, PyMC supports only reading data from a local file by data.py module. Is there any ongoing effort towards implementing some feature to PyMC 3 to write traces to some database or file system, like database in PyMC 2.3?

Thanks!

Mert

Thomas Wiecki

unread,

May 23, 2014, 4:45:39 PM5/23/14

to py...@googlegroups.com, no...@nouiz.org

On Fri, May 23, 2014 at 4:10 PM, <mertte...@gmail.com> wrote:

Thanks for the great pointers. Especially the papers "Asymptotically Exact, Embarrassingly Parallel MCMC" and "Bayes and Big Data: The Consensus Monte Carlo Algorithm" contain some insight to parallelize MCMC methods based on MapReduce paradigm.

As far as I've seen, PyMC supports only reading data from a local file by data.py module. Is there any ongoing effort towards implementing some feature to PyMC 3 to write traces to some database or file system, like database in PyMC 2.3?

Yes: https://github.com/pymc-devs/pymc/pull/527

Should get merged soon I think.

Thomas

mertte...@gmail.com

unread,

May 23, 2014, 5:17:17 PM5/23/14

to py...@googlegroups.com, no...@nouiz.org

Thanks Thomas, I'm looking forward to this merge!

Mert

Thomas Wiecki

unread,

May 25, 2014, 10:17:06 AM5/25/14

to py...@googlegroups.com, Frédéric Bastien

On Fri, May 23, 2014 at 5:17 PM, <mertte...@gmail.com> wrote:

Thanks Thomas, I'm looking forward to this merge!

FYI, this has been merged now https://github.com/pymc-devs/pymc/pull/527.

Thomas

Reply all

Reply to author

Forward