Hi,As I had mentioned in one of the posts, I'm currently developing an interface for PyMC to use Apache Spark as a backend. I forked PyMC at PyMC-Spark Repository. Since PyMC v3 is currently under development, I've decided to go with v2.3.
I'd like to share the current status with you. So far, I've completed the following items:- Implemented HDFS backend for PyMC v3 (hdfs.py).
- Implemented HDFS db backend for PyMC v2.3 (hdfs.py).The above implementations use PyWebHdfs to establish a communication with HDFS. Therefore, sending data through the network would be a bit slow.- Added MCMCSpark module to build an interface for the user to access Spark RDDs (Spark wraps data inside so called RDDs) with regular MCMC class methods (MCMCSpark.py). By using this class, traces can be easily saved into and loaded from HDFS, without requiring any network allocation. By calling the 'sample' method which has been implemented in MCMCSpark, one can run MCMC's sample method parallel on Spark clusters. However Spark jobs cannot return the MCMC instance that they had created, because PySpark can't serialize MCMC objects due to Fortran code. Instead, each Spark job returns a tuple of integer (chain number) and dictionary (trace data).
- Implemented Spark db backend. This module is simply the database backend for Spark RDD. It contains a Trace and a Database class like the other database modules (spark.py).- I've also added some example code which shows how to use HDFS backend (disaster_model_hdfs.py) or MCMCSpark (disaster_model_spark.py).I'd like to hear your opinions and ideas. I'd appreciate it, if you can give some feedback.
--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymc+uns...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pymc.
For more options, visit https://groups.google.com/d/optout.
This looks really interesting.
While I can understand that decision, it is a little unfortunate to have such an exciting feature not work with the up-and-coming pymc. The code base is much lighter and certainly we'd happy to help if questions arise.
This seems to be the same link as below for v2.3?
Why don't all instances simply directly save their traces to the HDFS and you simply read that out on the master thread?
Once this works more robustly it would be great to get the map-reducable samplers. Do you plan on adding those? I think Kai laid some interesting ground work here: https://github.com/pymc-devs/pymc/pull/547 (pymc3 though).
First of all, thank you for the reply and the feedback.On Sat, Jun 21, 2014 at 2:36 AM, Thomas Wiecki <thomas...@gmail.com> wrote:
This looks really interesting.While I can understand that decision, it is a little unfortunate to have such an exciting feature not work with the up-and-coming pymc. The code base is much lighter and certainly we'd happy to help if questions arise.
One of the reasons is to develop this project for PyMC v2.3 is that PyMC 3 is still in alpha stage, whereas PyMC 2.3 is more mature. However once PyMC 3 reaches enough maturity, we can apply the same idea that has been used to PyMC 3 to obtain a distributed processing for PyMC.
Once this works more robustly it would be great to get the map-reducable samplers. Do you plan on adding those? I think Kai laid some interesting ground work here: https://github.com/pymc-devs/pymc/pull/547 (pymc3 though).
Implementing one of the distributed samplers (i.e. Asymptotically Exact, Embarrassingly Parallel MCMC) would be a great future step. However we are more interested in distributing the data (assuming that the observations are big in size), rather than parallelizing the number of iterations in a sampler as described in the paper. We can work on this once we are done with the features on focus. In order to provide a framework to distribute the data across the machines in the cluster, I've been implementing DistributedMCMC.py. It differs from MCMCSpark.py in a way that MCMCSpark runs the same model on the same data across the cluster, whereas DistributedMCMC distributes the data and runs a model that utilizes the local data, and synchronizes the samplers via a global update function. As an introductory example, I have implemented a simple LDA model, DistributedLDA.py.
I would like to also contribute this ongoing work to the actual PyMC repository. Would it be possible to send a pull request? We can either integrate the work to the 2.3 branch or create another branch which contains the modules that provide distributed sampling. So far, both HDFS backends for v2.3 and v3 have passed all the unit tests and are ready to be used. I will test MCMCSpark and spark backend and they will be ready in a couple of days. There are some missing pieces in DistributedMCMC, however it will be a complete working module in a few days, too.
Thank you.Mert--Mert TerzihanM.Sc. Student in Computer Science at Brown UniversityBrown University, Computer Science Dept.115 Waterman Street, 4th FloorProvidence, RI 02912-1910
--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymc+uns...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pymc.
For more options, visit https://groups.google.com/d/optout.
On Thu, Jul 10, 2014 at 7:57 PM, Mert Terzihan <mertte...@gmail.com> wrote:
First of all, thank you for the reply and the feedback.On Sat, Jun 21, 2014 at 2:36 AM, Thomas Wiecki <thomas...@gmail.com> wrote:
This looks really interesting.While I can understand that decision, it is a little unfortunate to have such an exciting feature not work with the up-and-coming pymc. The code base is much lighter and certainly we'd happy to help if questions arise.
One of the reasons is to develop this project for PyMC v2.3 is that PyMC 3 is still in alpha stage, whereas PyMC 2.3 is more mature. However once PyMC 3 reaches enough maturity, we can apply the same idea that has been used to PyMC 3 to obtain a distributed processing for PyMC.That certainly makes sense and it's true this could be implemented for pymc 3 much easier then. The main reason that we consider pymc3 to be alpha is the lack of documentation. But since the code is pretty readable I think for this purpose it would not cause you issues.
Once this works more robustly it would be great to get the map-reducable samplers. Do you plan on adding those? I think Kai laid some interesting ground work here: https://github.com/pymc-devs/pymc/pull/547 (pymc3 though).
Implementing one of the distributed samplers (i.e. Asymptotically Exact, Embarrassingly Parallel MCMC) would be a great future step. However we are more interested in distributing the data (assuming that the observations are big in size), rather than parallelizing the number of iterations in a sampler as described in the paper. We can work on this once we are done with the features on focus. In order to provide a framework to distribute the data across the machines in the cluster, I've been implementing DistributedMCMC.py. It differs from MCMCSpark.py in a way that MCMCSpark runs the same model on the same data across the cluster, whereas DistributedMCMC distributes the data and runs a model that utilizes the local data, and synchronizes the samplers via a global update function. As an introductory example, I have implemented a simple LDA model, DistributedLDA.py.Not sure I understand. My read of the paper you cite was that it allows you to do exactly that -- combine traces from different parts of the data into one posterior of all data. How do you combine the samples now?
I would like to also contribute this ongoing work to the actual PyMC repository. Would it be possible to send a pull request? We can either integrate the work to the 2.3 branch or create another branch which contains the modules that provide distributed sampling. So far, both HDFS backends for v2.3 and v3 have passed all the unit tests and are ready to be used. I will test MCMCSpark and spark backend and they will be ready in a couple of days. There are some missing pieces in DistributedMCMC, however it will be a complete working module in a few days, too.That would definitely be of interest. Main concerns would be added dependencies (could be optional), unit tests (sounds like you have those already) and documentation / tutorials.
In any case, very interesting and exciting work.
Thomas
--Thomas WieckiPhD candidate, Brown UniversityQuantitative Researcher, Quantopian Inc, Boston
--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymc+uns...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pymc.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "PyMC" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pymc/-Y3DGWSAfvI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pymc+uns...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pymc.
For more options, visit https://groups.google.com/d/optout.