Using multi-processing for speeding up the pymc code

Zahra Sheikh

unread,

Jul 23, 2014, 10:54:27 AM7/23/14

to py...@googlegroups.com

I am a newbie in pymc and trying to learn how to make my MCMC code work. So far I have written the following code where I want to find a posterior for the parameter Xpos, Ypos, MASS=M0*ma and concentration. I used uniform prior for Xpos and Ypos while for ma it should be an exponential prior in the given boundaries which I used potential decorator to put the constraint on ma and also concentration is related to MASS via this equation

cExpected=5.26/(1+z)*(MASS/10^14)^-0.1

and its probability should follow lognormal. Finally the log-likelihood is given in the reduced_shear function. I have tested the sampling of priors and seems they are in the right track.

import pymc as pm

import nfwmodeltools
import numpy as np
import math
import matplotlib.pyplot as plt
from astropy import units as u
cosmo=nfwmodeltools.global_cosmology()
from astropy.cosmology import FlatLambdaCDM
cos = FlatLambdaCDM(H0=70, Om0=0.3)
print "Reading redshift distributions of galaxies which we will marginalize over this parameter when we compute beta and rho_c_over_sigma_c....."
redshift_pdf=np.loadtxt('/vol/pdfMockFidSurvey.dat')
#Normalize the probability
zpdf=np.array([redshift_pdf[j,:]/sum(redshift_pdf[j,:]) for j in xrange(gal_pos.shape[0])] )
#The whole range of redshift
z=np.arange(0,1.5,0.001)
z_halo=0.15
M0=1e15
x=1*u.Mpc
print "convert radian to degree for angular separation in the redshift of cluster and compute 1 Mpc seperation in degree:"angular_separation=(x*180)/np.pi/ cos.angular_diameter_distance(z_halo) 
#preparing the required inputs for the mcmc model
rho_crit=cosmo.cosmology.rho_crit(z_halo)
print "We multiply the probability with the computed function for the given array z"
rho_c_over_sigma_c=np.dot(zpdf,cosmo.cosmology.RhoCrit_over_SigmaC(z, z_halo))
print rho_c_over_sigma_c.shape
beta=np.dot(zpdf, cosmo.cosmology.beta_s(z, z_halo))       
print "Defining the model as nfw model for MCMC..."
def nfw(gal_pos,observed_g,g_err,rho_crit,rho_c_over_sigma_c,beta):
    """
    gal_pos:background galaxy positions
    gt: tangential reduced shear
    g_err: shear error
    rho_crit: critical density
    rho_c_over_sigma_c: critical density over critical surface mass density
    beta:Dds/Ds
    """
    @pm.stochastic(dtype=np.float, observed=False, trace=True)
    def Xpos(value=25.4,x_l=25.25,x_h=25.55):
        """The probable region of the position of halo centre"""
        if ((value>x_h) or (value<x_l)):
            return -np.inf
        else:
            return -np.log(x_h-x_l+1)        
    #------------------------------------------------------------
    @pm.stochastic(dtype=np.float, observed=False, trace=True)
    def Ypos(value=-10.01,y_l=-10.25,y_h=-9.9):
        """The probable region of the position of halo centre"""
        if ((value>y_h) or (value<y_l)):
            return -np.inf
        else:
            return -np.log(y_h-y_l+1)
    #------------------------------------------------------------
    # Based on a simulated or observed mass function we adopt a simple exponential prior  
    @pm.stochastic(dtype=np.float, observed=False, trace=True)
    def ma(name='ma',value=0.1, rate = 1):
        """mass is a stochastic parameter with exponential distribution.p(M)~exp(-M/10^15)"""
        return pm.exponential_like(value, rate)
    @pm.potential
    def ma_bound(ma=ma):
        if ((ma >= 0.01) and (ma < 10)):
           return 0.0
        else:
           return -np.inf
    #------------------------------------------------------------
    #uncertainty in the concentration parameter    
    @pm.deterministic
    def sigma( name='sigma_concentration' ,M=ma, trace=True, plot= True): 
       if M < 1:
           return .09
       else:
           return .06
    #------------------------------------------------------------
    #concentration
    cExpected = 5.26/(1+z_halo)*((ma.value*M0)/math.pow(10,14))**(-.1) # based on Neto et al. 2007
    concentration = pm.Lognormal("concentration", math.log(cExpected), 1/sigma**2)

    @pm.potential
    def concentration_bound(concentration=concentration):
        if ((concentration > 1.5) and (concentration <= 20)):
           return 0.0
        else:
           return -np.inf
    model_pars=[Xpos, Ypos, ma, concentration]
    @pm.stochastic( name='reduced_shear', dtype=float,observed=True, trace = True )
    def reduced_shear(value=observed_g, model_pars=model_pars):
       Xpos=model_pars[0]
       Ypos=model_pars[1]
       mass=model_pars[2]*M0
       conc=model_pars[3]
       #seperation in the scale of Mpc
       dist=np.sqrt((gal_pos[:,0]-Xpos)**2+(gal_pos[:,1]-Ypos)**2)*angular_separation
       phi=np.arctan2((gal_pos[:,1]-Ypos), (gal_pos[:,0]-Xpos))
       gtan=-(value[:,0]*np.cos(2*phi)+value[:,1]*np.sin(2*phi))
       gcros=-value[:,0]*np.sin(2*phi)+value[:,1]*np.cos(2*phi)
       d=np.vstack((dist,gtan,g_err,beta,rho_c_over_sigma_c))
       #concatenate input data and sort data based on their distance from cluster center
       ndata=np.array(sorted(d.T, key=lambda  l:l[0]))
       #choose objects closer than 3.0 Mpc to the cluster center
       ndata=ndata[ndata[:,0]<=3.0*angular_separation]
       #binning data again based on their distance
       bins = np.linspace(ndata[0,0], ndata[-1,0], 10)
       #compute the mean in each bin for different input parameters
       digitized = np.digitize(ndata[:,0], bins)
       bin_r_mpc= np.array([ndata[digitized == i,0].mean() for i in range(1, len(bins))])
       bin_shear= np.array([ndata[digitized == i,1].mean() for i in range(1, len(bins))])
       bin_shearerr= np.array([ndata[digitized == i,2].mean() for i in range(1, len(bins))])
       avebeta= ndata[:,3].mean() 
       avebeta2=(ndata[:,3]**2).mean()
       ave_rho_c_over_sigma_c=ndata[:,4].mean() 
       loglikelihood = nfwmodeltools.shearprofile_like(mass,
                                                       conc,
                                                       bin_r_mpc,
                                                       bin_shear,
                                                       bin_shearerr,
                                                       avebeta,
                                                       avebeta2,
                                                       rho_crit,
                                                       ave_rho_c_over_sigma_c)
       return loglikelihood
    return locals()

if __name__ == '__main__':

    M = pm.MCMC(nfw(gal_pos,observed_g,g_err,rho_crit,rho_c_over_sigma_c,beta),db='pickle',dbname='NFWTracer.pickle')
    M.use_step_method(pm.AdaptiveMetropolis, M.model_pars ,verbose=1)
    M.isample(40000,8000,50)
    [pm.Matplot.plot(s) for s in [ M.ma*M0, M.Xpos, M.Ypos, M.concentration]]

The problems:

The sampling is very slow and my question is whether it is possible to use multiprocessing in order to increase the speed of the code or not? Because I am wondering when you paralleled the code then in order it would move in the right direction in the parameter space it needs to know about previous steps and it is conflicted with paralleling the code.

Next point: I was reading the thread and somewhere it was mentioned that AdaptiveMetropolis is not a good step method if you do not know from where you should start sampling and also you need to make sure the sampling stays in the right track. On the other hand it might work better for the correlated parameters (I am not sure). So My question is whether it is a good choice to use  AdaptiveMetropolis as step method in my case or not?

Thanks  in advance,

Zahra.

Message has been deleted

Zahra Sheikh

unread,

Jul 28, 2014, 2:20:48 PM7/28/14

to py...@googlegroups.com

Hi,

Related to my question that unfortunately nobody answered I figured the previous post in this thread might help me to solve my problem. Since I ran raftery_lewis function to obtain a estimate of how many steps I need for convergence of my chain, and what I got was roughly 1700000 times. I have been running my MCMC code for 3 days and I reached to 135000 iteration and I am afraid it might take a month to reach to the end of the chain. I have used cython to optimize the performance of loglikelihhod as much as possible. The only bottleneck is the MCMC itself. According to the previous post the new version of 2.3.2 available, can use pyspark to parallel the mcmc chain. Hopefully if I understand it right it might help me to solve my problem. But could you please provide some small documentation how it works and how I could for instance implement it to my code.

I also tried to set up my problem with this pymc version with spark as following

if __name__ == '__main__':

sc=SparkContext('local','HaloMCMC')

input = nfw(gal_pos,observed_g,g_err,rho_crit,rho_c_over_sigma_c,beta)

M = MCMCSpark(input =input, db='spark',nJobs=10,spark_context=sc)

M.sample(1500000 ,burn=10000,thin=300)

pm.Matplot.plot(M,format='pdf',suffix='-his')

scores=pm.geweke(M,intervals=20)

pm.Matplot.geweke_plot(scores,format='pdf',suffix='-diag')

#pm.raftery_lewis(M,q=0.025,r=0.005,s=0.98,epsilon=0.001,verbose=2)

pm.utils.coda(M)

pm.Matplot.autocorrelation(M,maxlag=200,format='pdf',suffix='-autocorr',verbose=1)

I got the following error message:

ERROR: PicklingError: Can't pickle 'fortran' object: <fortran object> [pickle]

Traceback (most recent call last):

File "HALO_NFW_MCMCSpark.py", line 158, in <module>

M.sample(1500000 ,burn=10000,thin=300)

File "/vol/anaconda/lib/python2.7/site-packages/pymc/MCMCSpark.py", line 99, in sample

rdd = self.sc.parallelize(xrange(self.nJobs)).map(sample_on_spark).cache()

File "/vol/software/spark-1.0.1/python/pyspark/rdd.py", line 199, in cache

self._jrdd.cache()

File "/vol/anaconda/lib/python2.7/pickle.py", line 636, in _batch_appends

save(tmp[0])

File "/vol/anaconda/lib/python2.7/pickle.py", line 313, in save

(t.__name__, obj))

pickle.PicklingError: Can't pickle 'fortran' object: <fortran object>

I don't understand why this error occurs!!

I will appreciate if somebody kindly answer my questions.

Thanks in advance.

Zahra

Thomas Wiecki

unread,

Jul 28, 2014, 2:31:47 PM7/28/14

to py...@googlegroups.com

Hi Zahra,

If you're just getting started I would shy away from multiprocessing. In general, MCMC is not easily parallelizable.

I would see if you get by with less samples, or subsample your data if it's many data points. Or a simpler model if that's the cause of long running time. In addition, you can try the slice sampler from pymc 2.3.3 which often has better convergence.

Thomas

--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymc+uns...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pymc.
For more options, visit https://groups.google.com/d/optout.

--

Thomas Wiecki

PhD candidate, Brown University

Quantitative Researcher, Quantopian Inc, Boston

Zahra Sheikh

unread,

Jul 30, 2014, 1:04:05 AM7/30/14

to py...@googlegroups.com

Hi Thomas,

So your point is that applying multi-processing in pymc is not mature enough that still beginner users could easily apply it to their cases just by reading the documentation, right?

Other point about slice sampler, I couldn't find proper documentation from pymc that explain how it should be used. Could you please explain it how and with which input parameters it will get to work?

Regards,
Zahra

Thomas Wiecki

unread,

Jul 30, 2014, 3:41:55 AM7/30/14

to py...@googlegroups.com

See here for changing step_methods: http://pymc-devs.github.io/pymc/modelfitting.html#step-methods

You can just use slice as a drop-in replacement for metropolis hastings.

Kai Londenberg

unread,

Jul 30, 2014, 4:17:55 AM7/30/14

to py...@googlegroups.com

Hi,

just a short thought: If neither Metropolis Hastings nor the slice sampler lead to good results, you might try an emcee / PyMC combination, just like Thomas described here: http://twiecki.github.io/blog/2013/09/23/emcee-pymc/

This should also give you multi-processing capability as a by-product. I would not choose the step method just by looking at whether it supports multiprocessing, though...