emcee parallelization

552 views
Skip to first unread message

Peter Metz

unread,
Dec 19, 2017, 4:42:58 PM12/19/17
to lmfit-py
Hi all,

I'm trying to run my emcee sampling via lmfit on a computing cluster setup with SLURM.

Everything seems to work well on up to 99 cpu, but 120 and 160 cpu (@ 40 cpu / node) raise errors. The Python error raised complains that my lnprob function has issues evaulating (not true), but it seems to stem from computing resources being unavailable, which leads me to believe this is some sort of mpi pool / resource distribution issue.

SO- I have relatively little experience in MPI. Can anyone comment on whether this is a limitation coded into emcee or lmfit (e.g. is there a hardcoded maximum for n-cpu)? Any insight from anyone currently conducting parallel emcee runs?

Thanks!

-Peter

Matt Newville

unread,
Dec 19, 2017, 5:11:27 PM12/19/17
to lmfit-py
Peter,

It is always helpful to provide a small example that shows the problem.  We have no idea what you did unless you tell us.

Like, what did you set 'workers' to be?  Since you decided to not tell us, may I guess that you set it to 99?

There is not a hard-coded maximum for Ncpu in lmfit, but there are defaults for many arguments that might affect how the problem is run.

--Matt
Message has been deleted

Peter Metz

unread,
Dec 20, 2017, 10:18:50 AM12/20/17
to lmfit-py
Hi Matt,

That is correct. Workers was set to 99, and in the computing cluster was set to requisition 4 nodes * 40 cpu. Cases where Workers was set to 120 and 160 respectively aborted with errors like:

[...]
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 515238 max
Traceback (most recent call last):
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/emcee/ensemble.py", line 519, in __call__
    return self.f(x, *self.args, **self.kwargs)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/lmfit/minimizer.py", line 1739, in _lnpost
    out = userfcn(params, *userargs, **userkwargs)
  File "/home/metz/graphene/mc3.py", line 36, in pdf_lnprob
    resid = ref.objective_function(p) # get residual / recalc G_r
  File "/home/metz/graphene/model.py", line 168, in objective_function
    self.phase_composite(self.phases, dat, recalc=True)  # get new G(r)
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 735, in phase_composite
    gr = self.model_composite(phase, data)  # <--- weighted phase PDF
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 684, in model_composite
    phase.to_models()
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 371, in to_models
    self.inter = Interface(self.phase, self.mno)
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 251, in __init__
    self.update_phases(phases)
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 228, in update_phases
    self._get_lattices()    # reconstruct lattice dict with new phase
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 50, in _get_lattices
    getattr(self.phases[p], s)))})
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 143, in __init__
    self.setLatPar(a, b, c, alpha, beta, gamma, baserot=baserot)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 201, in setLatPar
    self.recbase = numalg.inv(self.base)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 513, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
KeyboardInterrupt
Process PoolWorker-102:
Traceback (most recent call last):
  File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/pool.py", line 113, in worker
    result = (True, func(*args, **kwds))
  File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/pool.py", line 65, in mapstar
    return map(*args)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/emcee/ensemble.py", line 519, in __call__
    return self.f(x, *self.args, **self.kwargs)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/lmfit/minimizer.py", line 1739, in _lnpost
    out = userfcn(params, *userargs, **userkwargs)
  File "/home/metz/graphene/mc3.py", line 36, in pdf_lnprob
    resid = ref.objective_function(p) # get residual / recalc G_r
  File "/home/metz/graphene/model.py", line 168, in objective_function
    self.phase_composite(self.phases, dat, recalc=True)  # get new G(r)
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 735, in phase_composite
    gr = self.model_composite(phase, data)  # <--- weighted phase PDF
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 684, in model_composite
    phase.to_models()
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 371, in to_models
    self.inter = Interface(self.phase, self.mno)
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 251, in __init__
    self.update_phases(phases)
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 228, in update_phases
    self._get_lattices()    # reconstruct lattice dict with new phase
  File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 50, in _get_lattices
    getattr(self.phases[p], s)))})
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 143, in __init__
    self.setLatPar(a, b, c, alpha, beta, gamma, baserot=baserot)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 201, in setLatPar
    self.recbase = numalg.inv(self.base)
  File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 513, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)


The log probability function I've defined (essentially following emcee's documentation):

def pdf_lnprob(p, ref, dat):
 
    # get model components
    resid = ref.objective_function(p)   #  get residual / recalc G_r
    model = ref.composite[dat.name]    # get computed G_r
    # evaluate log likelihood function
    lnf = ref.params['lnf'].value
    inv_sigma2 = 1. / (model[:, 1] ** 2 * np.exp(2 * lnf))     # weighting
    rv = -1./2. * np.sum(resid ** 2 * inv_sigma2  - np.log(inv_sigma2))

    return rv

The piece of code calling emcee from lmfit including passed in parameters. Workers in this instance is set to 99, and this job does not fail, whereas jobs set to 120 and 160 respectively did fail. The object ref below is a refinement class instance where parameter are stored in an lmfit.Parameters instance.

# emcee configuration # ##################### #
        ref.params.add('f', vary=True, value=1.8, min=0.001, max=3.0)  # estimation on the est. st. dev.
        ref.params.add('lnf', expr='log(f)')

        # !! sampling intentionally tiny to speed up completion for testing !! #
        steps = 10     # n steps
        nwalkers = 80  # population size
        burn =  0 # discard burn-in period
        workers = 99  # on 4 node * 40 core 
        # ####################################### #

        params = lmfit.Parameters()

        for k in [k for k, v in ref.params.items() if v.vary is True]:
                params.add(ref.params[k])  # MCMC for varied parameters only

        mini = lmfit.Minimizer(pdf_lnprob, params=ref.params,
                                                   fcn_args=(ref, dat), iter_cb=None)

         res = mini.emcee(params=params,
                                         steps=steps,
                                         nwalkers=nwalkers,
                                         burn=burn,
                                         workers=workers)


SBATCH file for SLURM:

#!/bin/sh
#SBATCH -N 4
#SBATCH -n 40
#SBATCH -t 24:00:00
#SBATCH -J pmetz%j
#SBATCH -o pmetz100.out
#SBATCH -e pmetz100.err
#       mpiexec ipython disordered_constrained_fit.py  # <-- this was a mistake, calling mpiexec with a python MPI pool messes things up
python mc100core.py   # <--- script containing the above code snippets

Matt Newville

unread,
Dec 20, 2017, 10:53:45 AM12/20/17
to lmfit-py
Hi Peter,


Hm, sorry but I don't know why that is happening.  Unfortunately, I don't have ready access to a 100+ node machine.  I also have only played around with MPI, so have no insight on using that. 

The error messages sort of looks like OpenBLAS has hit some limit, causing the linear algebra libraries to fail.  Is that how you read that too?   I'm not sure why setting Workers>99 would trigger that unless that was coincidentally a limit for OpenBLAS.

Anyone else have any insight on this?

--Matt

Peter Metz

unread,
Dec 20, 2017, 1:03:30 PM12/20/17
to lmfit-py
Hi Matt,

That's how I'm reading it as well.

I've run another test case with a simpler objective function and noted that I can extend beyond 99 cpu.

I think perhaps this is a memory allocation issue? Either way, I think you're correct and it is not a restriction on the emcee / lmfit side, which you have helpfully confirmed.

I'll summarize the solution here for posterity if I come up with something.

Thanks,
Peter

Matt Newville

unread,
Dec 20, 2017, 6:15:17 PM12/20/17
to lmfit-py
Hi Peter,



On Wed, Dec 20, 2017 at 12:03 PM, Peter Metz <pet...@gmail.com> wrote:
Hi Matt,

That's how I'm reading it as well.

I've run another test case with a simpler objective function and noted that I can extend beyond 99 cpu.

I think perhaps this is a memory allocation issue? Either way, I think you're correct and it is not a restriction on the emcee / lmfit side, which you have helpfully confirmed.

I'll summarize the solution here for posterity if I come up with something.


OK, then we'll assume this isn't exactly an lmfit (or emcee) Issue.  But, yes, if you can summarize the problem and solution, I'm sure that would be helpful for others.

Thanks,

--Matt

Reply all
Reply to author
Forward
0 new messages