emcee parallelization

Peter Metz

unread,

Dec 19, 2017, 4:42:58 PM12/19/17

to lmfit-py

Hi all,

I'm trying to run my emcee sampling via lmfit on a computing cluster setup with SLURM.

Everything seems to work well on up to 99 cpu, but 120 and 160 cpu (@ 40 cpu / node) raise errors. The Python error raised complains that my lnprob function has issues evaulating (not true), but it seems to stem from computing resources being unavailable, which leads me to believe this is some sort of mpi pool / resource distribution issue.

SO- I have relatively little experience in MPI. Can anyone comment on whether this is a limitation coded into emcee or lmfit (e.g. is there a hardcoded maximum for n-cpu)? Any insight from anyone currently conducting parallel emcee runs?

Thanks!

-Peter

Matt Newville

unread,

Dec 19, 2017, 5:11:27 PM12/19/17

to lmfit-py

Peter,

It is always helpful to provide a small example that shows the problem. We have no idea what you did unless you tell us.

Like, what did you set 'workers' to be? Since you decided to not tell us, may I guess that you set it to 99?

There is not a hard-coded maximum for Ncpu in lmfit, but there are defaults for many arguments that might affect how the problem is run.

--Matt

Message has been deleted

Peter Metz

unread,

Dec 20, 2017, 10:18:50 AM12/20/17

to lmfit-py

Hi Matt,

That is correct. Workers was set to 99, and in the computing cluster was set to requisition 4 nodes * 40 cpu. Cases where Workers was set to 120 and 160 respectively aborted with errors like:

[...]

OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 515238 max
Traceback (most recent call last):
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/emcee/ensemble.py", line 519, in __call__
return self.f(x, *self.args, **self.kwargs)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/lmfit/minimizer.py", line 1739, in _lnpost
out = userfcn(params, *userargs, **userkwargs)
File "/home/metz/graphene/mc3.py", line 36, in pdf_lnprob
resid = ref.objective_function(p) # get residual / recalc G_r
File "/home/metz/graphene/model.py", line 168, in objective_function
self.phase_composite(self.phases, dat, recalc=True) # get new G(r)
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 735, in phase_composite
gr = self.model_composite(phase, data) # <--- weighted phase PDF
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 684, in model_composite
phase.to_models()
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 371, in to_models
self.inter = Interface(self.phase, self.mno)
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 251, in __init__
self.update_phases(phases)
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 228, in update_phases
self._get_lattices() # reconstruct lattice dict with new phase
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 50, in _get_lattices
getattr(self.phases[p], s)))})

File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 143, in __init__
self.setLatPar(a, b, c, alpha, beta, gamma, baserot=baserot)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 201, in setLatPar
self.recbase = numalg.inv(self.base)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 513, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
KeyboardInterrupt
Process PoolWorker-102:
Traceback (most recent call last):
File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/pool.py", line 113, in worker
result = (True, func(*args, **kwds))
File "/cm/shared/apps/anaconda2/lib/python2.7/multiprocessing/pool.py", line 65, in mapstar
return map(*args)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/emcee/ensemble.py", line 519, in __call__
return self.f(x, *self.args, **self.kwargs)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/lmfit/minimizer.py", line 1739, in _lnpost
out = userfcn(params, *userargs, **userkwargs)
File "/home/metz/graphene/mc3.py", line 36, in pdf_lnprob
resid = ref.objective_function(p) # get residual / recalc G_r
File "/home/metz/graphene/model.py", line 168, in objective_function
self.phase_composite(self.phases, dat, recalc=True) # get new G(r)
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 735, in phase_composite
gr = self.model_composite(phase, data) # <--- weighted phase PDF
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 684, in model_composite
phase.to_models()
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/pairdistributionfunction.py", line 371, in to_models
self.inter = Interface(self.phase, self.mno)
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 251, in __init__
self.update_phases(phases)
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 228, in update_phases
self._get_lattices() # reconstruct lattice dict with new phase
File "/cm/shared/apps/anaconda2/pkgs/mstack_0.12.1/mstack/interface.py", line 50, in _get_lattices
getattr(self.phases[p], s)))})
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 143, in __init__
self.setLatPar(a, b, c, alpha, beta, gamma, baserot=baserot)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/diffpy.Structure-1.3.5-py2.7.egg/diffpy/Structure/lattice.py", line 201, in setLatPar
self.recbase = numalg.inv(self.base)
File "/cm/shared/apps/anaconda2/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 513, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)

The log probability function I've defined (essentially following emcee's documentation):

def pdf_lnprob(p, ref, dat):

# get model components
resid = ref.objective_function(p) # get residual / recalc G_r
model = ref.composite[dat.name] # get computed G_r

# evaluate log likelihood function
lnf = ref.params['lnf'].value
inv_sigma2 = 1. / (model[:, 1] ** 2 * np.exp(2 * lnf)) # weighting
rv = -1./2. * np.sum(resid ** 2 * inv_sigma2 - np.log(inv_sigma2))

return rv

The piece of code calling emcee from lmfit including passed in parameters. Workers in this instance is set to 99, and this job does not fail, whereas jobs set to 120 and 160 respectively did fail. The object ref below is a refinement class instance where parameter are stored in an lmfit.Parameters instance.

# emcee configuration # ##################### #

ref.params.add('f', vary=True, value=1.8, min=0.001, max=3.0) # estimation on the est. st. dev.

ref.params.add('lnf', expr='log(f)')

# !! sampling intentionally tiny to speed up completion for testing !! #

steps = 10 # n steps

nwalkers = 80 # population size

burn = 0 # discard burn-in period

workers = 99 # on 4 node * 40 core

# ####################################### #

params = lmfit.Parameters()

for k in [k for k, v in ref.params.items() if v.vary is True]:

params.add(ref.params[k]) # MCMC for varied parameters only

mini = lmfit.Minimizer(pdf_lnprob, params=ref.params,

fcn_args=(ref, dat), iter_cb=None)

res = mini.emcee(params=params,

steps=steps,

nwalkers=nwalkers,

burn=burn,

workers=workers)

SBATCH file for SLURM:

#!/bin/sh
#SBATCH -N 4
#SBATCH -n 40
#SBATCH -t 24:00:00
#SBATCH -J pmetz%j
#SBATCH -o pmetz100.out
#SBATCH -e pmetz100.err
# mpiexec ipython disordered_constrained_fit.py # <-- this was a mistake, calling mpiexec with a python MPI pool messes things up
python mc100core.py # <--- script containing the above code snippets

Matt Newville

unread,

Dec 20, 2017, 10:53:45 AM12/20/17

to lmfit-py

Hi Peter,

Hm, sorry but I don't know why that is happening. Unfortunately, I don't have ready access to a 100+ node machine. I also have only played around with MPI, so have no insight on using that.

The error messages sort of looks like OpenBLAS has hit some limit, causing the linear algebra libraries to fail. Is that how you read that too? I'm not sure why setting Workers>99 would trigger that unless that was coincidentally a limit for OpenBLAS.

Anyone else have any insight on this?

--Matt

Peter Metz

unread,

Dec 20, 2017, 1:03:30 PM12/20/17

to lmfit-py

Hi Matt,

That's how I'm reading it as well.

I've run another test case with a simpler objective function and noted that I can extend beyond 99 cpu.

I think perhaps this is a memory allocation issue? Either way, I think you're correct and it is not a restriction on the emcee / lmfit side, which you have helpfully confirmed.

I'll summarize the solution here for posterity if I come up with something.

Thanks,
Peter

Matt Newville

unread,

Dec 20, 2017, 6:15:17 PM12/20/17

to lmfit-py

Hi Peter,

On Wed, Dec 20, 2017 at 12:03 PM, Peter Metz <pet...@gmail.com> wrote:

Hi Matt,

That's how I'm reading it as well.

I've run another test case with a simpler objective function and noted that I can extend beyond 99 cpu.

I think perhaps this is a memory allocation issue? Either way, I think you're correct and it is not a restriction on the emcee / lmfit side, which you have helpfully confirmed.

I'll summarize the solution here for posterity if I come up with something.

OK, then we'll assume this isn't exactly an lmfit (or emcee) Issue. But, yes, if you can summarize the problem and solution, I'm sure that would be helpful for others.