Westpa hangs when importing mpi4py

178 views
Skip to first unread message

jcla...@hawk.iit.edu

unread,
Nov 28, 2017, 6:17:06 PM11/28/17
to westpa-users
Hello all,

I am trying to run westpa on a single node on a cluster, but I keep getting the following mpi error:

mpiexec noticed that process rank 1 with PID 187654 on node xs-0005 exited on signal 11 (Segmentation fault).

I've played with a python debugger and noticed that python will hang when loading mpi4py in mpi.py while setting up the work managers. I wrote a quick hello world script to test mpi4py and I've been able to execute that script without issues, so I'm fairly certain it isn't an environment issue. I've tried using different modes for w_run (threads, processes, etc.) but I haven't had any luck. Thoughts?

Thanks,
Joseph 

Matthew Zwier

unread,
Nov 28, 2017, 6:38:51 PM11/28/17
to westpa...@googlegroups.com
Hi Joseph,

This sounds like an environment mismatch of some sort, like the Python interpreter running WESTPA picking up a bad version of the mpi4py library, or (more likely) the mpi4py library picking up a bad version of the underlying MPI system. Can you establish if the Python interpreter you're using to test mpi4py is the same as the Python that WESTPA is using to run w_run?

Cheers,
Matt Z.

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users+unsubscribe@googlegroups.com.
To post to this group, send email to westpa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adam

unread,
Nov 28, 2017, 10:57:53 PM11/28/17
to westpa...@googlegroups.com
Hi Joseph,

In addition to what Matt said, do you get the same errors (i.e., mpi errors) when you try to run in threads, processes, etc, mode, or are they different errors?

Best,
Adam

---
Adam Pratt
Graduate Student in Chemistry
Chong Lab, Room 338, Eberly Hall
University of Pittsburgh
Pittsburgh, PA 15260

jcla...@hawk.iit.edu

unread,
Nov 29, 2017, 10:53:17 AM11/29/17
to westpa-users
Thank you both for your quick responses.

I have been able to run my test script before running w_run in a submission script; the test script will run as expected while w_run will still produce the mpiexec error as before. When setting the westpa environment I set WEST_PYTHON as which python, so I'm fairly confident both are using the same python implementation.

As far as different modes, I receive the same error for each--even when setting the mode to serial. 

Thanks,
Joseph
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.

To post to this group, send email to westpa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.

Adam

unread,
Nov 29, 2017, 11:18:59 AM11/29/17
to westpa...@googlegroups.com
Hi Joseph,

That is weird.  Are you using anaconda python 2?  In addition, can you tell us the output from

python --version

?

---
Adam Pratt
Graduate Student in Chemistry
Chong Lab, Room 338, Eberly Hall
University of Pittsburgh
Pittsburgh, PA 15260

To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users+unsubscribe@googlegroups.com.

jcla...@hawk.iit.edu

unread,
Nov 29, 2017, 11:27:29 AM11/29/17
to westpa-users
The python version I'm using is 2.7.10, but I'm not using anaconda (not implemented on the cluster I'm using). I did have to install h5py (using HDF5/1.8.15 which is implemented) and pyyaml using pip --user; I've tested both and they seem to be working.

Adam

unread,
Nov 29, 2017, 4:53:54 PM11/29/17
to westpa...@googlegroups.com
Hi Joseph,

It does sound like there's some sort of mismatch between the import that WESTPA is doing and the MPI library your cluster has.  I'd recommend using Anaconda Python, which is easily installed onto any cluster without needing admin priviledges.

https://www.anaconda.com/download/#linux

The installer will ask for a location, and ask if it can be added to your .bashrc.  If you're comfortable doing so (i.e., you don't want to micromanage your environment), let it.  You'll need to rebuild westpa.sh against the python version in Anaconda, so log out, log back in (or source ~/.bashrc), then re-run setup.sh in your westpa install directory.

Best,
Adam

---
Adam Pratt
Graduate Student in Chemistry
Chong Lab, Room 338, Eberly Hall
University of Pittsburgh
Pittsburgh, PA 15260

To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users+unsubscribe@googlegroups.com.

Matthew Zwier

unread,
Nov 30, 2017, 10:16:22 AM11/30/17
to westpa...@googlegroups.com
Joseph,

Did you happen to install h5py on top of a parallel version of HDF5?

~Matt Z.

Joseph Clayton

unread,
Nov 30, 2017, 11:33:03 AM11/30/17
to westpa...@googlegroups.com
The module for HDF5 I built requires OpenMPI, but it may be built on a different version of OpenMPI that I built mpi4py on. I plan on installing anaconda on my local user and then pointing westpa to that implementation; from what I can tell from python's pdb tool the error is coming from the packages, so hopefully having anaconda install the packages will help. I'll let you know if I continue to have issues.

Thanks for the help!

Joseph

You received this message because you are subscribed to a topic in the Google Groups "westpa-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/westpa-users/4_DlUli7jQU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to westpa-users+unsubscribe@googlegroups.com.

Matthew Zwier

unread,
Nov 30, 2017, 11:51:32 AM11/30/17
to westpa...@googlegroups.com
This sounds like a plan. I strongly suspect there is an MPI conflict between mpi4py and h5py/HDF5. I would recommend that you do not link h5py against a parallel version of HDF5, but rather a serial version. WESTPA currently doesn't use any parallel HDF5 features, and there hasn't been much experimentation with parallel HDF5 in conjunction with the MPI work manager, so I'm not surprised this is unearthing some conflicts.

Cheers,
Matt Z.

jcla...@hawk.iit.edu

unread,
Dec 7, 2017, 9:44:06 PM12/7/17
to westpa-users
Thanks to both of you for your help; installing anaconda solved my importing errors and I've been able to get westpa to run with one worker. When I attempt to use multiple, I receive an error from h5py saying that west.h5 file was unable to lock:

IOError: Unable to open file (Unable to lock file, errno = 11, error message = 'resource temporarily unavailable')

From what I've read and understood, this is a problem for several h5py users on lustre file systems (which is the system this cluster uses) since locking is expensive/slow on such systems and is often disabled by admins. I'm currently looking into other work arounds (writing to a scratch directory, etc.) but I'm curious--have you heard of similar issues?

Thanks,
Joseph
You received this message because you are subscribed to a topic in the Google Groups "westpa-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/westpa-users/4_DlUli7jQU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to westpa-users...@googlegroups.com.

To post to this group, send email to westpa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Zwier

unread,
Dec 8, 2017, 4:19:41 PM12/8/17
to westpa...@googlegroups.com
Joseph,

This is new to me. WESTPA currently only uses one process to read/write HDF5 files, so OS-level locking is unnecessary. If a workaround isn't feasible, then there might be a way to disable locking by h5py, though that would entail a quick edit to the WESTPA source code. Did you come across such a workaround in your research into this problem from the h5py side?

Cheers,
Matt Z.

To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users+unsubscribe@googlegroups.com.

Joseph Clayton

unread,
Dec 8, 2017, 4:36:54 PM12/8/17
to westpa...@googlegroups.com
I did find an environment flag that disabled locking which seemed to work, but I later discovered I was trying to run two instances of westpa in the same run (oops) which probably triggered the error. 

Joseph

To unsubscribe from this group and all its topics, send an email to westpa-users+unsubscribe@googlegroups.com.

Matthew Zwier

unread,
Dec 8, 2017, 4:37:57 PM12/8/17
to westpa...@googlegroups.com
That, too would do it. Glad to hear that h5py locking is preventing you from clobbering your hdf5 file, which used not to be the case.

Cheers,
Matt Z.
Reply all
Reply to author
Forward
0 new messages