Inflate() error with H5py

Alex DeGrave

unread,

Oct 26, 2015, 3:56:02 PM10/26/15

to westpa-users

Hello,

I am having trouble getting a WESTPA simulation of mine to run. On my first submission to the queue system, the simulation ran without any issues and exited normally as the queue time expired. Resubmitting the simulation, I get the following error:

exception caught; shutting down

-- ERROR 2015-10-26 15:22:08,434 PID 19211 TID 47435852512768

from logger "w_run"

at location /gscratch3/lchong/ajd98/apps/westpa_8.25.15/westpa/lib/cmds/w_run.py:73 [<module>()]

::

Traceback (most recent call last):

File "/gscratch3/lchong/ajd98/apps/westpa_8.25.15/westpa/lib/cmds/w_run.py", line 65, in <module>

sim_manager.run()

File "/gscratch3/lchong/ajd98/apps/westpa_8.25.15/westpa/src/west/sim_manager.py", line 643, in run

self.propagate()

File "/gscratch3/lchong/ajd98/apps/westpa_8.25.15/westpa/src/west/sim_manager.py", line 501, in propagate

self.data_manager.update_segments(self.n_iter, incoming)

File "/gscratch3/lchong/ajd98/apps/westpa_8.25.15/westpa/src/west/data_manager.py", line 915, in update_segments

dset.id.write(source_sel, dest_sel, auxdataset)

File "h5d.pyx", line 219, in h5py.h5d.DatasetID.write (h5py/h5d.c:2936)

File "_proxy.pyx", line 132, in h5py._proxy.dset_rw (h5py/_proxy.c:1585)

File "_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite (h5py/_proxy.c:1334)

IOError: Can't write data (Inflate() failed)

The simulation runs long enough for one segment to complete; after returning data, it looks like WESTPA tries to write the data to disk using h5py, but fails. Interestingly, the error does not occur with the serial work-manager (letting WESTPA run about 25 segments; finishing the iteration in serial mode would take quite a long time). However, it does occur with both ZMQ and the "processes" work manager. I have attached a log file from WESTPA run in debug mode using the processes work-manager. My WESTPA install includes the ZMQ work-manager rewrite; it gives the version as "WEST version 1.0.0 beta," and I downloaded it on 8/25/15.

Does anyone have insight on why I am getting this error, or how I can fix it?

Here are some links that I have been reading through, where other people have similar problems (not related to WESTPA):

1) https://github.com/h5py/h5py/issues/480

2) http://stackoverflow.com/questions/20551899/h5py-sporadic-writing-errors (perhaps related to parallelized code. The error given is different, but according to the traceback it occurs in the same line of the same file)

3) https://groups.google.com/forum/#!msg/h5py/bYneKuAMCb8/-yvy3cGi2kkJ

Some of these sources lead me to believe the error could be related to a bug in H5py. In any case, I will keep reading about how I may solve this. Any help is appreciated.

Thanks!

debug.log

Matthew Zwier

unread,

Oct 26, 2015, 4:07:00 PM10/26/15

to westpa...@googlegroups.com

Hi Alex,

The HDF5 file is always written to by only one thread, so if this is a parallelization bug, it's raising its head despite only having one reader/writer thread working with the HDF5 file at a time (or at least that's how WESTPA should be working).

One possibility is that this is related to a failure of the parallel file system you're writing to. It may only appear under load (hence why serial doesn't trigger it, but processes and ZMQ do). Do your sysadmins have any log entries about the shared filesystem at these times?

Another possibility is that you have a zombie WESTPA process running, and two master processes are both trying to write to the file at the same time. Unlikely, but possible. At least once I saw similarly strange behavior with overlapping WESTPA processes.

And finally, as you note, it's possible that this is a problem inside of the HDF5 library itself. The newest version of Anaconda may fix that, if HDF5 has been fixed since May (when the last "it's still broken" on the h5py github site was written).

MZ

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To post to this group, send email to westpa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex DeGrave

unread,

Oct 26, 2015, 5:22:15 PM10/26/15

to westpa...@googlegroups.com

Hi Matt,

Thanks for the quick response!

To explore the possibility that the issue arises from a zombie WESTPA process, I copied the simulation to a new directory and tried to run it; since an old zombie process wouldn't even know the new files exist, it should not interfere with writing. However, I'm getting the same error, so I think we can count that option out.

Also, using a fresh install of Anaconda still gives the same error; if the error is internal to H5py, it has not yet been fixed.

When I first ran the simulation, I was using 288 cores without errors, while I now get this error even with 1 process on the "processes" work-manager (setting --n-workers=1). That leads me to believe that the issue is not from overloading the filesystem. Of course, I may not be understanding you correctly; let me know if you still think log files on the filesystem would be useful, and I will look into getting them from a sysadmin.

One of the sites I posted suggests a patch of some sort for H5py. I think I'll try that next and see if it helps.

Thanks again for your help!

-Alex

--
You received this message because you are subscribed to a topic in the Google Groups "westpa-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/westpa-users/q7tTFxPqYXk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to westpa-users...@googlegroups.com.

Matthew Zwier

unread,

Oct 26, 2015, 6:14:28 PM10/26/15

to westpa...@googlegroups.com

Nice work running down the options. This does sound like a problem in HDF5 or h5py.

One more thing to try would be to eliminate all of the filters from the auxiliary datasets (or turn off the auxiliary datasets entirely, temporarily). It sounds like a problem in the filter pipeline, which is only used for aux data sets in WESTPA, and only if configured as such. Treating the aux data sets as non-chunked, non-filtered (i.e. vanilla, fixed-size datasets) may solve the problem until HDF5/h5py is/are fixed. This might come at the cost of substantially increased hdf5 file size, but oh well.

One last question: there's no chance you're on a disk that's almost full, is there?

Cheers,

MZ

Alex DeGrave

unread,

Oct 27, 2015, 1:13:27 AM10/27/15

to westpa...@googlegroups.com

The filesystem I'm using has plenty of space left, so that shouldn't be the problem.

The simulation runs after turning off the aux data sets. Since the progress coordinate can be stored without issues, it seems I should be able to keep storing the aux data (for example, using the same options as the progress coordinate). While I don't have any explicit options for chunks in my west.cfg, it has options for the scaleoffset filter. Looking through data_manager.py, that would automatically enable chunks. Disabling those options alone does not resolve the issue.

I also tried hardcoding options into data_manager.py. For example, near line 1455 I made changes as follows:

...

if log.isEnabledFor(logging.DEBUG):

log.debug('requiring aux dataset {!r}, shape={!r}, opts={!r}'

.format(h5_dsname, shape, opts))

opts['compression'] = None ## Added 10/27/15 AJD

opts['chunks'] = None ## Added 10/27/15 AJD

opts['shuffle'] = None ## Added 10/27/15 AJD

dset = containing_group.require_dataset(h5_dsname, **opts)

if data is not None:

dset[...] = data

...

Skimming through the data manager code, it seems like this would globally disable compression, chunks, and "shuffle" (I'm not familiar with this). However, the same error still occurs.

Is there a straightforward way to store auxdata in the same manner as the progress coordinate? (preferably from within west.cfg, or without altering my WESTPA installation?)

Thanks!

-Alex

Alex DeGrave

unread,

Oct 28, 2015, 9:37:08 AM10/28/15

to westpa...@googlegroups.com

Hello again,

On second thought, it's not surprising that my attempts to disable chunking, compression, and shuffle seemed to have no result: I was attempting to restart the simulation mid-iteration, so the data sets would already exist.

I was able to get the simulation back up and running. In case another user has a similar problem, here is what I did: First, I truncated an iteration to force any data sets to be created fresh. Mysteriously, the previous iteration would also not run correctly. Examining my west.h5 in an Ipython session, I was unable to open one of the aux data sets; attempting to open it threw the same error (IOError: Can't write data (Inflate() failed)). After deleting and remaking that dataset, as well as disabling compression in my west.cfg, the simulation ran normally.

Disappointingly, I don't think that gives me enough information to conclude what went wrong. Nor does it offer insight as to why the simulation would still run in serial mode, but not in parallel. I have a backup of the simulation from when it was broken, so I may go back to it at some point and try to investigate more.

I really appreciate your help Matt, so thanks again. If there are other tests you would like me to run, please let me know.