h5py and multiprocessing

3,197 views
Skip to first unread message

Filipe Maia

unread,
Jul 31, 2013, 9:53:18 AM7/31/13
to h5...@googlegroups.com
Hi,

If a file is open as read-only and the same file is open in a child process this seems to lead to undefined behaviour. 
It would be nice if this would be clearly documented as I think it's quite an important detail for people working with multiprocessing.

The following script reproduces the issue:

#!/usr/bin/env python

import h5py
import numpy as np
from multiprocessing import Pool

def find_bug(i):
    f = h5py.File("mp_data.h5",'r')
    v = f['/data'][i]
    f.close()
    return v

ndata = 2

f = h5py.File("mp_data.h5",'w')
f['/data'] = range(0,ndata)
f.close()

# If the Pool is created with the file open bad things happen
f = h5py.File("mp_data.h5",'r')

pool = Pool(2)
golden = sum(np.array(pool.map(find_bug,range(0,ndata))))

match = True
for i in range(0,100):
    iterate = sum(np.array(pool.map(find_bug,range(0,ndata))))
    if(golden != iterate):
        match = False
        break

if(match):
    print "%s == %s" % (golden,iterate)
    print "OK: Result is reproduced"
else:    
    print "%s != %s" % (golden,iterate)
    print "Error: Results don't match after %d loop(s)!" % (i+1)


Cheers,
Filipe

Andrew Collette

unread,
Jul 31, 2013, 12:43:56 PM7/31/13
to h5...@googlegroups.com
Hi Filipe,

> If a file is open as read-only and the same file is open in a child process
> this seems to lead to undefined behaviour.
> It would be nice if this would be clearly documented as I think it's quite
> an important detail for people working with multiprocessing.

Yes, this is known effect of how multiprocessing works (it's fork()-based).

I am happy to improve the documentation on this. Can you tell me
where you looked for information? In other words, where would the
added documentation provide the most benefit (FAQ entries, the docs at
h5py.org, etc.)?

Btw, there is a multiprocessing demo in the source distribution.

Andrew

Filipe Maia

unread,
Jul 31, 2013, 4:33:01 PM7/31/13
to h5...@googlegroups.com
At first I was not aware of this so it took me quite a while to track down this problem.
My initial assumption was that the forked processes would behave as two completely independent processes reading a file, which should be no problem.
I tried to search on h5py.org but i didn't find anything. I would suggest adding it to the FAQ for example.
It was only when searching on github that i found the multiprocess example with a comment saying that:
"Trying to interact with the same file on disk from multiple processes results in undefined behavior."
which sounds overly pessimistic.
It would be great to have a more precise explanation.

Cheers,
Filipe



Andrew

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Matthew Zwier

unread,
Jul 31, 2013, 5:32:42 PM7/31/13
to h5...@googlegroups.com
Yeah, it's the fork() itself that's the problem. Simultaneous reads work just fine from independent processes, as long as you don't open the file before you fork() [i.e. create a Pool].

MZ

Filipe Maia

unread,
Jul 31, 2013, 5:51:00 PM7/31/13
to h5...@googlegroups.com
Also what's the reason for this behaviour? 
Could it be changed such that one could open the file even if it's open in the parent process?

Matthew Zwier

unread,
Jul 31, 2013, 8:30:24 PM7/31/13
to h5...@googlegroups.com
It smells like something that's happening at the HDF5 level, rather than the h5py level. I can't back that up (and it won't be anytime soon that I can investigate it), but the behavior is exactly what you would expect for a memory page somewhere in the library that's not marked copy-on-write but should be. It's been a pain to work around, but not so much of a pain that I've had to go digging (yet).

MZ

Andrew Collette

unread,
Jul 31, 2013, 10:34:46 PM7/31/13
to h5...@googlegroups.com
Hi all,

> It smells like something that's happening at the HDF5 level, rather than the
> h5py level. I can't back that up (and it won't be anytime soon that I can
> investigate it), but the behavior is exactly what you would expect for a
> memory page somewhere in the library that's not marked copy-on-write but
> should be. It's been a pain to work around, but not so much of a pain that
> I've had to go digging (yet).

My impression is that the HDF5 library is not routinely tested against
fork()-based programs. It's disappointing, but likewise I was never
able to figure out exactly what caused this behavior. If anyone does
manage to track it down I'm sure the HDF Group would also like to
know. Until then the best approach is to not open your file(s) until
the child processes start.

Because of the headaches of using HDF5 with the multiprocessing
package, we added a new capability in h5py 2.2 which allows access to
Parallel HDF5 using mpi4py:

http://www.h5py.org/docs/topics/mpi.html

MPI is very different from the "multiprocessing" model, but among
other things it lets you access (read & write) the same file from
multiple parallel processes at the same time. MPI is the parallelism
flavor officially supported by the HDF Group, so it's also very well
tested.

Andrew

Filipe Maia

unread,
Aug 1, 2013, 4:09:04 AM8/1/13
to h5...@googlegroups.com
After a bit of searching I think I found the reason for this.
The documentation of H5Fopen() has a special section for multiple opens of the same file.
HDF5 tries to detect such occurrences (only works in certain cases) and uses the same file descriptor for both opens.
As the file descriptor in the children of the fork also share the same underlying structures you get a race conditions between the seek()s and the read()s when multiple children try to access different areas of the file.
HDF5 also requires that the flags used during open be the same everytime. The idea is that it can keep a consistent state for both hid_t. 
This behaviour is a bit surprising as it's not the same as open() but I can see that it might help in some cases.

But in the specific case where a file is open read-only I don't see any advantage, as every other H5Fopen also has to be read only and so there's nothing to keep consistent. So I think that read-only opens should always use their own independent file-descriptor.
Maybe one could ask the HDF5 group for a change in behaviour?

Cheers,
Filipe



Andrew

Andrew Collette

unread,
Aug 2, 2013, 12:28:35 PM8/2/13
to h5...@googlegroups.com
Hi Filipe,

> But in the specific case where a file is open read-only I don't see any
> advantage, as every other H5Fopen also has to be read only and so there's
> nothing to keep consistent. So I think that read-only opens should always
> use their own independent file-descriptor.
> Maybe one could ask the HDF5 group for a change in behaviour?

Sounds like a promising line of investigation. I will forward this
discussion to them.

Andrew

Andrew Collette

unread,
Aug 26, 2013, 1:36:55 PM8/26/13
to h5...@googlegroups.com
Just following up on the multiprocessing discussion.  I got in touch with the HDF Group and they confirmed that the fork() issues are known to them.  They are looking into what modifications might be necessary to support fork() on read-only files, and whether it's feasible/appropriate to implement.  AFAIK there's no timetable.

Andrew

ma...@m00s3jaw.net

unread,
Dec 13, 2013, 12:57:50 AM12/13/13
to h5...@googlegroups.com
Hello everyone,

Is there any workaround or fix?  This "feature" is killing me.

Matt

Matthew Zwier

unread,
Dec 13, 2013, 11:04:13 AM12/13/13
to h5...@googlegroups.com
Hi Matt,

I've had good luck by very carefully closing files before fork() and opening them again after. It's a pain, but only a mild one. The good news is that I've not yet observed a failure when open HDF5 files really, truly aren't carried across a fork().

Cheers,
Matt Z.


Matthew Vincent

unread,
Dec 13, 2013, 11:12:11 AM12/13/13
to h5...@googlegroups.com
Hi Matt,

So... yo are right.  It appears that while comparing PyTables and h5py, the h5py does not crash or hang. The PyTables version hangs or crashes.

I had to whip up an example with Flask, h5py, and PyTables but I can reproduce the error no matter what OS.  Must be something in the PyTables code.

Thanks!

Matt


--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/bJVtWdFtZQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+uns...@googlegroups.com.

Andrew Collette

unread,
Dec 13, 2013, 1:40:17 PM12/13/13
to h5...@googlegroups.com
Hi Matt(s),

> So... yo are right. It appears that while comparing PyTables and h5py, the
> h5py does not crash or hang. The PyTables version hangs or crashes.

Might be worth reporting to the PyTables people. The HDF Group issue
number is HDFFV-8496, if that helps.

Andrew
Reply all
Reply to author
Forward
0 new messages