h5py Parallel and performance

Stuart Mumford

unread,

Nov 18, 2013, 11:07:36 AM11/18/13

to h5...@googlegroups.com

Hello,

Firstly, I wish to say how absolutely awesome the parallel write API in h5py 2.2.0 is, I mean it is just so easy it almost made me literally dance around my office. However, as always, I seemed to have pushed it too far, and it is much slower than I would like. This email's purpose is trying to work out where the bottleneck is.

My setup:
I have a FORTRAN code that saves out one horrible, horrible 'unformatted' binary file per processor which is basically unusable! My Python script uses h5py and translates this data into a hdf5 file which is laid out like this: https://bitbucket.org/yt_analysis/grid_data_format/. What I have done is written an MPI code which uses the same number of processors as there is data files, each process reads a file and saves it's chunk to the HDF5 file. My grid size is 128^3 for 16 processors and 13 fields (I am taking one 13x128x128x128 array and writing 13 128x128x128 datasets to the HDF5 file).

This translation code takes ~7 hours to run for all 585 timesteps of my simulation which works out at around ~40 seconds per step, which seeing how each resultant hdf5 file is 273MB seems quite slow. I did some profiling of this code with various grid sizes and numbers of processors, I attach a html output of profiling 10 steps of the 128^3 16 processor script. Other profiling gave me 1.8 seconds for 60^3 and 1 or two processor(s), indicating that it is not MPI, however that was on a different system to the large write.

Questions:
Looking at the attached html file, the routine takes basically all of it's time on the '<method 'write' of 'h5py.h5d.DatasetID' objects>' call, which kind makes sense if, as my untrained eye tells me, this is basically I/O limited??
Could I speed this up by a) using the low level interface? b) doing anything to my numpy arrays? (I found a np.ascontiguousarray call before the ds[slice] = array call made a lot of difference.)

a note on file systems:
I ran h5perf on the cluster where I am doing this write and the results do indicate that I should be getting faster write speed, and also that larger transfer buffer sizes (logically) speed things up. Is there a way to tweak the transfer buffer size or is this just set by the size of the slab being written?

code:
The script is here:
https://bitbucket.org/smumford/period-paper/src/7acc77e17ba94efaae9a3766538094b0a56592fa/sac/out2gdf_pure.py?at=master

which calls functions from here:
https://bitbucket.org/swatsheffield/pysac/src/5fc267727e8a8a65a409de3d05120fdf08b00c1b/pysac/io/gdf_writer.py?at=master

Thanks a lot for a fantastic package
Stuart

Stuart Mumford

unread,

Nov 18, 2013, 11:16:44 AM11/18/13

to h5...@googlegroups.com

oh here is the attachment!!

index.html

Andrew Collette

unread,

Nov 18, 2013, 12:21:52 PM11/18/13

to h5...@googlegroups.com

Hi Stuart,

> Firstly, I wish to say how absolutely awesome the parallel write API in h5py
> 2.2.0 is, I mean it is just so easy it almost made me literally dance around
> my office. However, as always, I seemed to have pushed it too far, and it is
> much slower than I would like. This email's purpose is trying to work out
> where the bottleneck is.

Thanks for the feedback! I don't regularly use the MPI platform so we
rely on feedback from users to fix things.

By the way, are you using a parallel filesystem (and if so, which)?

> Could I speed this up by a) using the low level interface? b) doing anything
> to my numpy arrays? (I found a np.ascontiguousarray call before the
> ds[slice] = array call made a lot of difference.)

The first thing we could try is doing your writes using only the
low-level interface, to see if there's something about the slicing
that h5py is getting wrong from a performance standpoint. This is
pretty simple, since it looks to me like you're creating a dataset and
then writing the entire array to it in one go.

Try replacing your array writes with this:

from h5py import h5s
dset = f.create_dataset(NAME, SHAPE, dtype=DTYPE)

# Ensure numpy_array is C contiguous before calling!
dset.id.write(h5s.ALL, h5s.ALL, numpy_array)

> a note on file systems:
> I ran h5perf on the cluster where I am doing this write and the results do
> indicate that I should be getting faster write speed, and also that larger
> transfer buffer sizes (logically) speed things up. Is there a way to tweak
> the transfer buffer size or is this just set by the size of the slab being
> written?

Actually, if the above tweak doesn't work, the next thing that comes
to mind is writing your datasets with collective I/O (HDF5 defaults to
independent unless a flag is set in the dataset transfer property
list). This was a feature slated for 2.2, but unfortunately didn't
make it in due to time constraints. If you're available to help test
this feature, I could create a branch for you off master with
collective writes enabled.

I'm also happy to include buffer-size tweaks, but will need some
information from you (or our other MPI-aware colleagues) of which HDF5
C functions to wrap.

Andrew

Stuart Mumford

unread,

Nov 18, 2013, 12:27:44 PM11/18/13

to h5...@googlegroups.com

Hello,

Thanks for the feedback! I don't regularly use the MPI platform so we
rely on feedback from users to fix things.

By the way, are you using a parallel filesystem (and if so, which)?

Lustre I believe.

> Could I speed this up by a) using the low level interface? b) doing anything
> to my numpy arrays? (I found a np.ascontiguousarray call before the
> ds[slice] = array call made a lot of difference.)

The first thing we could try is doing your writes using only the
low-level interface, to see if there's something about the slicing
that h5py is getting wrong from a performance standpoint. This is
pretty simple, since it looks to me like you're creating a dataset and
then writing the entire array to it in one go.

Well each MPI process is writing a sub-slice of the whole array, so I believe your code snippet below will not work?

I create the np slice for each process here: https://bitbucket.org/smumford/period-paper/src/master/sac/out2gdf_pure.py?at=master#cl-119

Try replacing your array writes with this:

from h5py import h5s
dset = f.create_dataset(NAME, SHAPE, dtype=DTYPE)

# Ensure numpy_array is C contiguous before calling!
dset.id.write(h5s.ALL, h5s.ALL, numpy_array)

> a note on file systems:
> I ran h5perf on the cluster where I am doing this write and the results do
> indicate that I should be getting faster write speed, and also that larger
> transfer buffer sizes (logically) speed things up. Is there a way to tweak
> the transfer buffer size or is this just set by the size of the slab being
> written?

Actually, if the above tweak doesn't work, the next thing that comes
to mind is writing your datasets with collective I/O (HDF5 defaults to
independent unless a flag is set in the dataset transfer property
list). This was a feature slated for 2.2, but unfortunately didn't
make it in due to time constraints. If you're available to help test
this feature, I could create a branch for you off master with

collective writes enabled.

Looking over the HDF5 documentation I was wondering about this, I would be more than happy to help test this.
What does the collective write flag actually do for the parallel IO?

I'm also happy to include buffer-size tweaks, but will need some
information from you (or our other MPI-aware colleagues) of which HDF5
C functions to wrap.

On that I have no clue!!

Thanks for the prompt reply
Stuart

Andrew Collette

unread,

Nov 18, 2013, 12:53:55 PM11/18/13

to h5...@googlegroups.com

> Well each MPI process is writing a sub-slice of the whole array, so I
> believe your code snippet below will not work?
>
> I create the np slice for each process here:
> https://bitbucket.org/smumford/period-paper/src/master/sac/out2gdf_pure.py?at=master#cl-119

Yes, of course; that makes more sense!

The equivalent to slicing at the HDF5 level is hyperslab selection.
It's easy to convert from numpy slices. Say numpy_array is the data
you want to write; first create dataspace objects which tell HDF5
about the data in memory and on the disk:

>>> memory_space = h5s.create_simple(numpy_array.shape)
>>> file_space = dset.id.get_space() # copy of the dataset's dataspace

Specify the selection to write:

>>> file_space.select_hyperslab(START, COUNT)

where START and COUNT provide the selection. For example:

something[0:50,10:30] -> START (0, 10) COUNT (50, 20)

Then you call the low-level write:

>>> dset.id.write(memory_space, file_space, numpy_array)

Of course, the total number of data elements has to be the same in
your memory array and in the selection you make to the file.

> Looking over the HDF5 documentation I was wondering about this, I would be
> more than happy to help test this.
> What does the collective write flag actually do for the parallel IO?

From reading the HDF5 documentation, it looks like it explicitly uses
the parallel features of the filesystem, in addition to aggregating
writes in a way that avoid thrashing the disk/buffers.

I'll likely have a chance to work on the collective stuff in the next
day or so. I'll post here when it's ready.

Andrew

Stuart Mumford

unread,

Nov 18, 2013, 3:32:02 PM11/18/13

to h5...@googlegroups.com

Hello,

> Well each MPI process is writing a sub-slice of the whole array, so I

> believe your code snippet below will not work?
>
> I create the np slice for each process here:
> https://bitbucket.org/smumford/period-paper/src/master/sac/out2gdf_pure.py?at=master#cl-119

Yes, of course; that makes more sense!

The equivalent to slicing at the HDF5 level is hyperslab selection.
It's easy to convert from numpy slices.

I vaguely remember this from writing a FORTRAN HDF5 code...

Say numpy_array is the data
you want to write; first create dataspace objects which tell HDF5
about the data in memory and on the disk:

>>> memory_space = h5s.create_simple(numpy_array.shape)
>>> file_space = dset.id.get_space() # copy of the dataset's dataspace

Specify the selection to write:

>>> file_space.select_hyperslab(START, COUNT)

where START and COUNT provide the selection. For example:

something[0:50,10:30] -> START (0, 10) COUNT (50, 20)

Then you call the low-level write:

>>> dset.id.write(memory_space, file_space, numpy_array)

Cool, I will give that a go tomorrow and report benchmarks. I presume I can mix and match high level and low level API.

Of course, the total number of data elements has to be the same in
your memory array and in the selection you make to the file.

> Looking over the HDF5 documentation I was wondering about this, I would be
> more than happy to help test this.
> What does the collective write flag actually do for the parallel IO?

From reading the HDF5 documentation, it looks like it explicitly uses
the parallel features of the filesystem, in addition to aggregating
writes in a way that avoid thrashing the disk/buffers.

Interesting.

I'll likely have a chance to work on the collective stuff in the next
day or so. I'll post here when it's ready.

Sweet, let me know what I can do to help.

Stuart

Andrew Collette

unread,

Nov 18, 2013, 6:41:57 PM11/18/13

to h5...@googlegroups.com

Hi,

> Cool, I will give that a go tomorrow and report benchmarks. I presume I can
> mix and match high level and low level API.

Yes, absolutely; this is an intentional part of h5py's design.

>> I'll likely have a chance to work on the collective stuff in the next
>> day or so. I'll post here when it's ready.
>
> Sweet, let me know what I can do to help.

Let's get the low-level method working first. Then, if you have time
once I've implemented the collective bits, I'd love to see benchmarks
(total run time) for the following configurations:

1. Present performance with high-level code
2. Present performance with low-level code
3. Collective performance with low-level code
4. Collective performance with high-level code (likely via a
"collective" context manager)

Andrew

Stuart Mumford

unread,

Nov 19, 2013, 4:20:19 AM11/19/13

to h5...@googlegroups.com

Good Morning,

> Cool, I will give that a go tomorrow and report benchmarks. I presume I can
> mix and match high level and low level API.

Yes, absolutely; this is an intentional part of h5py's design.

I have re-written a version that uses the low level interface: https://bitbucket.org/swatsheffield/pysac/src/fdc6f316d1f5bb25cc79ba14c5ff803ccdfb86c7/pysac/io/gdf_writer.py?at=master#cl-378 It seems to make no difference to the performance.

>> I'll likely have a chance to work on the collective stuff in the next
>> day or so. I'll post here when it's ready.
>
> Sweet, let me know what I can do to help.

Let's get the low-level method working first. Then, if you have time
once I've implemented the collective bits, I'd love to see benchmarks
(total run time) for the following configurations:

1. Present performance with high-level code

~43 sec per timestep

2. Present performance with low-level code

~44 sec per timestep

3. Collective performance with low-level code
4. Collective performance with high-level code (likely via a
"collective" context manager)

Stuart

Andrew Collette

unread,

Nov 19, 2013, 10:39:06 AM11/19/13

to h5...@googlegroups.com

Hi,

>> 1. Present performance with high-level code
>
> ~43 sec per timestep
>>
>> 2. Present performance with low-level code
>
> ~44 sec per timestep

Well, I suppose that's both good and bad. At least we know it isn't a
slicing-related problem. :)

I've pushed a new branch "mpi_collective" to my clone of h5py:

https://github.com/andrewcollette/h5py/

Build and install should be the same as for the release version. When
you've got it installed, modify your write statement like this:

from h5py import h5p, h5fd

dxpl = h5p.create(h5p.DATASET_XFER)
dxpl.set_dxpl_mpio(h5fd.MPIO_COLLECTIVE)

dset.id.write(memory_space, file_space, data, dxpl=dxpl)

At the high-level interface, there's a context manager:

with dset.collective:
dset[SLICE] = data

Andrew

Stuart Mumford

unread,

Nov 20, 2013, 7:03:09 AM11/20/13

to h5...@googlegroups.com

Hello!!

So an update:

1. Present performance with high-level code

10 timesteps in 425s = 42.5 s

2. Present performance with low-level code

10 timesteps in 440s = 44.0 s
3. Collective performance with high-level code (likely via a
"collective" context manager)

10 timesteps in 442s = 44.2 s

4. Collective performance with low-level code

10 timesteps in 47.2s = 4.7 s

So you just made my week!

However it does seem that the high level collective interface is not working. Also all these profiles have not been repeated, I will give more info as I continue to tinker.

Stuart

Andrew

--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/wDgTjGho0dY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andrew Collette

unread,

Nov 20, 2013, 9:15:31 AM11/20/13

to h5...@googlegroups.com

Hi,

> 4. Collective performance with low-level code
> 10 timesteps in 47.2s = 4.7 s
>
> So you just made my week!

Great! Looks like collective writes are the way to go, at least in this case.

> However it does seem that the high level collective interface is not
> working. Also all these profiles have not been repeated, I will give more
> info as I continue to tinker.

That's because I forgot to actually use the property list for
high-level writes. :/ I've pushed another commit to the mpi_collective
branch which should fix this.

Ultimately if you have some sense of how these rates compare to h5perf
or other MPI benchmarks, that would also be interesting.

Andrew

Stuart Mumford

unread,

Nov 20, 2013, 9:21:33 AM11/20/13

to h5...@googlegroups.com

> 4. Collective performance with low-level code
> 10 timesteps in 47.2s = 4.7 s

Great! Looks like collective writes are the way to go, at least in this case.

It would seem so, though it does appear to be going slower on the full dataset, more investigation required.

> However it does seem that the high level collective interface is not
> working. Also all these profiles have not been repeated, I will give more
> info as I continue to tinker.

That's because I forgot to actually use the property list for
high-level writes. :/ I've pushed another commit to the mpi_collective
branch which should fix this.

I will re-run the benchmark as soon as the cluster grid engine lets me in ;)

Ultimately if you have some sense of how these rates compare to h5perf
or other MPI benchmarks, that would also be interesting.

I am not so sure about how to do this. I could write a known amount of data and time it but I do not know how to measure transfer rate in a reliable way? Suggestions?

Any ideas on the transfer buffer size? (http://www.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_buffer.htm)

Stuart

Andrew Collette

unread,

Nov 21, 2013, 1:01:58 PM11/21/13

to h5...@googlegroups.com

Hi Stuart,

>> Ultimately if you have some sense of how these rates compare to h5perf
>> or other MPI benchmarks, that would also be interesting.
>
>
> I am not so sure about how to do this. I could write a known amount of data
> and time it but I do not know how to measure transfer rate in a reliable
> way? Suggestions?

I was thinking in a very general sense, just estimating MB/s by
writing a known amount of data. I'm mostly interested if the h5py
performance is 50% of the estimates from h5perf, 10%, 1%, etc.

> Any ideas on the transfer buffer size?
> (http://www.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_buffer.htm)

This looks like it's involved with type conversion, but let's wrap it
anyway and see what effect it has. There's also H5Pset_alignment,
which evidently might also help. I can get to these in a day or two.

Andrew

Andrew Collette

unread,

Nov 22, 2013, 3:42:38 PM11/22/13

to h5...@googlegroups.com

Hi,

> This looks like it's involved with type conversion, but let's wrap it
> anyway and see what effect it has. There's also H5Pset_alignment,
> which evidently might also help. I can get to these in a day or two.

Just following up; I added H5Pset_buffer and H5Pset_alignment.

H5Pset_alignment, should you choose to use it, must be applied to the
file access property list, which is used when you open/create a file.
To create a new MPIO file at the low-level, do something like:

fapl = h5p.create(h5p.FILE_ACCESS)
fapl.set_fapl_mpio(MPI_COMM, MPI_INFO)
fapl.set_alignment(THRESHOLD, ALIGNMENT)

fid = h5f.create(b"name.hdf5", h5f.ACC_TRUNC, fapl=fapl)

Then, if desired, you can bind the FileID to a high-level File object
and use it normally:

f = h5py.File(fid)

If it produces useful results I'll consider adding it to the File
constructor directly.

Btw, I am going out of town next week so don't be alarmed if it takes
me a while to respond.

Andrew

Christoph Paulik

unread,

Jun 4, 2014, 1:40:12 PM6/4/14

to h5...@googlegroups.com

Dear Andrew,

I was trying to set the size of the temporary buffer manually but could not find the function H5Pset_buffer in the current version of the h5py package or on github. Was it removed?

If you think I should open a new topic with a more detailed explanation of why I would like to have this functionality then please tell me and I will gladly do so.

Best Regards,

Christoph

Andrew Collette

unread,

Jun 4, 2014, 3:25:10 PM6/4/14

to h5...@googlegroups.com

Hi Christoph,

> I was trying to set the size of the temporary buffer manually but could not
> find the function H5Pset_buffer in the current version of the h5py package
> or on github. Was it removed?
>
> If you think I should open a new topic with a more detailed explanation of
> why I would like to have this functionality then please tell me and I will
> gladly do so.

I don't have any memory of this function being in h5py... I suspect it
may simply never have been wrapped. Looking at the HDF Group docs for
H5Pset_buffer, I think there would be no conflict with adding it to
h5py. I would suggest, though, that we limit ourselves to just the
*size* argument.

I'm happy to accept a PR which implements this.

Andrew

vol...@gmail.com

unread,

Jun 18, 2014, 1:35:19 PM6/18/14

to h5...@googlegroups.com

Hello Andrew,

I've been using h5py 2.3.0 with parallel hdf5 (MPI-IO). Is collective write available in this version? I'd like to be able to do this:

plist_id = H5Pcreate (H5P_DATASET_XFER);

H5Pset_dxpl_mpio (plist_id, H5FD_MPIO_COLLECTIVE);

From the discussion thread above it seemed to me like this was going to be supported in h5py?

Thanks,

Alex

Andrew Collette

unread,

Jun 23, 2014, 12:00:02 PM6/23/14

to h5...@googlegroups.com

Hi Alex,

There was an experimental version of this (you can find it on github
in andrewcollette/mpi_collective). It's not something I personally
have time to work on at the moment, but a well-tested pull request
would be welcome.

Andrew

Stuart Mumford

unread,

Mar 5, 2015, 10:51:25 AM3/5/15

to h5...@googlegroups.com

Hello,

It's been nearly a year, I thought I would try and raise this from the dead.

I have been looking at this recently, trying to get it to compile correctly with the intention of tidying it up and PRing it. I have merged master into Andrews mpi_collective branch and the result is here: https://github.com/Cadair/h5py/tree/mpi_collective_24 there were some conflicts in the dataset.py file, which I think I have resolved properly.

However, when I compile this branch and run import h5py I get this:

(h5py)smq11sjm@node011 ~$ mpirun -n 1 python test_h5py.py
Traceback (most recent call last):
File "test_h5py.py", line 1, in <module>
import h5py
File "build/bdist.linux-x86_64/egg/h5py/__init__.py", line 15, in <module>
File "build/bdist.linux-x86_64/egg/h5py/_conv.py", line 7, in <module>
File "build/bdist.linux-x86_64/egg/h5py/_conv.py", line 6, in __bootstrap__
File "h5py/h5r.pxd", line 21, in init h5py._conv (/home/smq11sjm/GitHub/h5py/h5py/_conv.c:7271)
File "build/bdist.linux-x86_64/egg/h5py/h5r.py", line 7, in <module>
File "build/bdist.linux-x86_64/egg/h5py/h5r.py", line 6, in __bootstrap__
File "h5py/_objects.pxd", line 12, in init h5py.h5r (/home/smq11sjm/GitHub/h5py/h5py/h5r.c:3181)
File "build/bdist.linux-x86_64/egg/h5py/_objects.py", line 7, in <module>
File "build/bdist.linux-x86_64/egg/h5py/_objects.py", line 6, in __bootstrap__
File "h5py/_objects.pyx", line 1, in init h5py._objects (/home/smq11sjm/GitHub/h5py/h5py/_objects.c:7282)
File "build/bdist.linux-x86_64/egg/h5py/defs.py", line 7, in <module>
File "build/bdist.linux-x86_64/egg/h5py/defs.py", line 6, in __bootstrap__
ImportError: /home/smq11sjm/.python-eggs/h5py-2.5.0a0-py2.7-linux-x86_64.egg-tmp/h5py/defs.so: undefined symbol: H5Pset_dxpl_mpio

For some reason it doesn't seem to be compiling it properly, I have made some changes to the api_functions.txt file when compared against the original mpi_collective branch, the diff to master is:

diff --git a/h5py/api_functions.txt b/h5py/api_functions.txt
index 014aac7..b970651 100644
--- a/h5py/api_functions.txt
+++ b/h5py/api_functions.txt
@@ -301,6 +301,8 @@ hdf5:
   H5Z_EDC_t H5Pget_edc_check(hid_t plist)
   herr_t    H5Pset_chunk_cache( hid_t dapl_id, size_t rdcc_nslots, size_t rdcc_nbytes, double rdcc_w0 )
   herr_t    H5Pget_chunk_cache( hid_t dapl_id, size_t *rdcc_nslots, size_t *rdcc_nbytes, double *rdcc_w0 )
+ herr_t    H5Pset_buffer(hid_t plist, hsize_t size, void *tconv, void *bkg )
+ hsize_t   H5Pget_buffer(hid_t plist, void **tconv, void **bkg )

   # Other properties
   herr_t    H5Pset_sieve_buf_size(hid_t fapl_id, size_t size)
@@ -344,6 +346,9 @@ hdf5:
   MPI herr_t H5Pset_fapl_mpio(hid_t fapl_id, MPI_Comm comm, MPI_Info info)
   MPI herr_t H5Pget_fapl_mpio(hid_t fapl_id, MPI_Comm *comm, MPI_Info *info)

+ MPI herr_t H5Pset_dxpl_mpio(hid_t dxpl_id, H5FD_mpio_xfer_t xfer_mode)
+ MPI herr_t H5Pget_dxpl_mpio(hid_t dxpl_id, H5FD_mpio_xfer_t* xfer_mode)
+
   # === H5R - Reference API ===================================================

   herr_t    H5Rcreate(void *ref, hid_t loc_id, char *name, H5R_type_t ref_type, hid_t space_id)

The H5P_set_dxpl_mpio function definition is in there so I do not understand why it is not being compiled through to defs.so. I removed the MPI conditional statement to test it and I got a very similar error.

Someone with a better understanding of the API gen / build process might be able to see a way out of this mess.

Thanks in advance

Stuart

--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/wDgTjGho0dY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+uns...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andrew Collette

unread,

Mar 5, 2015, 12:17:41 PM3/5/15

to h5...@googlegroups.com

Hi Stuart,

> I have been looking at this recently, trying to get it to compile correctly
> with the intention of tidying it up and PRing it. I have merged master into
> Andrews mpi_collective branch and the result is here:
> https://github.com/Cadair/h5py/tree/mpi_collective_24 there were some
> conflicts in the dataset.py file, which I think I have resolved properly.
>
> However, when I compile this branch and run import h5py I get this:
>

> undefined symbol: H5Pset_dxpl_mpio

I suspect there is a problem with the HDF5 library itself... that's
the kind of error you get when trying to load a non-parallel version
of HDF5 from a module using parallel features. Double-check which
library you're using (and you might also check that the runtime
library path from setup.py is being set correctly).

Andrew

Stuart Mumford

unread,

Mar 6, 2015, 6:03:12 AM3/6/15

to h5...@googlegroups.com

Hi,

Interesting, manually setting the HDF5 path fixed it, I wonder what version it was detecting?! Thanks though.

How should I approach writing tests for this branch seeing how most of the functionality needs a MPI setup and a parallel filesystem?

Stuart

Andrew

Andrew Collette

unread,

Mar 6, 2015, 12:16:51 PM3/6/15

to h5...@googlegroups.com

Hi Stuart,

> How should I approach writing tests for this branch seeing how most of the
> functionality needs a MPI setup and a parallel filesystem?

That's a great question. I suppose you would have to add
functionality to the test suite that would check (e.g. with mpi4py)
for an MPI environment, and take appropriate action. Running it might
be as simple as 'mpirun -n X python setup.py test'.

Andrew

kirchen...@googlemail.com

unread,

May 20, 2015, 9:42:08 AM5/20/15

to h5...@googlegroups.com

Hi Stuart and Andrew,

I am highly interested in using the collective I/O feature with h5py.

I tried to compile this branch: https://github.com/Cadair/h5py/tree/mpi_collective_24

but i got stuck at the following error:

mpicc -fno-strict-aliasing -I /usr/include/ncurses -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -fPIC -DH5_USE_16_API -I/usr/local/hdf5/v1.8.12/include -I/lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/lzf -I/opt/local/include -I/usr/local/include -I/usr/local/Python/2.7.2/lib/python2.7/site-packages/numpy/core/include -I/lustre/jhome16/hhh20/hhh208/local/lib/python2.7/site-packages/mpi4py-1.3.1-py2.7-linux-x86_64.egg/mpi4py/include -I/usr/local/Python/2.7.2/include/python2.7 -c /lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/h5py/defs.c -o build/temp.linux-x86_64-2.7/lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/h5py/defs.o

/lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/h5py/defs.c:21955: error: redefinition of '__pyx_f_4h5py_4defs_H5Pset_alignment'

/lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/h5py/defs.c:19195: error: previous definition of '__pyx_f_4h5py_4defs_H5Pset_alignment' was here

/lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/h5py/defs.c:22047: error: redefinition of '__pyx_f_4h5py_4defs_H5Pget_alignment'

/lustre/jhome16/hhh20/hhh208/h5py-mpi_collective_24-2/h5py/defs.c:19287: error: previous definition of '__pyx_f_4h5py_4defs_H5Pget_alignment' was here

error: command 'mpicc' failed with exit status 1

Do you have an idea what went wrong there? I would be very happy to succeed at compiling this. Thanks!

Cheers,

Manuel

Pierre Complex

unread,

May 21, 2015, 4:03:16 AM5/21/15

to h5...@googlegroups.com

The compiler complains rightly about doubly defined functions. That most likely comes from a merge.

I could build and test successfully with this https://github.com/Cadair/h5py/pull/1

Regards,

Pierre

kirchen...@googlemail.com

unread,

Jun 10, 2015, 6:46:29 PM6/10/15

to h5...@googlegroups.com

Thanks a lot Pierre,

this should fix my bug. (did not try yet)

meanwhile I succeeded in compiling an older version of h5py that supports collective I/O and I ran some tests and compared the performance between:

- gathering with mpi4py on rank 0 and writing to a single non-parallel file

- writing data in parallel without collective I/O

- writing data in parallel with collective I/O

In the case of collective I/O, I implemented the low-level and high-level approach.

The code that I am working on just writes a scattered 2D array to a single file and dataset.

dset[local_lower_bound_x:local_upper_bound_x, local_lower_bound_y:local_upper_bound_y] = local_data[:,:]

I did not see any difference in performance for the collective I/O case compared to the one without and I did not see any difference for the low-level compared to the high-level approach.

The main problem is, that the parallel writing is always a lot (really a lot!) slower than gathering the data on rank == 0 and writing it to a single non-parallel file.

Any idea, why this is the case?

Thanks,

Manuel

David

unread,

Jun 10, 2015, 10:34:16 PM6/10/15

to h5...@googlegroups.com

I have been trying to get good performance with parallel independent I/O on our filesystem and have found that one big roadblock is older versions of lustre. While our filesystem supports 2 GB/sec output to multiple files, I can't get more than 300 MB/sec to a single file. Our filesystem uses lustre. I have learned that lustre versions < 2.7 have a bug where each lustre client locks the output file during writing. I've seen this in some tests where I have each MPI rank or thread write one big block of data - I can see that the calls get serialized when they come from threads on the same host by printing timing information. Here is part of the thread where I learned this: https://www.mail-archive.com/search?l=lustre-...@lists.lustre.org&q=subject:%22Re\%3A+\[lustre\-discuss\]+problem+getting+high+performance+output+to+single+file%22&o=newest&f=1

I see the same behavior at NERSC where they have lustre clients that are 2.5.3, I believe.

I'll try again when/if we can upgrade to lustre 2.7. I'd be curious to hear from people getting good performance with parallel I/O, what software stack they are on, what MPI library, how they built the MPI library, are they using special options or hints that they are passing to MPI. For instance with openmpi, out of the box it will use a basic internet IP protocol for communication which may be slower than what your network supports. If your system has infiniband, you should look into building the mpi library to use it.

best,

David

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.

kirchen...@googlemail.com

unread,

Jun 11, 2015, 2:10:45 AM6/11/15

to h5...@googlegroups.com

Hey David,

thanks for these insights. I just ran a "cat /proc/fs/lustre/version" on the cluster we use and got:

lustre: 1.8.4

kernel: patchless_client

build: 1.8.4-20100724012708-PRISTINE-2.6.32.59-0.3-default

which clearly indicates that this version is probably too old to perform well.

I will try to find another cluster with a newer version of lustre in order to test this.

Am Mittwoch, 10. Juni 2015 19:34:16 UTC-7 schrieb David Schneider:

I have been trying to get good performance with parallel independent I/O on our filesystem and have found that one big roadblock is older versions of lustre. While our filesystem supports 2 GB/sec output to multiple files, I can't get more than 300 MB/sec to a single file. Our filesystem uses lustre. I have learned that lustre versions < 2.7 have a bug where each lustre client locks the output file during writing. I've seen this in some tests where I have each MPI rank or thread write one big block of data - I can see that the calls get serialized when they come from threads on the same host by printing timing information. Here is part of the thread where I learned this: https://www.mail-archive.com/search?l=lustre-discuss@lists.lustre.org&q=subject:%22Re\%3A+\[lustre\-discuss\]+problem+getting+high+performance+output+to+single+file%22&o=newest&f=1

Yang Gao

unread,

Jun 13, 2015, 9:27:08 PM6/13/15

to h5...@googlegroups.com

Hi Stuart,

Glad that h5py is working very well for you! I tried installing hdf5 on my Ubuntu, and got error during "make check". I reported in this post:

https://groups.google.com/forum/#!topic/h5py/y5Gy1s50JV0

Do you know how to handle this?

Thanks,

yang

On Monday, 18 November 2013 08:07:36 UTC-8, Stuart Mumford wrote:

Hello,

Firstly, I wish to say how absolutely awesome the parallel write API in h5py 2.2.0 is, I mean it is just so easy it almost made me literally dance around my office. However, as always, I seemed to have pushed it too far, and it is much slower than I would like. This email's purpose is trying to work out where the bottleneck is.

My setup:
I have a FORTRAN code that saves out one horrible, horrible 'unformatted' binary file per processor which is basically unusable! My Python script uses h5py and translates this data into a hdf5 file which is laid out like this: https://bitbucket.org/yt_analysis/grid_data_format/. What I have done is written an MPI code which uses the same number of processors as there is data files, each process reads a file and saves it's chunk to the HDF5 file. My grid size is 128^3 for 16 processors and 13 fields (I am taking one 13x128x128x128 array and writing 13 128x128x128 datasets to the HDF5 file).

This translation code takes ~7 hours to run for all 585 timesteps of my simulation which works out at around ~40 seconds per step, which seeing how each resultant hdf5 file is 273MB seems quite slow. I did some profiling of this code with various grid sizes and numbers of processors, I attach a html output of profiling 10 steps of the 128^3 16 processor script. Other profiling gave me 1.8 seconds for 60^3 and 1 or two processor(s), indicating that it is not MPI, however that was on a different system to the large write.

Questions:
Looking at the attached html file, the routine takes basically all of it's time on the '<method 'write' of 'h5py.h5d.DatasetID' objects>' call, which kind makes sense if, as my untrained eye tells me, this is basically I/O limited??

Could I speed this up by a) using the low level interface? b) doing anything to my numpy arrays? (I found a np.ascontiguousarray call before the ds[slice] = array call made a lot of difference.)

a note on file systems:
I ran h5perf on the cluster where I am doing this write and the results do indicate that I should be getting faster write speed, and also that larger transfer buffer sizes (logically) speed things up. Is there a way to tweak the transfer buffer size or is this just set by the size of the slab being written?

code:
The script is here:
https://bitbucket.org/smumford/period-paper/src/7acc77e17ba94efaae9a3766538094b0a56592fa/sac/out2gdf_pure.py?at=master

which calls functions from here:
https://bitbucket.org/swatsheffield/pysac/src/5fc267727e8a8a65a409de3d05120fdf08b00c1b/pysac/io/gdf_writer.py?at=master

Thanks a lot for a fantastic package
Stuart

jal...@lbl.gov

unread,

Oct 22, 2015, 9:03:15 PM10/22/15

to h5py

Hi,

I installed this collective io branch, and run the default python setup.py test with 2 processes, but got error:

the script I used : aprun -n 2 python-mpi setup.py test

and the output is :

...................................................................................................x............x............................s............s....................................................................x...........x................................................................s.......s.............EEEE.....................................................................................................................................................................................................sss.....sss..........................................................................................................................x..x......x.x...x.x.............................x....x....x.x.............................................................

======================================================================

ERROR: test_mpi_atomic (h5py.tests.old.test_file.TestDrivers)

Enable atomic mode for MPIO driver

----------------------------------------------------------------------

Traceback (most recent call last):

File "/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/build/lib.linux-x86_64-2.7/h5py/tests/old/test_file.py", line 253, in test_mpi_atomic

with File(fname, 'w', driver='mpio', comm=MPI.COMM_WORLD) as f:

File "/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/build/lib.linux-x86_64-2.7/h5py/_hl/files.py", line 235, in __init__

fid = make_fid(name, mode, userblock_size, fapl)

File "/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/build/lib.linux-x86_64-2.7/h5py/_hl/files.py", line 88, in make_fid

fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)

File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/h5py/_objects.c:2718)

with _phil:

File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/h5py/_objects.c:2675)

return func(*args, **kwds)

File "h5py/h5f.pyx", line 92, in h5py.h5f.create (/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/h5py/h5f.c:2232)

return FileID(H5Fcreate(name, flags, pdefault(fcpl), pdefault(fapl)))

IOError: Unable to create file (Other i/o error , error stack:

adio_cray_adio_open(1444): open failed on a remote node)

If I just run with 1 process, it pass the test.

Should I write my own codes to test this collective io branch?

Best,

Jialin

Reply all

Reply to author

Forward