Thanks for the feedback! I don't regularly use the MPI platform so we
rely on feedback from users to fix things.
By the way, are you using a parallel filesystem (and if so, which)?
> Could I speed this up by a) using the low level interface? b) doing anything
> to my numpy arrays? (I found a np.ascontiguousarray call before the
> ds[slice] = array call made a lot of difference.)
The first thing we could try is doing your writes using only the
low-level interface, to see if there's something about the slicing
that h5py is getting wrong from a performance standpoint. This is
pretty simple, since it looks to me like you're creating a dataset and
then writing the entire array to it in one go.
Try replacing your array writes with this:
from h5py import h5s
dset = f.create_dataset(NAME, SHAPE, dtype=DTYPE)
# Ensure numpy_array is C contiguous before calling!
dset.id.write(h5s.ALL, h5s.ALL, numpy_array)
> a note on file systems:
> I ran h5perf on the cluster where I am doing this write and the results do
> indicate that I should be getting faster write speed, and also that larger
> transfer buffer sizes (logically) speed things up. Is there a way to tweak
> the transfer buffer size or is this just set by the size of the slab being
> written?
Actually, if the above tweak doesn't work, the next thing that comes
to mind is writing your datasets with collective I/O (HDF5 defaults to
independent unless a flag is set in the dataset transfer property
list). This was a feature slated for 2.2, but unfortunately didn't
make it in due to time constraints. If you're available to help test
this feature, I could create a branch for you off master with
collective writes enabled.
I'm also happy to include buffer-size tweaks, but will need some
information from you (or our other MPI-aware colleagues) of which HDF5
C functions to wrap.
> believe your code snippet below will not work?Yes, of course; that makes more sense!
>
> I create the np slice for each process here:
> https://bitbucket.org/smumford/period-paper/src/master/sac/out2gdf_pure.py?at=master#cl-119
The equivalent to slicing at the HDF5 level is hyperslab selection.
It's easy to convert from numpy slices.
Say numpy_array is the data
you want to write; first create dataspace objects which tell HDF5
about the data in memory and on the disk:
>>> memory_space = h5s.create_simple(numpy_array.shape)
>>> file_space = dset.id.get_space() # copy of the dataset's dataspace
Specify the selection to write:
>>> file_space.select_hyperslab(START, COUNT)
where START and COUNT provide the selection. For example:
something[0:50,10:30] -> START (0, 10) COUNT (50, 20)
Then you call the low-level write:
>>> dset.id.write(memory_space, file_space, numpy_array)
Of course, the total number of data elements has to be the same in
your memory array and in the selection you make to the file.
From reading the HDF5 documentation, it looks like it explicitly uses
> Looking over the HDF5 documentation I was wondering about this, I would be
> more than happy to help test this.
> What does the collective write flag actually do for the parallel IO?
the parallel features of the filesystem, in addition to aggregating
writes in a way that avoid thrashing the disk/buffers.
I'll likely have a chance to work on the collective stuff in the next
day or so. I'll post here when it's ready.
> Cool, I will give that a go tomorrow and report benchmarks. I presume I can
> mix and match high level and low level API.
Yes, absolutely; this is an intentional part of h5py's design.
Let's get the low-level method working first. Then, if you have time
>> I'll likely have a chance to work on the collective stuff in the next
>> day or so. I'll post here when it's ready.
>
> Sweet, let me know what I can do to help.
once I've implemented the collective bits, I'd love to see benchmarks
(total run time) for the following configurations:
1. Present performance with high-level code
2. Present performance with low-level code
3. Collective performance with low-level code
4. Collective performance with high-level code (likely via a
"collective" context manager)
Andrew
--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/wDgTjGho0dY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
> 4. Collective performance with low-level code
> 10 timesteps in 47.2s = 4.7 s
Great! Looks like collective writes are the way to go, at least in this case.
That's because I forgot to actually use the property list for
> However it does seem that the high level collective interface is not
> working. Also all these profiles have not been repeated, I will give more
> info as I continue to tinker.
high-level writes. :/ I've pushed another commit to the mpi_collective
branch which should fix this.
Ultimately if you have some sense of how these rates compare to h5perf
or other MPI benchmarks, that would also be interesting.
--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/wDgTjGho0dY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Andrew
--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
I have been trying to get good performance with parallel independent I/O on our filesystem and have found that one big roadblock is older versions of lustre. While our filesystem supports 2 GB/sec output to multiple files, I can't get more than 300 MB/sec to a single file. Our filesystem uses lustre. I have learned that lustre versions < 2.7 have a bug where each lustre client locks the output file during writing. I've seen this in some tests where I have each MPI rank or thread write one big block of data - I can see that the calls get serialized when they come from threads on the same host by printing timing information. Here is part of the thread where I learned this: https://www.mail-archive.com/search?l=lustre-discuss@lists.lustre.org&q=subject:%22Re\%3A+\[lustre\-discuss\]+problem+getting+high+performance+output+to+single+file%22&o=newest&f=1
Hello,
Firstly, I wish to say how absolutely awesome the parallel write API in h5py 2.2.0 is, I mean it is just so easy it almost made me literally dance around my office. However, as always, I seemed to have pushed it too far, and it is much slower than I would like. This email's purpose is trying to work out where the bottleneck is.
My setup:
I have a FORTRAN code that saves out one horrible, horrible 'unformatted' binary file per processor which is basically unusable! My Python script uses h5py and translates this data into a hdf5 file which is laid out like this: https://bitbucket.org/yt_analysis/grid_data_format/. What I have done is written an MPI code which uses the same number of processors as there is data files, each process reads a file and saves it's chunk to the HDF5 file. My grid size is 128^3 for 16 processors and 13 fields (I am taking one 13x128x128x128 array and writing 13 128x128x128 datasets to the HDF5 file).
This translation code takes ~7 hours to run for all 585 timesteps of my simulation which works out at around ~40 seconds per step, which seeing how each resultant hdf5 file is 273MB seems quite slow. I did some profiling of this code with various grid sizes and numbers of processors, I attach a html output of profiling 10 steps of the 128^3 16 processor script. Other profiling gave me 1.8 seconds for 60^3 and 1 or two processor(s), indicating that it is not MPI, however that was on a different system to the large write.
Questions:
Looking at the attached html file, the routine takes basically all of it's time on the '<method 'write' of 'h5py.h5d.DatasetID' objects>' call, which kind makes sense if, as my untrained eye tells me, this is basically I/O limited??
Could I speed this up by a) using the low level interface? b) doing anything to my numpy arrays? (I found a np.ascontiguousarray call before the ds[slice] = array call made a lot of difference.)
a note on file systems:
I ran h5perf on the cluster where I am doing this write and the results do indicate that I should be getting faster write speed, and also that larger transfer buffer sizes (logically) speed things up. Is there a way to tweak the transfer buffer size or is this just set by the size of the slab being written?
code:
The script is here:
https://bitbucket.org/smumford/period-paper/src/7acc77e17ba94efaae9a3766538094b0a56592fa/sac/out2gdf_pure.py?at=master
which calls functions from here:
https://bitbucket.org/swatsheffield/pysac/src/5fc267727e8a8a65a409de3d05120fdf08b00c1b/pysac/io/gdf_writer.py?at=master
Thanks a lot for a fantastic package
Stuart
...................................................................................................x............x............................s............s....................................................................x...........x................................................................s.......s.............EEEE.....................................................................................................................................................................................................sss.....sss..........................................................................................................................x..x......x.x...x.x.............................x....x....x.x.............................................................
======================================================================
ERROR: test_mpi_atomic (h5py.tests.old.test_file.TestDrivers)
Enable atomic mode for MPIO driver
----------------------------------------------------------------------
Traceback (most recent call last):
File "/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/build/lib.linux-x86_64-2.7/h5py/tests/old/test_file.py", line 253, in test_mpi_atomic
with File(fname, 'w', driver='mpio', comm=MPI.COMM_WORLD) as f:
File "/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/build/lib.linux-x86_64-2.7/h5py/_hl/files.py", line 235, in __init__
fid = make_fid(name, mode, userblock_size, fapl)
File "/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/build/lib.linux-x86_64-2.7/h5py/_hl/files.py", line 88, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/h5py/_objects.c:2718)
with _phil:
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/h5py/_objects.c:2675)
return func(*args, **kwds)
File "h5py/h5f.pyx", line 92, in h5py.h5f.create (/global/u1/j/jialin/h5py-burst/h5py-mpi_collective_24/h5py/h5f.c:2232)
return FileID(H5Fcreate(name, flags, pdefault(fcpl), pdefault(fapl)))
IOError: Unable to create file (Other i/o error , error stack:
adio_cray_adio_open(1444): open failed on a remote node)
If I just run with 1 process, it pass the test.
Should I write my own codes to test this collective io branch?
Best,
Jialin