Frequent Dataset append: how to do it fast?

1,623 views
Skip to first unread message

nils

unread,
Dec 14, 2010, 3:58:26 AM12/14/10
to h5py
Hi,

I am starting to use h5py to store simulation data on the fly. I have
many Groups which have one Dataset each, which accumulates the
generated
data. So far it seems to be working :)

Since there are no variable length Datasets in h5py, I keep a running
count of the len() of the Dataset array and append rows in chunks (of
size 5 rows but that's variable) by using Dataset.resize() whenever
the
Dataset overflows. Data are 'appended' by setting
dataset[running_row_index] = new_data_row. This happens very
frequently.

Now the question is: is this an efficient way of doing things? In
particular, is the frequent write access to the dataset buffered, and
do
I need to keep the number of resize events small? Or do I need to
manually buffer the simulation output in an in-memory numpy.array and
write it in chunks myself?


Nils

Andrew Collette

unread,
Dec 15, 2010, 4:41:36 PM12/15/10
to h5...@googlegroups.com
Hi Nils,

> Now the question is: is this an efficient way of doing things? In
> particular, is the frequent write access to the dataset buffered, and
> do
> I need to keep the number of resize events small? Or do I need to
> manually buffer the simulation output in an in-memory numpy.array and
> write it in chunks myself?

To be honest, I'm not sure. I think you will have to benchmark this
to be certain of the result. My impression from working with HDF5 is
that you want to keep the number of resize events small, but I'm not
sure what effect this has on performance. It may be that it doesn't
matter.

Another approach would be to create a large array and then trim it
when you're done. However, you would want to benchmark this to make
sure HDF5 doesn't end up wasting file space, which can happen
sometimes.

I am at a conference this week but will get to your other email this
weekend. I think there are some ways to make your tree traversal
faster.

Andrew

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Dec 15, 2010, 6:27:42 PM12/15/10
to h5...@googlegroups.com
On Wed, Dec 15, 2010 at 4:41 PM, Andrew Collette <andrew....@gmail.com> wrote:
> Now the question is: is this an efficient way of doing things? In
> particular, is the frequent write access to the dataset buffered, and
> do
> I need to keep the number of resize events small? Or do I need to
> manually buffer the simulation output in an in-memory numpy.array and
> write it in chunks myself?

To be honest, I'm not sure.  I think you will have to benchmark this
to be certain of the result.  My impression from working with HDF5 is
that you want to keep the number of resize events small, but I'm not
sure what effect this has on performance.  It may be that it doesn't
matter.


My experience is that picking the right chunk size is the most critical element.  Otherwise, resize events aren't too bad in terms of performance for the datasets that I've used.  I've benchmarked the difference between resizing each time to append rows versus collecting appends in a pre-allocated buffer and resizing/writing data in larger chunks.  The performance was slightly better with the latter approach, but didn't seem worth the effort in retrospect.  The majority of the time spent in h5py/hdf5 code was on data compression, as far as I could tell based on comparing the cost of creating a an hdf5 file with a gzip filter and comparing it to  the time taken to write an uncompressed hdf5 file and then to gzip compress that file.

Careful benchmarking turned out to be critical, since my initial guesses at chunk sizes turned out to far from optimal for the types of access patterns my application used.

I'll also note that wasted space was a bigger issue for datasets without compression filters, since the over allocation is physically represented in the file.  The overhead for compressed tables was very minor.  I did notice that repacking the resulting files did decrease file sizes, but not by a terribly significant factor for my application.

(In case you're interested, I'm using h5py/pytables and HDF5 to store genetic probe intensities and computed genetic variation data for millions of hybridized oligonucleotide probes measured on tens of thousands of individuals (humans).)

-Kevin

nils

unread,
Dec 20, 2010, 4:33:59 AM12/20/10
to h5py
Kevin,

thanks for your reply.

> My experience is that picking the right chunk size is the most critical
> element.  Otherwise, resize events aren't too bad in terms of performance
> for the datasets that I've used.  I've benchmarked the difference between
> resizing each time to append rows versus collecting appends in a
> pre-allocated buffer and resizing/writing data in larger chunks.  The
> performance was slightly better with the latter approach, but didn't seem
> worth the effort in retrospect.  The majority of the time spent in h5py/hdf5
> code was on data compression

In my case, there are many, deeply nested nodes with tiny amounts of
data. So I figure that I cannot profit from chunked access so much
since this is always per-dataset (right?) In fact, the major
bottleneck in my code during data generation is traversal over the HDF
document structure right now. I'm gonna start another thread when I
can figure out a precise question.

Nils
Reply all
Reply to author
Forward
0 new messages