How to write chunks of data to a H5py dataset?

6,567 views
Skip to first unread message

Luiz Vitor Martinez Cardoso

unread,
Oct 17, 2013, 7:19:21 PM10/17/13
to h5...@googlegroups.com
Dear,

I'm struggling to find a solution where I could write an np.array object into a dataset inside a loop.

What I need to do exactly is to save N tuples inside a dataset interactively

f = h5py.File('PriceCheckWrite.h5', 'w')

dgroup = f.create_group('PCAnalytics')

dset = dgroup.create_dataset('PCWrangler',
                             shape=(100, ),
                             dtype=price_check_wrangler_type,
                             compression='gzip',
                             compression_opts=9)

raw_data = ('192.168.13.248', '#live', 18874368, '192.168.13.44', 1377186528)

for i in range(0, 1000):
  dset[...] = np.array(raw_data,
                       dtype=price_check_wrangler_type)


How can I create an infinite dataset? How can I write to a dataset like I write to a simple Python file? Do I need to flush() on every write?

Best regards,
Luiz Vitor.

Andrew Collette

unread,
Oct 18, 2013, 12:38:42 PM10/18/13
to h5...@googlegroups.com
Hi,

> I'm struggling to find a solution where I could write an np.array object
> into a dataset inside a loop.

The simplest solution for your example is to use indexing on the dataset:

for i in xrange(0,1000):
dset[i] = <data>

Dataset objects support the same kinds of indexing & slicing as real
NumPy arrays.

> How can I create an infinite dataset?

Dataset axes need to be of a finite size. But you can declare certain
axes expandable by using the "maxshape" keyword:

create_dataset("name", (1000,), dtype, maxshape=(None,))

Then you can use the "resize" method to change the shape:

http://www.h5py.org/docs/high/dataset.html#h5py.Dataset.resize

> Do I need to flush() on every write?

No; if Python exits cleanly (even in response to an exception), the
file will be properly closed.

Andrew

Luiz Vitor Martinez Cardoso

unread,
Oct 21, 2013, 3:29:58 PM10/21/13
to h5...@googlegroups.com
Andrews,

Thank you, H5py is awesome!

I tried what your said but now I find two new problems.

Do you know how costly is a resize operation?

Do you think that applying a resize() operation on each loop iteration is a bad idea?

for i in xrange(0,1000):
    dset.resize(i,);
    dset[i] = <data>

The reason for doing that is that in my scenario I can't predict how many rows I'm going to use until I process all data be stored using H5py.

If I simply define a big dataset size and the data processing results in a smaller dataset then that, I'll get several "blank" rows.

Best regards,

Luiz Vitor.

Luiz Vitor Martinez Cardoso

unread,
Oct 21, 2013, 7:02:35 PM10/21/13
to h5...@googlegroups.com
Andrews,

I was searching more about it and found a post from yours giving some suggestions for the same problem... finally I came up with the following code:
 
import h5py
import numpy as np

wrangler_type = { 
    'names': ['dst_ip', 'src_ip', 'payload_type', 'payload', 'timestamp'],
            'formats': ['|S15', '|S15', 'i4', '|S100', 'i4']
  }
 
  f = h5py.File('/tmp/PriceCheckWrite.h5', 'w')
 
  dgroup = f.create_group('PCAnalytics')
 
  CHUNCK = 100
 
  dset = dgroup.create_dataset('PCWrangler',
                              shape=(CHUNCK,),
                            dtype=price_check_wrangler_type,
                                     maxshape=(None,),
                            compression='gzip',
                            compression_opts=9)  

write_count = 1
for i in [(...), (...), ...]:
                   if not (write_count % CHUNCK):
      dset.resize((write_count+CHUNCK,))
                      
                   dset[write_count] = np.array(i, dtype=price_check_wrangler_type)
   write_count += 1
dset.resize(write_count, 0)

But for my surprise it is ~20x slower than a similar implementation that writes direct to a CSV file.

Most of time is spend on writting "dset[write_count] = ..." and I already tried to disable compression.

Do you have any insight on it?

Best regards,
Luiz Vitor.

Em segunda-feira, 21 de outubro de 2013 17h29min58s UTC-2, Luiz Vitor Martinez Cardoso escreveu:Andrews,

Matthew Zwier

unread,
Oct 21, 2013, 10:33:16 PM10/21/13
to h5...@googlegroups.com
Hi Luiz,

Your chunk size is only about 14k. That will trigger a ton of writes to disk regardless of compression. Try defining CHUNCK=2000 (or even 2500) and see if that improves the situation.

Cheers,
Matt Z.


--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andrew Collette

unread,
Oct 22, 2013, 12:33:52 PM10/22/13
to h5...@googlegroups.com
Hi,

>> dset[write_count] = np.array(i, dtype=price_check_wrangler_type)

> But for my surprise it is ~20x slower than a similar implementation that writes direct to a CSV file.
> Most of time is spend on writting "dset[write_count] = ..." and I already tried to disable compression.

The biggest issue is that you are writing to the dataset one element
at a time. There's a certain amount of overhead involved in making a
write, so you will get much better performance if you write multiple
entries at once, for example:

dset[0:100] = np.ones((100,), dtype=mydtype)

Chunk size is also important, as Matt pointed out, although 14k is
technically OK (we recommend an absolute minimum of 10k).

Andrew

Luiz Vitor Martinez Cardoso

unread,
Oct 24, 2013, 10:21:08 PM10/24/13
to h5...@googlegroups.com
Now I'm able to store my data as fast as when I was writing a simple CSV file and with the benefit of having a 100x smaller file size.

Thank you guys!

sergey...@gmail.com

unread,
Jun 19, 2015, 1:58:59 AM6/19/15
to h5...@googlegroups.com
Hi Andrew,

I have recently started using h5py and is confused about the overhead for saving data to disk. This thread is relevant, so I hope you could clarify for me the following points.

1. Originally I thought that every time data is assign to elements of a data set e.g.:

dset[0:100] = np.ones((100,), dtype=mydtype)
it gets stored to disk. But then the PYTHON and HDF5 book also mentions flushing the buffers. It is unclear to me under which circumstances I would need to use flush() since currently I do not use this command but data gets saved.

2. Also when using chunked storage as follows:
dset = f.create_dataset("big dataset", (1024**2, ), dtype=np.int32, chunks=True)
the data should be stored in chunks. Then why is there still an overhead for writing individual elements to a data set?

Jérôme Kieffer

unread,
Jun 19, 2015, 2:33:53 AM6/19/15
to h5...@googlegroups.com
On Thu, 18 Jun 2015 22:58:58 -0700 (PDT)
sergey...@gmail.com wrote:


> 2. Also when using chunked storage as follows:
> dset = f.create_dataset("big dataset", (1024**2, ), dtype=np.int32,
> chunks=True)
> the data should be stored in chunks. Then why is there still an
> overhead for writing individual elements to a data set?

Chunks should be the chunk size. it should be around 1MB, here you set it to 4bytes (True*sizeof(np.int32)) which is relevant.

http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

Andrew Collette

unread,
Jun 19, 2015, 12:13:32 PM6/19/15
to h5...@googlegroups.com
Hi,

> 1. Originally I thought that every time data is assign to elements of a data set e.g.:
> dset[0:100] = np.ones((100,), dtype=mydtype)
> it gets stored to disk. But then the PYTHON and HDF5 book also mentions flushing the buffers. It is unclear to me under which circumstances I would need to use flush() since currently I do not use this command but data gets saved.

You don't need to manually flush to disk. The example you have here is fine.

> 2. Also when using chunked storage as follows:
> dset = f.create_dataset("big dataset", (1024**2, ), dtype=np.int32, chunks=True)
> the data should be stored in chunks. Then why is there still an overhead for writing individual elements to a data set?

There is still slicing, type conversion, etc. In general, writing or reading large blocks of data (within reason) will minimize the overhead.

Andrew

Reply all
Reply to author
Forward
0 new messages