Compressing existing hdf5 file with lzf

Paul

unread,

May 12, 2009, 8:46:19 AM5/12/09

to h5py

Hello.

I have a 5gb HDF5 file with one large dataset created using h5py:

In [37]: h5
Out[37]: <HDF5 file "data.h5" (mode r, 1 root members)>

In [39]: dset
Out[39]: <HDF5 dataset "m": shape (15000, 460, 620), type "|u1">

When I created_dataset, I used compression="lzf" and wrote to the file
in chunks:

dset[i:j,:] = m

where i-j = 1000 and m.shape = (1000, 460, 620).

But the data written to the file was not compressed, maybe cause I
wrote in chunks.

Is there a way now to open the dataset and save it in compressed form?
The data are frames to a movie and I imagine a good compression scheme
would work very well. Is there an optimal compression to use? I use
these hd5 files from Matlab and Mathematica as well. Is there a
compression scheme that those programs can understand?

Thanks very much for making this excellent code publicly available.

Andrew Collette

unread,

May 12, 2009, 1:49:50 PM5/12/09

to h5...@googlegroups.com

Hi,

> When I created_dataset, I used compression="lzf" and wrote to the file
> in chunks:
>
> dset[i:j,:] = m
>
> where i-j = 1000 and m.shape = (1000, 460, 620).

This is fine; you can write to the file in any order and the
compression should still work. It's possible that your movie
information is just very hard to compress. Keep in mind LZF works
like GZIP; it's lossless and not very effective on streams of random
or near-random data. If you're on a UNIX system, you can check if the
compression is actually making it into the file by doing "h5ls -vlr
<myfile.hdf5>".

> Is there a way now to open the dataset and save it in compressed form?
> The data are frames to a movie and I imagine a good compression scheme
> would work very well. Is there an optimal compression to use? I use
> these hd5 files from Matlab and Mathematica as well. Is there a
> compression scheme that those programs can understand?

With h5py there is not currently a programmatic way of processing a
file like this. You'd have to create a new dataset and copy the
information over. However, since lzf is h5py-specific, I'd recommend
using the standard gzip compressor (compression='gzip' or
compression=<integer>) when you create the dataset, as it is supported
by all or nearly all HDF5-aware systems.

Hope this helps,
Andrew Collette

Paul

unread,

May 12, 2009, 6:44:41 PM5/12/09

to h5py

Hello,

On May 12, 7:49 pm, Andrew Collette <andrew.colle...@gmail.com> wrote:
> Hi,
>
> > When I created_dataset, I used compression="lzf" and wrote to the file
> > in chunks:
>
> > dset[i:j,:] = m
>
> > where i-j = 1000 and m.shape = (1000, 460, 620).
>
> This is fine; you can write to the file in any order and the
> compression should still work. It's possible that your movie
> information is just very hard to compress. Keep in mind LZF works
> like GZIP; it's lossless and not very effective on streams of random
> or near-random data. If you're on a UNIX system, you can check if the
> compression is actually making it into the file by doing "h5ls -vlr
> <myfile.hdf5>".

I wasn't expecting it to get lossy mp4 compression or anything, but
thought naively that nearby pixels in a row, for example, are
correlated and perhaps compressible. Regardless, I must have been
doing something wrong as using gzip, I do get a very significant
compression. (How does it figure out how to compress the numpy array
if it is given the information piecemeal? I can see it writing to disk
each time I assign something to the dataset.)

This is the output of h5ls on the lzf and the smaller gzip compressed
file, respectively (h5py 1.1, hdf5 1.8.2). The lzf file was given
chunks that were 10 times smaller than the gzip one:

centaur:Build paul$ h5ls -vlr data1.h5
Opened "data1.h5" with sec2 driver.
/ Group
Location: 1:96
Links: 1
/m Dataset {9000/9000, 380/380, 620/620}
Location: 1:800
Links: 1
Chunks: {282, 12, 39} 131976 bytes
Storage: 2120400000 logical bytes, 2060046599 allocated bytes,
102.93% utilization
Filter-0: shuffle-2 OPT {1}
Filter-1: lzf-32000 OPT {1, 261, 131976}
Type: native unsigned char

centaur:Build paul$ h5ls -vlr data2.h5
Opened "data2.h5" with sec2 driver.
/ Group
Location: 1:96
Links: 1
/m Dataset {9000/9000, 380/380, 620/620}
Location: 1:800
Links: 1
Chunks: {282, 12, 39} 131976 bytes
Storage: 2120400000 logical bytes, 1379102006 allocated bytes,
153.75% utilization
Filter-0: shuffle-2 OPT {1}
Filter-1: deflate-1 OPT {4}
Type: native unsigned char

> > Is there a way now to open the dataset and save it in compressed form?
> > The data are frames to a movie and I imagine a good compression scheme
> > would work very well. Is there an optimal compression to use? I use
> > these hd5 files from Matlab and Mathematica as well. Is there a
> > compression scheme that those programs can understand?
>
> With h5py there is not currently a programmatic way of processing a
> file like this. You'd have to create a new dataset and copy the
> information over. However, since lzf is h5py-specific, I'd recommend
> using the standard gzip compressor (compression='gzip' or
> compression=<integer>) when you create the dataset, as it is supported
> by all or nearly all HDF5-aware systems.

I confirmed that the hdf5 produced by the gzip compressor is readable
by Mathematica. I created a new dataset and copied the data over as
you suggested. I need to access a random sequence of frames in this
movie (the random sequence is sorted). It is pretty fast even with
compression.

Thanks for h5py. I use it in every new python program I write that
reads and writes data. For my needs, I much prefer it over pytables,
which was also quite nice.

Pål.

Andrew Collette

unread,

May 12, 2009, 8:09:50 PM5/12/09

to h5...@googlegroups.com

Hi,

> I wasn't expecting it to get lossy mp4 compression or anything, but
> thought naively that nearby pixels in a row, for example, are
> correlated and perhaps compressible. Regardless, I must have been
> doing something wrong as using gzip, I do get a very significant
> compression. (How does it figure out how to compress the numpy array
> if it is given the information piecemeal? I can see it writing to disk
> each time I assign something to the dataset.)

HDF5 does some caching of chunks, but eventually it will need to read
the entire chunk into memory to recompress it. If you want to
fine-tune the write performance, you could try manually setting the
chunk shape when you create the dataset, and relating it to your I/O
slice size. From experience, we found that chunks in the 100kB -
300kB range work best for compression. However, the default chunk
shape is just a guess.

> Chunks: {282, 12, 39} 131976 bytes
> Storage: 2120400000 logical bytes, 2060046599 allocated bytes,
> 102.93% utilization
> Filter-0: shuffle-2 OPT {1}
> Filter-1: lzf-32000 OPT {1, 261, 131976}
> Type: native unsigned char

It looks like the filter is applied but not doing much. :) To be
honest I haven't tested LZF extensively with anything but multibyte
float data. It looks like this is one area where it's not effective.
Even GZIP in this case only managed a ~30% reduction in dataset size.
I think this is an area of performance which should be better measured
and/or documented.

> Thanks for h5py. I use it in every new python program I write that
> reads and writes data. For my needs, I much prefer it over pytables,
> which was also quite nice.

You're welcome! To give proper credit, the h5py LZF filter was
inspired by the excellent LZO filter in PyTables. However, I was not
comfortable using LZO due to the licensing issue.

Andrew Collette

Reply all

Reply to author

Forward