> When I created_dataset, I used compression="lzf" and wrote to the file
> in chunks:
>
> dset[i:j,:] = m
>
> where i-j = 1000 and m.shape = (1000, 460, 620).
This is fine; you can write to the file in any order and the
compression should still work. It's possible that your movie
information is just very hard to compress. Keep in mind LZF works
like GZIP; it's lossless and not very effective on streams of random
or near-random data. If you're on a UNIX system, you can check if the
compression is actually making it into the file by doing "h5ls -vlr
<myfile.hdf5>".
> Is there a way now to open the dataset and save it in compressed form?
> The data are frames to a movie and I imagine a good compression scheme
> would work very well. Is there an optimal compression to use? I use
> these hd5 files from Matlab and Mathematica as well. Is there a
> compression scheme that those programs can understand?
With h5py there is not currently a programmatic way of processing a
file like this. You'd have to create a new dataset and copy the
information over. However, since lzf is h5py-specific, I'd recommend
using the standard gzip compressor (compression='gzip' or
compression=<integer>) when you create the dataset, as it is supported
by all or nearly all HDF5-aware systems.
Hope this helps,
Andrew Collette
> I wasn't expecting it to get lossy mp4 compression or anything, but
> thought naively that nearby pixels in a row, for example, are
> correlated and perhaps compressible. Regardless, I must have been
> doing something wrong as using gzip, I do get a very significant
> compression. (How does it figure out how to compress the numpy array
> if it is given the information piecemeal? I can see it writing to disk
> each time I assign something to the dataset.)
HDF5 does some caching of chunks, but eventually it will need to read
the entire chunk into memory to recompress it. If you want to
fine-tune the write performance, you could try manually setting the
chunk shape when you create the dataset, and relating it to your I/O
slice size. From experience, we found that chunks in the 100kB -
300kB range work best for compression. However, the default chunk
shape is just a guess.
> Chunks: {282, 12, 39} 131976 bytes
> Storage: 2120400000 logical bytes, 2060046599 allocated bytes,
> 102.93% utilization
> Filter-0: shuffle-2 OPT {1}
> Filter-1: lzf-32000 OPT {1, 261, 131976}
> Type: native unsigned char
It looks like the filter is applied but not doing much. :) To be
honest I haven't tested LZF extensively with anything but multibyte
float data. It looks like this is one area where it's not effective.
Even GZIP in this case only managed a ~30% reduction in dataset size.
I think this is an area of performance which should be better measured
and/or documented.
> Thanks for h5py. I use it in every new python program I write that
> reads and writes data. For my needs, I much prefer it over pytables,
> which was also quite nice.
You're welcome! To give proper credit, the h5py LZF filter was
inspired by the excellent LZO filter in PyTables. However, I was not
comfortable using LZO due to the licensing issue.
Andrew Collette