Scale-offset compression but not quite

Anne Archibald

unread,

Jun 25, 2014, 9:11:20 AM6/25/14

to h5...@googlegroups.com

Hi,

We have some data we are compressing using a scheme a little like the scale-offset filter, and I'd like to ask whether anyone can recommend a better way to do this.

The data is radio-astronomy baseband data streams, so more or less Gaussian noise with a slowly-varying amplitude; we are going to feed it through some signal-processing code and then measure the amplitude. The data is natively 32-bit floating-point (output from other signal-processing routines) and accumulates at terabytes per hour. We have a script which can reduce the data rate by a factor of four with minimal impact on the data: we break the data stream into blocks, compute the RMS amplitude of each block, clip all values that are more than 5 sigma from zero, scale the block and then quantize its values to 8 bits. We record the scales in an auxiliary table so that the data can be approximately reconstructed. We're using few-second blocks (~32 MB), so the auxiliary table is of modest size. The amplitudes change somewhat within a file, but not usually by more than a factor of a few. (This kind of compression was traditionally used in radio astronomy at the initial quantization stage, often going as far as using a 3-level analog-to-digital converter to keep data rates down.)

As it stands, the output file can actually be processed directly by simple tools, but they see only the normalized (integer) data. Sometimes this is fine; other times we care about the slow variations in amplitude that are suppressed by the normalization. In these later cases we must convert the file back to floating-point. It would be nice if it were feasible for the rescaling to be done on the fly.

The scale-offset filter almost does what we want: you can specify a scaling, then it quantizes the scaled values (to integers) and stores the scaled values, packed per-chunk into the smallest number of bits that will handle the integers. Unfortunately, you only get one scaling for the entire file - which would at least mean scanning the entire file to decide on a scale, and in any case introduce more quantization noise than necessary in lower-amplitude portions of the file. Worse, the scaling must for some reason be a power of ten, which means your quantization noise will typically be 3 and perhaps as many as ten times higher than it needs to be. Perhaps one could accept a space penalty and use scale-offset with a pessimistic scale, so that the quantization noise was always below acceptable levels; gzip rather than n-bit might even make up the space penalty.

The easiest way I can see to achieve what I want would require me to implement a custom HDF5 compression filter; for each chunk, it would compute a scale and offset, then quantize to the requested number of bits and pass the resulting integers to the n-bit filter. It's not clear to me how I would then make the resulting (presumably C code, dynamically linked) filter available to applications that wanted to use it. Complexities also arise when one considers modifying data files (which we normally don't do).

Most code that accesses our data files uses a C++ wrapper around HDF5, to which I could add code that applies the rescaling on the fly. It would be clumsy and awkward to add such code, and it wouldn't help applications that use h5py to access the data.

Is there a better way to get this kind of compression?

Thanks,

Anne

Andrew Collette

unread,

Jun 25, 2014, 10:55:05 AM6/25/14

to h5...@googlegroups.com

Hi Anne,

> We have some data we are compressing using a scheme a little like the
> scale-offset filter, and I'd like to ask whether anyone can recommend a
> better way to do this.

Adding re-scaling code to your C++ wrapper may be clumsy, but to me it
seems like it may be the best way forward. I'm not really a fan of
custom filters... ten years from now, when the experiment is over and
the data has been archived, it's unreadable without the (possibly
unmaintained, possibly lost) filter code. In contrast, your current
approach uses only native HDF5 constructs, and assuming libhdf5 still
exists anyone could de-scale the data.

There's a possible alternative from the HDF5 library, but I've never used it:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetDataTransform

Possibly your C++ code could use this when reading (set the
destination type to float). But I've never tried it...

You could also ask on HDF-Forum; the HDF5 core developers hang out
there and could give another perspective.

Andrew

John Readey

unread,

Jun 30, 2014, 3:37:19 PM6/30/14

to h5...@googlegroups.com

Hey Anne,

The HDF Group has a clearing house where you can register your custom compression filter. Check out this page: http://www.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf. Do agree with Andrew about the concerns with custom filters - it can make things a challenge if you'll be sharing the data with others outside your group.

John

Reply all

Reply to author

Forward