Hi,
We have some data we are compressing using a scheme a little like the scale-offset filter, and I'd like to ask whether anyone can recommend a better way to do this.
The data is radio-astronomy baseband data streams, so more or less Gaussian noise with a slowly-varying amplitude; we are going to feed it through some signal-processing code and then measure the amplitude. The data is natively 32-bit floating-point (output from other signal-processing routines) and accumulates at terabytes per hour. We have a script which can reduce the data rate by a factor of four with minimal impact on the data: we break the data stream into blocks, compute the RMS amplitude of each block, clip all values that are more than 5 sigma from zero, scale the block and then quantize its values to 8 bits. We record the scales in an auxiliary table so that the data can be approximately reconstructed. We're using few-second blocks (~32 MB), so the auxiliary table is of modest size. The amplitudes change somewhat within a file, but not usually by more than a factor of a few. (This kind of compression was traditionally used in radio astronomy at the initial quantization stage, often going as far as using a 3-level analog-to-digital converter to keep data rates down.)
As it stands, the output file can actually be processed directly by simple tools, but they see only the normalized (integer) data. Sometimes this is fine; other times we care about the slow variations in amplitude that are suppressed by the normalization. In these later cases we must convert the file back to floating-point. It would be nice if it were feasible for the rescaling to be done on the fly.
The scale-offset filter almost does what we want: you can specify a scaling, then it quantizes the scaled values (to integers) and stores the scaled values, packed per-chunk into the smallest number of bits that will handle the integers. Unfortunately, you only get one scaling for the entire file - which would at least mean scanning the entire file to decide on a scale, and in any case introduce more quantization noise than necessary in lower-amplitude portions of the file. Worse, the scaling must for some reason be a power of ten, which means your quantization noise will typically be 3 and perhaps as many as ten times higher than it needs to be. Perhaps one could accept a space penalty and use scale-offset with a pessimistic scale, so that the quantization noise was always below acceptable levels; gzip rather than n-bit might even make up the space penalty.
The easiest way I can see to achieve what I want would require me to implement a custom HDF5 compression filter; for each chunk, it would compute a scale and offset, then quantize to the requested number of bits and pass the resulting integers to the n-bit filter. It's not clear to me how I would then make the resulting (presumably C code, dynamically linked) filter available to applications that wanted to use it. Complexities also arise when one considers modifying data files (which we normally don't do).
Most code that accesses our data files uses a C++ wrapper around HDF5, to which I could add code that applies the rescaling on the fly. It would be clumsy and awkward to add such code, and it wouldn't help applications that use h5py to access the data.
Is there a better way to get this kind of compression?
Thanks,
Anne