- Potentially >100.000 numbers
- Each number has a fixed span (I.e. can only be between -50.0 and 50.0)
My initial though is to take advantage of the fact that each number has
a span of 1001 values. This fits nicely within 12bit. The list of
numbers (each 12 bit) could then be in a byte array which is thrown
though a GZIPOutputStream only to be attached binary (or Base64 encoded
and included as an XML string element for simplicity).
Does this sound like a good idea or could I somehow improve the
throughput? Particulary I am wondering about compression schemes that
would work well on numbers rather than the very generic LZ* algorithm of
GZIP.
Thanks in advance,
Casper
Ask in comp.compression for more details, but almost all compression
schemes are based around the idea of taking advantages of expected
patterns in the output. And what patterns you'll expect is entirely domain
specific. For example, in compressing English text, you can almost always
assume that if you see a character 'q', the next character will be 'u',
but you may not be able to make such an assumption for other languages, or
compressing things other than natural-language texts.
- Oliver
other things you will need to know;
what is the distribution of these numbers? (random, smooth &
contiguous, can you describe most of the numbers in a stream using a
function)
How much do they change (variance)? Standard deviation?
Is there a rule (function) or process on how these are generated? How
many variables are involved.
Does a typical number series have a period? Do they repeat?
Does the number stream have some finite period where there is a
pattern and how often does it occur?
You will probably need some mathmatical help. Ask a physics,
statistics, math guy for some help and be prepared to compensate them
for it ($$$).
If it is a complex problem, assembler or C may be required.
>
> > My initial though is to take advantage of the fact that each number has
> > a span of 1001 values.
so you really have 1010 symbols. -50.9 to +50.9 ?
> > This fits nicely within 12bit. The list of
> > numbers (each 12 bit) could then be in a byte array which is thrown
> > though a GZIPOutputStream only to be attached binary (or Base64 encoded
> > and included as an XML string element for simplicity).
>
> > Does this sound like a good idea or could I somehow improve the
> > throughput? Particulary I am wondering about compression schemes that
> > would work well on numbers rather than the very generic LZ* algorithm of
> > GZIP.
that compression algorithm is copyrighted.
There are some variants and they are also copyrighted.
Certainly you can come up with your own, but it may hit a legal snarl
if you application ever earns any real money.
Keep good notes!
>
> Ask in comp.compression for more details, but almost all compression
> schemes are based around the idea of taking advantages of expected
> patterns in the output. And what patterns you'll expect is entirely domain
> specific. For example, in compressing English text, you can almost always
> assume that if you see a character 'q', the next character will be 'u',
> but you may not be able to make such an assumption for other languages, or
> compressing things other than natural-language texts.
>
> - Oliver- Hide quoted text -
>
I agree.
Distribution is unknown, hence not mentioned.
> How many variables are involved.
As I mentioned, potentially more then 100K.
> Does a typical number series have a period? Do they repeat?
> Does the number stream have some finite period where there is a
> pattern and how often does it occur?
As I mentioned, from -50.0 and 50.0 meaning two digits with a separator
and a decimal. A span that would fit within 10bit if the decimal is
factored away by multiplying with 10.
> You will probably need some mathmatical help. Ask a physics,
> statistics, math guy for some help and be prepared to compensate them
> for it ($$$).
This is not really the issue, I am simply asking the community for
advice during my optimization pass - which is what a newsgroup is about.
> If it is a complex problem, assembler or C may be required.
It is not more complex than stated and certainly mixing assembler with a
java web service is nasty and unnecessary in any case, the Java hotspot
is very good.
/Casper
You raise a fair point there. The use case is a rich client application
so data transfer must be be as seamless as possible without having to go
out of my way in complexity.
The protocol I loosely described in my original post amounts to some
90Kb of data for 100K values - an improvement of about 70-80 times over
the old naive web service approach.
I think I could even improve it by using Huffman rather than GZIP but
intend to prototype these things over the weekend, hence just wanted
some ideas.
Thanks,
Casper
The loose description amounted to encoding as a twelve-bit
quantity (not sure why; ten bits would suffice), packed into a
byte array in some unspecified way (two bytes per value? Less?)
and then compressed with GZIP (compression ratio unknown). I
don't know how you get from that loose description to the fairly
specific value of 90 kilobits ~= 11.3 kilobytes, and I don't see
how you can characterize the reduction from 500KB to 11.3KB as
"70-80 times" better: it seems like maybe one forty-fourth. (Or
about two elevenths, if "90Kb" actually means "90KB.")
> I think I could even improve it by using Huffman rather than GZIP but
> intend to prototype these things over the weekend, hence just wanted
> some ideas.
GZIP will almost certainly do better. Huffman encoding is
only an encoding, and does not in itself embody a probability
model, a.k.a. a "predictor" of the future stream. If you couple
it with a static model of symbol probabilities it will usually
"predict away" less of the entropy than will an adaptive method.
Simple thought experiment: Consider a data stream of one million
zero bits followed by one million ones. Each value occurs with
probability 50% so static-probability Huffman encoding will assign
one encoded bit per original bit. An adaptive method like GZIP
will "learn" the pattern of the one million zeroes and encode them
very compactly, then will be "surprised" at the transition and
lose efficiency for a while before it syncs up to the new reality.
Once it does, you'll find it once again encoding hundreds or even
thousands of original bits per compressed bit.
There are newsgroups devoted to compression topics where you
might find more information. It's been years since I followed
them, but I recall their FAQs as being highly informative.
--
Eric Sosman
eso...@acm-dot-org.invalid
The 12bit was a typo, I mean of course 10bit since 2^10 > 1001. Also, I
am talking about size and not bandwidth so forget the kbaud/kilobit talk.
The rationale behind my estimate goes as:
1) I have inspected the SOAP telegram and can see that a single number
(with namespace, element, whitespace), requires 70 bytes to transfer.
That means, ~7MB for 100.000 values.
2) The 10 bit protocol represents an improvement of a factor of 56
(70bytes * 8bit/10bit), down to roughly 121KB.
3) A simple test shows that the kind of data that would be in the buffer
can be GZIP compressed down to little more than half the original, now
roughly 67KB.
4) I need to carry it over a character based protocol, so I BASE64
encode it, adding circa 34% to the size caused by the overhead. The 67KB
goes up to 90KB.
It is not a mathematical proof, I can not vouch for how accurately this
would scale in real life but it would appear to solve the problem and
this is where I get the 70-80 factor from. If there is something wrong
with this rationale, by all means enlighten me.
> GZIP will almost certainly do better. Huffman encoding is
> only an encoding, and does not in itself embody a probability
> model, a.k.a. a "predictor" of the future stream. If you couple
> it with a static model of symbol probabilities it will usually
> "predict away" less of the entropy than will an adaptive method.
> Simple thought experiment: Consider a data stream of one million
> zero bits followed by one million ones. Each value occurs with
> probability 50% so static-probability Huffman encoding will assign
> one encoded bit per original bit. An adaptive method like GZIP
> will "learn" the pattern of the one million zeroes and encode them
> very compactly, then will be "surprised" at the transition and
> lose efficiency for a while before it syncs up to the new reality.
> Once it does, you'll find it once again encoding hundreds or even
> thousands of original bits per compressed bit.
Interesting. I hardly remember the Huffman algorithm but it was
suggested to me on com.compression (after making this initial post).
> There are newsgroups devoted to compression topics where you
> might find more information. It's been years since I followed
> them, but I recall their FAQs as being highly informative.
Yes I have already been directed there, but wavelet, furrier etc. is too
complicated for me. All I really wanted was to hear whether the Java
community knew of alternative compression mechanisms readily at hand.
Thanks,
/Casper
>
>My initial though is to take advantage of the fact that each number has
>a span of 1001 values. This fits nicely within 12bit. The list of
>numbers (each 12 bit) could then be in a byte array which is thrown
>though a GZIPOutputStream only to be attached binary (or Base64 encoded
>and included as an XML string element for simplicity).
Here is something to try. Encode the stream as deltas, the difference
from the previous number. Then turn that result loose in a number of
compression utilities including GZIP to see which squeezes best. See
http://mindprod.com/jgloss/compressionutilities.html If any of them
do much better than GZIP it might be worth your while to implement
that algorithm or to spawn that utility to do the work. see
http://mindprod.com/jgloss/exec.html
Huffman was one of a couple of suggestions you got there. I think the
Huffman recommendation was meant as a starting point from which you could
further build your own scheme.
After reading your description of the data (temperature readings over
a 24-hour period, and thus like a sine-wave, with occasional spikes), I
think an effective compression scheme would be to encode the sine wave
itself (just encode the amplitude, x-offset and y-offset; no need ot
encode frequency, as it can be assumed to be 1/24 hours), and then encode
any deviations that the real data has, as compared to this ideal sine
wave.
A simple scheme: for encoding the deviations, you'd have your
predictor such that 0 is the most likely deviation (and thus takes the
least bits to encode), and as you go further away from 0, these results
are less likely, and thus take more bits to encode.
A more complex scheme: You can take advantage of the assumption that
the temperatures will never fall outside the [-50,50] range. If your sine
wave predicts a certain value to be at 49.8, you need not worry about
encoding a deviation 0.3, because that would push you outside the allowed
range. How exactly to assign your bits to the remaining, I'll leave as an
exercise to the reader.
>> There are newsgroups devoted to compression topics where you
>> might find more information. It's been years since I followed
>> them, but I recall their FAQs as being highly informative.
>
> Yes I have already been directed there, but wavelet, furrier etc. is too
> complicated for me. All I really wanted was to hear whether the Java
> community knew of alternative compression mechanisms readily at hand.
Ah, in that case, comp.compression isn't the right place. I was under
the impression you wanted to design your own custom compression scheme.
There's no reason to suspect the regulars there are familiar with Java,
and they'd probably be more interested in the puzzle-like fun of coming up
with new compressions schemes than the boring task finding existing
libraries to suit a particular task.
- Oliver
Indeed.
> After reading your description of the data (temperature readings over
> a 24-hour period, and thus like a sine-wave, with occasional spikes), I
> think an effective compression scheme would be to encode the sine wave
> itself (just encode the amplitude, x-offset and y-offset; no need ot
> encode frequency, as it can be assumed to be 1/24 hours), and then encode
> any deviations that the real data has, as compared to this ideal sine
> wave.
>
> A simple scheme: for encoding the deviations, you'd have your
> predictor such that 0 is the most likely deviation (and thus takes the
> least bits to encode), and as you go further away from 0, these results
> are less likely, and thus take more bits to encode.
>
> A more complex scheme: You can take advantage of the assumption that
> the temperatures will never fall outside the [-50,50] range. If your sine
> wave predicts a certain value to be at 49.8, you need not worry about
> encoding a deviation 0.3, because that would push you outside the allowed
> range. How exactly to assign your bits to the remaining, I'll leave as an
> exercise to the reader.
Also very interesting, I will give this a try. :)
> Ah, in that case, comp.compression isn't the right place. I was under
> the impression you wanted to design your own custom compression scheme.
> There's no reason to suspect the regulars there are familiar with Java,
> and they'd probably be more interested in the puzzle-like fun of coming up
> with new compressions schemes than the boring task finding existing
> libraries to suit a particular task.
I am not sure myself what I want, I guess I was just sucked into the
classic tempting optimization trap. In any case, I think I have enough
to move on to actual implementation.
Thanks again,
Casper
Okay, let's take a step back. Suppose you do something
really straightforward and uncomplicated, like sending each
number as a five-byte string (sign, two digits, decimal
point, fraction digit). How much data must you transmit to
send 100K of such numbers?
Half a megabyte.
Even with a dial-up connection, less than two minutes.
Get a little fancier, and encode each of the 1001 possible
values as a two-character pair (A-Z and 0-9 gives 36 "digits,"
and 36*36 > 1001). Wow! You've saved sixty percent!
Which is 300KB.
Not quite one dial-up minute.
What I'm getting at here is that the potential gain from
exotic compression schemes seems fairly limited, unless your
circumstances are more exigent than you've let on so far.
It simply may not be worth your while to indulge in trickiness.