Huffman & Entropy Question

JasonFX

unread,

Oct 6, 2009, 12:05:25 PM10/6/09

to

Hi,

If I have an information source x bits long with an entropy of n bits,
Is it possible to mathematically calculate the maximum potential
compression ratio that Huffman Coding could yield on the information
source with respect to its entropy?

Thanks

Jason

Thomas Richter

unread,

Oct 6, 2009, 12:20:22 PM10/6/09

to

JasonFX schrieb:

What you're saying is somewhat self-contradictory. Lemme explain:

"Entropy" is something that works with respect to a model, i.e. you have
(conditioned or unconditioned) probabilities for the symbols emitted
from a source. From this, one can compute the entropy. A limited size
message allows you only to estimate the entropy by using relative
frequencies.

However, leaving this aside: Let's suppose we have a random source R
with entropy E, then one can compute how much a *typical* string from
the random source can be compressed. Depending on whether your input
string is typical or not, it may or may not be compressed to this ratio.
Clearly, if you have a proper model, typical strings are the most likely
strings generated by the source, and thus you have *usually* the
compression ratio Huffman grants you, but in the unlikely event of an
atypical string you might have no compression at all, and rather an
expansion.

So long,
Thomas

sebastian

unread,

Oct 14, 2009, 11:39:35 AM10/14/09

to

>> Clearly, if you have a proper model, typical strings are the most likely strings generated by the source, and thus you have *usually* the compression ratio Huffman grants you, but in the unlikely event of an atypical string you might have no compression at all, and rather an expansion.

Of course, expansion only occurs from to the overhead of transmitting
to model. In compressing the symbols alone, Huffman encoding
guarantees an output less than or equal to the input size.

Thomas Richter

unread,

Oct 14, 2009, 11:50:10 AM10/14/09

to

sebastian schrieb:

That wasn't my point, though. My point was rather: "NO, IT DOESN'T".

Example: use the following Huffman code:

A -> 1
B -> 01
C -> 000
D -> 001

Does this Huffman code ensure "compression"?
Not at all. If I feed it by *any* message I like to, *most* of the
messages will be expanded, so it clearly doesn't ensure it.

The reason is that this works only well if we really have a random
source with p(A) = 1/2, p(B) = 1/4, p(C)=p(D)=1/8, at least
approximately. If the source derives from that, compression will be
lower, or even an expansion.

However, *even for such a source*, messages like

DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

are *not* impossible. Just unlikely. So you do get, at best,
"compression on average" over a huge number of input messages. You do
not get a compression "for a message generated by the random source".
You only get that for "typical messages". The above message would be
rather a-typical, but not impossible. And hence, even for a random
source that *fits* the Huffman code, Huffman doesn't *ensure*
compression. It just makes *compression likely to happen*.

That was the point I was trying to make.

People here often mix the concepts of probabilities, relative
frequencies, models, random sources, specific sequences, random
sequences, etc. and hence create an immense noise level.

So long,
Thomas

glen herrmannsfeldt

unread,

Oct 14, 2009, 1:38:38 PM10/14/09

to

Thomas Richter <th...@math.tu-berlin.de> wrote:
< sebastian schrieb:
(snip)

<> Of course, expansion only occurs from to the overhead of transmitting
<> to model. In compressing the symbols alone, Huffman encoding
<> guarantees an output less than or equal to the input size.

< That wasn't my point, though. My point was rather: "NO, IT DOESN'T".

< Example: use the following Huffman code:

< A -> 1
< B -> 01
< C -> 000
< D -> 001

< Does this Huffman code ensure "compression"?
< Not at all. If I feed it by *any* message I like to, *most* of the
< messages will be expanded, so it clearly doesn't ensure it.

As far as I know, there are two ways to use Huffman coding.
One is to use a static table based on expected statisical
distribution for the data. Group 3 FAX does that. Morse code
does that.

The other way is to generate one based on the specific data set
to be compressed. In that case, and not including the bits needed
to code the table, it would seem that output should not be greater,
though could be equal. If all symbols have equal probability
then the result will be equal.

I like the way Ultrium (LTO) tapes handle compression. There is
a bit on each tape block indicating compressed or not. When
writing, the drive compares the size of the compressed block
to the source block, writes whichever is smaller and sets the
bit accordingly. If the bit is part of the block header, which
is necessary anyway, compressed data will never be bigger.
Otherwise, you could say it is one bit bigger. I have seen
DLT increase the size by about 10% with compression on.

-- glen

Thomas Richter

unread,

Oct 15, 2009, 3:06:18 AM10/15/09

to

glen herrmannsfeldt schrieb:

> Thomas Richter <th...@math.tu-berlin.de> wrote:
> < sebastian schrieb:
> (snip)
> <> Of course, expansion only occurs from to the overhead of transmitting
> <> to model. In compressing the symbols alone, Huffman encoding
> <> guarantees an output less than or equal to the input size.
>
> < That wasn't my point, though. My point was rather: "NO, IT DOESN'T".
>
> < Example: use the following Huffman code:
>
> < A -> 1
> < B -> 01
> < C -> 000
> < D -> 001
>
> < Does this Huffman code ensure "compression"?
> < Not at all. If I feed it by *any* message I like to, *most* of the
> < messages will be expanded, so it clearly doesn't ensure it.
>
> As far as I know, there are two ways to use Huffman coding.
> One is to use a static table based on expected statisical
> distribution for the data. Group 3 FAX does that. Morse code
> does that.
>
> The other way is to generate one based on the specific data set
> to be compressed. In that case, and not including the bits needed
> to code the table, it would seem that output should not be greater,
> though could be equal. If all symbols have equal probability
> then the result will be equal.

No, and no. (-: Yes, I'm picky, and I'm of course making some fun of it.

But, to be precise:

There is only "one type of Huffman coding", and that is based on
probabilities. Where you get the probabilities from is another issue,
and not part of the code. Thus, for "static coding", you take them for
granted, hopefully matching the source, where the tables have been
derived by observing typical sources. For "dynamic coding", you create
tables by assuming the relative frequencies found in the source are
equal to the probabilities of a random source creating this string.

Note, however, that both are assumptions. The second assumption does
indeed give you the shortest possible prefix code (so, yes!) for that
specific string, but again, relative frequencies != probabilities in
general. Probability is something of a model you cannot observe
directly, relative frequency is just a matter of counting.

Back to the OP: There, we had the question what we can say about Huffman
compression if we know something about the entropy of the source.
Entropy, again, is a term computed from probabilities. Not from relative
frequencies. Mixing the two is a common error, but that's not the same
thing. Probabilities can be assumed and are part of the model. Relative
frequencies are measured on an input string. If we only know the
entropy, we still do not know anything about the string. Only how it
would typically look like, but not how it does look like.

So long,
Thomas

glen herrmannsfeldt

unread,

Oct 15, 2009, 4:38:02 AM10/15/09

to

Thomas Richter <th...@math.tu-berlin.de> wrote:
< glen herrmannsfeldt schrieb:
(snip, someone wrote)

<> < Does this Huffman code ensure "compression"?
<> < Not at all. If I feed it by *any* message I like to, *most* of the
<> < messages will be expanded, so it clearly doesn't ensure it.

<> As far as I know, there are two ways to use Huffman coding.
<> One is to use a static table based on expected statisical
<> distribution for the data. Group 3 FAX does that. Morse code
<> does that.

<> The other way is to generate one based on the specific data set
<> to be compressed. In that case, and not including the bits needed
<> to code the table, it would seem that output should not be greater,
<> though could be equal. If all symbols have equal probability
<> then the result will be equal.

< No, and no. (-: Yes, I'm picky, and I'm of course making some fun of it.

< But, to be precise:

< There is only "one type of Huffman coding", and that is based on
< probabilities. Where you get the probabilities from is another issue,
< and not part of the code.

Fortunately I said "use the huffman code", use being important.

< Thus, for "static coding", you take them for
< granted, hopefully matching the source, where the tables have been
< derived by observing typical sources. For "dynamic coding", you create
< tables by assuming the relative frequencies found in the source are
< equal to the probabilities of a random source creating this string.

< Note, however, that both are assumptions. The second assumption does
< indeed give you the shortest possible prefix code (so, yes!) for that
< specific string, but again, relative frequencies != probabilities in
< general. Probability is something of a model you cannot observe
< directly, relative frequency is just a matter of counting.

For some reason this reminds me of a current discussion in comp.dsp
on the meaning of causal. If you are compressing a whole file,
yes, just count and generate a frequency table. In a causal system,
I can only use the frequencies of previous characters to predict
the probability. That isn't required, though, for file compression.
I can read the whole file, generate the table of frequencies.
With the small assumption of a finite length file (true of all
current computer systems) I don't see any problem calling
those probabilities, but I will accept that you do.

(In the DSP case, the FFT is often used as part of a filter
algorithm, the result being non-causal, but often useful.)

< Back to the OP: There, we had the question what we can say about Huffman
< compression if we know something about the entropy of the source.
< Entropy, again, is a term computed from probabilities. Not from relative
< frequencies. Mixing the two is a common error, but that's not the same
< thing. Probabilities can be assumed and are part of the model. Relative
< frequencies are measured on an input string. If we only know the
< entropy, we still do not know anything about the string. Only how it
< would typically look like, but not how it does look like.

But if we do know the frequencies and want to generate a huffman
code from those frequencies...

I thought unix used to have a program to do huffman code file
compression but I don't see it now. It is fairly rare to have
a source with statistically independent characters, though.

-- glen

Mark Adler

unread,

Oct 15, 2009, 10:29:41 AM10/15/09

to

On 2009-10-15 01:38:02 -0700, glen herrmannsfeldt <g...@ugcs.caltech.edu> said:
> I thought unix used to have a program to do huffman code file
> compression but I don't see it now.

It was called "pack". It slowly disappeared after the introduction of
the generally more effective compress (LZW). gzip is able to unpack
compressed files created by pack, but does not compress to that format.

zlib provides the Z_HUFFMAN_ONLY compression strategy.

Mark

glen herrmannsfeldt

unread,

Oct 15, 2009, 3:07:22 PM10/15/09

to

Mark Adler <mad...@alumni.caltech.edu> wrote:
< On 2009-10-15 01:38:02 -0700, glen herrmannsfeldt <g...@ugcs.caltech.edu> said:
<> I thought unix used to have a program to do huffman code file
<> compression but I don't see it now.

< It was called "pack". It slowly disappeared after the introduction of
< the generally more effective compress (LZW). gzip is able to unpack
< compressed files created by pack, but does not compress to that format.

Yes. I thought unix utilities stayed around forever, but it
seems that pack is gone.

The man page for compress on my system says, near the end:

"Compression is generally much better than that achieved by
Huffman coding (as used in the historical command pack),
or adaptive Huffman coding (as used in the historical command
compact), and takes less time to compute."

That seems as close as we get to a description of pack and compact.

-- glen

Niels Fröhling

unread,

Oct 15, 2009, 3:28:42 PM10/15/09

to

Thomas Richter wrote:

> Back to the OP: There, we had the question what we can say about Huffman
> compression if we know something about the entropy of the source.
> Entropy, again, is a term computed from probabilities. Not from relative
> frequencies. Mixing the two is a common error, but that's not the same
> thing. Probabilities can be assumed and are part of the model. Relative
> frequencies are measured on an input string. If we only know the
> entropy, we still do not know anything about the string. Only how it
> would typically look like, but not how it does look like.

Hey I have a question which sits in the back of my head, since you start to be
"so picky" :^)

Do you actually know a single example, where you do know the real-entropy^tm
of a source? You make the impression that real-entropy^tm is a deterministic /
calculateable term, and not an extrapolation of an observation.

I would agree with that theoretically the sources real-entropy^tm is distinct
to observed entropy, but both (in general) are respective a given model, and
because of that the real-entropy^tm is of limited use (you know your observed
entropy is never absolute correct not even in respect to a specific model, and
it's only x% efficient) given the fact that it is very difficult to get exactly.

Well actually I do can imagine a source with observation==real, that is when
you say something for example can only and only create "Hamlet", not a byte
more, no variation, nothing different.
Buuut then real-entropy^tm is identical to the entropy of the coded stream
(let's say LZ, 123k, the compressed file can't give you anything else than
"Hamlet").
This of course (for me) can't be true (it means claiming universal==specific,
contradiction), and is of doubtful use, because you don't want to create a
specific model for each finite message. I'm referring to programs here, you
don't want to program a specific model for a specific single message.

That's the point of my example, either you have infinite source and you try to
model fragments of it by approximating the source's real-entropy^tm though
piecewise observation (human-brain => "Hamlet", but also others, infinite
others), or you model context-free atomic universally independent finite sources
which makes no sense whatsoever.

In the end I suspect, expect that real-entropy^tm can't be quantified in
absolutes, and the best approximation would come from KC (then we get model
independent entropy, if you don't count KC as a model as well).
And simply waiting for the computer with infinite computational abilities, we
just accept our little poor custom-model dependent entropy as liable
approximation, which in most cases because of convenience is plain order-0
entropy (of a little piece of an infinite string: that are all the other "files"
you want to compress and haven't yet).

So, I hope I'm not completely out of my mind. And to ask the question again,
maybe lost in my discourse:

Do you actually know a single example, where you do know the real-entropy^tm
of a source?

Oh, I don't allow to count artificial constructed sources according an
entropy. My emphasis is more about an equivalent to "know real-Pi^tm", which was
observed, then approximated, then modelled, then absolutely determined, and now
the verify a circle in reality though the model Pi, instead of verifying Pi as
before. Taking the number "5" and saying "this constant is '5', exactly and
always", is not the sort of answer I thought of. :)

Ciao
Niels

Thomas Richter

unread,

Oct 16, 2009, 3:22:44 AM10/16/09

to

Niels Fr�hling schrieb:

>> Back to the OP: There, we had the question what we can say about
>> Huffman compression if we know something about the entropy of the
>> source. Entropy, again, is a term computed from probabilities. Not
>> from relative frequencies. Mixing the two is a common error, but
>> that's not the same thing. Probabilities can be assumed and are part
>> of the model. Relative frequencies are measured on an input string. If
>> we only know the entropy, we still do not know anything about the
>> string. Only how it would typically look like, but not how it does
>> look like.
>
> Hey I have a question which sits in the back of my head, since you
> start to be "so picky" :^)
>
> Do you actually know a single example, where you do know the
> real-entropy^tm of a source? You make the impression that
> real-entropy^tm is a deterministic / calculateable term, and not an
> extrapolation of an observation.

That's just backwards from what I'd say. Entropy and probabilities are
nothing you "know by observation", i.e. it is not a measurable quantity.
Probabilities and entropies are *assumed*. They are part of a model. You
cannot compute them. You can assume them.

Some background: The following "definition" of probability seems plausibe:

Let "A" be an event from a set of events X, and let #A the number of
cases in an experiment where the event A was observed. And let N the
number of experiments. Then:

p(A) := \lim_{N \rightarrow \infty} #A/N

Seems plausible? Certainly. Now proof that this limit exists. Just take
a dice, and let A be the event of the dice showing "6".

--

Result is: You *can't*. There is no mathematical proof that limit
converges, and it won't necessarily converge at all. One cannot show
that or proof that.

At this point, you need to take a step back and think about what you
actually *want*. You want a mathematical model, an axiomatic system,
that describes the properties of probabilities, and from *that* axioms,
derive properties of the system by deduction, then again on firm ground.
You always take for granted that the axioms are right, but you can never
show them to be right.

Take a good(!) math book about probability theory, open the first pages,
and read what you find there: Nothing about "limits", and "relative
frequencies". Instead, you typically find a set of axioms that defines a
system called a "probability space". In some books even completely
without a motivation. "That's it, take it or leave it." This probability
space encodes the properties you expect to find for probabilities, and
you never actually *take* any limit to infinity. You assume that the
system behaves like the axioms, but you never show it - you cannot show
it. In the same sense, you cannot "proof" (mathematically) that Newton's
axioms of classical mechanics are correct. Instead, axioms are the
starting point of a mathematical analysis of the problem: "Suppose our
system would behave like that, what could we say about it?".

Sounds strange if you think about it this way, but that's how
mathematics works.

> I would agree with that theoretically the sources real-entropy^tm is
> distinct to observed entropy, but both (in general) are respective a
> given model, and because of that the real-entropy^tm is of limited use
> (you know your observed entropy is never absolute correct not even in
> respect to a specific model, and it's only x% efficient) given the fact
> that it is very difficult to get exactly.

Can you show that this limit exists? First of all, there is no limit,
since the files you look at finite (finite hard disk, finite storage,
finite world), so there is no lim -> \infty. So the whole question is
mood. There is nothing like "the entropy of a file" you can measure. You
can measure something like relative frequencies, and from that assume(!)
a probability model (say, simply p(a) = relative frequency of a, though
this is not necessarily the best), and from that compute the entropy.
But there's always one step of "assumption" in here.

> Well actually I do can imagine a source with observation==real, that is
> when you say something for example can only and only create "Hamlet",
> not a byte more, no variation, nothing different.

I can't. I can never take the limit to have a probability.

> Buuut then real-entropy^tm is identical to the entropy of the coded
> stream (let's say LZ, 123k, the compressed file can't give you anything
> else than "Hamlet").

You "forget" that you already made an assumption here.

> This of course (for me) can't be true (it means claiming
> universal==specific, contradiction), and is of doubtful use, because you
> don't want to create a specific model for each finite message. I'm
> referring to programs here, you don't want to program a specific model
> for a specific single message.

Exactly. Instead, you assume that your message is a "typical message"
generated by a larger, more generic class of models, and you assume(!)
that the relative frequencies in the file derive only little from the
postulated(!) probabilities of your model.

A typical math problem: Allow some "schizophrenia". Somehow, we know
what probabilities are. But let's do as if we forget about all this for
a moment, step back and let's only work with the axioms.

> That's the point of my example, either you have infinite source and you
> try to model fragments of it by approximating the source's
> real-entropy^tm though piecewise observation (human-brain => "Hamlet",
> but also others, infinite others), or you model context-free atomic
> universally independent finite sources which makes no sense whatsoever.

Ha! There you go! Of course. But, you never have infinite sources to
begin with. Otherwise, no matter matter what its entropy is, it wouldn't
fit on my hard disk (not in the Shanon model, at least. For the
Kolmogorov model, there are infinite sources with finite
representations, but see above...).

The point is: Even though every model is incomplete, and ill-justified
(from a mathematical p.o.v.) by incomplete induction on the observation,
the resulting *abstract model* that states properties on infinite
sequences works typically (but not always) well enough to compress the
given sequence - or a larger set of sequences.

> In the end I suspect, expect that real-entropy^tm can't be quantified
> in absolutes, and the best approximation would come from KC (then we get
> model independent entropy, if you don't count KC as a model as well).

Well, that's not "entropy", but a complexity measure, a different beast.
It comes from a different branch of mathematics, and I don't see why the
two should be equal (in general). They have some interesting properties
in common, of course.

Entropy is defined(!) on a probability space. Full stop. Nothing else.
No probability space, no entropy. Entropy is not a property of a finite
string. KC is a property of a pair of a universal Turing machine and a
finite string of symbols from an alphabet. Different input, different
things.

> And simply waiting for the computer with infinite computational
> abilities, we just accept our little poor custom-model dependent entropy
> as liable approximation, which in most cases because of convenience is
> plain order-0 entropy (of a little piece of an infinite string: that are
> all the other "files" you want to compress and haven't yet).

Again, assumptions made, you just don't spell them out explicitly:
"Assume our string would be generated by a zero order probability
source, then its entropy would be...". This is "zero-order entropy". Of
course, nobody except mathematicians (brain-damaged people like me) say
things like that, and most people somehow accept this for granted,
probably without even thinking about it. However, it *does* cause some
confusion if you don't make your words very clear, and this unfortunate
gap is then filled by the con-artists we see here over and over again.

> So, I hope I'm not completely out of my mind. And to ask the question
> again, maybe lost in my discourse:
>
> Do you actually know a single example, where you do know the
> real-entropy^tm of a source?

YES! Those models I make up! And that's all I can do, make up models!
You cannot find them, or proof them. You *make them*.
Entropy is a man-made "mock-up" term, if you like - to put it this
extreme. (Of course, overdoing again to get the point clear!). Given
zero-order i.i.d probability source with p(A) = p(B) = 0.5, *of course*
I KNOW the entropy. What I *do not know* is whether my input is
generated from such a source. I can't. I never do. I assume. I form by
analysis on the basis that it would. If it doesn't, my conclusions are mood.

I assume that you mean the latter question, not the former (i.e. what do
you mean by "know"?). And of course I misunderstood you on purpose.
That's one of my hobbies (drives some of my students nuts, I know.).

> Oh, I don't allow to count artificial constructed sources according an
> entropy. My emphasis is more about an equivalent to "know real-Pi^tm",
> which was observed, then approximated, then modelled, then absolutely
> determined, and now the verify a circle in reality though the model Pi,
> instead of verifying Pi as before. Taking the number "5" and saying
> "this constant is '5', exactly and always", is not the sort of answer I
> thought of. :)

Sure, I know what you mean. Probably not what you said to begin with,
but clearly, I understand perfectly. And you have my answer. No, I
don't. But that is what a model is: It isn't proven. It is made.

The fun part is: It nevertheless works, because my (or rather, those of
other smart people designing compressors) guesses are often "good enough
to make it work".

So long,
Thomas

PS: Yes, it seems strange indeed how mathematicians think. But again,
step back for a while - abstract models like probability are so in our
minds we no longer see that they are abstract. Here's an even simpler
example:

Ever seen a two?

I mean the number, not the digit.

glen herrmannsfeldt

unread,

Oct 16, 2009, 6:39:40 AM10/16/09

to

Thomas Richter <th...@math.tu-berlin.de> wrote:
< Niels Fr?hling schrieb:

<>> Back to the OP: There, we had the question what we can say about
<>> Huffman compression if we know something about the entropy of the
<>> source. Entropy, again, is a term computed from probabilities. Not
<>> from relative frequencies. Mixing the two is a common error, but
<>> that's not the same thing. Probabilities can be assumed and are part
<>> of the model. Relative frequencies are measured on an input string. If
<>> we only know the entropy, we still do not know anything about the
<>> string. Only how it would typically look like, but not how it does
<>> look like.

Lets get away from the math and look at actual real compression
problems. If I have a file I want to compress, already stored
in disk, I can count the number of each character in that file.
I then don't care about 'typical' but just this one file.

Many compression algorithms allow for stream compression,
where output can be generated before the end of the input
string is known. That is certainly nice, but is not always
required. I was going to look at the man page for pack and compact,
but they don't seem to exist on the systems I use.

<> Hey I have a question which sits in the back of my head, since you
<> start to be "so picky" :^)

<> Do you actually know a single example, where you do know the
<> real-entropy^tm of a source? You make the impression that
<> real-entropy^tm is a deterministic / calculateable term, and not an
<> extrapolation of an observation.

< That's just backwards from what I'd say. Entropy and probabilities are
< nothing you "know by observation", i.e. it is not a measurable quantity.
< Probabilities and entropies are *assumed*. They are part of a model. You
< cannot compute them. You can assume them.

If the requirement is 'compress file X, as stored on disk' then yes
I do know the probabilities. The system is not causal. I know
the probability of a character before I see it, but that is fine.

For stream compression, say writing data to a tape while it is
being generated, then I don't know the frequencies in advance.
I might have one buffer full, but not the whole data stream.
In that case, the algorithm has to rely on the observations of
previous data, that is, it is causal.

Some people need one type of algorithm, others can use either.

-- glen

Thomas Richter

unread,

Oct 16, 2009, 7:21:14 AM10/16/09

to

glen herrmannsfeldt schrieb:

> <>> Back to the OP: There, we had the question what we can say about
> <>> Huffman compression if we know something about the entropy of the
> <>> source. Entropy, again, is a term computed from probabilities. Not
> <>> from relative frequencies. Mixing the two is a common error, but
> <>> that's not the same thing. Probabilities can be assumed and are part
> <>> of the model. Relative frequencies are measured on an input string. If
> <>> we only know the entropy, we still do not know anything about the
> <>> string. Only how it would typically look like, but not how it does
> <>> look like.
>
> Lets get away from the math and look at actual real compression
> problems. If I have a file I want to compress, already stored
> in disk, I can count the number of each character in that file.
> I then don't care about 'typical' but just this one file.

Sure enough, you do what all other people also do: You built up a
probability model from the observed characters.

> Many compression algorithms allow for stream compression,
> where output can be generated before the end of the input
> string is known. That is certainly nice, but is not always
> required. I was going to look at the man page for pack and compact,
> but they don't seem to exist on the systems I use.

Such stream compressors are the back-end of many compression standards.
LZ and its variants are stream compressors. The MQ coder in JPEG 2000
and JBIG is a stream compressor.

Nevertheless, they all have a model that *guesses* probabilities.

> < That's just backwards from what I'd say. Entropy and probabilities are
> < nothing you "know by observation", i.e. it is not a measurable quantity.
> < Probabilities and entropies are *assumed*. They are part of a model. You
> < cannot compute them. You can assume them.
>
> If the requirement is 'compress file X, as stored on disk' then yes
> I do know the probabilities.

No, you don't. Again, you mix probability with relative frequency. A
probability is a measure on a probability space - you cannot "measure"
such a beast. You can assume - or rather choose - a model whose
probabilities are equal to the relative frequencies. But this is a
*choice*, not a *measurement*. It might not even be the best possible
choice for a model - it is rather a naive model that can be improved
considerably.

> For stream compression, say writing data to a tape while it is
> being generated, then I don't know the frequencies in advance.

There you go: "Frequency". Yes, exactly that. Not "Probability"!

> I might have one buffer full, but not the whole data stream.
> In that case, the algorithm has to rely on the observations of
> previous data, that is, it is causal.
>
> Some people need one type of algorithm, others can use either.

All true, but in neither case you "know" the probability from the file.

So lnog,
Thomas

biject

unread,

Oct 16, 2009, 10:32:28 AM10/16/09

to

On Oct 16, 5:21 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
> glen herrmannsfeldt schrieb:
....

> > Lets get away from the math and look at actual real compression
> > problems. If I have a file I want to compress, already stored
> > in disk, I can count the number of each character in that file.
> > I then don't care about 'typical' but just this one file.
>
> Sure enough, you do what all other people also do: You built up a
> probability model from the observed characters.
>
> > Many compression algorithms allow for stream compression,
> > where output can be generated before the end of the input
> > string is known. That is certainly nice, but is not always
> > required. I was going to look at the man page for pack and compact,
> > but they don't seem to exist on the systems I use.
>
> Such stream compressors are the back-end of many compression standards.
> LZ and its variants are stream compressors. The MQ coder in JPEG 2000
> and JBIG is a stream compressor.
>
> Nevertheless, they all have a model that *guesses* probabilities.

If you have the whole file of concern there are no
"guesses" you can calculate the probabilities of each
symbol exactly.

However if you compress that file you may miss higher
order dependencies between the characters.

Example if file is long but the characters are only
"ABCD" if you assign two bits to each character that's
the best you can do based only on the individual
probabilities if each occurs the same number of times
in the file. However if occurs and a repeated string
of exactly ABCD then using the relative frequencies
a mistake if compression is your goal.

So when you write a compressor you have to assume
in advance if your only using the individual counts
of each character or higher order terms.

The problem with compression is the more special
cases you want to allow for the longer the final
compressed file will be since if you allow for more
cases the compressed data file has to communicate
that to the decompressor.

If you want the entropy of the data then sadly
you have to base that on the model you choose for
compression. You can fool yourself into thinking
you have a super compressor since you get close
to what you calculate as the entropy. But someone
else can come along and see relationships you mess
and get far better compression than what you think
the entropy is.

Take a given file. Look at the set of all files
made of a permutation of such a file. If you goal
is to write the best compressor based on this class
of files then you can do no better that writing
a compressor that just uses the probabilities of
each symbol. In this case an arithmetic would be
the best period if the counts not part of the file.
Note at points where it looks nice for huffman the
arithmetic exactly matches in this case where counts
not used as part of output.

When you have to include the table data then
the situations changes. It's cheaper to send info
on the tree with huffman that it is with arithmetic.

In this case there are times when the huffman
best and times when arithmetic best. However the
arithmetic to me appears more beautiful Since
for one thing to get an optimal compression you
don't need to use two passes through the file you
can compress on one pass. A proper adaptive
arithmetic is the same an equivalent a static
arithmetic with a table.

Of course you can compress with an adaptive
huffman but sadly they don't match with a static
huffman. The adaptive arithmetic would compress
all files in your special set the same length give
or take a byte. This could be called the entropy
of the files in the set. However if you use an
adaptive vitter huffman you seldom get the same
lenght for permutations of the file. Sometimes
longer sometimes shorter.

Of course I am only talking about bijective
file compressors in the above like arb255.zip

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Thomas Richter

unread,

Oct 16, 2009, 12:15:07 PM10/16/09

to

biject schrieb:

> If you have the whole file of concern there are no
> "guesses" you can calculate the probabilities of each
> symbol exactly.

NO! Damnit, no! You also confuse probabilities with relative frequencies.
You can make up a model that fits to those frequencies, but these *aren't*
probabilities.

So long,
Thomas

glen herrmannsfeldt

unread,

Oct 16, 2009, 4:18:07 PM10/16/09

to

That is true if you require causality (cause comes before effect).

Without causality, how do you tell the difference between frequency
and probability? Assume that you have all the possible samples
for the given system.

In the signal processing world, one can design causal filters,
where the output only depends on the current and previous values
for a signal, or acausal ones where future values are also available.
In the case of a signal on disk, (such as a CD) one can easily
generate non-causal filters.

-- glen

biject

unread,

Oct 16, 2009, 4:41:47 PM10/16/09

to

If you bothered actually reading what I wrote. You
would have noticed I mentioned the set of files that
are all possible permutations of a given file. So for
files of that set it is like knowing the individual
probabilities.

Since any permutation is as likely as any other.
The counts would give you the probabilities for the
best arithmetic compressor for the set.

Again that doesn't mean some of the files can't be
made smaller using another method. But what ever
method you choose for the set as a whole any other
method will increase the lengths of other files.
Since a bijective arithmetic will be the best.

Niels Fröhling

unread,

Oct 16, 2009, 11:49:37 PM10/16/09

to

Thomas Richter wrote:

>> Oh, I don't allow to count artificial constructed sources according
>> an entropy. My emphasis is more about an equivalent to "know
>> real-Pi^tm", which was observed, then approximated, then modelled,
>> then absolutely determined, and now the verify a circle in reality
>> though the model Pi, instead of verifying Pi as before. Taking the
>> number "5" and saying "this constant is '5', exactly and always", is
>> not the sort of answer I thought of. :)
>
> Sure, I know what you mean. Probably not what you said to begin with,
> but clearly, I understand perfectly. And you have my answer. No, I
> don't. But that is what a model is: It isn't proven. It is made.

Hehe. :)
I thought it's of utter importance that you say what you've said. So nobody
get's the impression entropy is some god-defined measurable constant. It's a
quantifier of (one of) our specific assumptions in a specific probabilistic context.

> The fun part is: It nevertheless works, because my (or rather, those of
> other smart people designing compressors) guesses are often "good enough
> to make it work".

Yes, of course. As with everything: limit awareness is essencial to true
understanding and that includes the limits of "understandable" itself too,
ignorance leads to dogmatism.

> So long,
> Thomas
>
> PS: Yes, it seems strange indeed how mathematicians think. But again,
> step back for a while - abstract models like probability are so in our
> minds we no longer see that they are abstract. Here's an even simpler
> example:
>
> Ever seen a two?
>
> I mean the number, not the digit.

Sure, we had a nice coffee yesterday ... but then, I get also visits from
Infinity. :D

Ciao
Niels

Thomas Richter

unread,

Oct 18, 2009, 11:38:51 AM10/18/09

to

biject schrieb:

> On Oct 16, 10:15 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
>> biject schrieb:
>>
>>> If you have the whole file of concern there are no
>>> "guesses" you can calculate the probabilities of each
>>> symbol exactly.
>> NO! Damnit, no! You also confuse probabilities with relative frequencies.
>> You can make up a model that fits to those frequencies, but these *aren't*
>> probabilities.
>>
>> So long,
>> Thomas
>
> If you bothered actually reading what I wrote. You
> would have noticed I mentioned the set of files that
> are all possible permutations of a given file. So for
> files of that set it is like knowing the individual
> probabilities.

Again, sorry to object, but no.

Here's a more concrete example to show you what I mean.

Consider a fair coin, and we can assume(!) that p(A) = p(B) = 0.5.
Of course, not exactly, but close to that. Now, I threw that coin four times, and I got:

B B A B

We now know everything. This is the file: B B A B.
This is what I want to compress. Does this mean that p(B) = 3/4 and p(A) = 1/4 ??

So long,
Thomas

Willem

unread,

Oct 18, 2009, 12:07:06 PM10/18/09

to

Thomas Richter wrote:
) Again, sorry to object, but no.
)
) Here's a more concrete example to show you what I mean.
)
) Consider a fair coin, and we can assume(!) that p(A) = p(B) = 0.5.
) Of course, not exactly, but close to that. Now, I threw that coin four times, and I got:
)
) B B A B
)
) We now know everything. This is the file: B B A B.
) This is what I want to compress. Does this mean that p(B) = 3/4 and p(A) = 1/4 ??

No, but it *does* mean that p'(A) = 1/4 and p'(B) = 3/4
p'(A) = The probability that a symbol in that specific file is A.
(Same for p'(B) )

SaSW, Willem
--

Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be

drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Thomas Richter

unread,

Oct 19, 2009, 2:44:16 AM10/19/09

to

Willem schrieb:

> Thomas Richter wrote:
> ) Again, sorry to object, but no.
> )
> ) Here's a more concrete example to show you what I mean.
> )
> ) Consider a fair coin, and we can assume(!) that p(A) = p(B) = 0.5.
> ) Of course, not exactly, but close to that. Now, I threw that coin four times, and I got:
> )
> ) B B A B
> )
> ) We now know everything. This is the file: B B A B.
> ) This is what I want to compress. Does this mean that p(B) = 3/4 and p(A) = 1/4 ??
>
> No, but it *does* mean that p'(A) = 1/4 and p'(B) = 3/4
> p'(A) = The probability that a symbol in that specific file is A.
> (Same for p'(B) )

No, that kind of thing is a relative frequency, not a probability. You
would have a probability if you ask for the probability of the event
getting an A or a B from the file when picking a symbol from it "at
random", i.e. you throw a dice, then pick the item at the indicated
position. However, at this point, you have again(!) assumed (and not
measured) probabilities, namely that your dice is a fair dice, i.e. the
probability of picking the symbol at position i does not depend on i and
is 1/4.

Thus, again, probabilities aren't measured, but assumed. You have a
model from the process how you pick a symbol from the file.

So long,
Thomas

sebastian

unread,

Nov 8, 2009, 11:02:50 AM11/8/09

to

I apologize for reviving an old discussion, but I just wanted to set
the record straight. The Huffman algorithm is, in fact, guaranteed to
produce an output LESS THAN or EQUAL TO the original size. As an
example (sorry, I'm not very good at formulating mathematical proofs),
if you generate a 256 byte file with every possible permutation just
once, every symbol gets assigned an 8 bit code. Here's the output of a
program given that particular input set:

Huffman Test
File: '256.txt'
...reading file...
Uncompressed size: 256 bytes
...compressing...
*** Begin Debugging Log ***
0xd7 : 10100000
0xd8 : 10100001
'b' : 10100010
'Z' : 10100011
'(' : 10100100
0xf3 : 10100101
0xa0 : 10100110
0xf5 : 10100111
',' : 10101000
0x12 : 10101001
0xc1 : 10101010
0x83 : 10101011
0x1e : 10101100
0xaf : 10101101
0x15 : 10101110
0xb7 : 10101111
'j' : 10110000
'R' : 10110001
'4' : 10110010
0xf0 : 10110011
'c' : 10110100
'Y' : 10110101
0xe0 : 10110110
0xe1 : 10110111
0xc2 : 10111000
0x87 : 10111001
0 : 10111010
0xd9 : 10111011
'!' : 10111100
0x9e : 10111101
'z' : 10111110
'B' : 10111111
0xad : 10000000
0x8 : 10000001
0x4 : 10000010
0x1 : 10000011
'v' : 10000100
'F' : 10000101
'i' : 10000110
'S' : 10000111
0xb8 : 10001000
0xd : 10001001
'8' : 10001010
0xac : 10001011
0x99 : 10001100
0xf7 : 10001101
'u' : 10001110
'G' : 10001111
'=' : 10010000
0xbe : 10010001
0x8d : 10010010
0xfa : 10010011
'p' : 10010100
'L' : 10010101
'x' : 10010110
'D' : 10010111
'2' : 10011000
0x11 : 10011001
0xda : 10011010
0xdb : 10011011
'q' : 10011100
'K' : 10011101
0x91 : 10011110
0xf9 : 10011111
0x9c : 11100000
0xdf : 11100001
0xdd : 11100010
0xde : 11100011
0x3 : 11100100
0xd0 : 11100101
'y' : 11100110
'C' : 11100111
0x10 : 11101000
0xb2 : 11101001
'a' : 11101010
'[' : 11101011
0xe5 : 11101100
0xe2 : 11101101
0x94 : 11101110
0x98 : 11101111
0xbc : 11110000
0x19 : 11110001
0x82 : 11110010
'~' : 11110011
0x7f : 11110100
0x80 : 11110101
't' : 11110110
'H' : 11110111
'#' : 11111000
0xb6 : 11111001
'_' : 11111010
'^' : 11111011
0xe6 : 11111100
0xe7 : 11111101
0xc : 11111110
0xee : 11111111
'9' : 11000000
0xa9 : 11000001
0x5 : 11000010
0x6 : 11000011
0x81 : 11000100
0xfd : 11000101
'w' : 11000110
'E' : 11000111
'?' : 11001000
0xdc : 11001001
0x84 : 11001010
0x88 : 11001011
0xcd : 11001100
0xa6 : 11001101
0xc3 : 11001110
0x8b : 11001111
'}' : 11010000
'7' : 11010001
0xab : 11010010
0xd6 : 11010011
0xb9 : 11010100
0x16 : 11010101
'e' : 11010110
'W' : 11010111
0x92 : 11011000
0x8e : 11011001
0xc8 : 11011010
0x9f : 11011011
0xa1 : 11011100
''' : 11011101
's' : 11011110
'I' : 11011111
'1' : 00100000
'-' : 00100001
0xc5 : 00100010
0x93 : 00100011
0x95 : 00100100
0xf8 : 00100101
'{' : 00100110
'A' : 00100111
0x8c : 00101000
0x90 : 00101001
'h' : 00101010
'T' : 00101011
0xe : 00101100
0xb0 : 00101101
0xd4 : 00101110
0xd5 : 00101111
0xeb : 00110000
0xe8 : 00110001
'k' : 00110010
'Q' : 00110011
'5' : 00110100
0xb3 : 00110101
0xe3 : 00110110
0xe4 : 00110111
'r' : 00111000
'J' : 00111001
'm' : 00111010
'O' : 00111011
0x20 : 00111100
0x14 : 00111101
0xcc : 00111110
'.' : 00111111
0xa8 : 00000000
0xef : 00000001
'/' : 00000010
0xb4 : 00000011
0x9 : 00000100
0xae : 00000101
'>' : 00000110
';' : 00000111
0xc0 : 00001000
0xbf : 00001001
0xca : 00001010
0xa2 : 00001011
0x1f : 00001100
']' : 00001101
'g' : 00001110
'U' : 00001111
0x17 : 00010000
0xbb : 00010001
0xff : 00010010
0xfe : 00010011
'&' : 00010100
0x13 : 00010101
0x89 : 00010110
0xfb : 00010111
0x9a : 00011000
0x96 : 00011001
'o' : 00011010
'M' : 00011011
0xb : 00011100
0x1c : 00011101
'<' : 00011110
0x1a : 00011111
0xec : 01100000
0xed : 01100001
0x2 : 01100010
0xd3 : 01100011
0xa4 : 01100100
0xf2 : 01100101
0xcf : 01100110
':' : 01100111
0xcb : 01101000
'*' : 01101001
'n' : 01101010
'N' : 01101011
'|' : 01101100
'@' : 01101101
')' : 01101110
0xb5 : 01101111
'+' : 01110000
0xa5 : 01110001
0xf : 01110010
0xb1 : 01110011
'3' : 01110100
0xa7 : 01110101
0x85 : 01110110
0xfc : 01110111
0x8a : 01111000
0x86 : 01111001
0xa3 : 01111010
'%' : 01111011
0xa : 01111100
0x7 : 01111101
0xce : 01111110
'6' : 01111111
0xc4 : 01000000
0x8f : 01000001
'd' : 01000010
'X' : 01000011
0xd1 : 01000100
0xd2 : 01000101
'$' : 01000110
0xf4 : 01000111
0xe9 : 01001000
0xea : 01001001
0x18 : 01001010
0xba : 01001011
0xc6 : 01001100
0x97 : 01001101
'`' : 01001110
'\' : 01001111
'f' : 01010000
'V' : 01010001
0x1b : 01010010
0xbd : 01010011
'0' : 01010100
0xf1 : 01010101
0x9d : 01010110
0xf6 : 01010111
0xc7 : 01011000
0x9b : 01011001
0xaa : 01011010
0x1d : 01011011
0xc9 : 01011100
'"' : 01011101
'l' : 01011110
'P' : 01011111
Model: 63.875 bytes (511 bits)
Symbols: 256 bytes
Accumulator: 4 bytes
Total overhead: 323.875 bytes
*** End of Debugging Log ***
256 bytes processed in 0.36 seconds
Compressed size: 580 bytes (2.3e+02%)
...decompressing...
580 bytes processed in 0 seconds
...verifying...
Result: pass

In fact, the algorithm automatically adjusts itself in such a way that
no matter what the frequency of any given symbol is, a minimal code is
chosen such that the size of the input is never exceeded. It's an
interesting property that sets it apart from all others (that I know
of). If you still have any doubt, just imagine a 'two-bit' machine and
visualize every possible arrangement of the Huffman tree for a given
input, which really isn't too difficult.

And just to touch on the subject of entropy, I'm going to reiterate
what Thomas has said (although somewhat differently): Entropy has no
real bearing on the 'reality' of the state of a system. It is simply a
measure derived from whatever analysis is being applied to it, eg: the
'relative' order of the system as viewed by the observer. In other
words: Entropy is not 'absolute'.

biject

unread,

Nov 8, 2009, 3:28:25 PM11/8/09

to

On Nov 8, 9:02 am, sebastian <sebastianga...@gmail.com> wrote:

> I apologize for reviving an old discussion, but I just wanted to set
> the record straight. The Huffman algorithm is, in fact, guaranteed to
> produce an output LESS THAN or EQUAL TO the original size.

That only true if you don't include the information in
the table that tells which tree your using.

In fact if you don't include the stats simple arithmetic
coding will always match or bet huffman.

The main reason for this is that several files will
compress to the same exact file. So you need to know
which tree or stats to use so the file will uncompress
to the file you want.

Its when you include this data which is needed for
a general decompressor that you assumtions fall apart.

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **

Disclaimer:I am in no way responsible for any of the statements

glen herrmannsfeldt

unread,

Nov 8, 2009, 5:27:51 PM11/8/09

to

sebastian <sebasti...@gmail.com> wrote:
(snip)

> I apologize for reviving an old discussion, but I just wanted to set
> the record straight. The Huffman algorithm is, in fact, guaranteed to
> produce an output LESS THAN or EQUAL TO the original size. As an
> example (sorry, I'm not very good at formulating mathematical proofs),
> if you generate a 256 byte file with every possible permutation just
> once, every symbol gets assigned an 8 bit code.

Yes, but as was previously noted if you add the table to the
compressed file then it can get larger. That is more significant
for smaller source files.

I still like the compression system used by Ultrium tape drives.
There is one bit per block, probably in the block header, indicating
that the block is compressed. If, during writing, the drive finds
that the compressed data is larger it writes the uncompressed data
instead and indicates that using just one bit. So, compression will
never increase the size, but might decrease it.

-- glen

sebastian

unread,

Nov 9, 2009, 10:49:20 AM11/9/09

to

I was just elaborating on my earlier statement:

"Of course, expansion only occurs from to the overhead of transmitting
to model. In compressing the symbols alone, Huffman encoding
guarantees an output less than or equal to the input size."

But yes, for smaller files, even an extra 300 or so bytes for the
model itself may be too expensive. Arithemetic is probably one of the
most ideal encoding methods, but unfortunately the overhead of the
statistics can be much larger still (an adaptive approach could omit
that, but at the cost of overall compression, though I'm not sure what
this works out to in practice). Algorithms such as LZW don't even
require transmission of the model, being adaptive in nature, but in
any case neither LZW nor arithmetic guarantee the level of non-
expansion of input that Huffman does (making it ideal for limited
resource environments). Probably the best kind of compressor then
would be a combination of special-purpose encoders (in conjunction
with the Ultrium marking technique, say), to achieve the best overall
performance.

biject

unread,

Nov 9, 2009, 3:00:26 PM11/9/09

to

On Nov 9, 8:49 am, sebastian <sebastianga...@gmail.com> wrote:

...

Not sure where to start. But the nice thing
about adaptive arithmetic is that can always
compress to the same size as a static arithmetic
if coded correctly. So there is seldom a need for
a static arithmetic unless the table is known
separate from the file.

Again this precious so called no expansion of
huffman with no table. Is not as good
as the arithmetic with no table. Since one
should view huffman as a subset of arithmetic
compression and not really something as different.

Yes there are cases but mostly likely fewer
than you could think of where a static huffman
with a table beats the arithmetic with a table
but it a rare thing since even if the table data
for an arithmetic is slightly longer the better
compression often more than makes up for it.

Especially if one realizes that it easier
to make a bijective adaptive arithmetic. Than
a bijective static huffman with a table, though
it can be done its not pretty code.

nightlight

unread,

Nov 9, 2009, 3:13:39 PM11/9/09

to

On Nov 8, 5:27 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

>
> I still like the compression system used by Ultrium tape drives.
> There is one bit per block, probably in the block header, indicating
> that the block is compressed. If, during writing, the drive finds
> that the compressed data is larger it writes the uncompressed data
> instead and indicates that using just one bit. So, compression will
> never increase the size, but might decrease it.
>

There is a missing qualifier in that conclusion "compression will
never increase the size" which is: "... beyond the 1 bit increase paid
on every block every time."

sebastian

unread,

Nov 10, 2009, 11:31:48 AM11/10/09

to

>> Not sure where to start. But the nice thing about adaptive arithmetic is that can always
compress to the same size as a static arithmetic if coded correctly.
So there is seldom a need for
a static arithmetic unless the table is known separate from the file.

Interesting. I had always thought there would be some sort of
performance hit involved in that case. Can you elaborate on how this a
bit more?

>> Again this precious so called no expansion of huffman with no table. Is not as good
as the arithmetic with no table. Since one should view huffman as a
subset of arithmetic
compression and not really something as different.

I see, so the difference isn't much. But still, in an embedded
environment, say, where you have a fixed size block to work with,
Huffman is nice since it carries a guarantee in output size, not to
mention it being a very fast algorithm. But for the general case, yes,
arithmetic is definitely superior.

>> Especially if one realizes that it easier to make a bijective adaptive arithmetic. Than
a bijective static huffman with a table, though it can be done its not
pretty code.

I'm not sure what exactly a "bijective" Huffman encoder would be, but
as far as I know, the original algorithm can't be improved upon, that
is, Huffman's original design is essentially "perfect" for the type of
compression that it performs (which was confirmed by Shannon, I
believe). At any rate, it would be interesting to hear what the
difference between standard and bijective Huffman would be.

Arithmetic coding is pretty cool, anyway, and it's definitely next on
my list of compression algorithms to implement (though I'm going to
use the "range coding" approach to avoid any patent infringement). Any
words of advice for someone about to embark on such a task?

biject

unread,

Nov 10, 2009, 12:54:21 PM11/10/09

to

On Nov 10, 9:31 am, sebastian <sebastianga...@gmail.com> wrote:
> >> Not sure where to start. But the nice thing about adaptive arithmetic is that can always
>
> compress to the same size as a static arithmetic if coded correctly.
> So there is seldom a need for
> a static arithmetic unless the table is known separate from the file.
>
> Interesting. I had always thought there would be some sort of
> performance hit involved in that case. Can you elaborate on how this a
> bit more?
>

I would suggest something like

ftp://ftp.cs.brown.edu/pub/techreports/91/cs91-03.pdf

for a start read this document one of the best
on arithmetic on net

> >> Again this precious so called no expansion of huffman with no table. Is not as good
>
> as the arithmetic with no table. Since one should view huffman as a
> subset of arithmetic
> compression and not really something as different.
>
> I see, so the difference isn't much. But still, in an embedded
> environment, say, where you have a fixed size block to work with,
> Huffman is nice since it carries a guarantee in output size, not to
> mention it being a very fast algorithm. But for the general case, yes,
> arithmetic is definitely superior.
>
> >> Especially if one realizes that it easier to make a bijective adaptive arithmetic. Than
>
> a bijective static huffman with a table, though it can be done its not
> pretty code.
>
> I'm not sure what exactly a "bijective" Huffman encoder would be, but
> as far as I know, the original algorithm can't be improved upon, that
> is, Huffman's original design is essentially "perfect" for the type of
> compression that it performs (which was confirmed by Shannon, I
> believe). At any rate, it would be interesting to hear what the
> difference between standard and bijective Huffman would be.
>

Bijective huffman is in the case of file compression
just a way of encoding it so that every file would be
a valid file that could be uncompressed and when you
huffman compress it goes back. Its a more perfect than
normal huffman because it handles how you end the file.
and makes full use of the file space. Example in normal
huffman you may end not on a byte boudary so you need to
waste space to communicate that. In bijective huffman it
does not have these kinds of problems.

> Arithmetic coding is pretty cool, anyway, and it's definitely next on
> my list of compression algorithms to implement (though I'm going to
> use the "range coding" approach to avoid any patent infringement). Any
> words of advice for someone about to embark on such a task?

To me range coding is not that difference you just update by waiting
till more than one bit is ready.

sebastian

unread,

Nov 10, 2009, 1:51:38 PM11/10/09

to

> My Compression codehttp://bijective.dogma.net/

> **TO EMAIL ME drop the roman "five" **
> Disclaimer:I am in no way responsible for any of the statements
> made in the above text. For all I know I might be drugged.
> As a famous person once said "any cryptograhic
> system is only as strong as its weakest link"

Okay, well thanks so much for the input, David. I'll be sure to look
into that PDF before I go much farther, as well. Cheers.

glen herrmannsfeldt

unread,

Nov 10, 2009, 4:06:26 PM11/10/09

to

sebastian <sebasti...@gmail.com> wrote:
(snip)

> Arithmetic coding is pretty cool, anyway, and it's definitely next on
> my list of compression algorithms to implement (though I'm going to
> use the "range coding" approach to avoid any patent infringement). Any
> words of advice for someone about to embark on such a task?

I implemented the PDF JBIG2 generic encoder not so long ago.
It is convenient that one can test it by decoding with Acrobat reader
to verify that it is working. While debugging it I first encoded
a file will of 0 bits and go that working, then a file full of 1's,
and then some more complicated patterns. Generic is pretty much
a pure arithmetic coder on the bit image.

I believe there are no patent problems with the generic encoder.
There might be for the other ones, but many have expired by now.
I actually asked IBM by fax and never got a reply.

-- glen

Thomas Richter

unread,

Nov 12, 2009, 6:01:35 PM11/12/09

to

AFAIK the MQ coder patents should be run out by now, i.e. specifically
the conditional exchange and the bitstuffing patents.

So long,
Thomas

Yann

unread,

Dec 14, 2009, 6:53:59 PM12/14/09

to

Quite funnily, i was writing some pure entropy coders as a way to
learn them while this thread was going.
This was completely unrelated, as i was not even aware that this
discussion occurs.

But then, as it seems to be pretty close to this discussion,
you can find them at this web page :
http://phantasie.tonempire.net/pc-compression-f2/entropy-coders-t97.htm#152

There is one Huffman coder, and one Range coder.
Both are semi-static (they calculate statistics and transmit tables
with compressed data).

Interestingly, Huffman coder gets better results on non-sorted
distribution, such as a text file.
That's mostly because its table is much smaller, allowing updates to
happen very often.

This is not true however for sorted distribution, something you
typically get with a serie of Match Lengthes (smaller match lengthes
are much much more likely than larger ones).
In this case, the need of Huffman coder to approximate probabilities
with 1/2^n introduces an inefficiency which is beaten by the range
coder better precision.

Regards