huffman code length

Alex Vinokur

unread,

Jun 10, 1999, 3:00:00 AM6/10/99

to

In article <uo9vhcw...@netcom2.netcom.com>,
Tom Lane <t...@netcom.com> wrote:
> mcl...@nospam.pizza.demon.co.uk (M Clarke) writes:
> > Namely what is the maximum possible code size for any given data
set..
> > (in this case the number of unique symbols is 256 and the data set
> > will typically be many thousands of bytes)
>
> It's theoretically possible for a Huffman code generator to
> produce the worst-case code set wherein the k'th symbol has
> code length k,
> 0
> 10
> 110
> 1110
> ...
> 111111111111111111...10
> This happens when the k'th symbol has probability 2^-k.
>
[snip]

For more details about worst-case code see
the message titled "Huffman codes and Fibonacci numbers"
in comp.compression

http://www.deja.com/getdoc.xp?AN=471802979&fmt=text

Alex

>

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

Mok-Kong Shen

unread,

Jun 10, 1999, 3:00:00 AM6/10/99

to Alex Vinokur

Alex Vinokur wrote:
>

> For more details about worst-case code see
> the message titled "Huffman codes and Fibonacci numbers"
> in comp.compression
>
> http://www.deja.com/getdoc.xp?AN=471802979&fmt=text
>

Recently I posed a question elsewhere concerning Huffman encoding.
Since I don't yet have any appropriate answer, I like to take this
opportunity to repost it here:

The Huffman encoding is such that a uniform frequency distribution
gives a balanced tree while extremely non-uniform distribution
gives a tree very remote from being balanced. For a balanced tree
of 4 symbols, for example, a frequency distribution of 0.20, 0.20,
0.21, 0.39 is possible but much more extreme distributions are not
possible.

Question: Given an arbitrary Huffman tree, how can we say something
useful in quantitative terms concerning the possilbe frequency
distributions that correspond to it?

M. K. Shen

Mr. X

unread,

Jun 10, 1999, 3:00:00 AM6/10/99

to

Mok-Kong Shen wrote:
> Recently I posed a question elsewhere concerning Huffman encoding.
> Since I don't yet have any appropriate answer, I like to take this
> opportunity to repost it here:

I actually saw and responded to your earlier post, but since
comp.compression.research is moderated, my post was never actually
posted. This may be because of my anti-spam return addy or whatever,
anyway, I tacked on my original reply onto this thread.

> Question: Given an arbitrary Huffman tree, how can we say something
> useful in quantitative terms concerning the possilbe frequency
> distributions that correspond to it?

From: "Mr. X" <fa...@spam.free.net> 5/26/99 3:21 PM
Subject: Frequency distribution of Huffman encoding tree
Newsgroups: comp.compression.research

Mok-Kong Shen wrote:
>d_c...@my-dejanews.com wrote:
>>Given that Huffman has assigned a length of Hi bits to some symbol i,
>>we can say that symbol definitely had a frequency in the range of
>>
>> 2^(-Hi) <= fi <= 2^(1-Hi)
>> where
>> fi = ni/L
>>
>>For example, if Huffman assigns a length of 1 bit to some symbol 73,
>>that symbol definitely occurs with some frequency in the range
>> 2^(-1) = 1/2 <= f73 <= 2^(1-1) = 1.
>>
>> (This seems wrong for the special case of 2 symbols ...)
>
>My example given for a balanced tree of 4 symbols appears to
>contradict your formula. (Your range is [0.25, 0.5]).

That's because the formula isn't correct. :)

While you may want to believe:

2^(-BitLength-1) <= Freq <= 2^(-BitLength+1)

And this works for even distrabutions sets of [1], [.5], [.33] and [.25]
- unfortunately, it's not that simple as [.1,.9] violates it.

What Huffman gaurentees is that shorter codes are from equal or higher
frequency symbols, but not what those frequencies actually are. You
might have a file with 2^64-1 symbols A and just 1 symbol B. Huffman
will map these both to codes length 1, which is 'supposed' to mean freq
..5 but wait, there's more - a file with ABC will have 3 symbols with the
exact same frequency mapped to 2 different code lengths, neither of
which are actually correspond to the orginal freq. This is why you can't
work backwards to get an absolute freq range. You can get relative freq
ranges, and you know they total 1, but that's it without information
from the actual encoded data.

Some possible symbol sets for code set [0,1]
[.1,.9] [.5,.5] [.9,.1]

Some possible symbol sets for code set [0,10,11]
[.33,.33,.33] [.34,.33,.32] [.8,.1,.1]

X

Mok-Kong Shen

unread,

Jun 11, 1999, 3:00:00 AM6/11/99

to

Mr. X wrote:
>
>
> Some possible symbol sets for code set [0,1]
> [.1,.9] [.5,.5] [.9,.1]
>
> Some possible symbol sets for code set [0,10,11]
> [.33,.33,.33] [.34,.33,.32] [.8,.1,.1]

Maybe I misunderstood you, but what I think is preferable is a method
to compute, given a Huffman tree, the extreme frequency distributions
within whose range any valid distribution must lie.

M. K. Shen

Andras Erdei

unread,

Jun 11, 1999, 3:00:00 AM6/11/99

to

Mok-Kong Shen <mok-ko...@stud.uni-muenchen.de> wrote:

>The Huffman encoding is such that a uniform frequency distribution
>gives a balanced tree while extremely non-uniform distribution
>gives a tree very remote from being balanced.

p0 = epsilon , p1 = 1.0 - epsilon seems to be non-uniform enough,
and still the tree ((0)(1)) seems to be "balanced" enough :)

>Question: Given an arbitrary Huffman tree, how can we say something
>useful in quantitative terms concerning the possilbe frequency
>distributions that correspond to it?

(1) You can something about the frequency distribution
without using *any* information other than that you have a
frequency distribution; if you have reordered your alphabet
so that the symbol probabilities are in non-descending order

0 < P0 <= P1 <= ... <= Pn-1 <= 1

then

Pi <= 1 / ( n - i )

<<hint: sum Pi = 1 thus (n-i)*Pi <= 1>>

(2) We dont know the probabilities, just the Huffman
code-lengths. AFAIK all you can say is that a longer code
belong to *less or equally* probable symbol:

Hi > Hj --> pi <= pj

<<Hi is the length, pi is the probability of the i-th Huffman
code>>
<<hint: indirect proof>>

Let's reorder the alphabet (creating subsets):

H0 = H1 = ... = Hi(1) > Hi(1)+1 = .... > Hi(k) + 1 = ... = Hn-1

<<the j-th set is from Hi(j-1)+1 to Hi(j),
let p-1 = 0 and i(0) = -1; i(k+1) = n, pn = 1>>
<<warning: it does *not* follow that pi(j) < pi(j)+1,
just that pi(j) <= pi(j)+1, and we know nothing
about the order of the symbols inside a set>>

Now we can prove that for all q: i(j-1) < q <= i(j) there
exists at least q - i(j-1) symbol i(j-1) < m <= i(j) that

(i.a) pm <= Pq <= 1 / ( n - q )

<<hint: from (1)>>

Thus in the j-th set all elements have
probability p such that

(i.b) p <= Pi(j) <= 1 / ( n - i(j) )

<<also Pi(j-1) <= p, but we have only an upper bound
on Pi(j-1), so it is of no much use>>

and

(ii) max{ pm : m <= i(j-1) } <= p

<<hint: same proof as at the beginning of (2)>>

<<spec: also pn-1 = 1 - (p0 + ... + pn-2)>>

(i.a) and (i.b) is a restricted version of (1)
for the case when you don't know the probs (only
the Huffman code-lengths), and want to give a bound
for a specific symbol. (ii) seems to be pretty
useless to me -- but i don't know what was the
intent behind your question, so maybe you can find
some use of it.

(3) Does not have the time, but i guess you
can derive a lower bound for the frequency of the
(most probable) symbol(s) based on the Fibonacci
series; my guess would be something like

pn-1 > F(k-1) / N

where N is the length of the message.
(So e.g. if you have four different code-lengths
then the most frequent symbol have to occur at least
F(3)+1=4 times.)

EXAMPLES:

>For a balanced tree
>of 4 symbols, for example, a frequency distribution of 0.20, 0.20,
>0.21, 0.39 is possible but much more extreme distributions are not
>possible.

In the example above: H0 = H1 = H2 = H3
k=1, i(1) = 3, thus (i.b) for all h=0..3: Ph < 1
(or (i.a) one symbol is in (0,1/4], two is in
(0,1/3], three is in (0,1/2] and all four is in (0,1])
which does not say much. :(

"Mr. X" <fa...@spam.free.net> wrote:
> Some possible symbol sets for code set [0,10,11]
> [.33,.33,.33] [.34,.33,.32] [.8,.1,.1]

H0 = H1 > H2
k=2, i(1)=1, i(2)=2

j=1: at least one of P0 or P1 is in (0..1/3] , (i.a)
both P0 and P1 is in (0..1/2] ; (i.b)
j=2: P2 = 1 - (P0+P1) is in [max(P0,P1)..1) (ii)

Br,

Andras Erdei
c...@freemail.c3.hu

Antaeus Feldspar

unread,

Jun 11, 1999, 3:00:00 AM6/11/99

to

Mok-Kong Shen <mok-ko...@stud.uni-muenchen.de> wrote:

> Mr. X wrote:
> >
> >
> > Some possible symbol sets for code set [0,1]
> > [.1,.9] [.5,.5] [.9,.1]

> >
> > Some possible symbol sets for code set [0,10,11]
> > [.33,.33,.33] [.34,.33,.32] [.8,.1,.1]
>

> Maybe I misunderstood you, but what I think is preferable is a method
> to compute, given a Huffman tree, the extreme frequency distributions
> within whose range any valid distribution must lie.
>
> M. K. Shen

You mean something like: "Ah, symbols A, B, and C have codes 0,
10, and 11 respectively; therefore, if the tree was created by the
Huffman algorithm, the lowest frequency A could have is *just* above
..33..., and the highest frequency for both of B and C is .33..."?

-jc

--
* -jc IS *NOW* feld...@cryogen.com
* Home page: http://members.tripod.com/~afeldspar/index.html
* The home of >>Failed Pilots Playhouse<<
* "Better you hold me close than understand..." Thomas Dolby

Mr. X

unread,

Jun 13, 1999, 3:00:00 AM6/13/99

to

Mok-Kong Shen wrote:
> Maybe I misunderstood you, but what I think is preferable is a method
> to compute, given a Huffman tree, the extreme frequency distributions
> within whose range any valid distribution must lie.

OK, let's kick the dead horse again....

Huffman lengths fit:

1 <= L(1) <= L(2) <= L(3) ... <= L(n) <= (TotalNumberOfSymbols-1)

Huffman frequencies fit:

1 >= F(1) >= F(2) >= F(3) ... >= F(n) >= (1/TotalLengthOfSource)

In addition:

F(1) >= (1/TotalNumberOfSymbols)

And:

F(2) + F(3) ... + F(n) = 1-(1/TotalNumberOfSymbols)

So yes, you can give some more constraints, but for the reasons I listed
before, in all but the most trivial of cases, the actual original
frequencies can not be calculated from the tree alone.

X

Andras Erdei

unread,

Jun 14, 1999, 3:00:00 AM6/14/99

to

"Mr. X" <fa...@spam.free.net> wrote:

>Huffman lengths fit:
>
>1 <= L(1) <= L(2) <= L(3) ... <= L(n) <= (TotalNumberOfSymbols-1)
>
>Huffman frequencies fit:
>
>1 >= F(1) >= F(2) >= F(3) ... >= F(n) >= (1/TotalLengthOfSource)

If you meant that from (i) follows (ii), then counterexample:
L(1) = L(2) = 1 ; F(1) = 0.01 ; F(2)= 0.99 ;

L(i) <= L(j) gives *no* hint on the relation of F(i) and F(j).

OTOH from L(i) < L(j) follows that F(i) >= F(j).

>And:
>
>F(2) + F(3) ... + F(n) = 1-(1/TotalNumberOfSymbols)

No. e.g F(1) = F(2) = 50 / 100 and F(2) != 1 - 1/100.

> the actual original
>frequencies can not be calculated from the tree alone.

True.

Mok-Kong Shen

unread,

Jun 15, 1999, 3:00:00 AM6/15/99

to

Andras Erdei wrote:
>

> p0 = epsilon , p1 = 1.0 - epsilon seems to be non-uniform enough,
> and still the tree ((0)(1)) seems to be "balanced" enough :)

The issue certainly depends on the number of nodes. As my example
shows, with four nodes the possible frequency distributions for
a balanced tree are already fairly limited. I conjecture that
for relatively large number of nodes the possible frequency
distributions (frequencies sorted according to magnitude) for an
arbitrarily given Huffman tree are quite limited so that some useful
informations can be gained. Of course the Huffman tree can't uniquely
determine the frequencies, as the examples clearly indicate.

M. K. Shen

Alex Vinokur

unread,

Jun 15, 1999, 3:00:00 AM6/15/99

to

In article <375F9FC8...@stud.uni-muenchen.de>,

Mok-Kong Shen <mok-ko...@stud.uni-muenchen.de> wrote:
> Alex Vinokur wrote:
> >
>
> > For more details about worst-case code see
> > the message titled "Huffman codes and Fibonacci numbers"
> > in comp.compression
> >
> > http://www.deja.com/getdoc.xp?AN=471802979&fmt=text
> >
>

> Recently I posed a question elsewhere concerning Huffman encoding.
> Since I don't yet have any appropriate answer, I like to take this
> opportunity to repost it here:
>

> The Huffman encoding is such that a uniform frequency distribution
> gives a balanced tree while extremely non-uniform distribution

> gives a tree very remote from being balanced. For a balanced tree

> of 4 symbols, for example, a frequency distribution of 0.20, 0.20,
> 0.21, 0.39 is possible but much more extreme distributions are not
> possible.
>

> Question: Given an arbitrary Huffman tree, how can we say something
> useful in quantitative terms concerning the possilbe frequency
> distributions that correspond to it?
>

> M. K. Shen
>

Hi,

Here is something concerning weights distribution in Huffman trees.

Thanks in advance,
Alex

=================================================================

Let W = {w1, w2, ..., w#n} be
a non-decreasing sequence of positive numbers:
w1 <= w2 <= w3 <= ... <= w#n.
The W sequence is called normalized
if w1 + w2 + ... + w#n = 1.
The W sequence of integers is called 100-normalized
if w1 + w2 + ... + w#n = 100.

Let T be a Huffman tree for the W sequence.

* Level#0
/\
/ \
* * Level#1
............

* * * * ... * Level#k

............

* * * ... * Level#n (Last)

(I think) The following conditions must take place for Huffman tree:

---------------------------------------------
A-conditions : for each level of Huffman tree
---------------------------------------------
Let a1, a2, a3, ..., a#m be nodes of Level#k
(a1 is the leftest node on this level,
a#m is the rightest node on this level).
---> A-Conditions are : a1 <= a2 <= a3 <= ... <= a#m.
---------------------------------------------

---------------------------------------------
B-conditions : for each level of Huffman tree
---------------------------------------------
Let Q be a such piece of Huffman tree :

* Level#(k-2)
/\
/ \
* * Level#(k-1)
/\ /\
/ \ / \
a b c d Level#k

---> B-Condition is : (a + b) >= d.
---------------------------------------------

=================================================================

-----------------------------------------------------------------
----------------- Example 1 (BEGIN) -----------------------------
----------------- Balanced Huffman binary tree ------------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

Weights : w1, w2, w3, w4 w1 <= w2 <= w3 <= w4

s3 Level#0
/\
/ \
s1 s2 Level#1
/\ /\
/ \ / \
w1 w2 w3 w4 Level#2

Tree 1 - Balanced binary Huffman tree
s1 = w1 + w2
s2 = w3 + w4
s3 = s2 + w3 = w1 + w2 + w3 + w4

-------------------------------

(Non-trivial) A-Condition is :
Level#1 :
w1 + w2 <= w3 + w4 (s1 <= s2)

B-Conditions :
Level#2 :
w1 + w2 >= w4

-------------------------------
Example of sequences for this Huffman tree:
10, 10, 10, 19 (non-normalized sequence)
20, 20, 21, 39 (100-normalized sequence)
24, 25, 25, 26 (100-normalized sequence)

-----------------------------------------------------------------
----------------- Example 1 (END) -------------------------------
----------------- Balanced Huffman binary tree ------------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

-----------------------------------------------------------------
----------------- Example 2 (BEGIN) -----------------------------
----------------- Left-sided Huffman binary tree ----------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

Weights : w1, w2, w3, w4 w1 <= w2 <= w3 <= w4

s3 Level#0
/\
/ \
s2 w4 Level#1
/\
/ \
s1 w3 Level#2
/\
/ \
w1 w2 Level#3 (last)

Tree 2 - Left-sided binary Huffman tree
s1 = w1 + w2
s2 = s1 + w3 = w1 + w2 + w3
s3 = s2 + w4 = w1 + w2 + w3 + w4

-------------------------------

(Non-trivial) A-Conditions are :
Level#2 :
w1 + w2 <= w3 (s1 <= w3)
Level#1 :
w1 + w2 + w3 <= w4 (s2 <= w4)

B-Conditions : None

-------------------------------
Example of sequences for this Huffman tree:
10, 10, 21, 42 (non-normalized sequence)
12, 12, 25, 51 (100-normalized sequence)

-----------------------------------------------------------------
----------------- Example 2 (END) -------------------------------
----------------- Left-sided Huffman binary tree ----------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

-----------------------------------------------------------------
----------------- Example 3 (BEGIN) -----------------------------
----------------- Huffman binary tree ---------------------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

Weights : w1, w2, w3, w4 w1 <= w2 <= w3 <= w4

s3 Level#0
/\
/ \
s2 w4 Level#1
/\
/ \
w3 s1 Level#2
/\
/ \
w1 w2 Level#3 (last)

Tree 3 - binary Huffman tree
s1 = w1 + w2
s2 = s1 + w3 = w1 + w2 + w3
s3 = s2 + w4 = w1 + w2 + w3 + w4

-------------------------------

(Non-trivial) A-Conditions are :
Level#2 :
w3 <= w1 + w2 (w3 <= s1)
Level#1 :
w1 + w2 + w3 <= w4 (s2 <= w4)

B-Conditions : None

-------------------------------
Example of sequences for this Huffman tree:
10, 10, 19, 40 (non-normalized sequence)
12, 13, 24, 51 (100-normalized sequence)
16, 16, 17, 51 (100-normalized sequence)

-----------------------------------------------------------------
----------------- Example 3 (END) -------------------------------
----------------- Huffman binary tree ---------------------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

-----------------------------------------------------------------
----------------- Example 4 (BEGIN) -----------------------------
----------------- Huffman binary tree ---------------------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

Weights : w1, w2, w3, w4 w1 <= w2 <= w3 <= w4

s3 Level#0
/\
/ \
w4 s2 Level#1
/\
/ \
w3 s1 Level#2
/\
/ \
w1 w2 Level#3 (last)

Tree 4 - binary Huffman tree
s1 = w1 + w2
s2 = s1 + w3 = w1 + w2 + w3
s3 = s2 + w4 = w1 + w2 + w3 + w4

-------------------------------

(Non-trivial) A-Conditions are :
Level#2 :
w1 + w2 <= w3 (s1 <= w3)
Level#1 :
w4 <= w1 + w2 + w3 (w4 <= s2)

B-Conditions : None

-------------------------------
Example of sequences for this Huffman tree:
10, 10, 21, 40 (non-normalized sequence)
12, 13, 26, 49 (100-normalized sequence)
16, 16, 33, 35 (100-normalized sequence)

-----------------------------------------------------------------
----------------- Example 4 (END) -------------------------------
----------------- Huffman binary tree ---------------------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

-----------------------------------------------------------------
----------------- Example 5 (BEGIN) -----------------------------
----------------- Right-sided Huffman binary tree ---------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

Weights : w1, w2, w3, w4 w1 <= w2 <= w3 <= w4

s3 Level#0
/\
/ \
w4 s2 Level#1
/\
/ \
w3 s1 Level#2
/\
/ \
w1 w2 Level#3 (last)

Tree 5 - Right-sided binary Huffman tree
s1 = w1 + w2
s2 = s1 + w3 = w1 + w2 + w3
s3 = s2 + w4 = w1 + w2 + w3 + w4

-------------------------------

(Non-trivial) A-Conditions are :
Level#2 :
w3 <= w1 + w2 (w3 <= s1)
Level#1 :
w4 <= w1 + w2 + w3 (w4 <= s2)

B-Conditions : None

-------------------------------
Example of sequences for this Huffman tree:
10, 10, 19, 38 (non-normalized sequence)
13, 13, 25, 49 (100-normalized sequence)
16, 17, 32, 35 (100-normalized sequence)

-----------------------------------------------------------------
----------------- Example 5 (END) -------------------------------
----------------- Right-sided Huffman binary tree ---------------
----------------- (4 terminal nodes) ----------------------------
-----------------------------------------------------------------

-----------------------------------------------------------------
----------------- Example 6 (BEGIN) -----------------------------
----------------- Huffman binary tree ---------------------------
----------------- (9 terminal nodes) ----------------------------
-----------------------------------------------------------------

Weights : w1 <= w2 <= w3 <= w4 <= w5 <= w6 <= w7 <= w8 <= w9

s8 Level#0
/\
/ \
/ \
/ \
/ \
/ \
s6 s7 Level#1
/ \ / \
/ \ / \
w8 s4 s5 w9 Level#2
/ \ /\
s2 s3 w6 w7 Level#3
/ \ / \
w3 w4 s1 w5 Level#4
/ \
w1 w2 Level#5

Tree 6 - binary Huffman tree
s1 = w1 + w2
s2 = w3 + w4
s3 = s1 + w5 = w1 + w2 + w5
s4 = s2 + s3 = w1 + w2 + w3 + w4 + w5
s5 = w6 + w7
s6 = w8 + s4 = w1 + w2 + w3 + w4 + w5 + w8
s7 = s5 + w9 = w6 + w7 + w9
s8 = s6 + s7 = w1 + w2 + w3 + w4 + w5 + w6 + w7 + w8 + w9

-------------------------------

(Non-trivial) A-Condition is :
Level#4 :
w4 <= w1 + w2 (w4 <= s1)
w1 + w2 <= w5 (s1 <= w5)
Level#3 :
w3 + w4 <= w1 + w2 + w5 (s2 <= s3)
w1 + w2 + w5 <= w6 (s3 <= w6)
Level#2 :
w8 <= w1 + w2 + w3 + w4 + w5 (w8 <= s4)
w1 + w2 + w3 + w4 + w5 <= w6 + w7 (s4 <= s5)
w6 + w7 <= w9 (s5 <= w9)
Level#1 :
w1 + w2 + w3 + w4 + w5 + w8 <= w6 + w7 + w9 (s6 <= s7)

B-Conditions :
Level#4 :
w3 + w4 >= w5
Level#3 :
None
Level#2 :
w1 + w2 + w3 + w4 + w5 + w8 >= w9 (w8 + s4 >= w9)
Level#1 :
None

-------------------------------
Example of sequences for this Huffman tree:
10, 10, 10, 10, 19, 40, 40, 58, 81 (non-normalized sequence)
3, 3, 4, 4, 7, 14, 14, 20, 31 (100-normalized sequence)

-----------------------------------------------------------------
----------------- Example 6 (END) -------------------------------
----------------- Huffman binary tree ---------------------------
----------------- (9 terminal nodes) ----------------------------
-----------------------------------------------------------------

Mr. X

unread,

Jun 16, 1999, 3:00:00 AM6/16/99

to

Andras Erdei wrote:
> "Mr. X" <fa...@spam.free.net> wrote:
> >Huffman lengths fit:
> >1 <= L(1) <= L(2) <= L(3) ... <= L(n) <= (TotalNumberOfSymbols-1)
> >Huffman frequencies fit:
> >1 >= F(1) >= F(2) >= F(3) ... >= F(n) >= (1/TotalLengthOfSource)
>
> If you meant that from (i) follows (ii), then counterexample:
> L(1) = L(2) = 1 ; F(1) = 0.01 ; F(2)= 0.99 ;

No, I did not say that. I said the reverse. (i) can not be resolved
back into (ii) in all but the most trivial of cases. I have already
stated this in a couple of ways. I can only assume that you
mis-understood my post because you obviously agree with my earlier
post. Just on thursday in this very thread I gave [.9,.1] as an example
to somebody as to why they can't derive (ii) from (i).

> L(i) <= L(j) gives *no* hint on the relation of F(i) and F(j).
>
> OTOH from L(i) < L(j) follows that F(i) >= F(j).

Which is what I said several days ago in this very thread.

> >F(2) + F(3) ... + F(n) = 1-(1/TotalNumberOfSymbols)
>
> No. e.g F(1) = F(2) = 50 / 100 and F(2) != 1 - 1/100.

There is a mis-understanding here again, but I did make an error in
leaving out a "<" I wrote:

F(1) >= (1/TotalNumberOfSymbols)

And:
F(2) + F(3) ... + F(n) = 1-(1/TotalNumberOfSymbols)

Which obviously *should* have been

F(1) >= (1/TotalNumberOfSymbols)

And:
F(2) + F(3) ... + F(n) <= 1-(1/TotalNumberOfSymbols)

As for the misunderstanding:

TotalNumberOfSymbols is not TotalLengthOfSource

If F(1)=50/100=.5
And F(2)=50/100=.5
Then TotalNumberOfSymbols = 2

And it is most certainly true that:

F(2) <= 1-(1/2)

And given the set [.8,.1,.1]

F(2)+F(3) <= 1-(1/3)

This is true because F(1) can never be less than 1/TotalNumberOfSymbols
because F is sorted. If F(1) was less, then something else would have
to be higher for the total to = 1 and then F wouldn't be sorted.

We can say that *all* F(x) >= 1/TotalLengthOfSource because the only way
for it to be less is if it didn't appear at all in the Source, and if it
didn't, it wouldn't get a symbol and wouldn't be in F.

> > the actual original
> >frequencies can not be calculated from the tree alone.
>
> True.

Which is the whole point of this.

I think we agree, but there is a language problem between us.

X

Mok-Kong Shen

unread,

Jun 17, 1999, 3:00:00 AM6/17/99

to Alex Vinokur

Alex Vinokur wrote:

For a Huffman tree that is for a W-sequence of your definition
than your A-Condition holds (via definition if the nodes are
ordered that way). But how about an arbitrarily given Huffman tree?

M. K. Shen

Alex Vinokur

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

In article <3768E873...@stud.uni-muenchen.de>,

> For a Huffman tree that is for a W-sequence of your definition
> than your A-Condition holds (via definition if the nodes are
> ordered that way). But how about an arbitrarily given Huffman tree?
>
> M. K. Shen
>
> M. K. Shen
>

Hi,

Let W be the following sequence : 10 11 15 20

Here are Huffman algorithm stages :

(I think) If the sequences are sorted on each stage
of the Huffman algorithm
then A-Conditions take place.

=======================================================
Example 1.
=========
Stage#0 : 10 11 15 20
Stage#1 : 15 20 (21)
Stage#2 : (21) (35)
Stage#3 : (66)
Note! The sequences are sorted on each stage.
So, A-Conditions take place for this tree.

66
/\
/ \
21 35
/\ /\
/ \ / \
10 11 15 20

###############################################
We can "desort" first two weight
on some (or all) stages of Huffman algorithm.
In this case A-conditions don't take place for Huffman tree.
###############################################
=======================================================
Example 2.
=========
Stage#0 : 11 10 15 20 The sequence isn't sorted
Stage#1 : 20 15 (21) The sequence isn't sorted
Stage#2 : (21) (35) The sequence is sorted
Stage#3 : (66)
Note! There are "desorted" sequences.
So, A-Conditions don't take place for this tree.

66
/\
/ \
21 35
/\ /\
/ \ / \
11 10 20 15

###############################################
(I think) If we use Huffman algorithm with "desorting",
apparently we need to use the weights permutations
on each level to build Conditions like A-Conditions.

By the way, is there any reason to use Huffman algorithm with
"desorting"?

Alex

Atsushi Marui

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

Alex Vinokur <alexande...@telrad.co.il> writes:

Hi,

100-normalized and sorted at the first place.

100
/\
/ \
51 \ <-- but not sorted at this level
/\ \
3 \ \
/\ \ \
/ \ \ \
1 2 48 49

Will this be the failing case? or am I missing the point?

--
MARUI Atsushi
___________________________________________________________
Computer Science & Engineering / PennState Univ
mailto: ma...@cse.psu.edu
http://www.cse.psu.edu/~marui

SCOTT19U.ZIP_GUY

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

In article <7ko7fi$ajm$1...@nnrp1.deja.com>, Alex Vinokur <alexande...@telrad.co.il> wrote:
>In article <3768E873...@stud.uni-muenchen.de>,

>> For a Huffman tree that is for a W-sequence of your definition
>> than your A-Condition holds (via definition if the nodes are
>> ordered that way). But how about an arbitrarily given Huffman tree?
>>
>> M. K. Shen
>>
>> M. K. Shen
>>
>

>By the way, is there any reason to use Huffman algorithm with
>"desorting"?
>

Yes I think there are resons to use Huffman trees that are not quit sorted
the way it usually is done. The main reason would be to create a headerless
dynamic huffman compression. Since the last byte compressed seldom
ends up on a byte bouday something special must be done. See what I
did at my web page on compression.

David A. Scott
--
SCOTT19U.ZIP NOW AVAILABLE WORLD WIDE
http://www.jim.com/jamesd/Kong/scott19u.zip
http://members.xoom.com/ecil/index.htm
NOTE EMAIL address is for SPAMERS

Alex Vinokur

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to

In article <vzdpv2o...@lawrence.cse.psu.edu>,
Atsushi Marui <ma...@lawrence.cse.psu.edu> wrote:

> Hi,
>
> 100-normalized and sorted at the first place.
>
> 100
> /\
> / \
> 51 \ <-- but not sorted at this level
> /\ \
> 3 \ \
> /\ \ \
> / \ \ \
> 1 2 48 49
>
> Will this be the failing case? or am I missing the point?
>
> --
> MARUI Atsushi
> ___________________________________________________________
> Computer Science & Engineering / PennState Univ
> mailto: ma...@cse.psu.edu
> http://www.cse.psu.edu/~marui
>

Hi,

Here are Huffman algorithm stages for sequence 1, 2, 48, 49.

=======================================================
L means left arc of binary tree, R means right arc.
=======================================================

----------------------------
Stage#0 : 1 2 48 49
L R

3
/\
/ \
1 2

----------------------------
Stage#1 : (3) 48 49
L R

51
/\
/ \
3 48
/\
/ \
1 2

----------------------------

Stage#2 : 49 (51)
L R

100
/\
/ \
49 51
/\
/ \
3 48
/\
/ \
1 2

=======================================================

Regards,
Alex

Mok-Kong Shen

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to Alex Vinokur

Alex Vinokur wrote:
>

> ----------------------------
>
> Stage#2 : 49 (51)
> L R
>
> 100
> /\
> / \
> 49 51
> /\
> / \
> 3 48
> /\
> / \
> 1 2
>

At each level the nodes are sorted. But the terminal nodes are
not sorted. Would this matter for you in proceeding further to obtain
the range of variation of the frequency distributions of (terminal)
nodes where the frequncies are sorted (the problem I originally
asked)?

M. K. Shen

Atsushi Marui

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to

Alex Vinokur <alexande...@telrad.co.il> writes:

> > Hi,
> >

> 51
> /\
> / \
> 3 48
> /\
> / \
> 1 2
>
> ----------------------------
>
> Stage#2 : 49 (51)
> L R
>
> 100
> /\
> / \
> 49 51
> /\
> / \
> 3 48
> /\
> / \
> 1 2
>

> =======================================================

Ah, I got it. It make sense. Thank you!

Alex Vinokur

unread,

Jun 24, 1999, 3:00:00 AM6/24/99

to

In article <3770DD09...@stud.uni-muenchen.de>,

Mok-Kong Shen <mok-ko...@stud.uni-muenchen.de> wrote:
> Alex Vinokur wrote:
> >
>

> > ----------------------------
> >
> > Stage#2 : 49 (51)
> > L R
> >
> > 100
> > /\
> > / \
> > 49 51
> > /\
> > / \
> > 3 48
> > /\
> > / \
> > 1 2
> >
>

> At each level the nodes are sorted. But the terminal nodes are
> not sorted.

The terminal nodes are sorted on stage#0 of Huffman algorithm. As far as the
terminal nodes are concerned It is enough. See example in my message :
http://x21.deja.com/getdoc.xp?AN=492905747&CONTEXT=930200759.400228377&hitnum
=3 http://www.deja.com/getdoc.xp?AN=492905747&fmt=text

> Would this matter for you in proceeding further to obtain
> the range of variation of the frequency distributions of (terminal)
> nodes where the frequncies are sorted (the problem I originally
> asked)?
>

A-conditions and B-conditions require that the (internal an terminal) nodes
be sorted at each level. See my original message :
http://x29.deja.com/getdoc.xp?AN=489813355&CONTEXT=930200784.2102591510&hitnu
m=15 http://www.deja.com/getdoc.xp?AN=489813355&fmt=text A-conditions and
B-conditions require no other sorting. So (I think) we can obtain the range
of variation of the frequency distributions of terminal nodes (the very
interesting problem you originally asked). But (I sure) this problem requires
additional research.

> M. K. Shen
>

Regards,
Alex

d.c...@ieee.org

unread,

Jul 15, 1999, 3:00:00 AM7/15/99

to

I've been thinking about a different paradigm to "explain" the Huffman
algorithm. This one has no trees. And it creates a easy-to-understand
concept of the "canonical" Huffman code.

Starting with the file you want to compress, count how many times each
symbol occurs (histogram). Then sort the file,
not in alphabetical order, but in order of least-frequently-occurring-
symbol first.

For example, the file "ebeaabeb" is sorted

000 a
001 a
010 b
011 b
100 b
101 e
110 e
111 e

Then we could encode the original file by just transmitting the offset
of each letter in this sorted file. This leads to 3 bits/symbol
encoding. (I think that arithmetic coding is very similar to this idea).
However, notice that all the symbols in this sorted file that start
with "00" all lead to `a'. This means that we could encode all `a's in
the original file with only those 2 bits, leading to 2 bits for a (00),
2 bits for the b (01) and 2 bits for the e (11). Note that the "10"
code is never used, since it only codes for b and e, which are already
covered.

The canonical Huffman algorithm works its magic on this sorted file,
adding and deleting a few letters, creating

000 a
001 a
010 b
011 b // deleted one `b' !
100 e
101 e
110 e
111 e // added 1 `e' !

The length of a Huffman sorted file is always a power of 2.
Each letter occurs exactly some power-of-2
times, letters that occur some smaller power-of-2 times occur first. In
a "canonical" Huffman sorted file, letters that occur the same power-of-
2 times are sorted in alphabetical order. This means that given that
power-of-2 ("the code length") of each symbol, we can easily regenerate
this canonical Huffman sorted file.

As before, we could encode the original file by encoding each character
by its offset in this sorted file. This leads to 3 bits per symbol
At decoding, we take the code lengths of each symbol, and regenerate
the canonical sorted file. Then we decode by looking up bits in the
compressed bit stream as offsets to this file. Note that if the first
codeword you're starting to decode starts with "1", you already know
that must have been a `e' in the original file. So the remaining bits
in the offset don't need to be stored in the file. This means the
canonical Huffman code for this file is 2 bits for a (00), 2 bits for b
(01), and 1 bit for e (1).

This immediately leads to all the standard properties for canonical
Huffman codes, including "the all-zeros code is the longest code", "the
all-ones code is the shortest code", etc.

In a few initial data files (for example, aabbeeee), letters are
already in a power-of-2 frequency distribution, and the total file
length is already a power-of-2. The Huffman "magic" never makes any
changes to these files. For those files,
Lo = Ls
bl(a) = log2(Lo/f(a)), so
f(a) = Lo / 2^(bl(a)).
where
bl(a) = bit length of the symbol `a'
f(a) = how many times 'a' occurs in the original source file
Ls = length of this hypothetical sorted file modified by Huffman,
Lo = length of original file.
Any possible bitlength distribution *could* have been generated by a
file of this type.

To other initial data files (for example, aabeeeb), Huffman only adds
letters (to round their frequencies up to the next power of 2), never
subtracts letters. (Is this right ? Would Huffman ever round up to more
than the next power of 2 ?)
For these files,
Lo < Ls
bl(a) = log2( Ls/f(a) )
bl(b) = floor(log2( Ls/f(b) ))
where
`a' is any letter that was unchanged
`b' is any letter that had extra copies added.
Since
x-1 < floor(x) <= x (Perhaps a tighter bound can be found here)
we can derive
log2( Lo/f(b) ) < bl(b)
Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)).
where
`c' is any type `a' or type `b' symbol in the original file.

( I thought I was going to get
Lo / 2^(bl(b)) < f(b) < 2*Lo / 2^(bl(b))
but I think this is incorrect).

Unfortunately, there are lots of tricky cases where Huffman deletes
some duplicate letters (and possibly adds other duplicate letters, in
the general case). For those files, Ls could be bigger or smaller
or the same as Lo. For any letter `d' that had duplicate copies deleted,
floor(log2(Ls/f(d))) < bl(d).
The remaining letters still get
bl(b) = floor(log2( Ls/f(b) ))
Sometimes some of the remaining letters are compressed even better than
Arithmetic coding (it appears to violate entropy); i.e., sometimes
bl(c) < log2( Lo/f(c) ).
However, total entropy is not violated since the worse compression on
letters who had some copies deleted more than make up for
this "improvement".
In particular, for the extremely skew case where Lo/f(d) = 1.00001
(or, in general, where Lo/f(d) < 2 approximately),
floor( log2( Ls/f(d) ) = 0 < bl(d) = 1.
So for files where Huffman deletes some characters, we get
Letters with unchanged frequency
bl(a) = log2( Ls/f(a) )
f(a) = Ls / 2^(bl(a)).
Letters with copies added
bl(b) = floor(log2( Ls/f(b) ))
Ls / 2^(1+bl(b)) < f(b) < Ls / 2^(bl(b)).
Letters with copies deleted
floor(log2(Ls/f(d))) < bl(d)
Ls / ( 2^(bl(d)+1) ) < f(d).

Of course, for every symbol in every kind of file that has Q unique
symbols,
0 < bl(c) < Q
0 < bl(d) < Q.
(except for the special case of Q=1 (Ls = 1) or Q=0 (Ls = 0), where the
compressed data has length zero bits).

Maybe if I could get a better handle on when exactly these copies are
deleted, I could answer the original question.

I suspect that if you look at every symbol that has the same bit length,
only the one closest to the end of the alphabet could possibly have
copies deleted. If that were true, then for all symbols of the same bit
length except for the last,
Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)).
where
Ls = sum{i=each unique symbol}( 2^(max - bl(i)) )
max = the longest bit length.

d_c...@my-deja.com

unread,

Jul 16, 1999, 3:00:00 AM7/16/99

to

At
http://www.compressconsult.com/huffman/#maxlength
I learned that the Fibonacci sequence of frequencies
gives the maximum possible code length.

length of file (# of samples)
maximum code length
2 1
3...4 2
5...7 3
8...12 4
13...20 5
21...33 6
...... ...
1597...2583 15
2584...4180 16
4181... 17

which gives a 16 bit code length
after only 2584 samples of a 17 symbol file
and a 17 bit code length
after only 4181 samples of a 18 symbol file
:-(.

It appears that Tom Lane made the same mistake I did when I figured
that if I used 15 bit counters to count the symbol frequencies, that I
could be guaranteed a minimum frequency
of 1 out of 2^15 and therefore (erroneously) the maximum symbol length
would be about 16 bits. The maximum symbol length could be far longer
than that, so I'm thinking about using the depth-limited near-Huffman
compression Tom
Lane mentioned.

DI Michael Schindler

unread,

Jul 16, 1999, 3:00:00 AM7/16/99

to

hi!

the formula below is right only if you take the longest subtrees
in case of equality of weights. If you prefer shorter trees
the # of samples raises faster.
The reason is that in the case of this worst-case sample
distribution you add just about half the number of samples you
already have but need an additional bit.

Note the codelength is not important; important is the part of
the code that is nonconstant for coding and decoding issues.
This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
canonical code (which you should for decoding speed reasons).

Michael

d_c...@my-deja.com wrote in article <7mmh3c$67k$1...@nnrp1.deja.com>...

Tom Lane

unread,

Jul 17, 1999, 3:00:00 AM7/17/99

to

d_c...@my-deja.com writes:
> It appears that Tom Lane made the same mistake I did when I figured
> that if I used 15 bit counters to count the symbol frequencies, that I
> could be guaranteed a minimum frequency
> of 1 out of 2^15 and therefore (erroneously) the maximum symbol length
> would be about 16 bits. The maximum symbol length could be far longer
> than that, so I'm thinking about using the depth-limited near-Huffman
> compression Tom Lane mentioned.

I did learn better some years ago ;-). One thing to keep in mind
here is that this stuff is not really significant unless you are
building a Huffman coder or decoder that has an implementation
limit on the maximum code length it can cope with. Rejiggering
the lengths of bottom-frequency symbols may make the code not
strictly optimal per Huffman's proof, but it obviously cannot
have much impact on the overall compression rate --- if those
symbols occurred often enough to be significant, they'd have
gotten shorter codes. (I learned *that* after implementing
an extremely complicated algorithm for computing an optimal
limited-length code tree, only to find that it outperformed
the approximate method recommended in the JPEG spec by a
spectacular 0.001% on real-world images...)

Now there are a lot of fast Huffman bit-pushing methods that
do want a short limit on the max code length, so these length-
limited almost-Huffman codes do have considerable interest in
practice. Just don't get carried overboard about the difference
between optimal and almost-optimal codes.

regards, tom lane
organizer, Independent JPEG Group

d_c...@my-deja.com

unread,

Jul 19, 1999, 3:00:00 AM7/19/99

to

I already knew about preferring shorter trees (easy to do by preferring
the least-frequently-merged nodes when building the tree during
compression), but somehow I neglected to take that into account when
building that table. Thanks for pointing out that flaw. Using
standard (prefer shorter trees) encoding,

length of file (# of samples)

maximum code length (standard encoding)
0..1 0
2 1
3..4 2
5..8 3
9..14 4
15..24 5
25..40 6
41..66 7
67..108 8
...
1973..3192 15
3193..5166 16
5167..8360 17

The width of each range is the lower end of the previous range.
(Did I get it right this time ?)

Unfortunately, this doesn't seem to make a significant difference (5166
characters vs. 4180 characters).

Using standard (prefer shorter trees) encoding,
the worst-case (pathological) case for 8 unique symbols is a depth of 7,
((((((((1+1)+1)+3)+4)+7)+11)+18) = 46 symbols
which is almost, but not quite, the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out "1 1 1 3" rather than "1 1".

But a depth of 7 can occur with only 41 symbols,
albeit with 9 unique symbols:
((((((((1+1)+(1+1))+1)+4)+6)+10)+16) = 41
This is even less like the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out
"1 1 1 1 1 4 6" rather than "1 1".

Thanks for telling me that, after the initial zeros, even the worst-
case canonical codes are relatively short. I didn't know that. I can
see that that would definitely simplify Huffman compressors and
decompressors. If I have time later, I will see if that really makes my
software (which already uses canonical codes) go any faster (or slower).

"DI Michael Schindler" <mic...@compressconsult.com>
mentioned:

> the formula below is right only if you take the longest subtrees
> in case of equality of weights. If you prefer shorter trees
> the # of samples raises faster.
> The reason is that in the case of this worst-case sample
> distribution you add just about half the number of samples you
> already have but need an additional bit.
>
> Note the codelength is not important; important is the part of
> the code that is nonconstant for coding and decoding issues.
> This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
> canonical code (which you should for decoding speed reasons).

...
> d_c...@my-deja.com wrote
...

> > the Fibonacci sequence of frequencies
> > gives the maximum possible code length.
> >
> > length of file (# of samples)
> > maximum code length
> > 2 1
> > 3...4 2
> > 5...7 3
> > 8...12 4
> > 13...20 5
> > 21...33 6
> > ...... ...
> > 1597...2583 15
> > 2584...4180 16
> > 4181... 17