Canonical huffman coder implementation

Gareth

unread,

Mar 4, 2003, 4:29:21 PM3/4/03

to

My cannonical huffman coder started out as a plain huffman coder that
I wrote and then modified to generated huffman codes. With my plain
huffman coder a branch generated a '1' or a '0' depending on the
weight of the other branch connected to the same parent node. With a
huffman coder, shorter codes are allocated '1's.

To change this so that canonical codes were generated I added another
field to each node, the "branch depth". Each leaf node had it's branch
depth set to one. Then working up the tree, each parent node had it's
branch depth set to the sum of the two children's branch depths so
that the longest distance to a leaf is recorded for each node. Now the
branch is set to generate a '0' or '1' depending on this branch depth,
ie, the smaller branch depths generate '1's.

Now passing a contrived file to my coder I get the following results :

symbol: 67 length: 3 code: 111
symbol: 68 length: 3 code: 101
symbol: 64 length: 4 code: 0001
symbol: 66 length: 4 code: 0110
symbol: 69 length: 4 code: 1101
symbol: 6a length: 4 code: 0010
symbol: 6b length: 4 code: 0111
symbol: 6c length: 4 code: 1001
symbol: 73 length: 4 code: 0011
symbol: 75 length: 4 code: 0101
symbol: 62 length: 5 code: 11001
symbol: 65 length: 5 code: 10000
symbol: 6f length: 5 code: 11000
symbol: 72 length: 5 code: 01001
symbol: 77 length: 5 code: 10001
symbol: 79 length: 5 code: 01000
symbol: 74 length: 6 code: 000001
symbol: 61 length: 7 code: 0000111
symbol: 63 length: 7 code: 0000100
symbol: 6e length: 7 code: 0000001
symbol: 71 length: 7 code: 0000110
symbol: 76 length: 7 code: 0000101
symbol: 78 length: 8 code: 00000001
symbol: 6d length: 9 code: 000000001
symbol: 70 length: 9 code: 000000000

I think something is wrong because nothing but the codes with the
shortest length (3 in this case) should start with '1', but some do.
If the above is correct, how would for example the code '1101' be
decoded?. Once the first two '1's have been read in by the decoder how
will it know if these are referring to codes of length 3,4 or 5?

Dale King

unread,

Mar 4, 2003, 5:07:25 PM3/4/03

to

"Gareth" <the_a...@hotmail.com> wrote in message
news:370afe3d.03030...@posting.google.com...

> My cannonical huffman coder started out as a plain huffman coder that
> I wrote and then modified to generated huffman codes. With my plain
> huffman coder a branch generated a '1' or a '0' depending on the
> weight of the other branch connected to the same parent node. With a
> huffman coder, shorter codes are allocated '1's.
>
> To change this so that canonical codes were generated I added another
> field to each node, the "branch depth". Each leaf node had it's branch
> depth set to one. Then working up the tree, each parent node had it's
> branch depth set to the sum of the two children's branch depths so
> that the longest distance to a leaf is recorded for each node. Now the
> branch is set to generate a '0' or '1' depending on this branch depth,
> ie, the smaller branch depths generate '1's.

I think the problem comes from how you handle a branch node that has
different branch depths for its two children.

But it is much easier than that. Once you know the depth of leaf nodes (i.e.
the length of the code) that is all you care about. You can throw the tree
away at that point. For some very efficient code that takes a sorted list of
frequencies and returns in the same array the code lengths for each see the
source code for this paper:

http://www.cs.mu.oz.au/~alistair/abstracts/mk95:wads.html

To understand what it is doing you'll have to read the paper.

Once you have the lengths of the codes it is a simple matter to turn that
into the codes. Here is my Java implementation that takes an array of code
lengths (sorted from longest to shortest which is what the algorithm cited
above generates) and creates and array of the codes for those code lengths.

private int[] generateCodes(int[] codeLengths)
{
int[] codes = new int[codeLengths.length];
int length = codeLengths[0];
int code = 0;
for (int i = 0; i < codeLengths.length; i++)
{
while( length > codeLengths[i] )
{
code >>>= 1;
length--;
}
codes[i] = code++;
}
return codes;
}

This works from the bottom of the tree up. The nodes at the greatest depth
will start at zero and increase by one. When you go up a level you simply
divide the code by 2.

That definitely is not canonical. The codes for all nodes at a given depth
should increment by 1.

Andrey Savov

unread,

Mar 5, 2003, 1:02:22 AM3/5/03

to

"Dale King" <Dale[dot]Ki...@thomson.net> wrote in message
news:3e65...@news.tce.com...

> That definitely is not canonical. The codes for all nodes at a given depth
> should increment by 1.

Do you mean the lengths or what?

Dale King

unread,

Mar 5, 2003, 12:29:29 PM3/5/03

to

"Andrey Savov" <sa...@hotmail.com> wrote in message
news:Opg9a.2733$5x4...@news1.west.cox.net...

No, the codes for all symbols of the same length should form a linear
sequence changin by one. Taking the bottom of Gareth's example:

> symbol: 74 length: 6 code: 000001
> symbol: 61 length: 7 code: 0000111
> symbol: 63 length: 7 code: 0000100
> symbol: 6e length: 7 code: 0000001
> symbol: 71 length: 7 code: 0000110
> symbol: 76 length: 7 code: 0000101
> symbol: 78 length: 8 code: 00000001
> symbol: 6d length: 9 code: 000000001
> symbol: 70 length: 9 code: 000000000

Notice that at depth 7 the code 0000010 is not present and that is used for
the length 6 code. Thereforre this tree cannot have the Canonical shape of
all branch nodes to the left side and all leaf nodes to the right. Those
numbers mean that the leaf node for symbol 74 is in the middle of the ones
for length 7 symbols.

A canonical code would be

symbol: 74 length: 6 code: 000011
symbol: 61 length: 7 code: 0000001
symbol: 63 length: 7 code: 0000010
symbol: 6e length: 7 code: 0000011
symbol: 71 length: 7 code: 0000100

symbol: 76 length: 7 code: 0000101
symbol: 78 length: 8 code: 00000001

symbol: 6d length: 9 code: 000000000
symbol: 70 length: 9 code: 000000001

It is this simple pattern to the codes that makes it so easy to translate
code lengths. The pattern is much easier to see if sort by length ascending
and symbol value descending.

symbol: 74 length: 6 code: 000011

symbol: 76 length: 7 code: 0000101

symbol: 71 length: 7 code: 0000100
symbol: 6e length: 7 code: 0000011
symbol: 63 length: 7 code: 0000010
symbol: 61 length: 7 code: 0000001

symbol: 78 length: 8 code: 00000001

symbol: 70 length: 9 code: 000000001
symbol: 6d length: 9 code: 000000000

Basically each code is "one" greater than the one below it. The meaning of
"one" changes as you change levels.

For more info see http://www.compressconsult.com/huffman/

Gareth

unread,

Mar 5, 2003, 5:14:25 PM3/5/03

to

> No, the codes for all symbols of the same length should form a linear
> sequence changin by one. Taking the bottom of Gareth's example:

> Basically each code is "one" greater than the one below it. The meaning of

> "one" changes as you change levels.

Right, I think that's clearer in my mind now. The tree is only used to
find the length of the codes. Once this is known, all symbols that
have codes of the same length are sorted alphabeticaly, and then
incremented according to their order.
So for example symbols A, B, C and D have code lengths of 4, their
codes will be

A: 0000
B: 0001
C: 0010
D: 0011

And say symbols E, F, G, and H have code lengths of 3, their codes wil
be :

E: 000
F: 001
G: 010
H: 011

I still don't see how to decode this though. If the encoded bit stream
starts with '000', how does the decoder know if this is E, or if it is
part of A?

Andrey Savov

unread,

Mar 5, 2003, 11:14:20 PM3/5/03

to

"Dale King" <Dale[dot]Ki...@thomson.net> wrote in message

news:3e66...@news.tce.com...

> > > That definitely is not canonical. The codes for all nodes at a given
> depth
> > > should increment by 1.
> >
> > Do you mean the lengths or what?
>
> No, the codes for all symbols of the same length should form a linear
> sequence changin by one. Taking the bottom of Gareth's example:

I don't think that's the definition of canonical. Canonical comes from

Schwartz, E. S. and Kallick, B. 1964. Generating a canonical prefix
encoding. Communications of the ACM 7, 166--169.

and has nothing to do with the actual code values, but rather the code tree,
which only determines the code-lengths. The code assignments are certainly
not unique within a code tree.

I understand that there is a canonical code assignment, which is probably
what you refer to, but that doesn't mean his codes are not canonical. After
all code assignments really don't matter once you've fixed the lengths.

Willem

unread,

Mar 6, 2003, 4:58:31 AM3/6/03

to

Gareth wrote:
)> No, the codes for all symbols of the same length should form a linear
)> sequence changin by one. Taking the bottom of Gareth's example:
)
)> Basically each code is "one" greater than the one below it. The meaning of
)> "one" changes as you change levels.
)
) Right, I think that's clearer in my mind now. The tree is only used to
) find the length of the codes. Once this is known, all symbols that
) have codes of the same length are sorted alphabeticaly, and then
) incremented according to their order.
) So for example symbols A, B, C and D have code lengths of 4, their
) codes will be
)
) A: 0000
) B: 0001
) C: 0010
) D: 0011
)
) And say symbols E, F, G, and H have code lengths of 3, their codes wil
) be :
)
) E: 000
) F: 001
) G: 010
) H: 011

You don't reset the counter, you just chop the last bit off it.
This means that after D, the counter would be 0100, and you get:

E: 010
F: 011
G: 100
H: 101

And if you have I with codelength 2, you finish with:

I: 11

The tree was (fixed font):

I
/ /H
x-x
/ \G
/
x /F
\ x
\ / \E
x /D
\ x
\ / \C
x
\ /B
x
\A

P.S. The 'last' bits you chop off should always be 0, otherwise you don't
have a valid tree.

SaSW, Willem (at stack dot nl)
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Dale King

unread,

Mar 6, 2003, 3:09:58 PM3/6/03

to

"Andrey Savov" <sa...@hotmail.com> wrote in message

news:wWz9a.363$hW6...@news1.west.cox.net...

> "Dale King" <Dale[dot]Ki...@thomson.net> wrote in message
> news:3e66...@news.tce.com...
> > > > That definitely is not canonical. The codes for all nodes at a given
> > depth
> > > > should increment by 1.
> > >
> > > Do you mean the lengths or what?
> >
> > No, the codes for all symbols of the same length should form a linear
> > sequence changin by one. Taking the bottom of Gareth's example:
>
> I don't think that's the definition of canonical. Canonical comes from

It is not the definition of canonical, it is a direct consequence of the
definition of canonical.

> Schwartz, E. S. and Kallick, B. 1964. Generating a canonical prefix
> encoding. Communications of the ACM 7, 166--169.
>
>
> and has nothing to do with the actual code values, but rather the code
tree,
> which only determines the code-lengths. The code assignments are
certainly
> not unique within a code tree.

I don't have access to that specific paper, but I have read other papers
that cite that paper. I certainly agree that the assignment of codes does
not have to be unique. The most important characteristic of a canonical
Huffman tree is its shape. You can reorder the leaf nodes on any level and
still have a canonically shaped tree. Although it is usual to also order the
leaves on a given level sorted by symbol so that all you need to know to
recreate the tree is the length of the code for each symbol.

If you look at this paper which cites the one you mentioned (under the
section labeled Method B):

D. S. Hirschberg and D. A. Lelewer. Efficient Decoding of Prefix Codes.
Communications of the ACM, 33(4): 449--459, 1990.
http://citeseer.nj.nec.com/hirschberg90efficient.html

You will see that they reiterate what I said:

"The canonical code posesses some nice mathematical properties. The
codewords of a given length are consecutive binary numbers .... In other
words, the first codeword of length l is obtained from the last codeword of
length l - 1 by adding 1 to the binary number represented by d(l-1) and
shifting that binary number left once... <lengthy explanation of what to do
when a length is unused snipped>... We say that a canonical code has the
numerical sequence property"

> I understand that there is a canonical code assignment, which is probably
> what you refer to, but that doesn't mean his codes are not canonical.
After
> all code assignments really don't matter once you've fixed the lengths.

The issue is not how he assigned symbols to codes, but how codes were
assigned to each length. In order to be canonical it must have the numerical
sequence property. If it does not then it cannot be a canonically shaped
tree.

Look at the portion of the codes I referred to (I eliminated the symbol
values as the issue is not which symbol goes with which code):

> length: 6 code: 000001
> length: 7 code: 0000111
> length: 7 code: 0000100
> length: 7 code: 0000001
> length: 7 code: 0000110
> length: 7 code: 0000101

Let's shorten this tree as all start with 0000 and just look at that
subtree.:

> length: 2 code: 01
> length: 3 code: 111
> length: 3 code: 100
> length: 3 code: 001
> length: 3 code: 110
> length: 3 code: 101

If you view the following tree representation with a fixed width font you
will see that it does not have a canonical shape. That length 6 (or 2) code
is stuck in between nodes of greater depth. That was why I said it was not
canonical.

O
/ \
O O
/ \ / \
O X O O
/ \ / \ / \
O X X X X X

The fact that it is not canonical can be seen from the fact that it does not
have the numerical sequence property. There is a length 7 code of 0000001
and one of 0000100 but the values in-between are not assigned to a length 7
code. Therefore it cannot have a canonical shape.

Gareth

unread,

Mar 6, 2003, 6:15:05 PM3/6/03

to

> > A: 0000
> > B: 0001
> > C: 0010
> > D: 0011

> You don't reset the counter, you just chop the last bit off it.
> This means that after D, the counter would be 0100, and you get:
>
> E: 010
> F: 011
> G: 100

You say that the chopped off bit should always be '0'? What if there
were only 2 symbols with length 3, then F would be the last symbol of
length 3 and '1' would be stripped off. So where has it gone wrong?

Willem

unread,

Mar 7, 2003, 6:19:41 AM3/7/03

to

Gareth wrote:
)> > A: 0000
)> > B: 0001
)> > C: 0010
)> > D: 0011

)> You don't reset the counter, you just chop the last bit off it.
)> This means that after D, the counter would be 0100, and you get:
)>
)> E: 010
)> F: 011
)> G: 100
)
) You say that the chopped off bit should always be '0'? What if there
) were only 2 symbols with length 3, then F would be the last symbol of
) length 3 and '1' would be stripped off. So where has it gone wrong?

If there were 2 symbols of length 3, then the counter would be 100, which
becomes 10.
I assume you mean if there were 3 symbols of length 3. Well there can't be.
There's always an even number of symbols of any length, except the shortest
length. If there isn't, you're wasting encoding space.

Dale King

unread,

Mar 7, 2003, 11:53:47 AM3/7/03

to

"Willem" <n...@sp.am.invalid> wrote in message
news:slrnb6h02...@toad.stack.nl...

> Gareth wrote:
> )> > A: 0000
> )> > B: 0001
> )> > C: 0010
> )> > D: 0011
> )> You don't reset the counter, you just chop the last bit off it.
> )> This means that after D, the counter would be 0100, and you get:
> )>
> )> E: 010
> )> F: 011
> )> G: 100
> )
> ) You say that the chopped off bit should always be '0'? What if there
> ) were only 2 symbols with length 3, then F would be the last symbol of
> ) length 3 and '1' would be stripped off. So where has it gone wrong?
>
> If there were 2 symbols of length 3, then the counter would be 100, which
> becomes 10.
> I assume you mean if there were 3 symbols of length 3. Well there can't
be.
> There's always an even number of symbols of any length, except the
shortest
> length. If there isn't, you're wasting encoding space.

Huh? There's always an even number of symbols of the longest length, but
there is no restriction on other lengths. Other lengths can be odd or even.
Consider the worst case for a Huffman tree when the frequencies are a
Fibonacci sequence:

1 1 2 3 5 8 13

Draw the Huffman tree and you will see you get 2 symbols of the longest
length and only one of every other length.

Or cosider the frequencies:

1 1 1 1 3 3

That gives you 4 symbols of the longest length and 2 of the shortest.

Dale King

unread,

Mar 7, 2003, 12:11:05 PM3/7/03

to

"Gareth" <the_a...@hotmail.com> wrote in message
news:370afe3d.03030...@posting.google.com...

> > > A: 0000
> > > B: 0001
> > > C: 0010
> > > D: 0011
> > You don't reset the counter, you just chop the last bit off it.
> > This means that after D, the counter would be 0100, and you get:
> >
> > E: 010
> > F: 011
> > G: 100
>
> You say that the chopped off bit should always be '0'? What if there
> were only 2 symbols with length 3, then F would be the last symbol of
> length 3 and '1' would be stripped off. So where has it gone wrong?

You forgot to increment it first. It works fine you increment from F's
symbol 011 and get 100 then strip off the 0 to get 10, just like we did to
go from D to E

To answer your next question about what would happen if you had E, F, and G
were three bits and you had H as 2 bits, the answer is that it cannot happen
as that would not be a complete binary tree (it would have an internal node
with only 1 child). There is no way you strip off a 1 if the tree is a
complete binary tree, which is a necesary condition for a tree to be a
Huffman tree.

Gareth

unread,

Mar 7, 2003, 12:28:04 PM3/7/03

to

> I assume you mean if there were 3 symbols of length 3. Well there can't be.
> There's always an even number of symbols of any length, except the shortest
> length. If there isn't, you're wasting encoding space.

OK, the fog is lifting slowly. The most important thing is to make
sure that the tree is built properly. So say there are 6 symbols with
frequencies 1, 2, 4, 5, 6, and 8. Building a huffman tree by the
normal rules, ie, adding the 2 lowest frequencies together to form a
new node, I get the following tree :

8 1
/ /
/ 3
15 / \
/ \ / 2
/ 7
/ \
26 4
\
\ 5
\ /
11
\
6

This tree gives codes of length :

Symbol Length
1 4
2 4
4 3
5 2
6 2
8 2

This is wrong because there are an uneven number of codes of length
two and three. So what is the correct way to build this tree?

Andrey Savov

unread,

Mar 7, 2003, 2:32:43 PM3/7/03

to

"Dale King" <Dale[dot]Ki...@thomson.net> wrote in message news:<3e67...@news.tce.com>...

[explanation snipped]

Ah, yes -- the numerical sequence property! I stand corrected.
Thanks for pointing that out Dale.

Federico

unread,

Mar 7, 2003, 3:03:34 PM3/7/03

to

I've utilized a code similar to followin to generate de bit
configuration for a canonical huffman, gived the size of
each codes:

/////////////////////////////////////////////////////////////
int codeSize[] = { 2, 2, 2, 3, 4, 4 };
#define srcCount (sizeof(codeSize) / sizeof(int))
int codeConf[srcCount];

...

int i, actualCodeSize=0, actualCodeConf=0;
for(i=0; i<srcCount; i++)
{
while(actualCodeSize < codeSize[i])
{
actualCodeConf *= 2;
actualCodeSize++;
}
codeConf[i] = actualCodeConf++;
}
/////////////////////////////////////////////////////////////

In this code, 'codeSize' array contain the lenght of the
code of each symbol. At output, codeConf array contain an
integer that represent the bit configuration. As an example,
to printout this bit configuration i've utilized following
code:

/////////////////////////////////////////////////////////////
// Printout bit configuration
char res[20];
for(i=0; i<srcCount; i++)
{
int useSize = codeSize[i], useConf = codeConf[i];
res[useSize] = 0;
for(int j=useSize; j; j--)
{
res[j - 1] = (useConf & 1)? '1': '0';
useConf /= 2;
}
printf("%d=%s\t", codeSize[i], res);
}
/////////////////////////////////////////////////////////////
Hope this usefull.

Federico

Gareth

unread,

Mar 9, 2003, 6:11:52 PM3/9/03

to

> int codeSize[] = { 2, 2, 2, 3, 4, 4 };
> #define srcCount (sizeof(codeSize) / sizeof(int))
> int codeConf[srcCount];
>
> ...
>
> int i, actualCodeSize=0, actualCodeConf=0;
> for(i=0; i<srcCount; i++)
> {
> while(actualCodeSize < codeSize[i])
> {
> actualCodeConf *= 2;
> actualCodeSize++;
> }
> codeConf[i] = actualCodeConf++;
> }

I can't see what this code does. I assume a huffman tree has already
been used to fill the codeSize array. In the while loop is the line

actualCodeConf *= 2;

Since actualCodeConf is initialised to zero the above line does
nothing so no matter how many times you go round the loop
actulaCodeConf will always be zero.

> char res[20];
> for(i=0; i<srcCount; i++)
> {
> int useSize = codeSize[i], useConf = codeConf[i];
> res[useSize] = 0;
> for(int j=useSize; j; j--)
> {
> res[j - 1] = (useConf & 1)? '1': '0';
> useConf /= 2;
> }
> printf("%d=%s\t", codeSize[i], res);
> }

Could you tell me what the line

res[j - 1] = (useConf & 1)? '1': '0';

does. It is only the bit after the = that I don't understand.

> Hope this usefull.
>
> Federico

Glenn Gordon

unread,

Mar 9, 2003, 10:37:10 PM3/9/03

to

"Gareth" <the_a...@hotmail.com> wrote in message
news:370afe3d.03030...@posting.google.com...

> > int codeSize[] = { 2, 2, 2, 3, 4, 4 };
> > #define srcCount (sizeof(codeSize) / sizeof(int))
> > int codeConf[srcCount];
> >
> > ...
> >
> > int i, actualCodeSize=0, actualCodeConf=0;
> > for(i=0; i<srcCount; i++)
> > {
> > while(actualCodeSize < codeSize[i])
> > {
> > actualCodeConf *= 2;
> > actualCodeSize++;
> > }
> > codeConf[i] = actualCodeConf++;
> > }
>
> I can't see what this code does. I assume a huffman tree has already
> been used to fill the codeSize array. In the while loop is the line
>
> actualCodeConf *= 2;
>
> Since actualCodeConf is initialised to zero the above line does
> nothing so no matter how many times you go round the loop
> actulaCodeConf will always be zero.
>

...

There are two loops, the outer for and the inner while. actualCodeConf was
initialized to zero prior to the loops, and at the end of the for loop is:

codeConf[i] = actualCodeConf++;

which will use the current value of actualCodeConf in the assignment to
codeConf[i] and then increment it by one.

Willem

unread,

Mar 10, 2003, 6:29:59 AM3/10/03

to

Dale wrote:
) Huh? There's always an even number of symbols of the longest length, but
) there is no restriction on other lengths. Other lengths can be odd or even.
) Consider the worst case for a Huffman tree when the frequencies are a
) Fibonacci sequence:

Oops, my bad. I didnt't really think this through.
What I meant is that the total number of nodes plus leaves at each depth is
always even. Obviously there can be an odd number of nodes at any but the
deepest level, so there can also be an odd number of leaves.

Federico

unread,

Mar 10, 2003, 6:47:31 AM3/10/03

to

> I can't see what this code does. I assume a huffman tree has already

> been used to fill the codeSize array. In the while loop is the line...

Yes. The **lenghts** of each code are into 'codeSize' array as output of
a huffman tree construction stage.

And, as Glenn say, the while loop make nothing over 'actualCodeConf', but
only in first iteration of 'for' loop. After first, the while loop 'make
room' for extended bits in 'actualCodeConf'.

> Could you tell me what the line
>
> res[j - 1] = (useConf & 1)? '1': '0';

> does. It is only the bit after the = that I don't understand.

I want to make a printable version of each configuration for print it.
One method is to make the string of 0's and 1's from right to left,
'printing' the value of bit 0 (useConf & 1) and eliminating bit 0 after,
with (useConf/2).

Federico

Dale King

unread,

Mar 10, 2003, 4:39:39 PM3/10/03

to

"Gareth" <the_a...@hotmail.com> wrote in message
news:370afe3d.03030...@posting.google.com...

As I already established there is nothing wrong with that. The longest
length must have an even number of nodes, but you can't tell anything by
whether the number of nodes at other levels is odd or even.

> So what is the correct way to build this tree?

There's nothing wrong with your tree. Following the procedure for assigning
codes to give a Canonical code would give you codes of:

Symbol Length
1 0000
2 0001
4 001
5 01
6 10
8 11

Dale King

unread,

Mar 10, 2003, 4:44:52 PM3/10/03

to

"Willem" <n...@sp.am.invalid> wrote in message

news:slrnb6otp...@toad.stack.nl...

> Dale wrote:
> ) Huh? There's always an even number of symbols of the longest length, but
> ) there is no restriction on other lengths. Other lengths can be odd or
even.
> ) Consider the worst case for a Huffman tree when the frequencies are a
> ) Fibonacci sequence:
>
> Oops, my bad. I didnt't really think this through.
> What I meant is that the total number of nodes plus leaves at each depth
is
> always even. Obviously there can be an odd number of nodes at any but the
> deepest level, so there can also be an odd number of leaves.

Which is easy to see because that number must be 2 times the number of
internal nodes at the previous level. Not sure that is really helpful in
this analysis.

Gareth

unread,

Mar 10, 2003, 5:20:45 PM3/10/03

to

Thanks for your explantions, I understand what it is doing now. I have
modified Frederico's code so that it works with mine. The output I get
for the code lengths in Fredrico's example is :

00 length : 2
01 length : 2
10 length : 2
110 length : 3
1110 length : 4
1111 length : 4

This looks correct to me, the symbols with the same code length
increment for each instance.

Now for a longer example :-) Again, for a given code length, the code
increments for each instance :

000 length : 3
001 length : 3
0100 length : 4
0101 length : 4
0110 length : 4
0111 length : 4
1000 length : 4
1001 length : 4
1010 length : 4
1011 length : 4
11000 length : 5
11001 length : 5
11010 length : 5
11011 length : 5
11100 length : 5
11101 length : 5
111100 length : 6
1111010 length : 7
1111011 length : 7
1111100 length : 7
1111101 length : 7
1111110 length : 7
11111110 length : 8
111111110 length : 9
111111111 length : 9

Now the decoder only knows how long each symbol's code is. I thought
the leading '1's would give a good idea of which code length the
current code is going to be, but this doesn't always work. Codes of
length 5 start with 2 leading '1's but then after a while this becomes
3. I'm still struggling to see how the decoding works.

> > res[j - 1] = (useConf & 1)? '1': '0';
>

> I want to make a printable version of each configuration for print it.
> One method is to make the string of 0's and 1's from right to left,
> 'printing' the value of bit 0 (useConf & 1) and eliminating bit 0 after,
> with (useConf/2).

Thanks, I understand what (useConf & 1) and (useConf/2) are doing, it
was the

? '1': '0';

that I didn't get. I assume it is doing somthing like "if what's in
the brackets is true, then '1', else, '0' "

I've just never seen this syntax in C before.

Federico

unread,

Mar 11, 2003, 9:39:48 AM3/11/03

to

Gareth,

I don't think that the start 1 or 0 must, by need, follow any rule. It
depend strictly of your codelen configuration...

If you need a fast decoder, the best way is construct a struct's tree,
where each struct have a pointer to one's and zero's branch. A flag
can indicate if you're at a leaf and force stop tree descending.
Something as:

///////////////////////////////////////////////////////////////
struct CodeSymbol
{
CodeSymbol *oneBranch, *zeroBranch;
int actualSymbol; // -1 flag not leaf, >=0 is leaf
}
///////////////////////////////////////////////////////////////

So, gived a root, you walk too fast throug tree, decoding input:

///////////////////////////////////////////////////////////////
CodeSymbol *actual = treeRoot;
while(actual->actualSymbol < 0)
{
bool bit = GetInputBit(); // Read only 1 bit
actual = bit? actual->oneBranch: actual->zeroBranch;
}
// here, actual->actualSymbol is the decoded symbol
///////////////////////////////////////////////////////////////

Another:
The ? and : operator came from first hour! It was defined by
Kerningan and Ritchie in their first C's implementation. And, yes
it can be traduced to something as:

x = (a)? (b): (c)
if(a) x=b; else x=c;

Federico

Dale King

unread,

Mar 11, 2003, 11:35:36 AM3/11/03

to

"Federico" <fvi...@alberdijujuy.com.ar> wrote in message
news:a88326b0.0303...@posting.google.com...

> Gareth,
>
> I don't think that the start 1 or 0 must, by need, follow any rule. It
> depend strictly of your codelen configuration...
>
> If you need a fast decoder, the best way is construct a struct's tree,

No, no, no! If you want a fast decoder you don't want a tree at all. You
want look up tables. Anything that looks one bit at a time and decides to
branch based on that single bit is dead slow.

For some papers on this subject see:

http://www.cs.mu.oz.au/~alistair/abstracts/mt97:ieeecomm.html
http://citeseer.nj.nec.com/hirschberg90efficient.html
http://citeseer.nj.nec.com/brodnik98sublinear.html

Gareth

unread,

Mar 11, 2003, 5:35:54 PM3/11/03

to

> If you need a fast decoder, the best way is construct a struct's tree,
> where each struct have a pointer to one's and zero's branch. A flag
> can indicate if you're at a leaf and force stop tree descending.

Yes that's the way i got my plain huffman coder working. The reason I
am going for cannonical huffman coding is that my decoder must use
minimal memory. Speed is not an issue. Rebuilding the tree takes up
ram so i want to avoid having to do that.

Dale King

unread,

Mar 13, 2003, 1:35:32 PM3/13/03

to

"Gareth" <the_a...@hotmail.com> wrote in message

news:370afe3d.03031...@posting.google.com...

Then you also want a look up table approach. I too had the limitation of
minimal memory. What I do is store the an array of all nodes ordered from
top to bottom, left-to-right. Then all you need is another array with one
element per depth of the tree that gives how many symbols are at each depth.

The code I generate is the mirror image of yours: So taking your longer
example and using the mirror image:

110 length : 3
111 length : 3

0100 length : 4
0101 length : 4
0110 length : 4
0111 length : 4
1000 length : 4
1001 length : 4
1010 length : 4
1011 length : 4

00010 length : 5
00011 length : 5
00100 length : 5
00101 length : 5
00110 length : 5
00111 length : 5
000011 length : 6
0000001 length : 7
0000010 length : 7
0000011 length : 7
0000100 length : 7
0000101 length : 7
00000001 length : 8
000000000 length : 9
000000001 length : 9

You would have an array of the nodes in order from top to bottom in the
above list. In addition you would have another array that is the number of
symbols on each level:

unsigned tree[] = { 0, 0, 2, 8, 6, 1, 5, 1, 2 }

Here is some code that should extract the next symbol. This probably has
errors as I had to strip out a bunch of other things that only matter to my
application (I have two different sizes of "symbols" and I store the tree
within the array of nodes instead of separately):

int nodes= 0;
int index = 0;
int treeIndex = 0;
int code = 0;
int base = 1;

do
{
code <<= 1;

code |= nextBitOfData();

index += nodes;
nodes = tree[ treeIndex++ ];
base <<= 1;
base -= nodes;
} while( code < base );

index += code - base;

// index is the index of the next symbol.

The code is easier to follow if you step through it looking at an actual
tree. base is the code for the first leaf node on the given level. nodes is
the number of symbols on the level. For the above tree the values of base at
the end of each iteration will be:

10
100
110
100
10
11
1
1
0

Which corresponds to the code for the first symbol at each depth. The first
two will fail at the end of the loop as they have more bits than extracted.

To make this faster you would use multiple bits at once to index a table at
the expense of more memory.

Gareth

unread,

Mar 16, 2003, 4:07:25 PM3/16/03

to

> The code I generate is the mirror image of yours: So taking your longer
> example and using the mirror image:
>
> 110 length : 3
> 111 length : 3

>.
>.
>.

> 000000000 length : 9
> 000000001 length : 9
>
> You would have an array of the nodes in order from top to bottom in the
> above list. In addition you would have another array that is the number of
> symbols on each level:
>
> unsigned tree[] = { 0, 0, 2, 8, 6, 1, 5, 1, 2 }

You start with the code and thier lengths, and then another array with
the number of symbols on each level. But where do theses come from?
Are thay sent in the compressed file? I thought the point of canonical
huffman coding was that you could decompress knowing only the length
of the code for each symbol. passing arrays along with the compressed
data is going to greatly reduce the compression ratio of smaller
files.

> Here is some code that should extract the next symbol. This probably has
> errors as I had to strip out a bunch of other things that only matter to my
> application (I have two different sizes of "symbols" and I store the tree
> within the array of nodes instead of separately):

Here you mention a tree, but where has it come from? Plain huffman
codeing would be easy to decode and use minimal memory if you sent the
whole tree with the compressed data but the tree will greatly increase
the size of the compressed data.

The light at the end of the tunnel that is my complete understanding
of canonical huffman coding is getting brighter! Don't give up on me!!

Thanks,
Gareth