14 views

Skip to first unread message

Mar 4, 2003, 4:29:21 PM3/4/03

to

My cannonical huffman coder started out as a plain huffman coder that

I wrote and then modified to generated huffman codes. With my plain

huffman coder a branch generated a '1' or a '0' depending on the

weight of the other branch connected to the same parent node. With a

huffman coder, shorter codes are allocated '1's.

I wrote and then modified to generated huffman codes. With my plain

huffman coder a branch generated a '1' or a '0' depending on the

weight of the other branch connected to the same parent node. With a

huffman coder, shorter codes are allocated '1's.

To change this so that canonical codes were generated I added another

field to each node, the "branch depth". Each leaf node had it's branch

depth set to one. Then working up the tree, each parent node had it's

branch depth set to the sum of the two children's branch depths so

that the longest distance to a leaf is recorded for each node. Now the

branch is set to generate a '0' or '1' depending on this branch depth,

ie, the smaller branch depths generate '1's.

Now passing a contrived file to my coder I get the following results :

symbol: 67 length: 3 code: 111

symbol: 68 length: 3 code: 101

symbol: 64 length: 4 code: 0001

symbol: 66 length: 4 code: 0110

symbol: 69 length: 4 code: 1101

symbol: 6a length: 4 code: 0010

symbol: 6b length: 4 code: 0111

symbol: 6c length: 4 code: 1001

symbol: 73 length: 4 code: 0011

symbol: 75 length: 4 code: 0101

symbol: 62 length: 5 code: 11001

symbol: 65 length: 5 code: 10000

symbol: 6f length: 5 code: 11000

symbol: 72 length: 5 code: 01001

symbol: 77 length: 5 code: 10001

symbol: 79 length: 5 code: 01000

symbol: 74 length: 6 code: 000001

symbol: 61 length: 7 code: 0000111

symbol: 63 length: 7 code: 0000100

symbol: 6e length: 7 code: 0000001

symbol: 71 length: 7 code: 0000110

symbol: 76 length: 7 code: 0000101

symbol: 78 length: 8 code: 00000001

symbol: 6d length: 9 code: 000000001

symbol: 70 length: 9 code: 000000000

I think something is wrong because nothing but the codes with the

shortest length (3 in this case) should start with '1', but some do.

If the above is correct, how would for example the code '1101' be

decoded?. Once the first two '1's have been read in by the decoder how

will it know if these are referring to codes of length 3,4 or 5?

Mar 4, 2003, 5:07:25 PM3/4/03

to

"Gareth" <the_a...@hotmail.com> wrote in message

news:370afe3d.03030...@posting.google.com...

> My cannonical huffman coder started out as a plain huffman coder that

> I wrote and then modified to generated huffman codes. With my plain

> huffman coder a branch generated a '1' or a '0' depending on the

> weight of the other branch connected to the same parent node. With a

> huffman coder, shorter codes are allocated '1's.

>

> To change this so that canonical codes were generated I added another

> field to each node, the "branch depth". Each leaf node had it's branch

> depth set to one. Then working up the tree, each parent node had it's

> branch depth set to the sum of the two children's branch depths so

> that the longest distance to a leaf is recorded for each node. Now the

> branch is set to generate a '0' or '1' depending on this branch depth,

> ie, the smaller branch depths generate '1's.

news:370afe3d.03030...@posting.google.com...

> My cannonical huffman coder started out as a plain huffman coder that

> I wrote and then modified to generated huffman codes. With my plain

> huffman coder a branch generated a '1' or a '0' depending on the

> weight of the other branch connected to the same parent node. With a

> huffman coder, shorter codes are allocated '1's.

>

> To change this so that canonical codes were generated I added another

> field to each node, the "branch depth". Each leaf node had it's branch

> depth set to one. Then working up the tree, each parent node had it's

> branch depth set to the sum of the two children's branch depths so

> that the longest distance to a leaf is recorded for each node. Now the

> branch is set to generate a '0' or '1' depending on this branch depth,

> ie, the smaller branch depths generate '1's.

I think the problem comes from how you handle a branch node that has

different branch depths for its two children.

But it is much easier than that. Once you know the depth of leaf nodes (i.e.

the length of the code) that is all you care about. You can throw the tree

away at that point. For some very efficient code that takes a sorted list of

frequencies and returns in the same array the code lengths for each see the

source code for this paper:

http://www.cs.mu.oz.au/~alistair/abstracts/mk95:wads.html

To understand what it is doing you'll have to read the paper.

Once you have the lengths of the codes it is a simple matter to turn that

into the codes. Here is my Java implementation that takes an array of code

lengths (sorted from longest to shortest which is what the algorithm cited

above generates) and creates and array of the codes for those code lengths.

private int[] generateCodes(int[] codeLengths)

{

int[] codes = new int[codeLengths.length];

int length = codeLengths[0];

int code = 0;

for (int i = 0; i < codeLengths.length; i++)

{

while( length > codeLengths[i] )

{

code >>>= 1;

length--;

}

codes[i] = code++;

}

return codes;

}

This works from the bottom of the tree up. The nodes at the greatest depth

will start at zero and increase by one. When you go up a level you simply

divide the code by 2.

That definitely is not canonical. The codes for all nodes at a given depth

should increment by 1.

Mar 5, 2003, 1:02:22 AM3/5/03

to

"Dale King" <Dale[dot]Ki...@thomson.net> wrote in message

news:3e65...@news.tce.com...

news:3e65...@news.tce.com...

> That definitely is not canonical. The codes for all nodes at a given depth

> should increment by 1.

Do you mean the lengths or what?

Mar 5, 2003, 12:29:29 PM3/5/03

to

"Andrey Savov" <sa...@hotmail.com> wrote in message

news:Opg9a.2733$5x4...@news1.west.cox.net...

No, the codes for all symbols of the same length should form a linear

sequence changin by one. Taking the bottom of Gareth's example:

> symbol: 74 length: 6 code: 000001

> symbol: 61 length: 7 code: 0000111

> symbol: 63 length: 7 code: 0000100

> symbol: 6e length: 7 code: 0000001

> symbol: 71 length: 7 code: 0000110

> symbol: 76 length: 7 code: 0000101

> symbol: 78 length: 8 code: 00000001

> symbol: 6d length: 9 code: 000000001

> symbol: 70 length: 9 code: 000000000

Notice that at depth 7 the code 0000010 is not present and that is used for

the length 6 code. Thereforre this tree cannot have the Canonical shape of

all branch nodes to the left side and all leaf nodes to the right. Those

numbers mean that the leaf node for symbol 74 is in the middle of the ones

for length 7 symbols.

A canonical code would be

symbol: 74 length: 6 code: 000011

symbol: 61 length: 7 code: 0000001

symbol: 63 length: 7 code: 0000010

symbol: 6e length: 7 code: 0000011

symbol: 71 length: 7 code: 0000100

symbol: 76 length: 7 code: 0000101

symbol: 78 length: 8 code: 00000001

symbol: 6d length: 9 code: 000000000

symbol: 70 length: 9 code: 000000001

It is this simple pattern to the codes that makes it so easy to translate

code lengths. The pattern is much easier to see if sort by length ascending

and symbol value descending.

symbol: 74 length: 6 code: 000011

symbol: 76 length: 7 code: 0000101

symbol: 71 length: 7 code: 0000100

symbol: 6e length: 7 code: 0000011

symbol: 63 length: 7 code: 0000010

symbol: 61 length: 7 code: 0000001

symbol: 78 length: 8 code: 00000001

symbol: 70 length: 9 code: 000000001

symbol: 6d length: 9 code: 000000000

Basically each code is "one" greater than the one below it. The meaning of

"one" changes as you change levels.

For more info see http://www.compressconsult.com/huffman/

Mar 5, 2003, 5:14:25 PM3/5/03

to

> No, the codes for all symbols of the same length should form a linear

> sequence changin by one. Taking the bottom of Gareth's example:

> sequence changin by one. Taking the bottom of Gareth's example:

> Basically each code is "one" greater than the one below it. The meaning of

> "one" changes as you change levels.

Right, I think that's clearer in my mind now. The tree is only used to

find the length of the codes. Once this is known, all symbols that

have codes of the same length are sorted alphabeticaly, and then

incremented according to their order.

So for example symbols A, B, C and D have code lengths of 4, their

codes will be

A: 0000

B: 0001

C: 0010

D: 0011

And say symbols E, F, G, and H have code lengths of 3, their codes wil

be :

E: 000

F: 001

G: 010

H: 011

I still don't see how to decode this though. If the encoded bit stream

starts with '000', how does the decoder know if this is E, or if it is

part of A?

Mar 5, 2003, 11:14:20 PM3/5/03

to

"Dale King" <Dale[dot]Ki...@thomson.net> wrote in message

news:3e66...@news.tce.com...> > > That definitely is not canonical. The codes for all nodes at a given

> depth

> > > should increment by 1.

> >

> > Do you mean the lengths or what?

>

> No, the codes for all symbols of the same length should form a linear

> sequence changin by one. Taking the bottom of Gareth's example:

I don't think that's the definition of canonical. Canonical comes from

Schwartz, E. S. and Kallick, B. 1964. Generating a canonical prefix

encoding. Communications of the ACM 7, 166--169.

and has nothing to do with the actual code values, but rather the code tree,

which only determines the code-lengths. The code assignments are certainly

not unique within a code tree.

I understand that there is a canonical code assignment, which is probably

what you refer to, but that doesn't mean his codes are not canonical. After

all code assignments really don't matter once you've fixed the lengths.

Mar 6, 2003, 4:58:31 AM3/6/03

to

Gareth wrote:

)> No, the codes for all symbols of the same length should form a linear

)> sequence changin by one. Taking the bottom of Gareth's example:

)

)> Basically each code is "one" greater than the one below it. The meaning of

)> "one" changes as you change levels.

)

) Right, I think that's clearer in my mind now. The tree is only used to

) find the length of the codes. Once this is known, all symbols that

) have codes of the same length are sorted alphabeticaly, and then

) incremented according to their order.

) So for example symbols A, B, C and D have code lengths of 4, their

) codes will be

)

) A: 0000

) B: 0001

) C: 0010

) D: 0011

)

) And say symbols E, F, G, and H have code lengths of 3, their codes wil

) be :

)

) E: 000

) F: 001

) G: 010

) H: 011

)> No, the codes for all symbols of the same length should form a linear

)> sequence changin by one. Taking the bottom of Gareth's example:

)

)> Basically each code is "one" greater than the one below it. The meaning of

)> "one" changes as you change levels.

)

) Right, I think that's clearer in my mind now. The tree is only used to

) find the length of the codes. Once this is known, all symbols that

) have codes of the same length are sorted alphabeticaly, and then

) incremented according to their order.

) So for example symbols A, B, C and D have code lengths of 4, their

) codes will be

)

) A: 0000

) B: 0001

) C: 0010

) D: 0011

)

) And say symbols E, F, G, and H have code lengths of 3, their codes wil

) be :

)

) E: 000

) F: 001

) G: 010

) H: 011

You don't reset the counter, you just chop the last bit off it.

This means that after D, the counter would be 0100, and you get:

E: 010

F: 011

G: 100

H: 101

And if you have I with codelength 2, you finish with:

I: 11

The tree was (fixed font):

I

/ /H

x-x

/ \G

/

x /F

\ x

\ / \E

x /D

\ x

\ / \C

x

\ /B

x

\A

P.S. The 'last' bits you chop off should always be 0, otherwise you don't

have a valid tree.

SaSW, Willem (at stack dot nl)

--

Disclaimer: I am in no way responsible for any of the statements

made in the above text. For all I know I might be

drugged or something..

No I'm not paranoid. You all think I'm paranoid, don't you !

#EOT

Mar 6, 2003, 3:09:58 PM3/6/03

to

"Andrey Savov" <sa...@hotmail.com> wrote in message

news:wWz9a.363$hW6...@news1.west.cox.net...> "Dale King" <Dale[dot]Ki...@thomson.net> wrote in message

> news:3e66...@news.tce.com...

> > > > That definitely is not canonical. The codes for all nodes at a given

> > depth

> > > > should increment by 1.

> > >

> > > Do you mean the lengths or what?

> >

> > No, the codes for all symbols of the same length should form a linear

> > sequence changin by one. Taking the bottom of Gareth's example:

>

> I don't think that's the definition of canonical. Canonical comes from

It is not the definition of canonical, it is a direct consequence of the

definition of canonical.

> Schwartz, E. S. and Kallick, B. 1964. Generating a canonical prefix

> encoding. Communications of the ACM 7, 166--169.

>

>

> and has nothing to do with the actual code values, but rather the code

tree,

> which only determines the code-lengths. The code assignments are

certainly

> not unique within a code tree.

I don't have access to that specific paper, but I have read other papers

that cite that paper. I certainly agree that the assignment of codes does

not have to be unique. The most important characteristic of a canonical

Huffman tree is its shape. You can reorder the leaf nodes on any level and

still have a canonically shaped tree. Although it is usual to also order the

leaves on a given level sorted by symbol so that all you need to know to

recreate the tree is the length of the code for each symbol.

If you look at this paper which cites the one you mentioned (under the

section labeled Method B):

D. S. Hirschberg and D. A. Lelewer. Efficient Decoding of Prefix Codes.

Communications of the ACM, 33(4): 449--459, 1990.

http://citeseer.nj.nec.com/hirschberg90efficient.html

You will see that they reiterate what I said:

"The canonical code posesses some nice mathematical properties. The

codewords of a given length are consecutive binary numbers .... In other

words, the first codeword of length l is obtained from the last codeword of

length l - 1 by adding 1 to the binary number represented by d(l-1) and

shifting that binary number left once... <lengthy explanation of what to do

when a length is unused snipped>... We say that a canonical code has the

numerical sequence property"

> I understand that there is a canonical code assignment, which is probably

> what you refer to, but that doesn't mean his codes are not canonical.

After

> all code assignments really don't matter once you've fixed the lengths.

The issue is not how he assigned symbols to codes, but how codes were

assigned to each length. In order to be canonical it must have the numerical

sequence property. If it does not then it cannot be a canonically shaped

tree.

Look at the portion of the codes I referred to (I eliminated the symbol

values as the issue is not which symbol goes with which code):

> length: 6 code: 000001

> length: 7 code: 0000111

> length: 7 code: 0000100

> length: 7 code: 0000001

> length: 7 code: 0000110

> length: 7 code: 0000101

Let's shorten this tree as all start with 0000 and just look at that

subtree.:

> length: 2 code: 01

> length: 3 code: 111

> length: 3 code: 100

> length: 3 code: 001

> length: 3 code: 110

> length: 3 code: 101

If you view the following tree representation with a fixed width font you

will see that it does not have a canonical shape. That length 6 (or 2) code

is stuck in between nodes of greater depth. That was why I said it was not

canonical.

O

/ \

O O

/ \ / \

O X O O

/ \ / \ / \

O X X X X X

The fact that it is not canonical can be seen from the fact that it does not

have the numerical sequence property. There is a length 7 code of 0000001

and one of 0000100 but the values in-between are not assigned to a length 7

code. Therefore it cannot have a canonical shape.

Mar 6, 2003, 6:15:05 PM3/6/03

to

> > A: 0000

> > B: 0001

> > C: 0010

> > D: 0011

> > B: 0001

> > C: 0010

> > D: 0011

> You don't reset the counter, you just chop the last bit off it.

> This means that after D, the counter would be 0100, and you get:

>

> E: 010

> F: 011

> G: 100

> This means that after D, the counter would be 0100, and you get:

>

> E: 010

> F: 011

> G: 100

You say that the chopped off bit should always be '0'? What if there

were only 2 symbols with length 3, then F would be the last symbol of

length 3 and '1' would be stripped off. So where has it gone wrong?

Mar 7, 2003, 6:19:41 AM3/7/03

to

Gareth wrote:

)> > A: 0000

)> > B: 0001

)> > C: 0010

)> > D: 0011

)> You don't reset the counter, you just chop the last bit off it.)> > A: 0000

)> > B: 0001

)> > C: 0010

)> > D: 0011

)> This means that after D, the counter would be 0100, and you get:

)>

)> E: 010

)> F: 011

)> G: 100

)

) You say that the chopped off bit should always be '0'? What if there

) were only 2 symbols with length 3, then F would be the last symbol of

) length 3 and '1' would be stripped off. So where has it gone wrong?

If there were 2 symbols of length 3, then the counter would be 100, which

becomes 10.

I assume you mean if there were 3 symbols of length 3. Well there can't be.

There's always an even number of symbols of any length, except the shortest

length. If there isn't, you're wasting encoding space.

Mar 7, 2003, 11:53:47 AM3/7/03

to

"Willem" <n...@sp.am.invalid> wrote in message

news:slrnb6h02...@toad.stack.nl...

> Gareth wrote:

> )> > A: 0000

> )> > B: 0001

> )> > C: 0010

> )> > D: 0011

> )> You don't reset the counter, you just chop the last bit off it.

> )> This means that after D, the counter would be 0100, and you get:

> )>

> )> E: 010

> )> F: 011

> )> G: 100

> )

> ) You say that the chopped off bit should always be '0'? What if there

> ) were only 2 symbols with length 3, then F would be the last symbol of

> ) length 3 and '1' would be stripped off. So where has it gone wrong?

>

> If there were 2 symbols of length 3, then the counter would be 100, which

> becomes 10.

> I assume you mean if there were 3 symbols of length 3. Well there can't

be.

> There's always an even number of symbols of any length, except the

shortest

> length. If there isn't, you're wasting encoding space.

news:slrnb6h02...@toad.stack.nl...

> Gareth wrote:

> )> > A: 0000

> )> > B: 0001

> )> > C: 0010

> )> > D: 0011

> )> You don't reset the counter, you just chop the last bit off it.

> )> This means that after D, the counter would be 0100, and you get:

> )>

> )> E: 010

> )> F: 011

> )> G: 100

> )

> ) You say that the chopped off bit should always be '0'? What if there

> ) were only 2 symbols with length 3, then F would be the last symbol of

> ) length 3 and '1' would be stripped off. So where has it gone wrong?

>

> If there were 2 symbols of length 3, then the counter would be 100, which

> becomes 10.

> I assume you mean if there were 3 symbols of length 3. Well there can't

be.

> There's always an even number of symbols of any length, except the

shortest

> length. If there isn't, you're wasting encoding space.

Huh? There's always an even number of symbols of the longest length, but

there is no restriction on other lengths. Other lengths can be odd or even.

Consider the worst case for a Huffman tree when the frequencies are a

Fibonacci sequence:

1 1 2 3 5 8 13

Draw the Huffman tree and you will see you get 2 symbols of the longest

length and only one of every other length.

Or cosider the frequencies:

1 1 1 1 3 3

That gives you 4 symbols of the longest length and 2 of the shortest.

Mar 7, 2003, 12:11:05 PM3/7/03

to

> > > A: 0000

> > > B: 0001

> > > C: 0010

> > > D: 0011

> > You don't reset the counter, you just chop the last bit off it.

> > This means that after D, the counter would be 0100, and you get:

> >

> > E: 010

> > F: 011

> > G: 100

>

> You say that the chopped off bit should always be '0'? What if there

> were only 2 symbols with length 3, then F would be the last symbol of

> length 3 and '1' would be stripped off. So where has it gone wrong?

> > > B: 0001

> > > C: 0010

> > > D: 0011

> > You don't reset the counter, you just chop the last bit off it.

> > This means that after D, the counter would be 0100, and you get:

> >

> > E: 010

> > F: 011

> > G: 100

>

> You say that the chopped off bit should always be '0'? What if there

> were only 2 symbols with length 3, then F would be the last symbol of

> length 3 and '1' would be stripped off. So where has it gone wrong?

You forgot to increment it first. It works fine you increment from F's

symbol 011 and get 100 then strip off the 0 to get 10, just like we did to

go from D to E

To answer your next question about what would happen if you had E, F, and G

were three bits and you had H as 2 bits, the answer is that it cannot happen

as that would not be a complete binary tree (it would have an internal node

with only 1 child). There is no way you strip off a 1 if the tree is a

complete binary tree, which is a necesary condition for a tree to be a

Huffman tree.

Mar 7, 2003, 12:28:04 PM3/7/03

to

> I assume you mean if there were 3 symbols of length 3. Well there can't be.

> There's always an even number of symbols of any length, except the shortest

> length. If there isn't, you're wasting encoding space.

> There's always an even number of symbols of any length, except the shortest

> length. If there isn't, you're wasting encoding space.

OK, the fog is lifting slowly. The most important thing is to make

sure that the tree is built properly. So say there are 6 symbols with

frequencies 1, 2, 4, 5, 6, and 8. Building a huffman tree by the

normal rules, ie, adding the 2 lowest frequencies together to form a

new node, I get the following tree :

8 1

/ /

/ 3

15 / \

/ \ / 2

/ 7

/ \

26 4

\

\ 5

\ /

11

\

6

This tree gives codes of length :

Symbol Length

1 4

2 4

4 3

5 2

6 2

8 2

This is wrong because there are an uneven number of codes of length

two and three. So what is the correct way to build this tree?

Mar 7, 2003, 2:32:43 PM3/7/03

to

"Dale King" <Dale[dot]Ki...@thomson.net> wrote in message news:<3e67...@news.tce.com>...

[explanation snipped]

Ah, yes -- the numerical sequence property! I stand corrected.

Thanks for pointing that out Dale.

Mar 7, 2003, 3:03:34 PM3/7/03

to

I've utilized a code similar to followin to generate de bit

configuration for a canonical huffman, gived the size of

each codes:

configuration for a canonical huffman, gived the size of

each codes:

/////////////////////////////////////////////////////////////

int codeSize[] = { 2, 2, 2, 3, 4, 4 };

#define srcCount (sizeof(codeSize) / sizeof(int))

int codeConf[srcCount];

...

int i, actualCodeSize=0, actualCodeConf=0;

for(i=0; i<srcCount; i++)

{

while(actualCodeSize < codeSize[i])

{

actualCodeConf *= 2;

actualCodeSize++;

}

codeConf[i] = actualCodeConf++;

}

/////////////////////////////////////////////////////////////

In this code, 'codeSize' array contain the lenght of the

code of each symbol. At output, codeConf array contain an

integer that represent the bit configuration. As an example,

to printout this bit configuration i've utilized following

code:

/////////////////////////////////////////////////////////////

// Printout bit configuration

char res[20];

for(i=0; i<srcCount; i++)

{

int useSize = codeSize[i], useConf = codeConf[i];

res[useSize] = 0;

for(int j=useSize; j; j--)

{

res[j - 1] = (useConf & 1)? '1': '0';

useConf /= 2;

}

printf("%d=%s\t", codeSize[i], res);

}

/////////////////////////////////////////////////////////////

Hope this usefull.

Federico

Mar 9, 2003, 6:11:52 PM3/9/03

to

> int codeSize[] = { 2, 2, 2, 3, 4, 4 };

> #define srcCount (sizeof(codeSize) / sizeof(int))

> int codeConf[srcCount];

>

> ...

>

> int i, actualCodeSize=0, actualCodeConf=0;

> for(i=0; i<srcCount; i++)

> {

> while(actualCodeSize < codeSize[i])

> {

> actualCodeConf *= 2;

> actualCodeSize++;

> }

> codeConf[i] = actualCodeConf++;

> }

> #define srcCount (sizeof(codeSize) / sizeof(int))

> int codeConf[srcCount];

>

> ...

>

> int i, actualCodeSize=0, actualCodeConf=0;

> for(i=0; i<srcCount; i++)

> {

> while(actualCodeSize < codeSize[i])

> {

> actualCodeConf *= 2;

> actualCodeSize++;

> }

> codeConf[i] = actualCodeConf++;

> }

I can't see what this code does. I assume a huffman tree has already

been used to fill the codeSize array. In the while loop is the line

actualCodeConf *= 2;

Since actualCodeConf is initialised to zero the above line does

nothing so no matter how many times you go round the loop

actulaCodeConf will always be zero.

> char res[20];

> for(i=0; i<srcCount; i++)

> {

> int useSize = codeSize[i], useConf = codeConf[i];

> res[useSize] = 0;

> for(int j=useSize; j; j--)

> {

> res[j - 1] = (useConf & 1)? '1': '0';

> useConf /= 2;

> }

> printf("%d=%s\t", codeSize[i], res);

> }

Could you tell me what the line

res[j - 1] = (useConf & 1)? '1': '0';

does. It is only the bit after the = that I don't understand.

> Hope this usefull.

>

> Federico

Mar 9, 2003, 10:37:10 PM3/9/03

to

> > int codeSize[] = { 2, 2, 2, 3, 4, 4 };

> > #define srcCount (sizeof(codeSize) / sizeof(int))

> > int codeConf[srcCount];

> >

> > ...

> >

> > int i, actualCodeSize=0, actualCodeConf=0;

> > for(i=0; i<srcCount; i++)

> > {

> > while(actualCodeSize < codeSize[i])

> > {

> > actualCodeConf *= 2;

> > actualCodeSize++;

> > }

> > codeConf[i] = actualCodeConf++;

> > }

>

> I can't see what this code does. I assume a huffman tree has already

> been used to fill the codeSize array. In the while loop is the line

>

> actualCodeConf *= 2;

>

> Since actualCodeConf is initialised to zero the above line does

> nothing so no matter how many times you go round the loop

> actulaCodeConf will always be zero.

>

...> > #define srcCount (sizeof(codeSize) / sizeof(int))

> > int codeConf[srcCount];

> >

> > ...

> >

> > int i, actualCodeSize=0, actualCodeConf=0;

> > for(i=0; i<srcCount; i++)

> > {

> > while(actualCodeSize < codeSize[i])

> > {

> > actualCodeConf *= 2;

> > actualCodeSize++;

> > }

> > codeConf[i] = actualCodeConf++;

> > }

>

> I can't see what this code does. I assume a huffman tree has already

> been used to fill the codeSize array. In the while loop is the line

>

> actualCodeConf *= 2;

>

> Since actualCodeConf is initialised to zero the above line does

> nothing so no matter how many times you go round the loop

> actulaCodeConf will always be zero.

>

There are two loops, the outer for and the inner while. actualCodeConf was

initialized to zero prior to the loops, and at the end of the for loop is:

codeConf[i] = actualCodeConf++;

which will use the current value of actualCodeConf in the assignment to

codeConf[i] and then increment it by one.

Mar 10, 2003, 6:29:59 AM3/10/03

to

Dale wrote:

) Huh? There's always an even number of symbols of the longest length, but

) there is no restriction on other lengths. Other lengths can be odd or even.

) Consider the worst case for a Huffman tree when the frequencies are a

) Fibonacci sequence:

) Huh? There's always an even number of symbols of the longest length, but

) there is no restriction on other lengths. Other lengths can be odd or even.

) Consider the worst case for a Huffman tree when the frequencies are a

) Fibonacci sequence:

Oops, my bad. I didnt't really think this through.

What I meant is that the total number of nodes plus leaves at each depth is

always even. Obviously there can be an odd number of nodes at any but the

deepest level, so there can also be an odd number of leaves.

Mar 10, 2003, 6:47:31 AM3/10/03

to

> I can't see what this code does. I assume a huffman tree has already

> been used to fill the codeSize array. In the while loop is the line...Yes. The **lenghts** of each code are into 'codeSize' array as output of

a huffman tree construction stage.

And, as Glenn say, the while loop make nothing over 'actualCodeConf', but

only in first iteration of 'for' loop. After first, the while loop 'make

room' for extended bits in 'actualCodeConf'.

> Could you tell me what the line

>

> res[j - 1] = (useConf & 1)? '1': '0';

> does. It is only the bit after the = that I don't understand.

I want to make a printable version of each configuration for print it.

One method is to make the string of 0's and 1's from right to left,

'printing' the value of bit 0 (useConf & 1) and eliminating bit 0 after,

with (useConf/2).

Federico

Mar 10, 2003, 4:39:39 PM3/10/03

to

As I already established there is nothing wrong with that. The longest

length must have an even number of nodes, but you can't tell anything by

whether the number of nodes at other levels is odd or even.

> So what is the correct way to build this tree?

There's nothing wrong with your tree. Following the procedure for assigning

codes to give a Canonical code would give you codes of:

Symbol Length

1 0000

2 0001

4 001

5 01

6 10

8 11

Mar 10, 2003, 4:44:52 PM3/10/03

to

"Willem" <n...@sp.am.invalid> wrote in message

news:slrnb6otp...@toad.stack.nl...> Dale wrote:

> ) Huh? There's always an even number of symbols of the longest length, but

> ) there is no restriction on other lengths. Other lengths can be odd or

even.

> ) Consider the worst case for a Huffman tree when the frequencies are a

> ) Fibonacci sequence:

>

> Oops, my bad. I didnt't really think this through.

> What I meant is that the total number of nodes plus leaves at each depth

is

> always even. Obviously there can be an odd number of nodes at any but the

> deepest level, so there can also be an odd number of leaves.

Which is easy to see because that number must be 2 times the number of

internal nodes at the previous level. Not sure that is really helpful in

this analysis.

Mar 10, 2003, 5:20:45 PM3/10/03

to

Thanks for your explantions, I understand what it is doing now. I have

modified Frederico's code so that it works with mine. The output I get

for the code lengths in Fredrico's example is :

modified Frederico's code so that it works with mine. The output I get

for the code lengths in Fredrico's example is :

00 length : 2

01 length : 2

10 length : 2

110 length : 3

1110 length : 4

1111 length : 4

This looks correct to me, the symbols with the same code length

increment for each instance.

Now for a longer example :-) Again, for a given code length, the code

increments for each instance :

000 length : 3

001 length : 3

0100 length : 4

0101 length : 4

0110 length : 4

0111 length : 4

1000 length : 4

1001 length : 4

1010 length : 4

1011 length : 4

11000 length : 5

11001 length : 5

11010 length : 5

11011 length : 5

11100 length : 5

11101 length : 5

111100 length : 6

1111010 length : 7

1111011 length : 7

1111100 length : 7

1111101 length : 7

1111110 length : 7

11111110 length : 8

111111110 length : 9

111111111 length : 9

Now the decoder only knows how long each symbol's code is. I thought

the leading '1's would give a good idea of which code length the

current code is going to be, but this doesn't always work. Codes of

length 5 start with 2 leading '1's but then after a while this becomes

3. I'm still struggling to see how the decoding works.

> > res[j - 1] = (useConf & 1)? '1': '0';

>

> I want to make a printable version of each configuration for print it.

> One method is to make the string of 0's and 1's from right to left,

> 'printing' the value of bit 0 (useConf & 1) and eliminating bit 0 after,

> with (useConf/2).

Thanks, I understand what (useConf & 1) and (useConf/2) are doing, it

was the

? '1': '0';

that I didn't get. I assume it is doing somthing like "if what's in

the brackets is true, then '1', else, '0' "

I've just never seen this syntax in C before.

Mar 11, 2003, 9:39:48 AM3/11/03

to

Gareth,

I don't think that the start 1 or 0 must, by need, follow any rule. It

depend strictly of your codelen configuration...

If you need a fast decoder, the best way is construct a struct's tree,

where each struct have a pointer to one's and zero's branch. A flag

can indicate if you're at a leaf and force stop tree descending.

Something as:

///////////////////////////////////////////////////////////////

struct CodeSymbol

{

CodeSymbol *oneBranch, *zeroBranch;

int actualSymbol; // -1 flag not leaf, >=0 is leaf

}

///////////////////////////////////////////////////////////////

So, gived a root, you walk too fast throug tree, decoding input:

///////////////////////////////////////////////////////////////

CodeSymbol *actual = treeRoot;

while(actual->actualSymbol < 0)

{

bool bit = GetInputBit(); // Read only 1 bit

actual = bit? actual->oneBranch: actual->zeroBranch;

}

// here, actual->actualSymbol is the decoded symbol

///////////////////////////////////////////////////////////////

Another:

The ? and : operator came from first hour! It was defined by

Kerningan and Ritchie in their first C's implementation. And, yes

it can be traduced to something as:

x = (a)? (b): (c)

if(a) x=b; else x=c;

Federico

Mar 11, 2003, 11:35:36 AM3/11/03

to

"Federico" <fvi...@alberdijujuy.com.ar> wrote in message

news:a88326b0.0303...@posting.google.com...

> Gareth,

>

> I don't think that the start 1 or 0 must, by need, follow any rule. It

> depend strictly of your codelen configuration...

>

> If you need a fast decoder, the best way is construct a struct's tree,

news:a88326b0.0303...@posting.google.com...

> Gareth,

>

> I don't think that the start 1 or 0 must, by need, follow any rule. It

> depend strictly of your codelen configuration...

>

> If you need a fast decoder, the best way is construct a struct's tree,

No, no, no! If you want a fast decoder you don't want a tree at all. You

want look up tables. Anything that looks one bit at a time and decides to

branch based on that single bit is dead slow.

For some papers on this subject see:

http://www.cs.mu.oz.au/~alistair/abstracts/mt97:ieeecomm.html

http://citeseer.nj.nec.com/hirschberg90efficient.html

http://citeseer.nj.nec.com/brodnik98sublinear.html

Mar 11, 2003, 5:35:54 PM3/11/03

to

> If you need a fast decoder, the best way is construct a struct's tree,

> where each struct have a pointer to one's and zero's branch. A flag

> can indicate if you're at a leaf and force stop tree descending.

> where each struct have a pointer to one's and zero's branch. A flag

> can indicate if you're at a leaf and force stop tree descending.

Yes that's the way i got my plain huffman coder working. The reason I

am going for cannonical huffman coding is that my decoder must use

minimal memory. Speed is not an issue. Rebuilding the tree takes up

ram so i want to avoid having to do that.

Mar 13, 2003, 1:35:32 PM3/13/03

to

"Gareth" <the_a...@hotmail.com> wrote in message

news:370afe3d.03031...@posting.google.com...Then you also want a look up table approach. I too had the limitation of

minimal memory. What I do is store the an array of all nodes ordered from

top to bottom, left-to-right. Then all you need is another array with one

element per depth of the tree that gives how many symbols are at each depth.

The code I generate is the mirror image of yours: So taking your longer

example and using the mirror image:

110 length : 3

111 length : 3

0100 length : 4

0101 length : 4

0110 length : 4

0111 length : 4

1000 length : 4

1001 length : 4

1010 length : 4

1011 length : 4

00010 length : 5

00011 length : 5

00100 length : 5

00101 length : 5

00110 length : 5

00111 length : 5

000011 length : 6

0000001 length : 7

0000010 length : 7

0000011 length : 7

0000100 length : 7

0000101 length : 7

00000001 length : 8

000000000 length : 9

000000001 length : 9

You would have an array of the nodes in order from top to bottom in the

above list. In addition you would have another array that is the number of

symbols on each level:

unsigned tree[] = { 0, 0, 2, 8, 6, 1, 5, 1, 2 }

Here is some code that should extract the next symbol. This probably has

errors as I had to strip out a bunch of other things that only matter to my

application (I have two different sizes of "symbols" and I store the tree

within the array of nodes instead of separately):

int nodes= 0;

int index = 0;

int treeIndex = 0;

int code = 0;

int base = 1;

do

{

code <<= 1;

code |= nextBitOfData();

index += nodes;

nodes = tree[ treeIndex++ ];

base <<= 1;

base -= nodes;

} while( code < base );

index += code - base;

// index is the index of the next symbol.

The code is easier to follow if you step through it looking at an actual

tree. base is the code for the first leaf node on the given level. nodes is

the number of symbols on the level. For the above tree the values of base at

the end of each iteration will be:

10

100

110

100

10

11

1

1

0

Which corresponds to the code for the first symbol at each depth. The first

two will fail at the end of the loop as they have more bits than extracted.

To make this faster you would use multiple bits at once to index a table at

the expense of more memory.

Mar 16, 2003, 4:07:25 PM3/16/03

to

> The code I generate is the mirror image of yours: So taking your longer

> example and using the mirror image:

>

> 110 length : 3

> 111 length : 3

>.> example and using the mirror image:

>

> 110 length : 3

> 111 length : 3

>.

>.

> 000000000 length : 9

> 000000001 length : 9

>

> You would have an array of the nodes in order from top to bottom in the

> above list. In addition you would have another array that is the number of

> symbols on each level:

>

> unsigned tree[] = { 0, 0, 2, 8, 6, 1, 5, 1, 2 }

You start with the code and thier lengths, and then another array with

the number of symbols on each level. But where do theses come from?

Are thay sent in the compressed file? I thought the point of canonical

huffman coding was that you could decompress knowing only the length

of the code for each symbol. passing arrays along with the compressed

data is going to greatly reduce the compression ratio of smaller

files.

> Here is some code that should extract the next symbol. This probably has

> errors as I had to strip out a bunch of other things that only matter to my

> application (I have two different sizes of "symbols" and I store the tree

> within the array of nodes instead of separately):

Here you mention a tree, but where has it come from? Plain huffman

codeing would be easy to decode and use minimal memory if you sent the

whole tree with the compressed data but the tree will greatly increase

the size of the compressed data.

The light at the end of the tunnel that is my complete understanding

of canonical huffman coding is getting brighter! Don't give up on me!!

Thanks,

Gareth

Reply all

Reply to author

Forward

0 new messages