Data compression for dictionary

caprice

unread,

Aug 26, 2003, 11:32:00 AM8/26/03

to

Hi all, I'm a newbie here.
I'm going to develop a Pocket PC dictionary, but I have no idea about
the database compression.
I've been searching on the internet for many days but there are too
many different kind of methods and papers which confused me.

Can anyone suggest one or two good compression for dictionary
database?

Thanks alot!

caprice

unread,

Aug 26, 2003, 11:32:02 AM8/26/03

to

Dr Chaos

unread,

Aug 26, 2003, 3:17:11 PM8/26/03

to

A prefix tree, since dictionary entries usually come sorted,
and many words in a dictionary are prefixes of others.

> Thanks alot!

caprice

unread,

Aug 27, 2003, 2:27:05 PM8/27/03

to

> A prefix tree, since dictionary entries usually come sorted,
> and many words in a dictionary are prefixes of others.

Thanks for advise!

How about using LZW then use prefix tree searching algorithm?
Since many papers said that LZW is the best compression for text data.

Btw, is prefix tree a compression algorithm or searching algorithm? (or both?)

Thanks in advance!

Regards,
Sara

Dr Chaos

unread,

Aug 27, 2003, 5:15:40 PM8/27/03

to

On 27 Aug 2003 11:27:05 -0700, caprice <sara_...@hotmail.com> wrote:
>> A prefix tree, since dictionary entries usually come sorted,
>> and many words in a dictionary are prefixes of others.
>
> Thanks for advise!
>
> How about using LZW then use prefix tree searching algorithm?
> Since many papers said that LZW is the best compression for text data.

No, it isn't even remotely so. The markov-modeling methods combined
with arithmetic coders are the best performing at the moment.

But typically to do LZW or LZ78 compression algorithms, it is
common (but not essential) to build a prefix tree to implement them.

> Btw, is prefix tree a compression algorithm or searching algorithm? (or both?)

It is a data structure which supports efficient algorithms to perform
queries.

Once you construct one, it is easy to search whether any given word
is in the structure or not.

The point is that if you store a dictionary it is usually in alphabetical
order, and you want to search it in alphabetical order.

You do not have to compress arbitrary sequences of words in arbitrary
orders, and thus you can gain.

Then you need an efficient way to serially encode a prefix tree
but that should not be hard at all.

Raymond Wan

unread,

Aug 28, 2003, 12:32:35 AM8/28/03

to

On Wed, 27 Aug 2003, Dr Chaos wrote:
> On 27 Aug 2003 11:27:05 -0700, caprice <sara_...@hotmail.com> wrote:
> >> A prefix tree, since dictionary entries usually come sorted,
> >> and many words in a dictionary are prefixes of others.
> >
> > Thanks for advise!
> >
> > How about using LZW then use prefix tree searching algorithm?
> > Since many papers said that LZW is the best compression for text data.
>
> No, it isn't even remotely so. The markov-modeling methods combined
> with arithmetic coders are the best performing at the moment.

Yes, quite true; in fact, to the OP, what papers actually said
that LZW is the best compression for text data? I'd be interested to take
a look at one or two, just for interest sake...

Ray

caprice

unread,

Aug 28, 2003, 11:15:30 AM8/28/03

to

So you mean simply construct a prefix tree is good enough for a
dictionary?
Do you know where can I get the information or theory? (e.g. web
sites) Since I can hardly find them on internet...

Thanks very much!

Dr Chaos

unread,

Aug 28, 2003, 2:26:47 PM8/28/03

to

On 28 Aug 2003 08:15:30 -0700, caprice <sara_...@hotmail.com> wrote:
> So you mean simply construct a prefix tree is good enough for a
> dictionary?

To do what?

Generally you have to say what words are in the dictionary, write
algorithms to search it, and write algorithms to serialize and
deserialize it efficiently if you want to 'compress' a dictionary.

You may need a slight modifcation to a prefix tree.

In a prefix tree, for every word which is in a tree, all prefixes of
it are also in a tree. Like if ABC is in a tree then AB and A is in a
tree. Think of a tree with nodes descending for each extant character
successively forward in sequence. (a suffix tree goes backwards).

In a list of words like a dictionary of natural language, it is common
for many words which are in the dictionary to have prefixes which are
also in the dictionary, but it is not always so.

You will probably need additional indicators, maybe a boolean flag per
node to say 'yes this element in the tree is also in the dictionary'.

Thus the nodes of the prefix tree will be a superset of the words
in the dictionary.

The hope is that given the empirical structure of dictionaries of
natural language, this excess will not be too significant.

> Do you know where can I get the information or theory? (e.g. web
> sites) Since I can hardly find them on internet...

I think you may need a textbook on algorithms. A prefix tree is
fairly simple.

There are probably lecture notes on line which may get you started but
a real book is better.

> Thanks very much!

OlEnSh

unread,

Aug 29, 2003, 12:21:13 PM8/29/03

to

try the 'data compression reference book' by david salomon - very
comprehensive.
but, first, read the latest papers from ieee and acm regarding the subject.
they'll give you an idea of what's the 'best'. don't just randomly search
the internet. you'll end up with a loadful of confusion.

"caprice" <sara_...@hotmail.com> wrote in message
news:dd26c21.03082...@posting.google.com...

OlEnSh

unread,

Aug 29, 2003, 12:22:42 PM8/29/03

to

> Yes, quite true; in fact, to the OP, what papers actually said
> that LZW is the best compression for text data? I'd be interested to take
> a look at one or two, just for interest sake...
>

well, it's faster compared to some other techniques and is commonly used.
maybe that's what is meant by 'best'. 'best' is pretty subjective.

caprice

unread,

Aug 30, 2003, 10:39:08 AM8/30/03

to

Thanks all! :)

I've borrowed the book today and started to read.

Actually I'm really not familiar with those compression technique and
searching algorithm.
From your professional point of view, for a dictionary, is that
different algorithms make a big difference on performance? (in terms
of searching rate)
As I have to include background study in my report (school
assignment), I wonder if I should study and compare all kinds of
algorithms!

pardon me my poor English..