compression of tiny strings using a custom dictionary

Oliver Mattos

unread,

Jan 3, 2010, 7:55:41 PM1/3/10

to

Hi,

I'm looking for an easy method to compress small strings (about 200
bytes), using a custom dictionary. I'm using the custom dictionary
because most of the strings I'm compressing contain some common
phrases, and therefore putting these in a custom dictionary would
reduce the output size considerably. The dictionary would be static
and available to both the compressor and decompressor, and would not
need to be stored as part of the compressed data.

I want to achieve this using standard linux tools or sample code - I'd
prefer not to rewrite gzip or do much software development.

My idea for an easy way to do this was:

1. set up a standard compression tool as stream compressor.
2. pipe in my dictionary, and discard the output - this should fill
all internal buffers with the dictionary words.
3. next pipe in my data to compress, and store the output.

And to decompress:
* set up a compression tool, and stream in a dictionary, and capture
the output (this is identical to what was discarded during
compression)
* pipe this output into a decompression tool, followed by the
compressed data.
* the decompression tools output will be the dictionary concatenated
with the uncompressed data - seperate them and you're done.

In theory, this works, in practice the "stream compressor" seems to
work on blocks of data which are way larger than my dictionary or my
data, so I get no output during stage 2 of compression, and a block of
random data during stage 3, which has nothing in common with other
stage3 outputs with different input data.

Has anyone any ideas?

Oliver Mattos

unread,

Jan 3, 2010, 8:14:16 PM1/3/10

to

> Has anyone any ideas?

Just in case this is an easy task and I get hundreds of responses,
lets make this a little competition. Compress the following strings
(one per line) independantly:

<test><data>1123444</data>hello world <abcde></abcde></test>

<test>qwerty 7723444 more data <abcde>hello world</abcde></test>

You are allowed the following as a dictionary (this string is
available during both compression and decompression):

<test><data>qwerty</data><abcde></abcde></test>hello world

All are ascii strings, winner is whoever can losslessly compress the
strings above independantly, and have the lowest mean compressed data
size (in bytes). it looks like xml, but it's not - compression needs
to be general purpose, not xml specific... Prize is fame here on
comp.compression.

glen herrmannsfeldt

unread,

Jan 3, 2010, 10:51:32 PM1/3/10

to

Oliver Mattos <oma...@gmail.com> wrote:

> I'm looking for an easy method to compress small strings (about 200
> bytes), using a custom dictionary. I'm using the custom dictionary
> because most of the strings I'm compressing contain some common
> phrases, and therefore putting these in a custom dictionary would
> reduce the output size considerably. The dictionary would be static
> and available to both the compressor and decompressor, and would not
> need to be stored as part of the compressed data.

(snip of possible way to do it)

My choice would be to match the fixed strings with a finite state
automaton such as used in the first part of the Aho-Corasick algorithm
and store each as a single byte in the string. Then run the result
through any compression algorithm such as LZW.

-- glen

BGB / cr88192

unread,

Jan 4, 2010, 12:42:27 AM1/4/10

to

"Oliver Mattos" <oma...@gmail.com> wrote in message
news:c248dc23-9f84-44f0...@m16g2000yqc.googlegroups.com...

maybe you might be better off in this case not using a dictionary...

consider instead that you produce a context:
for every prior 1 or 2 characters, figure out the ranking off all characters
likely to come next (can be figured out from a tool that simply uses the
prior 1 or 2 characters as a key, and counting the characters which follow,
from an input corpus, and then sorting the resultant rankings to produce the
model).

after this, the strings can be transformed via the model, converting each
character into its position in the ranking for the given context.

the result can be compressed with huffman, or even rice (rice coding is very
simple).

http://en.wikipedia.org/wiki/Rice_code

rice, k=0
0 0
10 1
110 2
1110 3
11110 4
...

k=1
00 0
01 1
100 2
101 3
1100 4
1101 5
...

k=2
000 0
001 1
010 2
011 3
1000 4
1001 5
1010 6
1011 7
...

there is another trivial variant of rice-coding, mostly differing in that
the prefix bits are inverted:
1
01
001
...

...

another simple code I know of (don't know the exact name of):
use a small rice code to encode the length of a suffix (in bits), then the
encode the suffix-((2^n)-1).

example with k=0:
0 0
100 1
101 2
11000 3
11001 4
11010 5
11011 6
1110000 7
1110001 8
1110010 9
1110011 10
1110100 11
1110101 12
1110110 13
1110111 14
...

k=1:
00 0
010 1
011 2
10000 4
10001 5
10010 6
10011 7
101000 8
101001 9
101010 10
101011 11
101100 12
101101 13
101110 14
101111 15
...

of course, if the prefix is inverted, then it appears to be
exponential-golumb coding:
http://en.wikipedia.org/wiki/Exponential-Golomb_coding

which is better depends on the particular distribution, which I don't know
off-hand.

rice starts steep and stays steep, whereas the other starts steep and
flattens out.

both are much simpler than huffman though, and one only has to send k to
configure them, that or hard-code k, where the ideal k could be done as part
of building the model (allowing for a no-parameters codec).

note:
8-bit context would produce a 64kB model (if full 8 bits are used);
a 12-bit context (such as the hash of the 2 or 3 bytes) would produce a 1MB
model;
a 16-bit context (2 bytes or a hash of 3+ bytes) would produce a 16MB model.

this is tunable based on how many bits of a model one is willing to have.

OTOH, a simple (0 bits context) ranking could produce a cheap analogue of
huffman coding (I have actually done this before), but this has no
"predictive" power.

some space in the model can be saved either by reducing the character set,
or by compressing the model (although the gains are unlikely to be that
exciting).

or such...

Mark Adler

unread,

Jan 4, 2010, 11:12:03 AM1/4/10

to

On 2010-01-03 16:55:41 -0800, Oliver Mattos <oma...@gmail.com> said:
> The dictionary would be static
> and available to both the compressor and decompressor, and would not
> need to be stored as part of the compressed data.

zlib supports exactly this. See the deflateSetDictionary() and
inflateSetDictionary() calls.

Mark

Ernst

unread,

Jan 7, 2010, 6:00:13 PM1/7/10

to

Thanks Oliver, and Mark interesting read.

Oliver Mattos

unread,

Jan 10, 2010, 2:33:45 PM1/10/10

to

Thanks everyone for the responses. It looks like the zlib custom
dictionaries are exactly what I was after, so I'll knock together a
program that can do stream compression (like gzip), using the zlib
backend, and taking a custom dictionary as an argument. If I get it
working well I'll post it here.

Having looked at the way DEFLATE works, it looks like a further size
improvement could be made by having custom huffman tables instead of
either using standard huffman tables (which means you don't get
compression based on character frequncy), or using dynamic huffman
tables (which requires the table itself be stored in the output).

cr88192's solution looks like it could work well, but I'm afraid I
don't have the resources to implement it for this project.

Thanks
Oliver

On Jan 4, 4:12 pm, Mark Adler <mad...@alumni.caltech.edu> wrote:

Mark Adler

unread,

Jan 10, 2010, 3:15:39 PM1/10/10

to

On 2010-01-10 11:33:45 -0800, Oliver Mattos said:
> Having looked at the way DEFLATE works, it looks like a further size
> improvement could be made by having custom huffman tables

Oliver,

Yes, it could. While there is a way to feed inflate a pre-set dynamic
block header and follow with transmitted codes, there is not currently
a way to make deflate use a pre-set dynamic block header. At least not
without some involved changes to the code.

A typical dynamic block header is around 80 bytes. So for your
200-byte strings, chances are that deflate will almost never choose to
use dynamic blocks -- deflate will use fixed blocks since that result
will turn out to be shorter.

Mark

BGB / cr88192

unread,

Jan 10, 2010, 4:24:37 PM1/10/10

to

"Mark Adler" <mad...@alumni.caltech.edu> wrote in message
news:2010011012153916807-madler@alumnicaltechedu...

yep.

this is also why my idea used pre-computed rank-sorting and rice-coding,
rather than either a huffman variant or arithmetic coding, since huffman
would need a header, and adaptive huffman and arithmetic coding would likely
generate suboptimal results, as well as being more complicated.

granted, with a modeled input stream, a PAQ-style coder could work ok, since
one could also pre-calculate the model for this as well.

however, rice coding is still likely to be a little simpler, and still
likely to do fairly well despite its simplicity (I have noted how well it
does in FLAC, and have also had good success with it in a few
special-purpose image compressors, ...).

however, deflate with a pre-built dictionary is still likely a fairly good
option as well.

I guess it could be compared to the tradeoffs between a hand-drill and a
power-drill...

not all tasks are better done by one or the other, as some tasks may be
better done with a hand-drill, and for others, one really needs the
power-drill...

> Mark
>