I'm looking for an easy method to compress small strings (about 200
bytes), using a custom dictionary. I'm using the custom dictionary
because most of the strings I'm compressing contain some common
phrases, and therefore putting these in a custom dictionary would
reduce the output size considerably. The dictionary would be static
and available to both the compressor and decompressor, and would not
need to be stored as part of the compressed data.
I want to achieve this using standard linux tools or sample code - I'd
prefer not to rewrite gzip or do much software development.
My idea for an easy way to do this was:
1. set up a standard compression tool as stream compressor.
2. pipe in my dictionary, and discard the output - this should fill
all internal buffers with the dictionary words.
3. next pipe in my data to compress, and store the output.
And to decompress:
* set up a compression tool, and stream in a dictionary, and capture
the output (this is identical to what was discarded during
compression)
* pipe this output into a decompression tool, followed by the
compressed data.
* the decompression tools output will be the dictionary concatenated
with the uncompressed data - seperate them and you're done.
In theory, this works, in practice the "stream compressor" seems to
work on blocks of data which are way larger than my dictionary or my
data, so I get no output during stage 2 of compression, and a block of
random data during stage 3, which has nothing in common with other
stage3 outputs with different input data.
Has anyone any ideas?
Just in case this is an easy task and I get hundreds of responses,
lets make this a little competition. Compress the following strings
(one per line) independantly:
<test><data>76234987</data>data5??< qwerty <>>bad></test>
<test><data>1123444</data>hello world <abcde></abcde></test>
<test>qwerty 7723444 more data <abcde>hello world</abcde></test>
You are allowed the following as a dictionary (this string is
available during both compression and decompression):
<test><data>qwerty</data><abcde></abcde></test>hello world
All are ascii strings, winner is whoever can losslessly compress the
strings above independantly, and have the lowest mean compressed data
size (in bytes). it looks like xml, but it's not - compression needs
to be general purpose, not xml specific... Prize is fame here on
comp.compression.
(snip of possible way to do it)
My choice would be to match the fixed strings with a finite state
automaton such as used in the first part of the Aho-Corasick algorithm
and store each as a single byte in the string. Then run the result
through any compression algorithm such as LZW.
-- glen
maybe you might be better off in this case not using a dictionary...
consider instead that you produce a context:
for every prior 1 or 2 characters, figure out the ranking off all characters
likely to come next (can be figured out from a tool that simply uses the
prior 1 or 2 characters as a key, and counting the characters which follow,
from an input corpus, and then sorting the resultant rankings to produce the
model).
after this, the strings can be transformed via the model, converting each
character into its position in the ranking for the given context.
the result can be compressed with huffman, or even rice (rice coding is very
simple).
http://en.wikipedia.org/wiki/Rice_code
rice, k=0
0 0
10 1
110 2
1110 3
11110 4
...
k=1
00 0
01 1
100 2
101 3
1100 4
1101 5
...
k=2
000 0
001 1
010 2
011 3
1000 4
1001 5
1010 6
1011 7
...
there is another trivial variant of rice-coding, mostly differing in that
the prefix bits are inverted:
1
01
001
...
...
another simple code I know of (don't know the exact name of):
use a small rice code to encode the length of a suffix (in bits), then the
encode the suffix-((2^n)-1).
example with k=0:
0 0
100 1
101 2
11000 3
11001 4
11010 5
11011 6
1110000 7
1110001 8
1110010 9
1110011 10
1110100 11
1110101 12
1110110 13
1110111 14
...
k=1:
00 0
010 1
011 2
10000 4
10001 5
10010 6
10011 7
101000 8
101001 9
101010 10
101011 11
101100 12
101101 13
101110 14
101111 15
...
of course, if the prefix is inverted, then it appears to be
exponential-golumb coding:
http://en.wikipedia.org/wiki/Exponential-Golomb_coding
which is better depends on the particular distribution, which I don't know
off-hand.
rice starts steep and stays steep, whereas the other starts steep and
flattens out.
both are much simpler than huffman though, and one only has to send k to
configure them, that or hard-code k, where the ideal k could be done as part
of building the model (allowing for a no-parameters codec).
note:
8-bit context would produce a 64kB model (if full 8 bits are used);
a 12-bit context (such as the hash of the 2 or 3 bytes) would produce a 1MB
model;
a 16-bit context (2 bytes or a hash of 3+ bytes) would produce a 16MB model.
this is tunable based on how many bits of a model one is willing to have.
OTOH, a simple (0 bits context) ranking could produce a cheap analogue of
huffman coding (I have actually done this before), but this has no
"predictive" power.
some space in the model can be saved either by reducing the character set,
or by compressing the model (although the gains are unlikely to be that
exciting).
or such...
zlib supports exactly this. See the deflateSetDictionary() and
inflateSetDictionary() calls.
Mark
Having looked at the way DEFLATE works, it looks like a further size
improvement could be made by having custom huffman tables instead of
either using standard huffman tables (which means you don't get
compression based on character frequncy), or using dynamic huffman
tables (which requires the table itself be stored in the output).
cr88192's solution looks like it could work well, but I'm afraid I
don't have the resources to implement it for this project.
Thanks
Oliver
On Jan 4, 4:12 pm, Mark Adler <mad...@alumni.caltech.edu> wrote:
Oliver,
Yes, it could. While there is a way to feed inflate a pre-set dynamic
block header and follow with transmitted codes, there is not currently
a way to make deflate use a pre-set dynamic block header. At least not
without some involved changes to the code.
A typical dynamic block header is around 80 bytes. So for your
200-byte strings, chances are that deflate will almost never choose to
use dynamic blocks -- deflate will use fixed blocks since that result
will turn out to be shorter.
Mark
yep.
this is also why my idea used pre-computed rank-sorting and rice-coding,
rather than either a huffman variant or arithmetic coding, since huffman
would need a header, and adaptive huffman and arithmetic coding would likely
generate suboptimal results, as well as being more complicated.
granted, with a modeled input stream, a PAQ-style coder could work ok, since
one could also pre-calculate the model for this as well.
however, rice coding is still likely to be a little simpler, and still
likely to do fairly well despite its simplicity (I have noted how well it
does in FLAC, and have also had good success with it in a few
special-purpose image compressors, ...).
however, deflate with a pre-built dictionary is still likely a fairly good
option as well.
I guess it could be compared to the tradeoffs between a hand-drill and a
power-drill...
not all tasks are better done by one or the other, as some tasks may be
better done with a hand-drill, and for others, one really needs the
power-drill...
> Mark
>