I don't need all of these requirements but I need something that can
easily be modified to suit. This is just quicky for a protoype : )
The words "lightweight" and "PPM" don't generally go together, at
least not with respect to the resources required.
> It needs to be able to compression short messages (in english)
For a compressor to be good at all at compressing short messages, it
needs to be preloaded with some sort of relevant context for the
messages expected. Otherwise pretty much any compressor will do
relatively poorly on short messages, as compared to a large number of
such messages compressed together. One way to build the context is to
consider the sequence of short messages as one long message that's
being compressed. The downside there is that if you lose the context
on the other end, then you can't decompress any subsequent messages.
> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
> or 64Kb would be desirable)
That's not really the regime that PPM thrives in. That's the regime
for LZish compressors.
Mark
Thanks for any help you can offer, Im really hoping to avoid having to
implement this.
For a pre-loaded model you might want a pre-allocated model. The
escape code and other complications added by PPMC etc help the model
grow dynamically but that doesn't makes much sense for a preloaded
model.
There is no free lunch here, but I once wrote a simple order-0
arithmetic coder in Java, it can be very easily extended for using
higher order statically allocated models or for model pre-loading.
http://sachingarg.com/compression/entropy_coding/java_range_coder
If you do want to implement a dynamically growing model like PPMC
(where the entire tree is not initially allocated and then escape
codes are used to grow the tree over time), then the source you
mentioned in your first post is the best bet.
Sachin Garg [India]
www.sachingarg.com | www.c10n.info
ps. in case you are wondering, range coder is just a faster alternate
to arithmetic coder, both are based on same concepts.
Cheers
Very poorly, again unless some context is pre-loaded that has some
matches for the short strings.
Mark
>>Just out of interest, how well do LZ compressors do on short english
>>strings (possibly 70 characters long)?
>
>
> Very poorly, again unless some context is pre-loaded that has some
> matches for the short strings.
I suggest evaluating actual measurements on your inputs.
You might be favorably impressed by something such as lzma,
which uses deflate plus a range encoder.
Ratios (output/input) of 0.4 won't be common, but 0.6 could be.
--
you know, PPM and Java, are almost the furthest thing from "lightweight" you
can get.
getting much of anything written in Java to run in that little memory is a
stretch, much less *PPM* of all things.
or at least this is my impression...
but, the bigger problem is not *just* the PPM, but, Java...
if no one has noticed, Java, even for trivial things, tends to use teh-huge
amounts of memory.
partial reason:
Java likes dynamically allocating things, and using a GC;
though good for convinience, ... these things are bad for memory footprint.
even in C, one has to use very specific representations of data (and
possibly custom memory management) to keep the footprint low.
this is, for the most part, beyond the capabilities of Java.
however, naive code will have a bad footprint in either case (dunno if the
typical JVM MM is more or less space-efficient than a typical C malloc
implementation).
I am speaking more specifically of specialized approaches, for example
slabs, special space saving tricks (for example, using the same slot for
different things in different situations, packing multiple values per
variable, ...), ...
someone may disagree with me, I don't know...
well, be careful not to use classes, or at least be very careful with
classes, as well (each instance can easily use tens to hundreds of bytes,
with some indeterminant amount of MM overhead).
maybe keep as much state as possible in single large arrays (keeping down
the actual number of such arrays).
...
as for strings, it doesn't help that java uses UTF-16...
in my projects, internally, typically I use UTF-8 (or often just ASCII).
IMO, for most things, UTF-8 would seem to be a better option than UTF-16
(only converting to UTF-16 for some specific kinds of operations, which IME
haven't really occured in practice).
well, UTF-8 represents the same values as UTF-16 (more or less, what is used
in Java).
however, note that different characters will take a different number of
bytes to represent. for data compression, even if for text, it may make more
sense to view things in terms of strings of raw-bytes (rather than
'characters').
oh well, I use UTF-8, but in my code support is chaotic (at best) as some
code will support UTF-8, and other code will not (treating it like raw
ascii). one issue is that, normally, C represents strings as "char *", but
for UTF-8, I need "uchar *" or "byte *". this leads to extra variables and
casts in many places.
different code does different things in different places though...
The thing I dislike most about Java is its lack of unsigned integers.
Almost all algorithm's implementation needs unsigned maths and I
haven't yet found any reasonable explanation on why this was decided.
<snip>
>>
>> oh well, I use UTF-8, but in my code support is chaotic (at best) as some
>> code will support UTF-8, and other code will not (treating it like raw
>> ascii). one issue is that, normally, C represents strings as "char *",
>> but
>> for UTF-8, I need "uchar *" or "byte *". this leads to extra variables
>> and
>> casts in many places.
>>
>> different code does different things in different places though...
>
> The thing I dislike most about Java is its lack of unsigned integers.
> Almost all algorithm's implementation needs unsigned maths and I
> haven't yet found any reasonable explanation on why this was decided.
>
can't say myself, I don't know.
Probably because you can do everything with signed numbers and the >>>
unsigned right shift operator. In C++, unsigned numbers are a pain
when expressions like (s.size() > -1) don't work the way you would
expect because -1 is silently converted to a huge number.
-- Matt Mahoney
many cases, signed numbers make sense, some cases, unsigned would help a
lot. then again, one can always take a smaller signed number and mask it to
get a larger unsigned one, but oh well.
I am just surprised no one debated with me about memory footprint issue,
trying to claim Java had a smaller footprint than C or something, but then
again, I don't think many are much doubting me on this one...
> -- Matt Mahoney
>
>
Yes, this does makes sense, but its just that with so much legacy code
using such issues as features, it becomes a pain to port it :-)
There will still be 1-bit of lost precision.
> many cases, signed numbers make sense, some cases, unsigned would help a
> lot. then again, one can always take a smaller signed number and mask it to
> get a larger unsigned one, but oh well.
That is what I have done to port range-coder to Java, use 64-bit
variables. You can take a quick look at code in case there is a way to
improve it. With >>> it might be possible to fit it back into 32-bits.
64-bit Java implementation:
http://www.sachingarg.com/compression/entropy_coding/java_range_coder/code/RangeEncoder64.java
32-bit C++ implementation:
http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range32.cpp
64-bit C++ implementation:
http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range64.cpp
> I am just surprised no one debated with me about memory footprint issue,
> trying to claim Java had a smaller footprint than C or something, but then
> again, I don't think many are much doubting me on this one...
Yep :-)
yes.
>> many cases, signed numbers make sense, some cases, unsigned would help a
>> lot. then again, one can always take a smaller signed number and mask it
>> to
>> get a larger unsigned one, but oh well.
>
> That is what I have done to port range-coder to Java, use 64-bit
> variables. You can take a quick look at code in case there is a way to
> improve it. With >>> it might be possible to fit it back into 32-bits.
>
> 64-bit Java implementation:
>
> http://www.sachingarg.com/compression/entropy_coding/java_range_coder/code/RangeEncoder64.java
>
> 32-bit C++ implementation:
>
> http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range32.cpp
>
> 64-bit C++ implementation:
>
> http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range64.cpp
>
sad though that 64 bit math is usually slower.
may not look into this much right now.
>> I am just surprised no one debated with me about memory footprint issue,
>> trying to claim Java had a smaller footprint than C or something, but
>> then
>> again, I don't think many are much doubting me on this one...
>
> Yep :-)
>
yeah.
> It needs to be able to compression short messages (in english) to
> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
there is one rather small known compressor (written by me) -
http://www.haskell.org/bz/cm.zip
it is the only static PPM compressor ever developed. what it does:
counts all order-N contexts, then omits rare ones and encode the rest
very like the idea of encoding huffman tree in deflate. then seond
pass runs where data encoded using this PPM statistics
this code used in several compression projects what need random
access. you do something like this - either save one PPM tree for many
messages or use one fixed tree for all messages
approximately, if you use 64kb for final tree, then full tree (before
cutting rare contexts) may be ~10x larger and in compressed form this
tree should ~10x smaller. so, one possible solution is to build tree
on large set of typical data using ~1mb of memory and order 3 or 4,
cut it off down to 64kb and save in some form - all these on PC
then on your device you can put this tree to memory and use it to
encode/decode strings. because you already cutted of rare cases
counted using big amount of memory, efficiency of such encoding will
be close to using 512k of memory with regular PPM algorithm
i also recommend you to use something like http://www.compression.ru/ds/ppmdj1.rar
which uses memory more efficiently than my algorithm. finally you
should get performance close to that of http://www.compression.ru/ds/ppmsi.rar
Sachin Garg, I downloaded your Range coder and convinced it give me a
byte array of text read in from a file. Im guessing its adaptive so it
has no model of any language beforehand but estimates character
probabilities as it encounters them? I have been getting poor results,
roughly a byte per character, I was hoping for 4 bits per character at
least. Perhaps Im doing something wrong? I don't know your code
insideout so its hard to tell right away whats going on at run time.
- Ross
Correct.
> I have been getting poor results,
> roughly a byte per character, I was hoping for 4 bits per character at
> least.
You won't get very good results when just using that code as-is, it
only has a order-0 model. But for normal (and large enough, like a few
KBs) data you should get better than 'a byte per character' (which
seems to mean 'no compression').
> Perhaps Im doing something wrong?
I can't tell :-), If you can post relevant parts of your code, I can
take a look.
> I don't know your code
> insideout so its hard to tell right away whats going on at run time.
I thought you were looking for something that can be extended for pre-
loading a static order-1 model etc... its not very hard to do but will
need some education :-) Let me know if you have any specific
questions.
Try these links if you are still on basics,
http://marknelson.us/articles/arith/part1.htm
http://marknelson.us/articles/arith/part2.htm
http://www.arturocampos.com/ac_range.html
If you are just looking for a quick-fix, I am not aware of any already
available solutions. In this case the original code you mentioned in
your first post is the next best option to Java's inbuilt deflate
streams.