Open source java implementations of PPM?

digi...@gmail.com

unread,

May 7, 2007, 8:04:19 PM5/7/07

to

Hi, Im looking a lightweight implementation of PPM or some variant
that I can use to test the potential of integrating it with my own
system. I've sort of found what Im looking for here:
http://www.colloquial.com/ArithmeticCoding/ but its not quite what I
need. Its a huge package and seems like overkill for what I need.
It needs to be able to compression short messages (in english) to
roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
or 64Kb would be desirable) so paramters such as the order of the
model would of course need to be configurable. Ideally the result
would be returned as a byte array.

I don't need all of these requirements but I need something that can
easily be modified to suit. This is just quicky for a protoype : )

Mark Adler

unread,

May 7, 2007, 8:11:41 PM5/7/07

to

On May 7, 5:04 pm, digiol...@gmail.com wrote:
> Hi, Im looking a lightweight implementation of PPM

The words "lightweight" and "PPM" don't generally go together, at
least not with respect to the resources required.

> It needs to be able to compression short messages (in english)

For a compressor to be good at all at compressing short messages, it
needs to be preloaded with some sort of relevant context for the
messages expected. Otherwise pretty much any compressor will do
relatively poorly on short messages, as compared to a large number of
such messages compressed together. One way to build the context is to
consider the sequence of short messages as one long message that's
being compressed. The downside there is that if you lose the context
on the other end, then you can't decompress any subsequent messages.

> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
> or 64Kb would be desirable)

That's not really the regime that PPM thrives in. That's the regime
for LZish compressors.

Mark

digi...@gmail.com

unread,

May 8, 2007, 4:32:27 PM5/8/07

to

Hi Mark, this actually has been done before using a variant PPMC at
orders 2 and 3 (partial context I would say) with only 32kb of memory.
The application would of course be preloaded with a context model
which would be present at both receiving and sending ends. It acheives
roughly the same compression ratio as im aiming for, unfortunately the
code isnt available to me. H
However, here is a paper outlining the approach used, its a little
over my head at the moment, probably too much for a one man job:
http://www.academypublisher.com/jcp/vol01/no06/jcp01060110.pdf
Just out of interest, how well do LZ compressors do on short english
strings (possibly 70 characters long)?

Thanks for any help you can offer, Im really hoping to avoid having to
implement this.

Sachin Garg

unread,

May 8, 2007, 5:40:42 PM5/8/07

to

For a pre-loaded model you might want a pre-allocated model. The
escape code and other complications added by PPMC etc help the model
grow dynamically but that doesn't makes much sense for a preloaded
model.

There is no free lunch here, but I once wrote a simple order-0
arithmetic coder in Java, it can be very easily extended for using
higher order statically allocated models or for model pre-loading.

http://sachingarg.com/compression/entropy_coding/java_range_coder

If you do want to implement a dynamically growing model like PPMC
(where the entire tree is not initially allocated and then escape
codes are used to grow the tree over time), then the source you
mentioned in your first post is the best bet.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info

ps. in case you are wondering, range coder is just a faster alternate
to arithmetic coder, both are based on same concepts.

digi...@gmail.com

unread,

May 8, 2007, 6:57:53 PM5/8/07

to

Hi, thanks for the reply. I'll try looking at your implementation and
see how I can go about extending it to order 2 or 3, I'll probably be
back with lots of questions if I do : ) Im just beginning to explore
compression. Im also glad you brought range coder to my attention,
I've been looking for something like this.

Cheers

Mark Adler

unread,

May 9, 2007, 3:08:57 AM5/9/07

to

On May 8, 1:32 pm, digiol...@gmail.com wrote:
> Just out of interest, how well do LZ compressors do on short english
> strings (possibly 70 characters long)?

Very poorly, again unless some context is pre-loaded that has some
matches for the short strings.

Mark

John Reiser

unread,

May 9, 2007, 10:46:49 AM5/9/07

to

Mark Adler wrote:

>>Just out of interest, how well do LZ compressors do on short english
>>strings (possibly 70 characters long)?
>
>
> Very poorly, again unless some context is pre-loaded that has some
> matches for the short strings.

I suggest evaluating actual measurements on your inputs.
You might be favorably impressed by something such as lzma,
which uses deflate plus a range encoder.
Ratios (output/input) of 0.4 won't be common, but 0.6 could be.

--

cr88192

unread,

May 9, 2007, 4:01:09 PM5/9/07

to

<digi...@gmail.com> wrote in message
news:1178582659....@e65g2000hsc.googlegroups.com...

you know, PPM and Java, are almost the furthest thing from "lightweight" you
can get.

getting much of anything written in Java to run in that little memory is a
stretch, much less *PPM* of all things.

or at least this is my impression...

digi...@gmail.com

unread,

May 10, 2007, 8:06:24 PM5/10/07

to

Yeah, there's no doubt it'd be a challenge, but not impossible. I
think I could implement a scaled down PPM implementation within those
memory limits its just the arithmetic/range coding and dealing with
variable length codes that gives me the willys *shudders*. Hopefully
Sachin's order-0 implementation can be easily extended.

cr88192

unread,

May 11, 2007, 12:28:56 AM5/11/07

to

<digi...@gmail.com> wrote in message
news:1178841984....@y80g2000hsf.googlegroups.com...

but, the bigger problem is not *just* the PPM, but, Java...

if no one has noticed, Java, even for trivial things, tends to use teh-huge
amounts of memory.

partial reason:
Java likes dynamically allocating things, and using a GC;
though good for convinience, ... these things are bad for memory footprint.

even in C, one has to use very specific representations of data (and
possibly custom memory management) to keep the footprint low.

this is, for the most part, beyond the capabilities of Java.

however, naive code will have a bad footprint in either case (dunno if the
typical JVM MM is more or less space-efficient than a typical C malloc
implementation).

I am speaking more specifically of specialized approaches, for example
slabs, special space saving tricks (for example, using the same slot for
different things in different situations, packing multiple values per
variable, ...), ...

someone may disagree with me, I don't know...

digi...@gmail.com

unread,

May 11, 2007, 8:32:40 AM5/11/07

to

Your right, you do have to be careful with what Java is doing behind
the scenes, if you know your stuff and are careful for example not
declaring variables in loops, not using + for string concatenation and
staying away from Vectors, Strings ect for convenience, these can all
help.
Im using Java because I have no other choice for this application (its
in J2ME to be specific).

cr88192

unread,

May 11, 2007, 3:14:00 PM5/11/07

to

<digi...@gmail.com> wrote in message
news:1178886760.6...@w5g2000hsg.googlegroups.com...

well, be careful not to use classes, or at least be very careful with
classes, as well (each instance can easily use tens to hundreds of bytes,
with some indeterminant amount of MM overhead).

maybe keep as much state as possible in single large arrays (keeping down
the actual number of such arrays).

...

as for strings, it doesn't help that java uses UTF-16...

in my projects, internally, typically I use UTF-8 (or often just ASCII).

IMO, for most things, UTF-8 would seem to be a better option than UTF-16
(only converting to UTF-16 for some specific kinds of operations, which IME
haven't really occured in practice).

digi...@gmail.com

unread,

May 11, 2007, 9:34:18 PM5/11/07

to

Yeah, I recently noticed this, I had 2000 strings stored in a vector
(because it was convenient to access that way), using a single
character array really brought down the memory usage.
Im not sure how to change Java's internal character representation but
I'd probably use bytes with my own mapping from characters to byte
values if necessary, I think char primitives are 16 bit by default. If
UTF-8 does the same job without me having to worry about it then I'll
try that when I figure it out : )

cr88192

unread,

May 11, 2007, 10:47:34 PM5/11/07

to

<digi...@gmail.com> wrote in message
news:1178933658....@w5g2000hsg.googlegroups.com...

well, UTF-8 represents the same values as UTF-16 (more or less, what is used
in Java).

however, note that different characters will take a different number of
bytes to represent. for data compression, even if for text, it may make more
sense to view things in terms of strings of raw-bytes (rather than
'characters').

oh well, I use UTF-8, but in my code support is chaotic (at best) as some
code will support UTF-8, and other code will not (treating it like raw
ascii). one issue is that, normally, C represents strings as "char *", but
for UTF-8, I need "uchar *" or "byte *". this leads to extra variables and
casts in many places.

different code does different things in different places though...

Sachin Garg

unread,

May 12, 2007, 9:41:59 AM5/12/07

to

On May 12, 7:47 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
> <digiol...@gmail.com> wrote in message

The thing I dislike most about Java is its lack of unsigned integers.
Almost all algorithm's implementation needs unsigned maths and I
haven't yet found any reasonable explanation on why this was decided.

cr88192

unread,

May 12, 2007, 3:46:43 PM5/12/07

to

"Sachin Garg" <sch...@gmail.com> wrote in message
news:1178977319.0...@w5g2000hsg.googlegroups.com...

> On May 12, 7:47 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
>> <digiol...@gmail.com> wrote in message
>>
>> news:1178933658....@w5g2000hsg.googlegroups.com...
>>

<snip>

>>
>> oh well, I use UTF-8, but in my code support is chaotic (at best) as some
>> code will support UTF-8, and other code will not (treating it like raw
>> ascii). one issue is that, normally, C represents strings as "char *",
>> but
>> for UTF-8, I need "uchar *" or "byte *". this leads to extra variables
>> and
>> casts in many places.
>>
>> different code does different things in different places though...
>
> The thing I dislike most about Java is its lack of unsigned integers.
> Almost all algorithm's implementation needs unsigned maths and I
> haven't yet found any reasonable explanation on why this was decided.
>

can't say myself, I don't know.

Matt Mahoney

unread,

May 12, 2007, 5:29:38 PM5/12/07

to

On May 12, 9:41 am, Sachin Garg <schn...@gmail.com> wrote:
> The thing I dislike most about Java is its lack of unsigned integers.
> Almost all algorithm's implementation needs unsigned maths and I
> haven't yet found any reasonable explanation on why this was decided.

Probably because you can do everything with signed numbers and the >>>
unsigned right shift operator. In C++, unsigned numbers are a pain
when expressions like (s.size() > -1) don't work the way you would
expect because -1 is silently converted to a huge number.

-- Matt Mahoney

cr88192

unread,

May 12, 2007, 6:01:44 PM5/12/07

to

"Matt Mahoney" <matma...@yahoo.com> wrote in message
news:1179005378.4...@n59g2000hsh.googlegroups.com...

many cases, signed numbers make sense, some cases, unsigned would help a
lot. then again, one can always take a smaller signed number and mask it to
get a larger unsigned one, but oh well.

I am just surprised no one debated with me about memory footprint issue,
trying to claim Java had a smaller footprint than C or something, but then
again, I don't think many are much doubting me on this one...

> -- Matt Mahoney
>
>

Sachin Garg

unread,

May 13, 2007, 5:39:31 AM5/13/07

to

On May 13, 3:01 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
> "Matt Mahoney" <matmaho...@yahoo.com> wrote in message

>
> news:1179005378.4...@n59g2000hsh.googlegroups.com...
>
> > On May 12, 9:41 am, Sachin Garg <schn...@gmail.com> wrote:
> >> The thing I dislike most about Java is its lack of unsigned integers.
> >> Almost all algorithm's implementation needs unsigned maths and I
> >> haven't yet found any reasonable explanation on why this was decided.
>
> > Probably because you can do everything with signed numbers and the >>>
> > unsigned right shift operator. In C++, unsigned numbers are a pain
> > when expressions like (s.size() > -1) don't work the way you would
> > expect because -1 is silently converted to a huge number.

Yes, this does makes sense, but its just that with so much legacy code
using such issues as features, it becomes a pain to port it :-)

There will still be 1-bit of lost precision.

> many cases, signed numbers make sense, some cases, unsigned would help a
> lot. then again, one can always take a smaller signed number and mask it to
> get a larger unsigned one, but oh well.

That is what I have done to port range-coder to Java, use 64-bit
variables. You can take a quick look at code in case there is a way to
improve it. With >>> it might be possible to fit it back into 32-bits.

64-bit Java implementation:
http://www.sachingarg.com/compression/entropy_coding/java_range_coder/code/RangeEncoder64.java

32-bit C++ implementation:
http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range32.cpp

64-bit C++ implementation:
http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range64.cpp

> I am just surprised no one debated with me about memory footprint issue,
> trying to claim Java had a smaller footprint than C or something, but then
> again, I don't think many are much doubting me on this one...

Yep :-)

cr88192

unread,

May 13, 2007, 7:22:27 AM5/13/07

to

"Sachin Garg" <sch...@gmail.com> wrote in message

news:1179049171.4...@p77g2000hsh.googlegroups.com...

>
> On May 13, 3:01 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
>> "Matt Mahoney" <matmaho...@yahoo.com> wrote in message
>>
>> news:1179005378.4...@n59g2000hsh.googlegroups.com...
>>
>> > On May 12, 9:41 am, Sachin Garg <schn...@gmail.com> wrote:
>> >> The thing I dislike most about Java is its lack of unsigned integers.
>> >> Almost all algorithm's implementation needs unsigned maths and I
>> >> haven't yet found any reasonable explanation on why this was decided.
>>
>> > Probably because you can do everything with signed numbers and the >>>
>> > unsigned right shift operator. In C++, unsigned numbers are a pain
>> > when expressions like (s.size() > -1) don't work the way you would
>> > expect because -1 is silently converted to a huge number.
>
> Yes, this does makes sense, but its just that with so much legacy code
> using such issues as features, it becomes a pain to port it :-)
>
> There will still be 1-bit of lost precision.
>

yes.

>> many cases, signed numbers make sense, some cases, unsigned would help a
>> lot. then again, one can always take a smaller signed number and mask it
>> to
>> get a larger unsigned one, but oh well.
>
> That is what I have done to port range-coder to Java, use 64-bit
> variables. You can take a quick look at code in case there is a way to
> improve it. With >>> it might be possible to fit it back into 32-bits.
>
> 64-bit Java implementation:
>
> http://www.sachingarg.com/compression/entropy_coding/java_range_coder/code/RangeEncoder64.java
>
> 32-bit C++ implementation:
>
> http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range32.cpp
>
> 64-bit C++ implementation:
>
> http://www.sachingarg.com/compression/entropy_coding/64bit/code/entropy/range64.cpp
>

sad though that 64 bit math is usually slower.
may not look into this much right now.

>> I am just surprised no one debated with me about memory footprint issue,
>> trying to claim Java had a smaller footprint than C or something, but
>> then
>> again, I don't think many are much doubting me on this one...
>
> Yep :-)
>

yeah.

Bulat Ziganshin

unread,

May 23, 2007, 12:52:35 PM5/23/07

to

On 8 май, 04:04, digiol...@gmail.com wrote:
> Hi, Im looking a lightweight implementation of PPM or some variant

> It needs to be able to compression short messages (in english) to

> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb

there is one rather small known compressor (written by me) -
http://www.haskell.org/bz/cm.zip

it is the only static PPM compressor ever developed. what it does:
counts all order-N contexts, then omits rare ones and encode the rest
very like the idea of encoding huffman tree in deflate. then seond
pass runs where data encoded using this PPM statistics

this code used in several compression projects what need random
access. you do something like this - either save one PPM tree for many
messages or use one fixed tree for all messages

approximately, if you use 64kb for final tree, then full tree (before
cutting rare contexts) may be ~10x larger and in compressed form this
tree should ~10x smaller. so, one possible solution is to build tree
on large set of typical data using ~1mb of memory and order 3 or 4,
cut it off down to 64kb and save in some form - all these on PC

then on your device you can put this tree to memory and use it to
encode/decode strings. because you already cutted of rare cases
counted using big amount of memory, efficiency of such encoding will
be close to using 512k of memory with regular PPM algorithm

i also recommend you to use something like http://www.compression.ru/ds/ppmdj1.rar
which uses memory more efficiently than my algorithm. finally you
should get performance close to that of http://www.compression.ru/ds/ppmsi.rar

digi...@gmail.com

unread,

Jun 9, 2007, 3:40:18 PM6/9/07

to

Thanks for all your replies.

Sachin Garg, I downloaded your Range coder and convinced it give me a
byte array of text read in from a file. Im guessing its adaptive so it
has no model of any language beforehand but estimates character
probabilities as it encounters them? I have been getting poor results,
roughly a byte per character, I was hoping for 4 bits per character at
least. Perhaps Im doing something wrong? I don't know your code
insideout so its hard to tell right away whats going on at run time.

- Ross

Sachin Garg

unread,

Jun 9, 2007, 10:41:12 PM6/9/07

to

On Jun 10, 12:40 am, digiol...@gmail.com wrote:
> Thanks for all your replies.
>
> Sachin Garg, I downloaded your Range coder and convinced it give me a
> byte array of text read in from a file. Im guessing its adaptive so it
> has no model of any language beforehand but estimates character
> probabilities as it encounters them?

Correct.

> I have been getting poor results,
> roughly a byte per character, I was hoping for 4 bits per character at
> least.

You won't get very good results when just using that code as-is, it
only has a order-0 model. But for normal (and large enough, like a few
KBs) data you should get better than 'a byte per character' (which
seems to mean 'no compression').

> Perhaps Im doing something wrong?

I can't tell :-), If you can post relevant parts of your code, I can
take a look.

> I don't know your code
> insideout so its hard to tell right away whats going on at run time.

I thought you were looking for something that can be extended for pre-
loading a static order-1 model etc... its not very hard to do but will
need some education :-) Let me know if you have any specific
questions.

Try these links if you are still on basics,

http://marknelson.us/articles/arith/part1.htm
http://marknelson.us/articles/arith/part2.htm
http://www.arturocampos.com/ac_range.html

If you are just looking for a quick-fix, I am not aware of any already
available solutions. In this case the original code you mentioned in
your first post is the next best option to Java's inbuilt deflate
streams.