Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] Marshalling data format deteriorates compressibility

6 views
Skip to first unread message

Markus Mottl

unread,
Jun 27, 2006, 5:08:00 PM6/27/06
to ocaml
Hi,

we have just found out that under certain circumstances the current
binary format used for encoding marshalled OCaml-values badly affects
compressibility of the data by a wide class of compression algorithms
(especially Lempel-Ziv based ones).

Before we write out datastructures to disk we usually hashcons them to
maximize sharing. This, of course, leads to much smaller file sizes.
Suprisingly, however, compressing the hashconsed files using e.g. gzip
leads to very significantly (e.g. three times!) larger files than
gzipping marshalled datastructures that have not been hashconsed.

We finally found out what causes the problem: OCaml represents
pointers to shared data values using relative offsets instead of
absolute positions within the marshalled data. This means that e.g.
an array containing pointers to these values will be represented by a
sequence of increasing relative offsets, which essentially renders it
almost incompressible to the usual compression algorithms.

As it seems, the current marshalling algorithm uses this relative
addressing approach to save space: relative offsets are encoded with
variable length (this assumes some degree of locality), which is not
possible with absolute addressing. Unfortunately, this does not take
compression algorithms into account, which may greatly benefit from
repeating patterns of pointers.

Could one of the maintainers please tell us what would be the least
intrusive way of adding a new marshalling format? If this data format
were added to the distribution, it could be turned on using a new flag
in the Marshal-module. A different magic number might be used for
chosing the right unmarshalling function.

Of course, we would be even more grateful if there were an
implementation of the alternative encoding in the next release... ;-)

Regards,
Markus

--
Markus Mottl http://www.ocaml.info markus...@gmail.com

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Martin Jambon

unread,
Jun 27, 2006, 7:47:00 PM6/27/06
to caml...@inria.fr
[to the list admin: why can I send messages with this subscription and not
with martin...@emailuser.net?]

On Tue, 27 Jun 2006, Markus Mottl wrote:

> We finally found out what causes the problem: OCaml represents
> pointers to shared data values using relative offsets instead of
> absolute positions within the marshalled data. This means that e.g.
> an array containing pointers to these values will be represented by a
> sequence of increasing relative offsets, which essentially renders it
> almost incompressible to the usual compression algorithms.
>
> As it seems, the current marshalling algorithm uses this relative
> addressing approach to save space: relative offsets are encoded with
> variable length (this assumes some degree of locality), which is not
> possible with absolute addressing. Unfortunately, this does not take
> compression algorithms into account, which may greatly benefit from
> repeating patterns of pointers.

Maybe you can convert the data into a marshal-optimized format before
marshalling, where you put your shared data into an array, and substitute the
pointers by array indices, e.g.
type t = string list
becomes:
type marshalled_t = (string array * int list)

["a"; "b"; "a"; "a"] -> ([| "a"; "b" |], [0; 1; 0; 0])

That seems like a lot of work, but it shouldn't be too hard to maintain.

By the way, floats don't compress very well. Rounding them as much as possible
used to save me about 50% of space.


Martin

--
Martin Jambon, PhD
http://martin.jambon.free.fr

Jon Harrop

unread,
Jun 27, 2006, 8:03:52 PM6/27/06
to caml...@yquem.inria.fr
On Tuesday 27 June 2006 22:04, Markus Mottl wrote:
> As it seems, the current marshalling algorithm uses this relative
> addressing approach to save space: relative offsets are encoded with
> variable length (this assumes some degree of locality), which is not
> possible with absolute addressing. Unfortunately, this does not take
> compression algorithms into account, which may greatly benefit from
> repeating patterns of pointers.

Interesting. Perhaps using a higher-order bitwise arithmetic coder instead of
*zip would be the least intrusive solution? Alternatively, a custom rewrite
tool in order to evade the compiler's internals?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists

Martin Jambon

unread,
Jun 28, 2006, 10:52:43 AM6/28/06
to caml...@inria.fr
[to the list admin: why can I send messages with this subscription and not
with martin...@emailuser.net?]

On Tue, 27 Jun 2006, Markus Mottl wrote:

> We finally found out what causes the problem: OCaml represents
> pointers to shared data values using relative offsets instead of
> absolute positions within the marshalled data. This means that e.g.
> an array containing pointers to these values will be represented by a
> sequence of increasing relative offsets, which essentially renders it
> almost incompressible to the usual compression algorithms.
>

> As it seems, the current marshalling algorithm uses this relative
> addressing approach to save space: relative offsets are encoded with
> variable length (this assumes some degree of locality), which is not
> possible with absolute addressing. Unfortunately, this does not take
> compression algorithms into account, which may greatly benefit from
> repeating patterns of pointers.

Maybe you can convert the data into a marshal-optimized format before

marshalling, where you put your shared data into an array, and substitute the
pointers by array indices, e.g.
type t = string list
becomes:
type marshalled_t = (string array * int list)

["a"; "b"; "a"; "a"] -> ([| "a"; "b" |], [0; 1; 0; 0])

That seems like a lot of work, but it shouldn't be too hard to maintain.

By the way, floats don't compress very well. Rounding them as much as possible
used to save me about 50% of space.


Martin

--
Martin Jambon, PhD
http://martin.jambon.free.fr

_______________________________________________

Markus Mottl

unread,
Jun 28, 2006, 6:51:40 PM6/28/06
to caml...@inria.fr
Hi,

I have just finished (and attached) a patch for the latest CVS-release
of OCaml, which adds a new marshalling flag. This flag generates a
different marshalling format, which uses absolute addresses to refer
to shared values. This fixes the problem of bad compressibility of
marshalled OCaml data. See the patch header for more information.

Since the patch is quite small and does not break any existing code,
we'd be very grateful if it could be made part of the next
OCaml-release. I'll also enter the patch into the OCaml bugtracker so
that people running into the same problem can find it there, too.

Best regards,

absolute.patch
0 new messages