binary serialization

122 views
Skip to first unread message

fft1976

unread,
Aug 10, 2009, 10:25:43 PM8/10/09
to Clojure
Is there a way to do binary serialization of Clojure/Java values?
ASCII (read) and (write) are nice, but they are wasting space,
truncating floats and are probably slow compared to binary
serialization.

Kyle R. Burton

unread,
Aug 10, 2009, 10:42:49 PM8/10/09
to clo...@googlegroups.com

The following utility functions have worked in many cases for me:

(defn object->file [obj file]
(with-open [outp (java.io.ObjectOutputStream.
(java.io.FileOutputStream. file))]
(.writeObject outp obj)))


(defn file->object [file]
(with-open [inp (java.io.ObjectInputStream. (java.io.FileInputStream. file))]
(.readObject inp)))

(defn freeze
([obj]
(with-open [baos (java.io.ByteArrayOutputStream. 1024)
oos (java.io.ObjectOutputStream. baos)]
(.writeObject oos obj)
(.toByteArray baos)))
([obj & objs]
(freeze (vec (cons obj objs)))))

One caveat though is that currently some of the clojure data types
(like Symbols) that I thought would have been serializable are not. I
think that in the case of Clojure symbols it is being addressed
though.

Hope this helps,

Regards,

Kyle Burton

Kyle R. Burton

unread,
Aug 10, 2009, 10:57:35 PM8/10/09
to clo...@googlegroups.com
On Mon, Aug 10, 2009 at 10:42 PM, Kyle R. Burton<kyle....@gmail.com> wrote:
>> Is there a way to do binary serialization of Clojure/Java values?
>> ASCII (read) and (write) are nice, but they are wasting space,
>> truncating floats and are probably slow compared to binary
>> serialization.
>
> The following utility functions have worked in many cases for me:
>
> (defn object->file [obj file]
>  (with-open [outp (java.io.ObjectOutputStream.
> (java.io.FileOutputStream. file))]
>    (.writeObject outp obj)))
>
>
> (defn file->object [file]
>  (with-open [inp (java.io.ObjectInputStream. (java.io.FileInputStream. file))]
>    (.readObject inp)))
>
> (defn freeze
>  ([obj]
>     (with-open [baos (java.io.ByteArrayOutputStream. 1024)
>                 oos  (java.io.ObjectOutputStream. baos)]
>       (.writeObject oos obj)
>       (.toByteArray baos)))
>  ([obj & objs]
>     (freeze (vec (cons obj objs)))))

Sorry, forgot to offer up the inverse of freeze, thaw:

(defn thaw [bytes]
(with-open [bais (java.io.ByteArrayInputStream. bytes)
ois (java.io.ObjectInputStream. bais)]
(.readObject ois)))


Regards,

Kyle

fft1976

unread,
Aug 10, 2009, 11:10:24 PM8/10/09
to Clojure
On Aug 10, 7:57 pm, "Kyle R. Burton" <kyle.bur...@gmail.com> wrote:
Does all this work with cycles, Java arrays, etc.?

Kyle R. Burton

unread,
Aug 10, 2009, 11:19:22 PM8/10/09
to clo...@googlegroups.com
> Does all this work with cycles, Java arrays, etc.?


It will work with anything that implements the Serializable interface
in Java. Arrays do implement that interface, as do all the
primitives. With respect to cycles, I'd suspect it does, but would
test it. If you have a repl handy it should be pretty easy to test
those functions out on your data structures.

What class has the cycle? Is it a standard collection?

Regards,

Kyle


--
------------------------------------------------------------------------------
kyle....@gmail.com http://asymmetrical-view.com/
------------------------------------------------------------------------------

fft1976

unread,
Aug 11, 2009, 12:17:46 AM8/11/09
to Clojure
On Aug 10, 8:19 pm, "Kyle R. Burton" <kyle.bur...@gmail.com> wrote:
> > Does all this work with cycles, Java arrays, etc.?
>
> It will work with anything that implements the Serializable interface
> in Java.  Arrays do implement that interface, as do all the
> primitives.  With respect to cycles, I'd suspect it does, but would
> test it.  If you have a repl handy it should be pretty easy to test
> those functions out on your data structures.
>
> What class has the cycle?  Is it a standard collection?
>

Cycles are a special case of substructure sharing. Let's talk about
that instead.

(def common [1 2 3 4 5])
(def a [6 common])
(def b [7 common])
(def c [a b])

If you are serializing c, I want "common" to get copied only once.

I don't know JVM too well, but I think no efficient user-level
solution is possible. Why? To take care of substructure sharing, you
need to remember a set of shareable values that have already been
serialized, and do "reference equality" comparisons when new new
substructures are serialized.

This comparison and a set implementation can easily be done with
pointers (because you have "<"), but there are no pointers in the JVM,
and no "reference inequality", so you must use linear seeks, making
the time complexity of serialization quadratic, where in C/C++ it
could be O(N log N)

Christian Vest Hansen

unread,
Aug 11, 2009, 4:07:27 AM8/11/09
to clo...@googlegroups.com
Java object serialization handles cycles based on object identity.
--
Venlig hilsen / Kind regards,
Christian Vest Hansen.

John Harrop

unread,
Aug 11, 2009, 11:36:45 AM8/11/09
to clo...@googlegroups.com
On Tue, Aug 11, 2009 at 12:17 AM, fft1976 <fft...@gmail.com> wrote:
I don't know JVM too well, but I think no efficient user-level
solution is possible. Why? To take care of substructure sharing, you
need to remember a set of shareable values that have already been
serialized, and do "reference equality" comparisons when new new
substructures are serialized.

This comparison and a set implementation can easily be done with
pointers (because you have "<"), but there are no pointers in the JVM,
and no "reference inequality", so you must use linear seeks, making
the time complexity of serialization quadratic, where in C/C++ it
could be O(N log N)

Reference equality is available in the JVM (instructions if_acmpeq and if_acmpne), in Java (operators == and !=), and in Clojure (predicate identical?). Furthermore, though < on pointers isn't, so a tree-map of already serialized structures to themselves also isn't, Java provides System.identityHashCode() and IdentityHashMap. These use a hash that respects reference equality. So one in fact can implement one's own serialization that is O(n) using O(1) hashmap lookups (and using reflection, and not working if SecurityManager won't let you setAccessible private fields and the like, so not in an unsigned applet).

(Another use for reference equality is to see if Double.valueOf() is caching, something that arose as an issue in another thread. On my system, Sun JVM 1.6.0_13 -server and Clojure 1.0.0, it apparently is not:

user=> (identical? 2.0 2.0)
false

If this comes out to true then it's caching. Integer.valueOf() is caching on my system, but only for small integers:

user=> (identical? 1 1)
true
user=> (identical? 5 5)
true
user=> (identical? 50 50)
true
user=> (identical? 500 500)
false
user=> (identical? 255 255)
false
user=> (identical? 127 127)
true
user=> (identical? 128 128)
false

The threshold seems to be at values that will fit in one byte.

[Remember the literals get boxed when passed to a function like identical? that isn't inlined. And identical? isn't inlined:

user=> (meta (var identical?))
{:ns #<Namespace clojure.core>, :name identical?, :doc "Tests if 2 arguments are the same object", :arglists ([x y])}

whereas two-argument + is:

user=> (meta (var +))
{:ns #<Namespace clojure.core>, :name +, :file "clojure/core.clj", :line 549, :arglists ([] [x] [x y] [x y & more]), :inline-arities #{2}, :inline #<core$fn__3329 clojure.core$fn__3329@d337d3>, :doc "Returns the sum of nums. (+) returns 0."}

You can also use ^#'identical? and ^#'+ but I like my Clojure looking like Lisp, not like perl. :)])

John Harrop

unread,
Aug 11, 2009, 11:15:10 AM8/11/09
to clo...@googlegroups.com
On Mon, Aug 10, 2009 at 10:57 PM, Kyle R. Burton <kyle....@gmail.com> wrote:

On Mon, Aug 10, 2009 at 10:42 PM, Kyle R. Burton<kyle....@gmail.com> wrote:Sorry, forgot to offer up the inverse of freeze, thaw:

(defn thaw [bytes]
 (with-open [bais (java.io.ByteArrayInputStream. bytes)
             ois  (java.io.ObjectInputStream. bais)]
   (.readObject ois)))


Regards,

Kyle

Which in turn gives us this, otherwise sorely lacking from the Java standard library, but much less useful to us Clojurians who tend to mainly use immutable objects:

(defn deep-copy [obj]
  (thaw (freeze obj)))

(Object.clone() does a shallow copy and typically isn't as widely available as Serializable.)

fft1976

unread,
Aug 11, 2009, 2:31:48 PM8/11/09
to Clojure


On Aug 11, 8:36 am, John Harrop <jharrop...@gmail.com> wrote:

> System.identityHashCode() and IdentityHashMap. These use a hash that
> respects reference equality. So one in fact can implement one's own
> serialization that is O(n) using O(1) hashmap lookups (and using reflection,
> and not working if SecurityManager won't let you setAccessible private
> fields and the like, so not in an unsigned applet).

Good to know, thanks. By the way, hash table operations are O(log N),
because calculating the hash needs to be O(log N), but I'm nitpicking
now.

fft1976

unread,
Aug 11, 2009, 2:33:30 PM8/11/09
to Clojure
On Aug 11, 8:15 am, John Harrop <jharrop...@gmail.com> wrote:

> Which in turn gives us this, otherwise sorely lacking from the Java standard
> library, but much less useful to us Clojurians who tend to mainly use
> immutable objects:
>
> (defn deep-copy [obj]
>   (thaw (freeze obj)))

Somebody should benchmark that vs manual implementations of deep copy
in Java.

John Harrop

unread,
Aug 11, 2009, 11:00:56 PM8/11/09
to clo...@googlegroups.com
The time taken to calculate an object's hash depends on that object's class. For a String for instance it is linear in the String's length; for an Integer, it is constant. Furthermore, hashes are often cached (java.lang.String caches its hash for example). So if A and B are objects that share a common reference to an object C in their fields, and use C's hash code in computing their own, C's hash code may be computed only once, and the cost of hashing A and B may be lower than the sum of the cost of hashing only A and the cost of hashing only B. Furthermore, if A and B are serialized again later, their own hashes may not need to be recalculated...
Reply all
Reply to author
Forward
0 new messages