Is there a way to do binary serialization of Clojure/Java values?
ASCII (read) and (write) are nice, but they are wasting space,
truncating floats and are probably slow compared to binary
serialization.
> Is there a way to do binary serialization of Clojure/Java values? > ASCII (read) and (write) are nice, but they are wasting space, > truncating floats and are probably slow compared to binary > serialization.
The following utility functions have worked in many cases for me:
One caveat though is that currently some of the clojure data types (like Symbols) that I thought would have been serializable are not. I think that in the case of Clojure symbols it is being addressed though.
On Mon, Aug 10, 2009 at 10:42 PM, Kyle R. Burton<kyle.bur...@gmail.com> wrote: >> Is there a way to do binary serialization of Clojure/Java values? >> ASCII (read) and (write) are nice, but they are wasting space, >> truncating floats and are probably slow compared to binary >> serialization.
> The following utility functions have worked in many cases for me:
> On Mon, Aug 10, 2009 at 10:42 PM, Kyle R. Burton<kyle.bur...@gmail.com> wrote:
> >> Is there a way to do binary serialization of Clojure/Java values?
> >> ASCII (read) and (write) are nice, but they are wasting space,
> >> truncating floats and are probably slow compared to binary
> >> serialization.
> > The following utility functions have worked in many cases for me:
> Does all this work with cycles, Java arrays, etc.?
It will work with anything that implements the Serializable interface in Java. Arrays do implement that interface, as do all the primitives. With respect to cycles, I'd suspect it does, but would test it. If you have a repl handy it should be pretty easy to test those functions out on your data structures.
What class has the cycle? Is it a standard collection?
On Aug 10, 8:19 pm, "Kyle R. Burton" <kyle.bur...@gmail.com> wrote:
> > Does all this work with cycles, Java arrays, etc.?
> It will work with anything that implements the Serializable interface
> in Java. Arrays do implement that interface, as do all the
> primitives. With respect to cycles, I'd suspect it does, but would
> test it. If you have a repl handy it should be pretty easy to test
> those functions out on your data structures.
> What class has the cycle? Is it a standard collection?
Cycles are a special case of substructure sharing. Let's talk about
that instead.
(def common [1 2 3 4 5])
(def a [6 common])
(def b [7 common])
(def c [a b])
If you are serializing c, I want "common" to get copied only once.
I don't know JVM too well, but I think no efficient user-level
solution is possible. Why? To take care of substructure sharing, you
need to remember a set of shareable values that have already been
serialized, and do "reference equality" comparisons when new new
substructures are serialized.
This comparison and a set implementation can easily be done with
pointers (because you have "<"), but there are no pointers in the JVM,
and no "reference inequality", so you must use linear seeks, making
the time complexity of serialization quadratic, where in C/C++ it
could be O(N log N)
On Tue, Aug 11, 2009 at 6:17 AM, fft1976<fft1...@gmail.com> wrote:
> On Aug 10, 8:19 pm, "Kyle R. Burton" <kyle.bur...@gmail.com> wrote:
>> > Does all this work with cycles, Java arrays, etc.?
>> It will work with anything that implements the Serializable interface
>> in Java. Arrays do implement that interface, as do all the
>> primitives. With respect to cycles, I'd suspect it does, but would
>> test it. If you have a repl handy it should be pretty easy to test
>> those functions out on your data structures.
>> What class has the cycle? Is it a standard collection?
> Cycles are a special case of substructure sharing. Let's talk about
> that instead.
> (def common [1 2 3 4 5])
> (def a [6 common])
> (def b [7 common])
> (def c [a b])
> If you are serializing c, I want "common" to get copied only once.
> I don't know JVM too well, but I think no efficient user-level
> solution is possible. Why? To take care of substructure sharing, you
> need to remember a set of shareable values that have already been
> serialized, and do "reference equality" comparisons when new new
> substructures are serialized.
> This comparison and a set implementation can easily be done with
> pointers (because you have "<"), but there are no pointers in the JVM,
> and no "reference inequality", so you must use linear seeks, making
> the time complexity of serialization quadratic, where in C/C++ it
> could be O(N log N)
-- Venlig hilsen / Kind regards,
Christian Vest Hansen.
On Tue, Aug 11, 2009 at 12:17 AM, fft1976 <fft1...@gmail.com> wrote: > I don't know JVM too well, but I think no efficient user-level > solution is possible. Why? To take care of substructure sharing, you > need to remember a set of shareable values that have already been > serialized, and do "reference equality" comparisons when new new > substructures are serialized.
> This comparison and a set implementation can easily be done with > pointers (because you have "<"), but there are no pointers in the JVM, > and no "reference inequality", so you must use linear seeks, making > the time complexity of serialization quadratic, where in C/C++ it > could be O(N log N)
Reference equality is available in the JVM (instructions if_acmpeq and if_acmpne), in Java (operators == and !=), and in Clojure (predicate identical?). Furthermore, though < on pointers isn't, so a tree-map of already serialized structures to themselves also isn't, Java provides System.identityHashCode() and IdentityHashMap. These use a hash that respects reference equality. So one in fact can implement one's own serialization that is O(n) using O(1) hashmap lookups (and using reflection, and not working if SecurityManager won't let you setAccessible private fields and the like, so not in an unsigned applet).
(Another use for reference equality is to see if Double.valueOf() is caching, something that arose as an issue in another thread. On my system, Sun JVM 1.6.0_13 -server and Clojure 1.0.0, it apparently is not:
user=> (identical? 2.0 2.0) false
If this comes out to true then it's caching. Integer.valueOf() is caching on my system, but only for small integers:
Which in turn gives us this, otherwise sorely lacking from the Java standard library, but much less useful to us Clojurians who tend to mainly use immutable objects:
(defn deep-copy [obj] (thaw (freeze obj)))
(Object.clone() does a shallow copy and typically isn't as widely available as Serializable.)
On Aug 11, 8:36 am, John Harrop <jharrop...@gmail.com> wrote:
> System.identityHashCode() and IdentityHashMap. These use a hash that
> respects reference equality. So one in fact can implement one's own
> serialization that is O(n) using O(1) hashmap lookups (and using reflection,
> and not working if SecurityManager won't let you setAccessible private
> fields and the like, so not in an unsigned applet).
Good to know, thanks. By the way, hash table operations are O(log N),
because calculating the hash needs to be O(log N), but I'm nitpicking
now.
On Aug 11, 8:15 am, John Harrop <jharrop...@gmail.com> wrote:
> Which in turn gives us this, otherwise sorely lacking from the Java standard
> library, but much less useful to us Clojurians who tend to mainly use
> immutable objects:
> (defn deep-copy [obj]
> (thaw (freeze obj)))
Somebody should benchmark that vs manual implementations of deep copy
in Java.
On Tue, Aug 11, 2009 at 2:31 PM, fft1976 <fft1...@gmail.com> wrote: > On Aug 11, 8:36 am, John Harrop <jharrop...@gmail.com> wrote:
> > System.identityHashCode() and IdentityHashMap. These use a hash that > > respects reference equality. So one in fact can implement one's own > > serialization that is O(n) using O(1) hashmap lookups (and using > reflection, > > and not working if SecurityManager won't let you setAccessible private > > fields and the like, so not in an unsigned applet).
> Good to know, thanks. By the way, hash table operations are O(log N), > because calculating the hash needs to be O(log N), but I'm nitpicking > now.
The time taken to calculate an object's hash depends on that object's class. For a String for instance it is linear in the String's length; for an Integer, it is constant. Furthermore, hashes are often cached (java.lang.String caches its hash for example). So if A and B are objects that share a common reference to an object C in their fields, and use C's hash code in computing their own, C's hash code may be computed only once, and the cost of hashing A and B may be lower than the sum of the cost of hashing only A and the cost of hashing only B. Furthermore, if A and B are serialized again later, their own hashes may not need to be recalculated...