Lazy deserialization / reserialization, aatree release 0.3.0

145 views
Skip to first unread message

William la Forge

unread,
Sep 28, 2015, 8:39:28 PM9/28/15
to Clojure
Too often when dealing with data saved on disk you find yourself having to deserialize a large block of data, make a small update, and then reserialize the whole thing and write it back to disk. Deserialization and reserialization are quite slow, so having to deserialize and reserialize a large block of data for each small update creates a bottleneck. In this case, incremental deserialization / reserialization will be a substantial improvement.

Release 0.3.0 provides lazy aa maps and lazy aa vectors, which can be used as the top-level container for large blocks of durable data. Serialized data is binary and organized in a binary tree. To access an entry, only the nodes in the path from the root to the node of interested must be deserialized. And only the updated nodes (and their parent nodes) need to be subsequently reserialized. So the speed improvement is substantial. (Benchmark code is provided)

As with previous releases, collection-check is used to validate the vector and map implementations.


Your feedback here is greatly appreciated!

Bill

Nathan Davis

unread,
Oct 1, 2015, 3:16:43 PM10/1/15
to Clojure
Hi Bill,

This sounds interesting, but the descriptions here and on the project page are pretty high-level.  Are there more details available?  One question I have in particular is, what do you mean by "update"?  From what I can tell, aatree provides an immutable/persistent data structure compatible with Clojure maps.  But then you say that serializing "updates" is done incrementally.  Could you elaborate on this?  Do you mean that the new copy to be serialized is basically diff'd agaist the old version, and just the difference is serialized?

Thanks,

Nathan Davis

William la Forge

unread,
Oct 1, 2015, 3:58:15 PM10/1/15
to Clojure

A value can have more than one form or implementation. For example, a String and a char array may both represent the same value. Most objects, for example, can be serialized. Changing the form of a value then is not a change of value. This is the idea behind lazy deserialization / reserialization.


We can represent a data structure as a tree, where the values in the tree can be either the immutable value of the object or a read-only ByteBuffer holding the serialized form of the value, or both. Both forms can be held by nodes using atoms, so the form of the values can be changed without having to create a new tree. The value remains unchanged and the tree can be safely shared between threads.


Lazy deserialization comes after loading a map or vector from a ByteBuffer. Initially only the root exists, which holds only the ByteBuffer. As you access and alter the contents of the map or vector, the portions of the tree which are accessed are deserialized.


Incremental reserialization then occurs when writing a map or vector back to a ByteBuffer. Unaltered portions of the tree need not be serialized, as a reference to the ByteBuffer that the structure was loaded from is retained.


Currently only the top-level vector or map is lazy. The contents of the structure is serialized/deserialized using prn-str/read-string. Read-string options, if needed, can be passed in an opts map (i.e. resources map) which is the last argument of the lazy-load method.


The benchmark for map illustrates this:


(ns aatree.map-updates
(:require [aatree.core :refer :all])
(:import (java.nio ByteBuffer)))

(def map-size 1000000)
(def updates 1000)

(defn bld [m i]
(conj m [i i]))

(println)
(def t0 (System/currentTimeMillis))
(def lazy-map (reduce bld emptyLazyAAMap (range map-size)))
(def t1 (System/currentTimeMillis))
(def micr-0 (* 1000. (- t1 t0)))
(println "Time to build a map of size" map-size "=" micr-0 "microseconds")
(println "Time per entry:" (/ micr-0 map-size) "microseconds")

(defn upd [m i]
(let [m1 (assoc m i (- i))
bb (ByteBuffer/allocate (lazy-byte-length m1))]
(lazy-write m1 bb)
(.flip bb)
(load-aamap bb)))

(println)
(def t0 (System/currentTimeMillis))
(def lazy-m (reduce upd lazy-map (range updates)))
(def t1 (System/currentTimeMillis))
(def micr-0 (* 1000. (- t1 t0)))
(println "Time to deserialize/update/reserialize " updates "times =" micr-0 "microseconds")
(println "Time per complete update:" (/ micr-0 updates) "microseconds")

(println)

Nathan Davis

unread,
Oct 1, 2015, 4:41:33 PM10/1/15
to Clojure


On Thursday, October 1, 2015 at 2:58:15 PM UTC-5, William la Forge wrote:

A value can have more than one form or implementation. For example, a String and a char array may both represent the same value. Most objects, for example, can be serialized. Changing the form of a value then is not a change of value. This is the idea behind lazy deserialization / reserialization.


We can represent a data structure as a tree, where the values in the tree can be either the immutable value of the object or a read-only ByteBuffer holding the serialized form of the value, or both. Both forms can be held by nodes using atoms, so the form of the values can be changed without having to create a new tree. The value remains unchanged and the tree can be safely shared between threads.


I think we are on the same page to this point.
 

Lazy deserialization comes after loading a map or vector from a ByteBuffer. Initially only the root exists, which holds only the ByteBuffer. As you access and alter the contents of the map or vector, the portions of the tree which are accessed are deserialized.


This is where I get confused.  Are aatrees persistent structures (in the functional programming sense, like Clojure's data structures) whose values cannot be changed (although their internal representation ("form") may change), or are they mutable?  If they are persistent, what do you mean by "alter the contents"?
 
Nathan Davis

William la Forge

unread,
Oct 1, 2015, 5:22:48 PM10/1/15
to Clojure
Sorry for my bad language. I've not written much about immutable structures before so old phraseology keeps coming in inappropriately.

The structures are immutable. A change results in a new tree with only the nodes which are changed (and their parent nodes) replaced. But the key property here is that deserialization and serialization operations which are performed on these trees are done in place, without making any new nodes--thanks to the magic of atoms. So for example, if several threads are accessing the same tree instance, they all share the deserialized values while any changes to the tree contents results in new tree instances. I.E. It is done right, to the best of my ability. (Extra eyeballs on the code always being appreciated, of course!)

Now I'm sorry for having been so terse and ambiguous. I need practice talking about this stuff. So your questions are appreciated twice over! :-) 

Nathan Davis

unread,
Oct 1, 2015, 6:15:06 PM10/1/15
to Clojure
Thanks, that really helps clarify things.  So is the new tree saved when it is created (i.e., when the old tree is "updated"), or do you have to "write it back" at some point?  In other words, as a user of aatree, do I obtain a new tree and then tell aatree "Here is a new tree.  Please (incrementally) save it."  Or does the mere fact that I have a new tree mean that it has already been saved?

As for terminology, I wouldn't worry too much about using the word "update" when talking about immutable datastructures.  Most functional programmers understand the term "update" to mean "make a 'copy' of something, with some part of it changed in some way" in such contexts.  I think what was confusing is that (a) the Readme didn't explicitly say aatrees were immutable (from a value perspective) and (b) there aren't any examples on using the library.  I think adding a few examples (with perhaps a little explanation of what is happening under the covers) would go a long way.

Nathan Davis
Message has been deleted

William la Forge

unread,
Oct 1, 2015, 7:15:46 PM10/1/15
to Clojure
Lets look at the upd function that is part of the map benchmark I posted above...

(defn upd [m i]
(let [m1 (assoc m i (- i))
bb (ByteBuffer/allocate (lazy-byte-length m1))]
(lazy-write m1 bb)
(.flip bb)
(load-aamap bb)))

And its invocation...

(def lazy-m (reduce upd lazy-map (range updates)))

Here we see that upd is called repeatedly by the reduce, its first argument being the result of the previous call to upd.

m1 (assoc m i (- i)

The above is in the upd function. It uses assoc to create an updated copy of the map.

bb (ByteBuffer/allocate (lazy-byte-length m1)

bb is a byte buffer with a capacity equal to the length of the serialized contents of m1.

    (lazy-write m1 bb)
(.flip bb)
(load-aamap bb))

Now we write the updated map to the byte buffer, flip it, and then create a new map from the byte buffer. This new map is then passed back to the reduce method which calls upd again with the new/updated map.

This usage, being a benchmark, is a bit atypical as we are serializing the contents of the map with each update and then loading the result back into a new map. But. You can see that the map works like the standard clojure map, and only it works additionally with the lazy-byte-length, lazy-write and aamap-load functions. The underlying tree is not normally accessed by the application developer, just as the red/black tree used to implement clojure sorted maps are not normally accessed by the application developer.

So aamap is used just like sorted-map, except for the additional capability of being able to quickly load it from and write it to a ByteBuffer. And, like a sorted-map, you can also specify a comparator for ordering the keys, though the comparator would be passed in the optional opts map on the aamap-load function or in the create-lazy-aamap function.

The full API is given here: https://github.com/laforge49/aatree/blob/master/src/aatree/core.clj (Still need to add doc strings though.)

William la Forge

unread,
Oct 3, 2015, 1:06:56 PM10/3/15
to Clojure
I've just written the API for aatree: https://github.com/laforge49/aatree/wiki/API

Please let me know if anything needs clarification! (I'm way way to close to all this, having worked on similar logic for the past several years.)

(We should strive for beauty in the eye of the reader.)
Reply all
Reply to author
Forward
0 new messages