are serialized bytes for each object in a stream independent of each other?

183 views
Skip to first unread message

Reynold Xin

unread,
Mar 23, 2015, 12:33:59 AM3/23/15
to kryo-...@googlegroups.com
Hi Nate / Martin (and other helpful people here),

In Spark we are thinking about taking the bytes serialized by writeClassAndObject and move it around. Since what we are about to do isn't part of Kryo's contract, it would be great to have somebody more familiar with Kryo to confirm.

What we are hoping to do is similar to the following:

1. writeClassAndObject(a)
2. flush()
3. writeClassAndObject(b)
4. flush()

and then swap the bytes written by the two writeClassAndObject calls to swap the ordering of a and b (i.e. upon deserialization, we want to get b, followed by a).

I took a quick look at writeClassAndObject's source code, and did some basic experiments. It seems to me that one invocation of that method is independent of another invocation. One place that concerned me was the following on the readme page: "Subsequent appearances of that object type within the same object graph are written using a variable length int." However, based on my experiment, it seems like subsequent appearances in a stream does not belong in the "same object graph" -- if I'm understanding this correctly, an object graph does not apply to multiple objects in a stream, but rather only applies to the serialization of a single object (and its children).

Can I get somebody who's more familiar with Kryo to comment on this? 

Thanks!

Chetan Narsude

unread,
Mar 23, 2015, 2:24:13 AM3/23/15
to kryo-...@googlegroups.com
Others can correct me if I am wrong. I am sure that the object graph belongs to a separately and then b separately. So you can safely do what you have set out for.

Chetan 


--
You received this message because you are subscribed to the "kryo-users" group.
http://groups.google.com/group/kryo-users
---
You received this message because you are subscribed to the Google Groups "kryo-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kryo-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joachim Durchholz

unread,
Mar 23, 2015, 3:44:12 AM3/23/15
to kryo-...@googlegroups.com
Am 23.03.2015 um 05:33 schrieb Reynold Xin:
> Hi Nate / Martin (and other helpful people here),
>
> In Spark we are thinking about taking the bytes serialized
> by writeClassAndObject and move it around. Since what we are about to do
> isn't part of Kryo's contract, it would be great to have somebody more
> familiar with Kryo to confirm.

Please be aware that if it is not in Kryo's contract, what you're doing
may break in future versions of Kryo.
May or may not be relevant to your work, of course, but if you do it,
make sure that it's documented as a dependency and has a unit test.

Non-exhaustive list of things to test:
* Does it work if a and b have different classes?
* Does it work if a and b are the first objects of the class that Kryo
sees? The order in which class name information and class data are sent
may vary across Kryo versions, and Kryo might make assumptions about
consistent ordering.
* Does it work if a and b have references to a shared object? (This is
the most likely candidate for breakage I think.) If this breaks and you
want to swap bytes anyway, you'll want to put really big warning signs
into the class definitions of a and b.

Personally, what you want might be easier to achieve and more robust if
you write a serializer/deserializer class that deserializes in a
different order than it serializes.
Or maybe there is yet another approach - it really depends on whether
you want to do this just to allow the deserializer to see something
early (and maybe start background processes, or abort the transmission),
or for on-the-fly data transformation, or something entirely different
that I can't think of now.

Nate

unread,
Mar 23, 2015, 8:33:23 AM3/23/15
to kryo-users
On Mon, Mar 23, 2015 at 5:33 AM, Reynold Xin <rx...@databricks.com> wrote:
Hi Nate / Martin (and other helpful people here),

In Spark we are thinking about taking the bytes serialized by writeClassAndObject and move it around. Since what we are about to do isn't part of Kryo's contract, it would be great to have somebody more familiar with Kryo to confirm.

What we are hoping to do is similar to the following:

1. writeClassAndObject(a)
2. flush()
3.
​​
writeClassAndObject(b)
4. flush()

and then swap the bytes written by the two writeClassAndObject calls to swap the ordering of a and b (i.e. upon deserialization, we want to get b, followed by a).

​Output is similar to ByteBuffer (has a position, capacity, etc) + DataOutputStream (convenience methods). If the same Output is used for multiple serializations (ie calls to ​writeClassAndObject) then the Output's byte[] will contain the bytes for each (assuming they fit).

What is "flush()" in your example? Output#flush is only useful when the Output has an OutputStream. In that case the Output is emptied into the stream and the position is set to 0.

What specifically are you doing that isn't part of Kryo's contract?

I took a quick look at writeClassAndObject's source code, and did some basic experiments. It seems to me that one invocation of that method is independent of another invocation. One place that concerned me was the following on the readme page: "Subsequent appearances of that object type within the same object graph are written using a variable length int." However, based on my experiment, it seems like subsequent appearances in a stream does not belong in the "same object graph" -- if I'm understanding this correctly, an object graph does not apply to multiple objects in a stream, but rather only applies to the serialization of a single object (and its children).

​See ​Kryo#setAutoReset. You can set it to false if you want the varint ordinals written for object references to be shared for multiple calls to writeClassAndObject. Default is true.

-Nate

Reynold Xin

unread,
Mar 23, 2015, 3:23:40 PM3/23/15
to kryo-...@googlegroups.com
On Mon, Mar 23, 2015 at 5:32 AM, Nate <nathan...@gmail.com> wrote:
On Mon, Mar 23, 2015 at 5:33 AM, Reynold Xin <rx...@databricks.com> wrote:
Hi Nate / Martin (and other helpful people here),

In Spark we are thinking about taking the bytes serialized by writeClassAndObject and move it around. Since what we are about to do isn't part of Kryo's contract, it would be great to have somebody more familiar with Kryo to confirm.

What we are hoping to do is similar to the following:

1. writeClassAndObject(a)
2. flush()
3.
​​
writeClassAndObject(b)
4. flush()

and then swap the bytes written by the two writeClassAndObject calls to swap the ordering of a and b (i.e. upon deserialization, we want to get b, followed by a).

​Output is similar to ByteBuffer (has a position, capacity, etc) + DataOutputStream (convenience methods). If the same Output is used for multiple serializations (ie calls to ​writeClassAndObject) then the Output's byte[] will contain the bytes for each (assuming they fit).

What is "flush()" in your example? Output#flush is only useful when the Output has an OutputStream. In that case the Output is emptied into the stream and the position is set to 0.

It was Output.flush().
 

What specifically are you doing that isn't part of Kryo's contract?

Basically we are serializing a bunch of objects onto a single stream, and then we sort (based on some other identifier) the serialized bytes of these objects and then write them out in sorted order. I'm wondering if it is safe to do this. Based on your reply, it seems like as long as "setAutoReset" is true, then it is safe to do so.

And if I am understanding this correctly, does it mean that on Kryo's side, the performance of serializing directly to a stream should be the same as serializing each objects individually into some byte arrays (assuming these byte arrays are pre-allocated, and next to each other so you don't lose cache locality).

 
I took a quick look at writeClassAndObject's source code, and did some basic experiments. It seems to me that one invocation of that method is independent of another invocation. One place that concerned me was the following on the readme page: "Subsequent appearances of that object type within the same object graph are written using a variable length int." However, based on my experiment, it seems like subsequent appearances in a stream does not belong in the "same object graph" -- if I'm understanding this correctly, an object graph does not apply to multiple objects in a stream, but rather only applies to the serialization of a single object (and its children).

​See ​Kryo#setAutoReset. You can set it to false if you want the varint ordinals written for object references to be shared for multiple calls to writeClassAndObject. Default is true.

This is great to know. Thanks.

 


-Nate

--
You received this message because you are subscribed to the "kryo-users" group.
http://groups.google.com/group/kryo-users
---
You received this message because you are subscribed to a topic in the Google Groups "kryo-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kryo-users/6ZUSyfjjtdo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kryo-users+...@googlegroups.com.

Ian O'Connell

unread,
Mar 23, 2015, 3:53:47 PM3/23/15
to kryo-...@googlegroups.com
Totally aside from your question Reynold, but I presume you've delt with the inconsistency between sorting on bytes and in memory jvm objects that comes up a bit. (Sets, Maps, Enums, etc..) ?

You received this message because you are subscribed to the Google Groups "kryo-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kryo-users+...@googlegroups.com.

Reynold Xin

unread,
Mar 23, 2015, 4:01:38 PM3/23/15
to kryo-users
Hi Ian,

We are not actually sorting the serialized bytes based on the bytes themselves -- we are sorting the serialized bytes based on some other integer id we assigned to each bytes (basically a partition id). It is fairly hard to normalize all data structures such as sets and maps, unless we sort those internally too (which can be expensive).


Ian O'Connell

unread,
Mar 23, 2015, 4:05:51 PM3/23/15
to kryo-...@googlegroups.com
Yep, was just making sure. In general btw for the most part we haven't seen the keys having maps that large that sorting is too prohibitive so far. For the advantage of consistent sorting on bytes has been reasonably decent in an MR setting. (Using macros to automatically build stable serializers + ordering).
Reply all
Reply to author
Forward
0 new messages