Small question about serialization of implicitly registered classes

826 views
Skip to first unread message

mongonix

unread,
Jun 7, 2012, 6:55:41 AM6/7/12
to kryo-...@googlegroups.com
Hi Nate,

I see the following in the Kryo.java, in public Registration getRegistration (Class type)  method.
If class is not registered and registration not required, then Kryo tries to register this class implicitly.
There is a dedicated code to do that:
registration = registerInternal(new Registration(type, getDefaultSerializer(type), NAME));

But this line uses the NAME as last parameter. As a result, no numeric ID is assigned to this class,
and every time (!!!) you serialize an object of this type, a complete fully-qualified class name is written,
which introduces quite some overhead.


I'd like to understand the reason for doing it this way. Why wouldn't you e.g. call instead this method:
registration = register(type, getDefaultSerializer(type))

It would register the class using a next available numeric id. But may be you want to treat those
implicitly registered differently? If so, could you please clarify?

Thanks,
Leo

Nate

unread,
Jun 7, 2012, 1:39:47 PM6/7/12
to kryo-...@googlegroups.com
If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize. When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.

Note the class name is only written once per object graph.

-Nate


--
You received this message because you are subscribed to the "kryo-users" group.
http://groups.google.com/group/kryo-users

mongonix

unread,
Jun 7, 2012, 5:38:23 PM6/7/12
to kryo-...@googlegroups.com


On Thursday, June 7, 2012 7:39:47 PM UTC+2, Nate wrote:
If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize.

 Yes. True.
 
When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.


I understand what you mean, but I do not quite agree with you. Let me give you an example. 
I work on kryo-based serialization for Scala/Akka. I added to Akka the possibility to define in a config file the set of classes (classnames) that should be pre-registered. You can either provide just classnames or even their IDs. This way you can also ensure that the same order of registration is used by both serializer and deserializer. But there is a problem. Many of Scala classes, e.g. collection classes from the standard library contain a lot of sub-classes, nested classes and other classes generated by Scala compiler from the original source. Many of these classes have very long and cryptic names. It is very annoying if not unrealistic to demand all users to provide a set of all those strange class names, if they want to use Kryo for serializing collection objects. It would be much better, if these classes are registered implicitly. This is what happens now, but with the problem described in the original post. And BTW, since all "main" classes are explicitly listed in the same order in the config files, all dependent generated classes are also seen and implicitly registered in the same order. So, in this sense, if ids would be assigned, they would be the same.
 
Note the class name is only written once per object graph.

True. But since e.g. collection classes and all generated ones are used very often (e.g. every time you send a message between actors) it ends up introducing a lot of overhead. I see performance drops by a factor of 2 or 3 due to the way how these implicit registrations handled now by Kryo. For one, writing fully qualified class names introduces overhead during serialization and deserialization. And secondly, during deserialization you always do at least once per object graph class lookup: 
Class.forName(className, false, classLoader)
This is also a very expensive operation.
 
So, may be something can be done about it? May be we can introduce an option to auto-assign ids for implicitly registered classes? The option can be off by default to preserve the old semantics. More over, to avoid id conflicts, one could even have an option to assign unusual ids to it, e.g. hashCodes of those class names. This would guarantee that they do not overlap with anything else and are unique enough. What do you think?

BTW, if we look at the problem in a more generic way and think about using Kryo as a framework for remote invocations or message exchange sessions between sender and receiver, one could think of introducing the concept of session contexts. I.e. sender and receiver exchange information about mappings between classnames and ids and later use it for all messages inside the same session. It will be very similar to your remark "the class name is only written once per object graph" with the difference that the class name is only written once per session and may be referred by any number of object graphs exchanged  during this session. But this is just an idea for the future and would most likely require bigger changes.

-Nate

mongonix

unread,
Jun 7, 2012, 5:42:04 PM6/7/12
to kryo-...@googlegroups.com
And secondly, during deserialization you always do at least once per object graph class lookup: 
Class.forName(className, false, classLoader)
This is also a very expensive operation.
 
To elaborate on this: My profiler shows that under heavy load, 40% of CPU time is spent on this operaiton, because it does it for every actor message being received. 
I'd say that Kryo local caching using Map<String, Class> could help a lot here.

Nate

unread,
Jun 7, 2012, 6:20:35 PM6/7/12
to kryo-...@googlegroups.com
On Thu, Jun 7, 2012 at 2:38 PM, mongonix <romi...@gmail.com> wrote:


On Thursday, June 7, 2012 7:39:47 PM UTC+2, Nate wrote:
If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize.

 Yes. True.
 
When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.


I understand what you mean, but I do not quite agree with you. Let me give you an example. 
I work on kryo-based serialization for Scala/Akka. I added to Akka the possibility to define in a config file the set of classes (classnames) that should be pre-registered. You can either provide just classnames or even their IDs. This way you can also ensure that the same order of registration is used by both serializer and deserializer. But there is a problem. Many of Scala classes, e.g. collection classes from the standard library contain a lot of sub-classes, nested classes and other classes generated by Scala compiler from the original source. Many of these classes have very long and cryptic names. It is very annoying if not unrealistic to demand all users to provide a set of all those strange class names, if they want to use Kryo for serializing collection objects. It would be much better, if these classes are registered implicitly. This is what happens now, but with the problem described in the original post. And BTW, since all "main" classes are explicitly listed in the same order in the config files, all dependent generated classes are also seen and implicitly registered in the same order. So, in this sense, if ids would be assigned, they would be the same.

Registering the main classes does not cause the generated classes to also be registered. Currently FieldSerializer (and others) defer looking up serializers for the fields until the first time an instance is serialized. This enables registering classes without requiring the order to match the dependencies between classes, and solves this situation:

class A {
   B b;
}
class B {
   A a;
}

Could you write something that inspects each of your "main" classes and explicitly registers the generated classes? This may be harder than it seems at first. Imagine three classes: A, B extends A, and C extends A. If one of your main classes has field of type A, you would need to register A, B, and C.

Note the generated classes are likely not public API, so any changes to them will invalidate previously serialized bytes.

Also note that a class identifier is only written when necessary. If the type of a field is final, the class will not be written because at deserialization time we know the concrete type of the value must match the field type (no polymorphism). Kryo#isFinal(Class) can be overridden to decide what classes are treated as final. I don't know if it would be valid to treat the generated classes as final.

 
Note the class name is only written once per object graph.

True. But since e.g. collection classes and all generated ones are used very often (e.g. every time you send a message between actors) it ends up introducing a lot of overhead. I see performance drops by a factor of 2 or 3 due to the way how these implicit registrations handled now by Kryo. For one, writing fully qualified class names introduces overhead during serialization and deserialization. And secondly, during deserialization you always do at least once per object graph class lookup: 
Class.forName(className, false, classLoader)
This is also a very expensive operation.
 
So, may be something can be done about it? May be we can introduce an option to auto-assign ids for implicitly registered classes? The option can be off by default to preserve the old semantics. More over, to avoid id conflicts, one could even have an option to assign unusual ids to it, e.g. hashCodes of those class names. This would guarantee that they do not overlap with anything else and are unique enough. What do you think?

I'm open to doing this if we can come up with a solution that will work and is not application specific. ID uniqueness is easily done with int ordinals. The real problem is that you need a deterministic order for the classes. Without that, implicit ID assignment doesn't make sense.
 

BTW, if we look at the problem in a more generic way and think about using Kryo as a framework for remote invocations or message exchange sessions between sender and receiver, one could think of introducing the concept of session contexts. I.e. sender and receiver exchange information about mappings between classnames and ids and later use it for all messages inside the same session. It will be very similar to your remark "the class name is only written once per object graph" with the difference that the class name is only written once per session and may be referred by any number of object graphs exchanged  during this session. But this is just an idea for the future and would most likely require bigger changes.

Possibly this could be implemented on top of Kryo. Maybe override Kryo#getRegistration(Class), call super, if the returned registration has ID == NAME, then register the class explicitly (this will overwrite NAME with an auto assigned ID) and send a message to the other side with the class name string and int ID.

-Nate

Nate

unread,
Jun 7, 2012, 6:38:47 PM6/7/12
to kryo-...@googlegroups.com

It does seem that Class.forName is expensive:

Class.forName: 61.39909ms
ObjectMap: 4.607253ms
HashMap: 5.77879ms

Code for that is below. I'll add caching to avoid Class.forName.

int count = 100000;
String name = Kryo.class.getName();
ClassLoader classLoader = Kryo.class.getClassLoader();
while (true) {
    Class[] types = new Class[count];
    long s = System.nanoTime();
    for (int i = 0, n = types.length; i < n; i++) {
        types[i] = Class.forName(name, false, classLoader);
    }
    long e = System.nanoTime();
    System.out.println("Class.forName: " + (e - s) / 1e6f + "ms");

    ObjectMap cache = new ObjectMap();
    String[] names = new String[count];
    for (int i = 0, n = types.length; i < n; i++) {
        names[i] = name + i; // Use key similar in length to the class name.
        cache.put(names[i], types[i]);
    }
    s = System.nanoTime();
    for (int i = 0, n = types.length; i < n; i++) {
        types[i] = (Class)cache.get(names[i]);
    }
    e = System.nanoTime();
    System.out.println("ObjectMap: " + (e - s) / 1e6f + "ms");

    HashMap cache2 = new HashMap();
    for (int i = 0, n = types.length; i < n; i++)
        cache2.put(name + i, types[i]);
    s = System.nanoTime();
    for (int i = 0, n = types.length; i < n; i++) {
        types[i] = (Class)cache2.get(names[i]);
    }
    e = System.nanoTime();
    System.out.println("HashMap: " + (e - s) / 1e6f + "ms");
}

mongonix

unread,
Jun 7, 2012, 6:58:20 PM6/7/12
to kryo-...@googlegroups.com


On Friday, June 8, 2012 12:20:35 AM UTC+2, Nate wrote:
On Thu, Jun 7, 2012 at 2:38 PM, mongonix  wrote:


On Thursday, June 7, 2012 7:39:47 PM UTC+2, Nate wrote:
If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize.

 Yes. True.
 
When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.


I understand what you mean, but I do not quite agree with you. Let me give you an example. 
I work on kryo-based serialization for Scala/Akka. I added to Akka the possibility to define in a config file the set of classes (classnames) that should be pre-registered. You can either provide just classnames or even their IDs. This way you can also ensure that the same order of registration is used by both serializer and deserializer. But there is a problem. Many of Scala classes, e.g. collection classes from the standard library contain a lot of sub-classes, nested classes and other classes generated by Scala compiler from the original source. Many of these classes have very long and cryptic names. It is very annoying if not unrealistic to demand all users to provide a set of all those strange class names, if they want to use Kryo for serializing collection objects. It would be much better, if these classes are registered implicitly. This is what happens now, but with the problem described in the original post. And BTW, since all "main" classes are explicitly listed in the same order in the config files, all dependent generated classes are also seen and implicitly registered in the same order. So, in this sense, if ids would be assigned, they would be the same.

Registering the main classes does not cause the generated classes to also be registered. Currently FieldSerializer (and others) defer looking up serializers for the fields until the first time an instance is serialized. This enables registering classes without requiring the order to match the dependencies between classes, and solves this situation:

class A {
   B b;
}
class B {
   A a;
}


OK. I see.
 
Could you write something that inspects each of your "main" classes and explicitly registers the generated classes? This may be harder than it seems at first. Imagine three classes: A, B extends A, and C extends A. If one of your main classes has field of type A, you would need to register A, B, and C.


I'll think about it. But it can be very tricky, as you say. It is already tricky in Java and will be even more so in Scala.
 
Note the generated classes are likely not public API, so any changes to them will invalidate previously serialized bytes.


Well, generated classes in Scala are resulting from case classes, closures, overloaded operators with symbolic names, etc. They are often public, but their names are really cryptic in many cases ;-)
 
Also note that a class identifier is only written when necessary. If the type of a field is final, the class will not be written because at deserialization time we know the concrete type of the value must match the field type (no polymorphism). Kryo#isFinal(Class) can be overridden to decide what classes are treated as final. I don't know if it would be valid to treat the generated classes as final.

I'm not sure that all those classes are final. Plus, I'd like to avoid explicit dependency on their names in the code, because the implementation of Scala library may change in the future and there will be new class names to take care of. Therefore a generic solution would be nicer.

 
 
Note the class name is only written once per object graph.

True. But since e.g. collection classes and all generated ones are used very often (e.g. every time you send a message between actors) it ends up introducing a lot of overhead. I see performance drops by a factor of 2 or 3 due to the way how these implicit registrations handled now by Kryo. For one, writing fully qualified class names introduces overhead during serialization and deserialization. And secondly, during deserialization you always do at least once per object graph class lookup: 
Class.forName(className, false, classLoader)
This is also a very expensive operation.
 
So, may be something can be done about it? May be we can introduce an option to auto-assign ids for implicitly registered classes? The option can be off by default to preserve the old semantics. More over, to avoid id conflicts, one could even have an option to assign unusual ids to it, e.g. hashCodes of those class names. This would guarantee that they do not overlap with anything else and are unique enough. What do you think?

I'm open to doing this if we can come up with a solution that will work and is not application specific. ID uniqueness is easily done with int ordinals. The real problem is that you need a deterministic order for the classes. Without that, implicit ID assignment doesn't make sense.

Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.
 
 

BTW, if we look at the problem in a more generic way and think about using Kryo as a framework for remote invocations or message exchange sessions between sender and receiver, one could think of introducing the concept of session contexts. I.e. sender and receiver exchange information about mappings between classnames and ids and later use it for all messages inside the same session. It will be very similar to your remark "the class name is only written once per object graph" with the difference that the class name is only written once per session and may be referred by any number of object graphs exchanged  during this session. But this is just an idea for the future and would most likely require bigger changes.

Possibly this could be implemented on top of Kryo. Maybe override Kryo#getRegistration(Class), call super, if the returned registration has ID == NAME, then register the class explicitly (this will overwrite NAME with an auto assigned ID) and send a message to the other side with the class name string and int ID.

Nice idea. I'll think about it. The real problem will be probably to associate a given Kryo instance with a session context. Akka for example does not support "session" concept directly. Currently, it completely decouples networking layer from serialization layer. Therefore, serializers have no idea about a session context they are used for. I've started working on improving that.

-Leo

On Friday, June 8, 2012 12:20:35 AM UTC+2, Nate wrote:

Nate

unread,
Jun 7, 2012, 7:21:58 PM6/7/12
to kryo-...@googlegroups.com
On Thu, Jun 7, 2012 at 3:58 PM, mongonix <romi...@gmail.com> wrote:

Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.

I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals, which use fewer bytes and can't collide. But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.

-Nate

mongonix

unread,
Jun 7, 2012, 7:41:11 PM6/7/12
to kryo-...@googlegroups.com


On Friday, June 8, 2012 1:21:58 AM UTC+2, Nate wrote:
On Thu, Jun 7, 2012 at 3:58 PM, mongonix  wrote:

Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.

I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals,

I guess we misunderstand each other here, but I'm not sure where exactly ;-) By IDs I mean 32bit or 24bits integer numbers (whatever is enough to represent the value of a numeric ID in binary form). I guess you mean the same with int ordinals.

 
> which use fewer bytes and can't collide.
Why they cannot collide? I see that on each side (i.e. serializer and deserializer) the IDs are assigned uniquely using current Kryo's approach. But what guarantees that the same id is not used for different classes on each of the sides?

But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.

 I cannot quite follow you. Why does it work only if messages are deserialized in the same order? My whole point was that if we map FQCN to the FQCN's hashCode as its ID, then it is a unique mapping and it is not dependent on the order how classes were registered on either side. Anyone who registers a (implicit) mapping for this FQCN would always map it to the same ID, namely hashCode of FQCN. Doesn't it solve the issue with ordering? Or do I miss something?

-Leo

Nate

unread,
Jun 7, 2012, 9:37:05 PM6/7/12
to kryo-...@googlegroups.com
On Thu, Jun 7, 2012 at 4:41 PM, mongonix <romi...@gmail.com> wrote:


On Friday, June 8, 2012 1:21:58 AM UTC+2, Nate wrote:
On Thu, Jun 7, 2012 at 3:58 PM, mongonix  wrote:

Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.

I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals,

I guess we misunderstand each other here, but I'm not sure where exactly ;-) By IDs I mean 32bit or 24bits integer numbers (whatever is enough to represent the value of a numeric ID in binary form). I guess you mean the same with int ordinals.

 
> which use fewer bytes and can't collide.
Why they cannot collide? I see that on each side (i.e. serializer and deserializer) the IDs are assigned uniquely using current Kryo's approach. But what guarantees that the same id is not used for different classes on each of the sides?

It depends on how you build the system. I would think the sending side would decide on its own IDs, and the receiving side would learn them. For two way communication you'd have two sets of IDs.
 

But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.

 I cannot quite follow you. Why does it work only if messages are deserialized in the same order? My whole point was that if we map FQCN to the FQCN's hashCode as its ID, then it is a unique mapping and it is not dependent on the order how classes were registered on either side. Anyone who registers a (implicit) mapping for this FQCN would always map it to the same ID, namely hashCode of FQCN. Doesn't it solve the issue with ordering? Or do I miss something?

Here is what I was thinking:

1) We serialize object graph A, encounter a class for the first time, write the FQCN and an ID (doesn't matter whether an ordinal or FQCN hashcode for now), subsequent encounters of the same class we only write the ID.

2) We serialize object graph B, any classes we have already encountered in the first object graph we only write their ID, otherwise we write the FQCN and ID.

During deserialization we are learning the ID->FQCN mapping from the data, so we must deserialize object graph A first, then B. If we deserialize B first, we may encounter IDs whose FQCN were only written in object graph A.

Were you thinking of it working in this way or did I go wrong somewhere?

-Nate

mongonix

unread,
Jun 8, 2012, 6:29:21 AM6/8/12
to kryo-...@googlegroups.com

On Friday, June 8, 2012 3:37:05 AM UTC+2, Nate wrote:
On Thu, Jun 7, 2012 at 4:41 PM, mongonix wrote:


On Friday, June 8, 2012 1:21:58 AM UTC+2, Nate wrote:
On Thu, Jun 7, 2012 at 3:58 PM, mongonix  wrote:

Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.

I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals,

I guess we misunderstand each other here, but I'm not sure where exactly ;-) By IDs I mean 32bit or 24bits integer numbers (whatever is enough to represent the value of a numeric ID in binary form). I guess you mean the same with int ordinals.

 
> which use fewer bytes and can't collide.
Why they cannot collide? I see that on each side (i.e. serializer and deserializer) the IDs are assigned uniquely using current Kryo's approach. But what guarantees that the same id is not used for different classes on each of the sides?

It depends on how you build the system. I would think the sending side would decide on its own IDs, and the receiving side would learn them. For two way communication you'd have two sets of IDs.

In general, you are right. It would be a preferred way. 

 

But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.

 I cannot quite follow you. Why does it work only if messages are deserialized in the same order? My whole point was that if we map FQCN to the FQCN's hashCode as its ID, then it is a unique mapping and it is not dependent on the order how classes were registered on either side. Anyone who registers a (implicit) mapping for this FQCN would always map it to the same ID, namely hashCode of FQCN. Doesn't it solve the issue with ordering? Or do I miss something?

Here is what I was thinking:

1) We serialize object graph A, encounter a class for the first time, write the FQCN and an ID (doesn't matter whether an ordinal or FQCN hashcode for now), subsequent encounters of the same class we only write the ID.

2) We serialize object graph B, any classes we have already encountered in the first object graph we only write their ID, otherwise we write the FQCN and ID.


Yes. It also implies that we reuse Kryo instance that we used for A, also for B, because otherwise we would not remember the IDs of already encountred classes.
 
During deserialization we are learning the ID->FQCN mapping from the data, so we must deserialize object graph A first, then B. If we deserialize B first, we may encounter IDs whose FQCN were only written in object graph A.


I see your point now. You meant the order in which messages were delivered to deserializer. So, you say that if message order delivery is not guaranteed then B may be delivered before A and contain an ID which was not mapped to FQCN yet and then deserializer doesn't know what to do. And then you conclude that to get around it, a pessimistic approach should be used and every serialized object graph needs to contain this ID->FQCN mapping once. Then it doesn't matter which message is deserialized first, as deserializer will always leran the mapping first and then do its work. I agree that this approach solves the problem. The questions are:
- Is it the only possible approach?
- Is it optimial in typical cases
- At what price does it come?
 
Were you thinking of it working in this way or did I go wrong somewhere?


I think you are right for the most general case. 

But in many cases, one can guarantee the in-order message delivery (e.g. JMS, sessions with in-order message delivery, etc). Or one can build a system which e.g. (locally) serializes/deserializes a few messages of all possible types (based on a configuration read at run-time) at startup to initialize the serialization sub-system and register explicitly and implicitly all required FQCNs (It also means that you sort of control initialization (and its order) on both sides, i.e. for serializer and deserializer, which is a case for akka and many similar frameworks).  So, assuming one is able to do it, what can be done in Kryo to support those use-cases? 

The goal is to avoid putting the same ID->FQCN in each serialized object graph for implicitly registered classes. Assume also that serialization subsystem cannot influence communication subsystem (i.e. we cannot provide any info from deserializer to serializer), but communication subsystem has some way to influence serialization, i.e. it can create Kryo instances, set flags, register classes, etc. So, for example it may indicate that it supports in-order message delivery by setting a special flag in Kryo. If this flag is set then we add ID->FQCN only once in the first serialzed graph that uses this FQCN.

-Leo

mongonix

unread,
Jun 8, 2012, 7:06:50 AM6/8/12
to kryo-...@googlegroups.com


On Friday, June 8, 2012 12:29:21 PM UTC+2, mongonix wrote:

But in many cases, one can guarantee the in-order message delivery (e.g. JMS, sessions with in-order message delivery, etc). Or one can build a system which e.g. (locally) serializes/deserializes a few messages of all possible types (based on a configuration read at run-time) at startup to initialize the serialization sub-system and register explicitly and implicitly all required FQCNs (It also means that you sort of control initialization (and its order) on both sides, i.e. for serializer and deserializer, which is a case for akka and many similar frameworks).  So, assuming one is able to do it, what can be done in Kryo to support those use-cases? 


BTW, one possible, but very wasteful way to do it, would be like this:
- Scan all classes available in JVM  classloader
- Build ID->FQCN mappings for all of them. Use hashCode or any other strong enough hash function for it to avoid collissions. The only condition is that everyone uses the same hash function
- Use those IDs during serialization/deserialization

Since everyone would end up with the same ID->FQCN mappings for classes present on both sides, there is no need to exchange any info about mappings at all. Ordering is not an issue any more.

The price for all this is that you'll keep a rather big Map in memory (e.g. 50000 unique key/value pairs).

Nate

unread,
Jun 8, 2012, 2:42:01 PM6/8/12
to kryo-...@googlegroups.com
On Fri, Jun 8, 2012 at 3:29 AM, mongonix <romi...@gmail.com> wrote:
 
During deserialization we are learning the ID->FQCN mapping from the data, so we must deserialize object graph A first, then B. If we deserialize B first, we may encounter IDs whose FQCN were only written in object graph A.


I see your point now. You meant the order in which messages were delivered to deserializer. So, you say that if message order delivery is not guaranteed then B may be delivered before A and contain an ID which was not mapped to FQCN yet and then deserializer doesn't know what to do. And then you conclude that to get around it, a pessimistic approach should be used and every serialized object graph needs to contain this ID->FQCN mapping once. Then it doesn't matter which message is deserialized first, as deserializer will always leran the mapping first and then do its work. I agree that this approach solves the problem. The questions are:
- Is it the only possible approach?
- Is it optimial in typical cases
- At what price does it come?

There are many uses for serialization. It is common to write to a file or database, where objects may not be read back in any particular order. The most general and flexible usage is to write the FQCN once for each object graph. Currently the other mode supported is to register the classes. Kryo should be flexible enough to support other uses.
 
 
Were you thinking of it working in this way or did I go wrong somewhere?


I think you are right for the most general case. 

But in many cases, one can guarantee the in-order message delivery (e.g. JMS, sessions with in-order message delivery, etc). Or one can build a system which e.g. (locally) serializes/deserializes a few messages of all possible types (based on a configuration read at run-time) at startup to initialize the serialization sub-system and register explicitly and implicitly all required FQCNs (It also means that you sort of control initialization (and its order) on both sides, i.e. for serializer and deserializer, which is a case for akka and many similar frameworks).  So, assuming one is able to do it, what can be done in Kryo to support those use-cases? 

The goal is to avoid putting the same ID->FQCN in each serialized object graph for implicitly registered classes. Assume also that serialization subsystem cannot influence communication subsystem (i.e. we cannot provide any info from deserializer to serializer), but communication subsystem has some way to influence serialization, i.e. it can create Kryo instances, set flags, register classes, etc. So, for example it may indicate that it supports in-order message delivery by setting a special flag in Kryo. If this flag is set then we add ID->FQCN only once in the first serialzed graph that uses this FQCN.

I think it makes sense to make resolving classes pluggable. I had this in the past for v2 but ripped it out because it felt like feature creep at the time. This does not break the API and makes the Kryo class a little smaller, which is nice. It is committed as revision 271:
https://code.google.com/p/kryo/source/detail?r=271
A ClassResolver is given to Kryo at construction. It controls how classes are registered, writing bytes to represent a class, and reading bytes that represent a class. I would appreciate a code review from any or all of you guys. :)

-Nate

mongonix

unread,
Jun 8, 2012, 3:35:06 PM6/8/12
to kryo-...@googlegroups.com


On Friday, June 8, 2012 8:42:01 PM UTC+2, Nate wrote:

Very cool, Nate!
I'll look into it and see if I can use this new feature to implement what I had in mind. But this will requite some time I guess. Will report next days or next week how it goes.

-Leo

mongonix

unread,
Jun 11, 2012, 5:15:24 AM6/11/12
to kryo-...@googlegroups.com
Hi Nate,


On Friday, June 8, 2012 8:42:01 PM UTC+2, Nate wrote:

I had a quick look. I defined my own resolver (which assigns ids to implicitly registered classes), based on your default resolver. I simply derived from it. To be able to do it, I changed all private fields and methods in your default class to protected. I think, it could be a good idea, if you do the same in your code and change private to protected, so that people do not need to write resolvers from scratch and reuse as much as possible your code.

Other than that, I'm quite happy with how it looks now when it comes to class resolving. I'm still working on ensuring registration order, etc. But this is an issue on
my side, not on Kryo's side.

Thanks,
  Leo

Nate

unread,
Jun 11, 2012, 7:25:20 PM6/11/12
to kryo-...@googlegroups.com
On Mon, Jun 11, 2012 at 2:15 AM, mongonix <romi...@gmail.com> wrote:
https://code.google.com/p/kryo/source/detail?r=271
A ClassResolver is given to Kryo at construction. It controls how classes are registered, writing bytes to represent a class, and reading bytes that represent a class. I would appreciate a code review from any or all of you guys. :)
I think it makes sense to make resolving classes pluggable. I had this in the past for v2 but ripped it out because it felt like feature creep at the time. This does not break the API and makes the Kryo class a little smaller, which is nice. It is committed as revision 271:

I had a quick look. I defined my own resolver (which assigns ids to implicitly registered classes), based on your default resolver. I simply derived from it. To be able to do it, I changed all private fields and methods in your default class to protected. I think, it could be a good idea, if you do the same in your code and change private to protected, so that people do not need to write resolvers from scratch and reuse as much as possible your code.

Other than that, I'm quite happy with how it looks now when it comes to class resolving. I'm still working on ensuring registration order, etc. But this is an issue on
my side, not on Kryo's side.

Nice, glad that worked out. I usually resist making everything protected in a lib, but I guess it is ok for this. I doubt the DefaultClassResolver code will change too much in the future and you're right that it can reduce the amount of code needed to customize resolving classes.

-Nate

mongonix

unread,
Nov 15, 2012, 1:28:34 AM11/15/12
to kryo-...@googlegroups.com


On Wednesday, November 14, 2012 9:37:58 PM UTC+1, Chetan Narsude wrote:
The topic under discussion in this thread is precisely what I was after. I also looked at the r=271 code. May be there are others who want the same feature. If so, may be we can let this feature creep.

Mongonix, is your code sharable? Thanks.


IIRC, Nate added a while ago the possibility to define your own class resolvers and derive them from a default one.
I used it for my Kryo-based Scala serialization library for Akka:
 
Does it answer your question? Or do you mean something else? This thread was long and there were other issues discussed here ;-)

Cheers,
 Leo

Chetan Narsude

unread,
Nov 15, 2012, 2:56:42 AM11/15/12
to kryo-...@googlegroups.com

You pointed me to precisely what I wanted to know. Wondering though if your deserializer hashes all the possible classes?

Actually shortly after I sent the previous email, I researched the source code and found out that Nate had committed the feature a few days after this discussion with you.

I used it and took a slightly different approach than you did. Instead of hash, I use incrementing Id and as Nate suggested in one of the replies, sent that mapping to the deserializer.

Works great and have not heard any murmur of possible collision ;-)

--
Reply all
Reply to author
Forward
0 new messages