registration = registerInternal(new Registration(type, getDefaultSerializer(type), NAME));
But this line uses the NAME as last parameter. As a result, no numeric ID is assigned to this class,
and every time (!!!) you serialize an object of this type, a complete fully-qualified class name is written,
which introduces quite some overhead.
I'd like to understand the reason for doing it this way. Why wouldn't you e.g. call instead this method:
registration = register(type, getDefaultSerializer(type))
It would register the class using a next available numeric id. But may be you want to treat those
implicitly registered differently? If so, could you please clarify?
Thanks,
Leo
--
You received this message because you are subscribed to the "kryo-users" group.
http://groups.google.com/group/kryo-users
If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize.
When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.
Note the class name is only written once per object graph.
-Nate
And secondly, during deserialization you always do at least once per object graph class lookup:Class.forName(className, false, classLoader)This is also a very expensive operation.
On Thursday, June 7, 2012 7:39:47 PM UTC+2, Nate wrote:If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize.Yes. True.When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.
I understand what you mean, but I do not quite agree with you. Let me give you an example.I work on kryo-based serialization for Scala/Akka. I added to Akka the possibility to define in a config file the set of classes (classnames) that should be pre-registered. You can either provide just classnames or even their IDs. This way you can also ensure that the same order of registration is used by both serializer and deserializer. But there is a problem. Many of Scala classes, e.g. collection classes from the standard library contain a lot of sub-classes, nested classes and other classes generated by Scala compiler from the original source. Many of these classes have very long and cryptic names. It is very annoying if not unrealistic to demand all users to provide a set of all those strange class names, if they want to use Kryo for serializing collection objects. It would be much better, if these classes are registered implicitly. This is what happens now, but with the problem described in the original post. And BTW, since all "main" classes are explicitly listed in the same order in the config files, all dependent generated classes are also seen and implicitly registered in the same order. So, in this sense, if ids would be assigned, they would be the same.
Note the class name is only written once per object graph.True. But since e.g. collection classes and all generated ones are used very often (e.g. every time you send a message between actors) it ends up introducing a lot of overhead. I see performance drops by a factor of 2 or 3 due to the way how these implicit registrations handled now by Kryo. For one, writing fully qualified class names introduces overhead during serialization and deserialization. And secondly, during deserialization you always do at least once per object graph class lookup:Class.forName(className, false, classLoader)This is also a very expensive operation.So, may be something can be done about it? May be we can introduce an option to auto-assign ids for implicitly registered classes? The option can be off by default to preserve the old semantics. More over, to avoid id conflicts, one could even have an option to assign unusual ids to it, e.g. hashCodes of those class names. This would guarantee that they do not overlap with anything else and are unique enough. What do you think?
BTW, if we look at the problem in a more generic way and think about using Kryo as a framework for remote invocations or message exchange sessions between sender and receiver, one could think of introducing the concept of session contexts. I.e. sender and receiver exchange information about mappings between classnames and ids and later use it for all messages inside the same session. It will be very similar to your remark "the class name is only written once per object graph" with the difference that the class name is only written once per session and may be referred by any number of object graphs exchanged during this session. But this is just an idea for the future and would most likely require bigger changes.
On Thu, Jun 7, 2012 at 2:38 PM, mongonix wrote:
On Thursday, June 7, 2012 7:39:47 PM UTC+2, Nate wrote:If you register with IDs, whether explicitly or implicitly, you must have the exact same IDs when you deserialize.Yes. True.When you register implicitly, the order classes are seen is not guaranteed. When you deserialize if you got an ID, you would not know what class the ID belonged to.
I understand what you mean, but I do not quite agree with you. Let me give you an example.I work on kryo-based serialization for Scala/Akka. I added to Akka the possibility to define in a config file the set of classes (classnames) that should be pre-registered. You can either provide just classnames or even their IDs. This way you can also ensure that the same order of registration is used by both serializer and deserializer. But there is a problem. Many of Scala classes, e.g. collection classes from the standard library contain a lot of sub-classes, nested classes and other classes generated by Scala compiler from the original source. Many of these classes have very long and cryptic names. It is very annoying if not unrealistic to demand all users to provide a set of all those strange class names, if they want to use Kryo for serializing collection objects. It would be much better, if these classes are registered implicitly. This is what happens now, but with the problem described in the original post. And BTW, since all "main" classes are explicitly listed in the same order in the config files, all dependent generated classes are also seen and implicitly registered in the same order. So, in this sense, if ids would be assigned, they would be the same.
Registering the main classes does not cause the generated classes to also be registered. Currently FieldSerializer (and others) defer looking up serializers for the fields until the first time an instance is serialized. This enables registering classes without requiring the order to match the dependencies between classes, and solves this situation:
class A {
B b;
}
class B {
A a;
}
Could you write something that inspects each of your "main" classes and explicitly registers the generated classes? This may be harder than it seems at first. Imagine three classes: A, B extends A, and C extends A. If one of your main classes has field of type A, you would need to register A, B, and C.
Note the generated classes are likely not public API, so any changes to them will invalidate previously serialized bytes.
Also note that a class identifier is only written when necessary. If the type of a field is final, the class will not be written because at deserialization time we know the concrete type of the value must match the field type (no polymorphism). Kryo#isFinal(Class) can be overridden to decide what classes are treated as final. I don't know if it would be valid to treat the generated classes as final.
Note the class name is only written once per object graph.True. But since e.g. collection classes and all generated ones are used very often (e.g. every time you send a message between actors) it ends up introducing a lot of overhead. I see performance drops by a factor of 2 or 3 due to the way how these implicit registrations handled now by Kryo. For one, writing fully qualified class names introduces overhead during serialization and deserialization. And secondly, during deserialization you always do at least once per object graph class lookup:Class.forName(className, false, classLoader)This is also a very expensive operation.So, may be something can be done about it? May be we can introduce an option to auto-assign ids for implicitly registered classes? The option can be off by default to preserve the old semantics. More over, to avoid id conflicts, one could even have an option to assign unusual ids to it, e.g. hashCodes of those class names. This would guarantee that they do not overlap with anything else and are unique enough. What do you think?
I'm open to doing this if we can come up with a solution that will work and is not application specific. ID uniqueness is easily done with int ordinals. The real problem is that you need a deterministic order for the classes. Without that, implicit ID assignment doesn't make sense.
BTW, if we look at the problem in a more generic way and think about using Kryo as a framework for remote invocations or message exchange sessions between sender and receiver, one could think of introducing the concept of session contexts. I.e. sender and receiver exchange information about mappings between classnames and ids and later use it for all messages inside the same session. It will be very similar to your remark "the class name is only written once per object graph" with the difference that the class name is only written once per session and may be referred by any number of object graphs exchanged during this session. But this is just an idea for the future and would most likely require bigger changes.
Possibly this could be implemented on top of Kryo. Maybe override Kryo#getRegistration(Class), call super, if the returned registration has ID == NAME, then register the class explicitly (this will overwrite NAME with an auto assigned ID) and send a message to the other side with the class name string and int ID.
Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.
On Thu, Jun 7, 2012 at 3:58 PM, mongonix wrote:Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.
I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals,
But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.
On Friday, June 8, 2012 1:21:58 AM UTC+2, Nate wrote:On Thu, Jun 7, 2012 at 3:58 PM, mongonix wrote:
Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.
I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals,
I guess we misunderstand each other here, but I'm not sure where exactly ;-) By IDs I mean 32bit or 24bits integer numbers (whatever is enough to represent the value of a numeric ID in binary form). I guess you mean the same with int ordinals.Why they cannot collide? I see that on each side (i.e. serializer and deserializer) the IDs are assigned uniquely using current Kryo's approach. But what guarantees that the same id is not used for different classes on each of the sides?
> which use fewer bytes and can't collide.
But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.
I cannot quite follow you. Why does it work only if messages are deserialized in the same order? My whole point was that if we map FQCN to the FQCN's hashCode as its ID, then it is a unique mapping and it is not dependent on the order how classes were registered on either side. Anyone who registers a (implicit) mapping for this FQCN would always map it to the same ID, namely hashCode of FQCN. Doesn't it solve the issue with ordering? Or do I miss something?
On Thu, Jun 7, 2012 at 4:41 PM, mongonix wrote:
On Friday, June 8, 2012 1:21:58 AM UTC+2, Nate wrote:On Thu, Jun 7, 2012 at 3:58 PM, mongonix wrote:
Hmm. Have you really though about my idea of using hashCode of a fully qualified class name as an ID for implicitly registered classes? I did some tests recently for all JDK classes + all Scala lib classes + all Glassfish classes + all WebLogic classes. I experimented with different hash functions. The results have shown that even String.hashCode() results in just a few collisions for about 90000 class names. So, basically it can be considered unique and it is the same for each classname independent of the Kryo instance and ids of other registered classes. Which means that we can do it like this: first time we see a class, we implicitly register id, assign an ID = hashCode(classname), serialize both this id and classname. When we deserialize, we read ID and classname and register this new mapping on the deserializer side. Next time we send a message, we simply use this ID. Of course, such an ID is longer (I've seen experimentally that 32 bits or even 24 out of 32 bits of hashCode are producing unique enough results) and may require 3-4 bytes instead of 1-2 bytes for short IDs. But it is way better than writing fully qualified class names in every object graph.
I'm not sure how this is different than writing an int ordinal? It seems the part that is really different is "Next time we send a message, we simply use this ID.". We could do that with ordinals,
I guess we misunderstand each other here, but I'm not sure where exactly ;-) By IDs I mean 32bit or 24bits integer numbers (whatever is enough to represent the value of a numeric ID in binary form). I guess you mean the same with int ordinals.Why they cannot collide? I see that on each side (i.e. serializer and deserializer) the IDs are assigned uniquely using current Kryo's approach. But what guarantees that the same id is not used for different classes on each of the sides?
> which use fewer bytes and can't collide.
It depends on how you build the system. I would think the sending side would decide on its own IDs, and the receiving side would learn them. For two way communication you'd have two sets of IDs.
1) We serialize object graph A, encounter a class for the first time, write the FQCN and an ID (doesn't matter whether an ordinal or FQCN hashcode for now), subsequent encounters of the same class we only write the ID.But this approach only works if messages sent are deserialized in the same order. This makes me feel like it is application specific. Does it make sense as a Kryo feature? If not, we can still make any needed changes to Kryo to make it possible.
I cannot quite follow you. Why does it work only if messages are deserialized in the same order? My whole point was that if we map FQCN to the FQCN's hashCode as its ID, then it is a unique mapping and it is not dependent on the order how classes were registered on either side. Anyone who registers a (implicit) mapping for this FQCN would always map it to the same ID, namely hashCode of FQCN. Doesn't it solve the issue with ordering? Or do I miss something?
Here is what I was thinking:
2) We serialize object graph B, any classes we have already encountered in the first object graph we only write their ID, otherwise we write the FQCN and ID.
During deserialization we are learning the ID->FQCN mapping from the data, so we must deserialize object graph A first, then B. If we deserialize B first, we may encounter IDs whose FQCN were only written in object graph A.
Were you thinking of it working in this way or did I go wrong somewhere?
But in many cases, one can guarantee the in-order message delivery (e.g. JMS, sessions with in-order message delivery, etc). Or one can build a system which e.g. (locally) serializes/deserializes a few messages of all possible types (based on a configuration read at run-time) at startup to initialize the serialization sub-system and register explicitly and implicitly all required FQCNs (It also means that you sort of control initialization (and its order) on both sides, i.e. for serializer and deserializer, which is a case for akka and many similar frameworks). So, assuming one is able to do it, what can be done in Kryo to support those use-cases?
During deserialization we are learning the ID->FQCN mapping from the data, so we must deserialize object graph A first, then B. If we deserialize B first, we may encounter IDs whose FQCN were only written in object graph A.
I see your point now. You meant the order in which messages were delivered to deserializer. So, you say that if message order delivery is not guaranteed then B may be delivered before A and contain an ID which was not mapped to FQCN yet and then deserializer doesn't know what to do. And then you conclude that to get around it, a pessimistic approach should be used and every serialized object graph needs to contain this ID->FQCN mapping once. Then it doesn't matter which message is deserialized first, as deserializer will always leran the mapping first and then do its work. I agree that this approach solves the problem. The questions are:- Is it the only possible approach?- Is it optimial in typical cases- At what price does it come?
Were you thinking of it working in this way or did I go wrong somewhere?I think you are right for the most general case.But in many cases, one can guarantee the in-order message delivery (e.g. JMS, sessions with in-order message delivery, etc). Or one can build a system which e.g. (locally) serializes/deserializes a few messages of all possible types (based on a configuration read at run-time) at startup to initialize the serialization sub-system and register explicitly and implicitly all required FQCNs (It also means that you sort of control initialization (and its order) on both sides, i.e. for serializer and deserializer, which is a case for akka and many similar frameworks). So, assuming one is able to do it, what can be done in Kryo to support those use-cases?The goal is to avoid putting the same ID->FQCN in each serialized object graph for implicitly registered classes. Assume also that serialization subsystem cannot influence communication subsystem (i.e. we cannot provide any info from deserializer to serializer), but communication subsystem has some way to influence serialization, i.e. it can create Kryo instances, set flags, register classes, etc. So, for example it may indicate that it supports in-order message delivery by setting a special flag in Kryo. If this flag is set then we add ID->FQCN only once in the first serialzed graph that uses this FQCN.
I think it makes sense to make resolving classes pluggable. I had this in the past for v2 but ripped it out because it felt like feature creep at the time. This does not break the API and makes the Kryo class a little smaller, which is nice. It is committed as revision 271:https://code.google.com/p/kryo/source/detail?r=271
A ClassResolver is given to Kryo at construction. It controls how classes are registered, writing bytes to represent a class, and reading bytes that represent a class. I would appreciate a code review from any or all of you guys. :)
I had a quick look. I defined my own resolver (which assigns ids to implicitly registered classes), based on your default resolver. I simply derived from it. To be able to do it, I changed all private fields and methods in your default class to protected. I think, it could be a good idea, if you do the same in your code and change private to protected, so that people do not need to write resolvers from scratch and reuse as much as possible your code.
Other than that, I'm quite happy with how it looks now when it comes to class resolving. I'm still working on ensuring registration order, etc. But this is an issue on
my side, not on Kryo's side.
The topic under discussion in this thread is precisely what I was after. I also looked at the r=271 code. May be there are others who want the same feature. If so, may be we can let this feature creep.Mongonix, is your code sharable? Thanks.
You pointed me to precisely what I wanted to know. Wondering though if your deserializer hashes all the possible classes?
Actually shortly after I sent the previous email, I researched the source code and found out that Nate had committed the feature a few days after this discussion with you.
I used it and took a slightly different approach than you did. Instead of hash, I use incrementing Id and as Nate suggested in one of the replies, sent that mapping to the deserializer.
Works great and have not heard any murmur of possible collision ;-)
--