Hazelcast supporting alternatives to Java built-in serialization?

634 views
Skip to first unread message

mongonix

unread,
Nov 2, 2010, 3:50:18 AM11/2/10
to Hazelcast
Hi,

Hazelcast seems to use only standard Java serialization mechanisms,
AFAIK. While this is very standard conforming, performance of many
application may really suffer from this circumstance. As you can see
from the benchmarks listed below, the standard Java serialization can
be an order of magnitude slower than alternatives.

Are there any plans to support any alternative implementations of Java
serialization, e.g. protobuf or thrift? It would be even better, if
selection of a serialization framework would be configurable, so that
users could pick one that suites their needs in a best way.

Please see here for more information about benchmarks comparing
different alternatives:
http://github.com/eishay/jvm-serializers/wiki
http://code.google.com/p/thrift-protobuf-compare/wiki/BeyondNumbers

Thanks,
Mongonix

Fuad Malikov

unread,
Nov 2, 2010, 4:08:39 AM11/2/10
to haze...@googlegroups.com
Hi,

Yes we know that and there is already an issue for that.
http://code.google.com/p/hazelcast/issues/detail?id=153
Before than that there is another way to implement custom
serialization(similar to externalizable)
Here is how you do it:
http://www.hazelcast.com/documentation.jsp#InternalsSerialization.

Hope this will help.

> --
> You received this message because you are subscribed to the Google Groups "Hazelcast" group.
> To post to this group, send email to haze...@googlegroups.com.
> To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.
>
>

--

@fuadm

mongonix

unread,
Nov 2, 2010, 4:32:34 AM11/2/10
to Hazelcast
Fuad Malikov wrote:
> Hi,
>
> Yes we know that and there is already an issue for that.
> http://code.google.com/p/hazelcast/issues/detail?id=153

Nice to know. Thanks for the link!
But the discussion on this issue seems to be pretty quiet since April.
So, nothing is happening in this area since then?

I'd ask another question, though it is a bit of a hi-jacking this
thread ;-)
It is related and I posted it on another thread (http://
groups.google.com/group/hazelcast/browse_thread/thread/
4b201276a91d28d8), but there was no reaction. It would be nice if you
could answer some of the questions there. In the meantime, I'd like to
ask one serialization related question here again:

- Hazelcast ItemListeners and other listeners get as input an
already deserialized object. But when is it deserialized and by which
thread? For big/complex objects under high-load situation,
(de)serialization may
consume quite some CPU resources and thus affect processing of other
requests, because now CPU time is spent elsewhere. It would be nice
if
it would be possible to control when and how (de)serialization
happens. E.g. provide a custom thread-pool that would be used to
convert Hazelcast network objects into deserialized Java objects.
Would it possible? It would be also nice to provide a way to have a
listener for raw Hazelcast data packets, so that any application may
decide on its own when to deserialize them.

> Before than that there is another way to implement custom
> serialization(similar to externalizable)
> Here is how you do it:
> http://www.hazelcast.com/documentation.jsp#InternalsSerialization.

Yes. I've read it. But this seems to be too annoying for our use-
cases. We have a lot of classes with complex structure. Writing
optimized serializers by hand is out of question ;-( We need some
automation for that.

Fuad Malikov

unread,
Nov 2, 2010, 9:18:47 AM11/2/10
to haze...@googlegroups.com
Why don't you consider then to pass the byte[] values of your entries
to Hazelcast. That is do all serialization and deserialization on your
part and put the raw data into Hazelcast. Seems to solve all your
problems. The only downside is that you'll not be able to use queries
for maps. All the rest should work just fine.

Fuad

Fuad Malikov

unread,
Nov 2, 2010, 10:18:31 AM11/2/10
to haze...@googlegroups.com
Hi Roman,

We discussed this internally here is couple of points that we must
consider in this situation. Even though what I suggested to you on the
other thread was better than your approach but there is a better a
way.

First I will answer to your question about threads. For Queues and
Topics we must guarantee the order of notifications. So it is
guaranteed that the Queue is FIFO and the messages on Topic are sent
to the listeners in the appropriate order. For the sake of order there
is only one thread per Queue and Topic who notifies the listeners. And
since the appropriate event receiving methods pass the actual object,
those objects are deserialized by that thread. And this is the actual
bottleneck. Instead of actual Object we should have passed the
MessageObject kind of thing where the actual object is serialized or
deserialized only when you call messageObject.getObject(). This way
you could receive the events in ordered way and process them in as
many threads as you want. And all deserialization would be done on
your threads. We plan to change this on future Hazelcast versions.
This will solve two issues
1. Deserialization that takes on 1 thread
2. Classloading problem; your classloader can be different from what
Hazelcast uses.


So for now,
We know that there is only one thread who notifies. I am redefining my
suggestion as to use the Queue approach. There you can control the
number of threads. I suggested to use on thread which in while loop
calls q.take. This is essential if you care the order. If you don't,
you can have multiple threads who wait on q.take and either process
the item themselves or again send them to other threadpool (executor
service) that processes them. This way
1. you can concurrently take from the Queue
2. The deserialization will be done on user thread, so you can control it.

We expect this result to perform much better.

Hope i didn't confuse you.

Fuad


Here is a more detailed answer.

For queues and topics we need to guarantee the order of messages.
Therefore there is just one thread who notifies the listener. And
since we are passing the object itself that thread have to deserialize
it. Which as you said can be very costly. So the actual bottleneck is
this one thread who notifies.

On Tue, Nov 2, 2010 at 10:32 AM, mongonix <romi...@gmail.com> wrote:

mongonix

unread,
Nov 3, 2010, 3:48:02 AM11/3/10
to Hazelcast
Hi Fuad,

On Nov 2, 2:18 pm, Fuad Malikov <f...@hazelcast.com> wrote:
> Why don't you consider then to pass the byte[] values of your entries
> to Hazelcast. That is do all serialization and deserialization on your
> part and put the raw data into Hazelcast. Seems to solve all your
> problems. The only downside is that you'll not be able to use queries
> for maps. All the rest should work just fine.

Yes. This can be an approach, of course. A bit too explicit when it
comes to conversions from user classes into byte[], but would do the
job.

If possible, I'd prefer the API where I could specify the points in
time (in processing flow) and threadpools, where (de)serialization
should happen.

mongonix

unread,
Nov 3, 2010, 4:14:53 AM11/3/10
to Hazelcast
Hi Fuad,

> We discussed this internally here is couple of points that we must
> consider in this situation. Even though what I suggested to you on the
> other thread was better than your approach but there is a better a
> way.

It's getting interesting.

> First I will answer to your question about threads. For Queues and
> Topics we must guarantee the order of notifications. So it is
> guaranteed that the Queue is FIFO and the messages on Topic are sent
> to the listeners in the appropriate order. For the sake of order there
> is only one thread per Queue and Topic who notifies the listeners.

BTW, what if I need a guaranteed order of notifications only for the
messages from the same source?
But there is no restrictions for the messages from different sources
sent to the same destination.
How would you model that? Multiple queues or topics - one per source?
Are there any "cheaper" alternatives?

> And since the appropriate event receiving methods pass the actual object,
> those objects are deserialized by that thread. And this is the actual
> bottleneck.

Thanks! This analysis makes our current situation very clear.

> Instead of actual Object we should have passed the
> MessageObject kind of thing where the actual object is serialized or
> deserialized only when you call messageObject.getObject(). This way
> you could receive the events in ordered way and process them in as
> many threads as you want. And all deserialization would be done on
> your threads. We plan to change this on future Hazelcast versions.
> This will solve two issues
> 1. Deserialization that takes on 1 thread
> 2. Classloading problem; your classloader can be different from what
> Hazelcast uses.

Very cool! Sounds like a good plan. I'm looking forward to see it in
trunk.

Few comments on this feature:
a) messageObject.getObject() should probably remember its result, once
it is called, to avoid a deserialization of the same object happening
multiple times.
b) If there are multiple listeners inside the same JVM/classloader,
this approach would result in every listener doing its own
deserialization of the same object. May be this can be optimized
somehow.

> So for now,
> We know that there is only one thread who notifies. I am redefining my
> suggestion as  to use the Queue approach. There you can control the
> number of threads. I suggested to use on thread which in while loop
> calls q.take. This is essential if you care the order. If you don't,
> you can have multiple threads who wait on q.take and either process
> the item themselves or again send them to other threadpool (executor
> service) that processes them. This way
> 1. you can concurrently take from the Queue

> 2. The deserialization will be done on user thread, so you can control it.
I guess you assume that I use byte[] approach and do deserialization
in my threads, right?

I have a few comments on the queue-based approach as such:
a) q.take() is a blocking operation, as you say. What if I need to
wait for inputs from thousands of queues? I guess for such scenarios
(which are not so seldom, as one may think), the q.take() approach
does not scale. There are two alternatives:
- use asynchronous notifications (as it is the case with listeners
now)
- introduce a UNIX select-like functionality, i.e. blocking wait
for an input from one of the provided sources.
This can be either an API call or can be eventually modeled as a
"virtual" queue or topic that in fact collects from multiple sources
(I think it would be rather elegant to implement it as a virtual queue
or topic).

I think it is very important and useful to support this kind of
functionality

b) What about control over the serialization? Which thread does it -
user's thread or Hazelcast thread? When it happens? Should it be
symmetric in approach to deserialization?

c) Imagine the situation, where Hazelcast is used for both intra-JVM
and distributed JVM setup, where concrete topology is decided at run-
time. It may very well happen, that all producers and consumers of a
given distributed datastructure (queue, topic, map) are on the same
JVM or even in the same class-loader. It would be nice in this case,
if Hazelcast could optimize for this situation and eventually avoid
the (de)serialization all together, unless it is really required to
support remote listeners. It would be similar to the local EJB vs.
remote EJB invocations optimization, where in local case objects are
passed by reference, instead of copying them. Of course, passing by
reference would mean that listeners should not modify the message
object (i.e. it is immutable), but this is acceptable in many cases
and can be even indicated at the API level may be. I.e. by doing
something like. myQueue.setProperty("immutable.messages", true);

d) Minor proposal: It would be nice if listeners could get also the
name of the shared datastructure that triggered them. This would allow
for using the same listener to listen for (thousands of ) multiple
inputs. This is a pure convenience and can be modeled even now, but
with wrapper objects.

e) Minor proposal: It would be nice to have notifications about
changes of a given instance of data-structure. Such a notification
should not contain the object itself. It should just say which
datastructure (e.g. name) and what kind of change (e.g. add, remove,
update, destroy). Therefore, such a notification is rather cheap. And
the listener can then decide what to do next, e.g. it may want to
perform an expensive operation like obtaining the changed object. The
advantage of this approach is that it avoids expensive sending/
serialization/deserialization of complex objects and lets user code
decide what to do when changes happen. It also suites the asynchronous
(IO) model much better as it avoids any blocking calls.

Please, let me know what think about these comments and proposals. Do
they make sense?

> We expect this result to perform much better.

> Hope i didn't confuse you.

Not at all, your explanation is very clear. Thanks a lot!

-Roman

Fuad Malikov

unread,
Nov 4, 2010, 6:05:47 AM11/4/10
to haze...@googlegroups.com
Hi Roman,

On Wed, Nov 3, 2010 at 10:14 AM, mongonix <romi...@gmail.com> wrote:
> Hi Fuad,
>
>> We discussed this internally here is couple of points that we must
>> consider in this situation. Even though what I suggested to you on the
>> other thread was better than your approach but there is a better a
>> way.
>
> It's getting interesting.
>
>> First I will answer to your question about threads. For Queues and
>> Topics we must guarantee the order of notifications. So it is
>> guaranteed that the Queue is FIFO and the messages on Topic are sent
>> to the listeners in the appropriate order. For the sake of order there
>> is only one thread per Queue and Topic who notifies the listeners.
>
> BTW, what if I need a guaranteed order of notifications only for the
> messages from the same source?
> But there is no restrictions for the messages from different sources
> sent to the same destination.
> How would you model that? Multiple queues or topics - one per source?
> Are there any "cheaper" alternatives?

Yes, it seems that it is the way you should do it.


>
>> And since the appropriate event receiving methods pass the actual object,
>> those objects are deserialized by that thread. And this is the actual
>> bottleneck.
>
> Thanks! This analysis makes our current situation very clear.
>
>> Instead of actual Object we should have passed the
>> MessageObject kind of thing where the actual object is serialized or
>> deserialized only when you call messageObject.getObject(). This way
>> you could receive the events in ordered way and process them in as
>> many threads as you want. And all deserialization would be done on
>> your threads. We plan to change this on future Hazelcast versions.
>> This will solve two issues
>> 1. Deserialization that takes on 1 thread
>> 2. Classloading problem; your classloader can be different from what
>> Hazelcast uses.
>
> Very cool! Sounds like a good plan. I'm looking forward to see it in
> trunk.

Can you please create this as an issue, and add a link to this
conversation. You'll be notified where there will be an action on the
issue.


>
> Few comments on this feature:
> a) messageObject.getObject() should probably remember its result, once
> it is called, to avoid a deserialization of the same object happening
> multiple times.
> b) If there are multiple listeners inside the same JVM/classloader,
> this approach would result in every listener doing its own
> deserialization of the same object. May be this can be optimized
> somehow.
>

Agree, can you also add these to the issue

>> So for now,
>> We know that there is only one thread who notifies. I am redefining my
>> suggestion as  to use the Queue approach. There you can control the
>> number of threads. I suggested to use on thread which in while loop
>> calls q.take. This is essential if you care the order. If you don't,
>> you can have multiple threads who wait on q.take and either process
>> the item themselves or again send them to other threadpool (executor
>> service) that processes them. This way
>> 1. you can concurrently take from the Queue
>
>> 2. The deserialization will be done on user thread, so you can control it.
> I guess you assume that I use byte[] approach and do deserialization
> in my threads, right?

No, since this approach does not use the listeners. The bottleneck
occurs when you use listeners. Then Hazelcast notifies you and it uses
one thread for each Queue or Topic to preserve the order.
If you'll not use listeners and have your own threads that takes from
the queue then deserialization will be done on your threads. And you
can increase the number of threads that concurrently waits on take and
consumes the queue. With this approach you get all the benefits but
give up the order.

But if you will put and get to Hazelcast the bytes[] then all problems
I described above disappears. Hazelcast will never try to
serialize/deserialize your objects and having one thread notifying you
want be a big problem.


>
> I have a few comments on the queue-based approach as such:
> a) q.take() is a blocking operation, as you say. What if I need to
> wait for inputs from thousands of queues? I guess for such scenarios
> (which are not so seldom, as one may think), the q.take() approach
> does not scale. There are two alternatives:
>   - use asynchronous notifications (as it is the case with listeners
> now)
>   - introduce a UNIX select-like functionality, i.e. blocking wait
> for an input from one of the provided sources.
>    This can be either an API call or can be eventually modeled as a
> "virtual" queue or topic that in fact collects from multiple sources
> (I think it would be rather elegant to implement it as a virtual queue
> or topic).

You can give a timeout to the queue.take(). it will wait till that
timeout and if nothing comes, will return null. But I agree that still
it is not a very good way.

>
>    I think it is very important and useful to support this kind of
> functionality
>
> b) What about control over the serialization? Which thread does it -
> user's thread or Hazelcast thread? When it happens? Should it be
> symmetric in approach to deserialization?

I think I have answered to this question above. If it is not clear let me know.

>
> c) Imagine the situation, where Hazelcast is used for both intra-JVM
> and distributed JVM setup, where concrete topology is decided at run-
> time. It may very well happen, that all producers and consumers of a
> given distributed datastructure (queue, topic, map) are on the same
> JVM or even in the same class-loader. It would be nice in this case,
> if Hazelcast could optimize for this situation and eventually avoid
> the (de)serialization all together, unless it is really required to
> support remote listeners. It would be similar to the local EJB vs.
> remote EJB invocations optimization, where in local case objects are
> passed by reference, instead of copying them. Of course, passing by
> reference would mean that listeners should not modify the message
> object (i.e. it is immutable), but this is acceptable in many cases
> and can be even indicated at the API level may be. I.e. by doing
> something like. myQueue.setProperty("immutable.messages", true);

Serialization happens when you put into the Queue(Map, Topic) and
Hazelcast never stores your objects. So it is must to occur. Imagine
what if you add another remote member and the data starts to migrate
or the backup is taken to that remote member. So serialization should
happen whenever you put.
But for maps we also keep the reference to deserialized Object for
fast local access. For maps it makes sense since you do get the same
object more than once. But it doesn't make sense for queues and
topics. They are one time objects. I agree that there can be
optimizations but there a certain rules we don't want to change for
sake if simplicity and portability.


>
> d) Minor proposal: It would be nice if listeners could get also the
> name of the shared datastructure that triggered them. This would allow
> for using the same listener to listen for (thousands of ) multiple
> inputs. This is a pure convenience and can be modeled even now, but
> with wrapper objects.

I agree, when we will implement the messageObject stuff it can be add
like messageObject.source


>
> e) Minor proposal: It would be nice to have notifications about
> changes of a given instance of data-structure. Such a notification
> should not contain the object itself. It should just say which
> datastructure (e.g. name) and what kind of change (e.g. add, remove,
> update, destroy). Therefore, such a notification is rather cheap. And
> the listener can then decide what to do next, e.g. it may want to
> perform an expensive operation like obtaining the changed object. The
> advantage of this approach is that it avoids expensive sending/
> serialization/deserialization of complex objects and lets user code
> decide what to do when changes happen. It also suites the asynchronous
> (IO) model much better as it avoids any blocking calls.

For maps the notification passes the event object and you can specify
while adding a listener whether you want the value to be passed with
event object or not. Again these optimizations can be done with
messageObject.


>
> Please, let me know what think about these comments and proposals. Do
> they make sense?
>
>> We expect this result to perform much better.
>
>> Hope i didn't confuse you.
>
> Not at all, your explanation is very clear. Thanks a lot!
>
> -Roman

Best Regards,

Fuad

mongonix

unread,
Nov 4, 2010, 3:12:31 PM11/4/10
to Hazelcast
Hi,

As you proposed, I've submitted the following issues:

http://code.google.com/p/hazelcast/issues/detail?id=407
http://code.google.com/p/hazelcast/issues/detail?id=408
http://code.google.com/p/hazelcast/issues/detail?id=409
http://code.google.com/p/hazelcast/issues/detail?id=410

At the moment, they all have type "Defect", even though they are
actually enhancements or new features. But I couldn't find the way to
select a proper type. May be it can be only done by project owners?

Best Regards,
Roman
Reply all
Reply to author
Forward
0 new messages