Hazelcast query performance

1,130 views
Skip to first unread message

Sorin Constantinescu

unread,
Mar 7, 2010, 4:03:13 PM3/7/10
to Hazelcast
I tried a simple query ( name like "John" ) against a map with 50.000
records and it took 2.8 seconds. After I checked Hazelcast code, I
discovered that you keep the data serialized in the memory and every
time you execute a query you deserialized the data.

Why did you took this approach? In my opinion, you should keep in
memory objects and serialize these objects only when you transfer the
objects between nodes of the cluster. When a user do a "get" from the
map you should also do a serialization/deserialization to give back to
the user a copy of the original object. In the case of the query you
will have to use serialization only for the objects which are returned
by the query. For example in my case instead of 50.000 serialization
will do only 1 because the query returns 1 object.

These will give you a huge increase of the performance. For 500.000
records it takes now 20 seconds. As you can imagine this is not usable
in a production environment

Talip Ozturk

unread,
Mar 8, 2010, 5:38:51 AM3/8/10
to haze...@googlegroups.com
Hi Sorin,

1. why do you keep values in byte[] form?
a) We should because most of the requests will come from other nodes
(image 10 node cluster; 9 of them are remote). And we don't want to
serialize objects to byte[] for each get() call.
b) If we keep everything in byte[] form then Hazelcast can be a

2. why is the query slow?
Because the current implementation is not good enough :) and we are
well-aware of it. We have great plans for the query; significant
portion of it will be re-written. Here are some of things that comes
to my mind right now:
1. No deserialization as you said. (Very first read or query will
cache the value in object form)
2. Concurrency: Currently queries are partially concurrent.
QueryService itself is a bottleneck.
3. Indexing optimizations: Current indexing logic is really good so we
will keep that but will be optimized for concurrency.
4. Query based listeners: Get events for the entries matching your predicate.

Design of the new Query implementation is almost complete and it will
be released as part of 1.8.2.

-talip

> --
> You received this message because you are subscribed to the Google Groups "Hazelcast" group.
> To post to this group, send email to haze...@googlegroups.com.
> To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.
>
>

Talip Ozturk

unread,
Mar 8, 2010, 5:50:15 AM3/8/10
to haze...@googlegroups.com
completing my sentence :)

> b) If we keep everything in byte[] form then Hazelcast can be a generic remote cache that any kind of client (C#, C++..) can use as long as we have the client implementation for.

-talip

Fuad Malikov

unread,
Mar 8, 2010, 5:56:18 AM3/8/10
to haze...@googlegroups.com
Hi Sorin,

Did you try adding index to the "name" field. Indexing will skip deserialization for all entries. Only matching entries will be deserialized. 
The sample of adding index is as follows. With current Hazelcast version you should add indexes at all nodes before putting any entry. 
IMap imap = Hazelcast.getMap("employees");
imap.addIndex("age", true);        // ordered, since we have ranged queries for this field
imap.addIndex("active", false);    // not ordered, because boolean field cannot have range

-Fuad

Sorin Constantinescu

unread,
Mar 11, 2010, 6:14:20 PM3/11/10
to Hazelcast
Thanks to all for the answers provided.

I discovered in fact that you are caching the deserialized object in
the class Record.RecordEntry but only after the first query.
The index is working great if you use something like name =
'John' ( 100 times faster than not indexed ) but is not used in a
query like: name like 'Joh%'. I would like to suggest you a small
optimization: if it exists a sorted index on a field and the LIKE
operation does not start with % then you can do first a binary search
on the index.

If I execute 2 times the same query, do you cache the query results?
If not, this could be another optimization. The query cache should be
evicted only when the map of data on the local node of the cluster is
modified.

Congratulations for the product :)

Talip Ozturk

unread,
Mar 12, 2010, 6:03:57 PM3/12/10
to haze...@googlegroups.com
> I discovered in fact that you are caching the deserialized object in
> the class Record.RecordEntry but only after the first query.
> The index is working great if you use something like name =
> 'John' ( 100 times faster than not indexed ) but is not used in a
> query like: name like 'Joh%'. I would like to suggest you a small
> optimization: if it exists a sorted index on a field and the LIKE
> operation does not start with % then you can do first a binary search
> on the index.

makes sense.. thanks.

> If I execute 2 times the same query, do you cache the query results?

no. we don't cache the query results.

> If not, this could be another optimization. The query cache should be
> evicted only when the map of data  on the local node of the cluster is
> modified.

yes doable. can be configurable.

great feedbacks. we are actually in the process of making queries much
better. some portion of the query implementation is getting
re-written. because the current implementation is not concurrent
enough. new version of the query implementation will be faster with
less memory and cpu consumption. we are very close. you cannot imagine
how much fun i am having right now.

-talip

Reply all
Reply to author
Forward
0 new messages