Scale up/down cluster with high traffic

85 views
Skip to first unread message

Stefano S

unread,
Jul 31, 2017, 10:55:55 AM7/31/17
to Hazelcast
Hi all,
we have 3 front-end application. 
These applications use an Hazelcast cluster of 3 nodes with near cache and with smart routing.
The front end applications are serving 150 req/secs.
For each request they call the HZ cluster and, normally, the total response time is few ms (5, 6 ms).
So.. if we scale up the HZ from 3 to 4 nodes.. all works fine!
When we scale down, from 4 to 3, sometimes we have a spike on response times.
The response time reach 3, 6, 10 seconds.
We have metrics on the front-end apps and I saw that this time is spent on get/put operation on the IMap.
In the node that is "shutting down" I have clear logs about graceful shutting down of Hazelcast.

Any idea about this?
I mean.. is there any log to enable, any settings to check in order to avoid these situations?

Thanks
Stefano.



Peter Veentjer

unread,
Jul 31, 2017, 12:40:00 PM7/31/17
to haze...@googlegroups.com
On Mon, Jul 31, 2017 at 5:55 PM, Stefano S <stefano...@gmail.com> wrote:
Hi all,
we have 3 front-end application. 
These applications use an Hazelcast cluster of 3 nodes with near cache and with smart routing.
The front end applications are serving 150 req/secs.

I guess the number of requests per second is limited because 150 req/second is almost nothing.

 
For each request they call the HZ cluster and, normally, the total response time is few ms (5, 6 ms).

This is still quite high. Please explain how much data each request is transferring. Also provide information
about your setup. So hardware, os, network, java version, hz version etc. Also how big the imap is
(so number of keys and size of the values).
 
So.. if we scale up the HZ from 3 to 4 nodes.. all works fine!
When we scale down, from 4 to 3, sometimes we have a spike on response times.
The response time reach 3, 6, 10 seconds.

This should be a temporary; I guess this is the partitions that are migrating. Does it eventually go
back to its average response time of 5/6ms ?
 
We have metrics on the front-end apps and I saw that this time is spent on get/put operation on the IMap.
In the node that is "shutting down" I have clear logs about graceful shutting down of Hazelcast.

Any idea about this?

Probably migrations.

 You can try to add this one:

-Dhazelcast.partition.migration.interval=5

This will add some time between the migrations so the regular operations don't get delayed that much.

 
I mean.. is there any log to enable, any settings to check in order to avoid these situations?

Thanks
Stefano.



--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+unsubscribe@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/df58b9c8-746b-4e0e-a325-05ceeb32247e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stefano S

unread,
Sep 21, 2017, 8:59:52 AM9/21/17
to Hazelcast
Hi Peter,
thank you for your answer and sorry if I replied after so much time.
So...
The number of req/secs is what we had.. but yes we will plan to increase a lot this number.
About this
This should be a temporary; I guess this is the partitions that are migrating. Does it eventually go
back to its average response time of 5/6ms ?
Yes.. high response time is temporary. But this is a very big problem for us.

We tried following this:
 You can try to add this one:
-Dhazelcast.partition.migration.interval=5
This will add some time between the migrations so the regular operations don't get delayed that much
and we have the opposite effect!! We reached more than 1 minutes as response time! We have a big outage with our services!

I would add some info:
We are using kubernetes. The HZ nodes are java applications running on k8s.
Entries size are less than 1 kb and their number are thousands
We are using 1 (sync) backup with our cache (we are using IMap<>, so data are distributed) 

Now... these issues are related to migrations.
But is there any way to avoid this kind of problems?
I'm talking during a graceful shutdown of a node, not about an unexpected crash.

I suppose we can mitigate this problem using async backup, is it right?
And what about this sys prop?
hazelcast.migration.min.delay.on.member.removed.seconds
Default is 5 seconds.. so we could have delay for this? I mean.. member is shutting down but its data will be not available until migration is finished, could be?
Is it safe using 0 seconds?


Thank you again
Stefano





Thanks
Stefano.



To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages