Nodes (EC2 instances) not being recongnized they are in the cluster or has been destroied

Simeon Angelov

unread,

Oct 31, 2018, 7:42:57 AM10/31/18

to jgroups-dev

Hi all,

I would like to ask you for some propositions/recommendations on improvement on our jgroups configuration or some improvements as a whole in order to delegate some issues we have faced on our last sales period.

Set up of the environment: We are using an e-commerce platform running on Hybris (Tomcat) that uses jgroups version 3.6.16. Please note that we are using TCP as opposed to UDP because AWS can't support multicast on our VPC.

Issues:

1. Getting "JGRP000034: hybrisnode-8355: failure sending message to hybrisnode-8369: java.net.SocketTimeoutException: connect timed out" error , especially on a time when we scale up/down some of the nodes in the clusters as the users traffic increase significantly and the nodes in the cluster becomes heavy (high CPU - more than 80 ~ 90% ).

Is the sock_conn_timeout="300" what we need to increase here in order to not facing such a SocketTimeoutException exception ? Are there any other depended configuration properties to it we should have in mind changing it?

2. At some point a AWS EC2 instance (node) (e.g. when a scale down policy is being executed, or a AWS EC2 instance (node) is being terminated), is receiving a SIGKILL(9) . This force the application to stop and also the AWS EC2 instance to be terminated. What we are facing incidentally (not always happening) is that that EC2 instance node is not recognized by the other nodes in the cluster that is out of the cluster / terminated already. And so, the other nodes are trying to send messages to that node (even it has been terminated already). It is like the terminated EC2 instance couldn't manage to notify the other EC2 instances that is our of the cluster. We can identify that as we are having again the

"JGRP000034: hybrisnode-8355: failure sending message to hybrisnode-8369: java.net.SocketTimeoutException: connect timed out" error where the destination EC2 instance, for where the message has been sent, is the node has been terminated.

Again, this is not a at all time happening, it is incidentally where we can see that situation.

Is there a property we could change in order to mitigate that situation ? And what could be the reason the EC2 instance to be not recognized by the other EC2 instance has been terminated?

3. Also, we can see, incidentally again, that a AWS EC2 instance even is in the cluster , messages has been send to that node for a while - for couple of minutes, let's say (again we could see "JGRP000034:" error).

What could be the reason behind that ? The destination EC2 instance is too busy to accepts any kind of jgroups messages, per say ? Could we delegate that case with a configuration change or some other way ?

Please find bellow with our current jgroups configuration:



<config xmlns="urn:org:jgroups" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
   
   
   <TCP loopback="true" 
      recv_buf_size="${tcp.recv_buf_size:20M}" 
      send_buf_size="${tcp.send_buf_size:640K}"
      discard_incompatible_packets="true" 
      max_bundle_size="64K" 
      max_bundle_timeout="30"
      enable_bundling="true" 
      use_send_queues="true"
      sock_conn_timeout="300" 
      timer_type="new" 
      timer.min_threads="4" 
      timer.max_threads="10" 
      timer.keep_alive_time="3000"
      timer.queue_max_size="500" 
      thread_pool.enabled="true" 
      thread_pool.min_threads="40"
      thread_pool.max_threads="250"
      thread_pool.keep_alive_time="5000" 
      thread_pool.queue_enabled="false" 
      thread_pool.queue_max_size="10000"
      thread_pool.rejection_policy="discard" 
      oob_thread_pool.enabled="true" 
      oob_thread_pool.min_threads="5"
      oob_thread_pool.max_threads="40"
      oob_thread_pool.keep_alive_time="5000" 
      oob_thread_pool.queue_enabled="false"
      oob_thread_pool.queue_max_size="10000"
      oob_thread_pool.rejection_policy="discard" 
      bind_addr="${hybris.jgroups.bind_addr}" 
      bind_port="${hybris.jgroups.bind_port}" />

   <JDBC_PING connection_driver="${hybris.database.driver}" 
      connection_password="${hybris.database.password}" 
      connection_username="${hybris.database.user}"
      connection_url="${hybris.database.url}" 
      initialize_sql="${hybris.jgroups.schema}"
        datasource_jndi_name="${hybris.datasource.jndi.name}"/>

   <MERGE2 min_interval="10000" max_interval="30000" />
   <FD_SOCK />
   <FD timeout="3000" max_tries="3" />
   <VERIFY_SUSPECT timeout="1500" />
   <BARRIER />
   <pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" discard_delivered_msgs="true" />

   <UNICAST />
   <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M" />
   <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" />

   <UFC max_credits="20M" min_threshold="0.6" />
   <MFC max_credits="20M" min_threshold="0.4"  />

   <FRAG2 frag_size="60K" />
   <pbcast.STATE_TRANSFER />
   
</config>

Thank you kindly

Simeon

Bela Ban

unread,

Dec 6, 2018, 9:50:30 AM12/6/18

to jgrou...@googlegroups.com

On 31.10.18 12:42, Simeon Angelov wrote:
> Hi all,
>
> I would like to ask you for some propositions/recommendations on
> improvement on our jgroups configuration or some improvements as a whole
> in order to delegate some issues we have faced on our last sales period.
>
> Set up of the environment: We are using an e-commerce platform running
> on Hybris (Tomcat) that uses jgroups version 3.6.16. Please note that we
> are using TCP as opposed to UDP because AWS can't support multicast on
> our VPC.
>
> Issues:
>

> 1. Getting*"JGRP000034: hybrisnode-8355: failure sending message to

> hybrisnode-8369: java.net.SocketTimeoutException: connect timed out"

> *error , especially on a time when we scale up/down some of the nodes in

> the clusters as the users traffic increase significantly and the nodes
> in the cluster becomes heavy (high CPU - more than 80 ~ 90% ).

> Is the*sock_conn_timeout="300" *what we need to increase here in order

> to not facing such a SocketTimeoutException exception ? Are there any
> other depended configuration properties to it we should have in mind

> changing it? *

In general, a low timeout is fine: if a connection cannot be
established, the next attempt might succeed (retransmission will take
care of that).

> 2. At some point a AWS EC2 instance (node) (e.g. when a scale down
> policy is being executed, or a AWS EC2 instance (node) is being
> terminated), is receiving a SIGKILL(9) . This force the application to
> stop and also the AWS EC2 instance to be terminated. What we are facing
> incidentally (not always happening) is that that EC2 instance node is
> not recognized by the other nodes in the cluster that is out of the
> cluster / terminated already.

This would be determined by FD_SOCK (if the sockets of the process are
closed) or by FD (if FD_SOCK doesn't detect this). I recommend replace
FD with FD_ALL, which is much faster in detecting multiple member crashes.

> And so, the other nodes are trying to send
> messages to that node (even it has been terminated already). It is like
> the terminated EC2 instance couldn't manage to notify the other EC2
> instances that is our of the cluster. We can identify that as we are
> having again the
> "JGRP000034: hybrisnode-8355: failure sending message to
> hybrisnode-8369: java.net.SocketTimeoutException: connect timed out"
> error where the destination EC2 instance, for where the message has been
> sent, is the node has been terminated.
> Again, this is not a at all time happening, it is incidentally where we
> can see that situation.
> Is there a property we could change in order to mitigate that situation

If UNICAST3 tries to retransmit to a member that dies, then
max_retransmit_time will remove that connection after a given time and
the sending will stop. Also take a look at the conn_close_timeout.

I suggest upgrade from UNICAST to UNICAST3 anyway.

> ? And what could be the reason the EC2 instance to be not recognized by
> the other EC2 instance has been terminated?

It *should* be recognized after ca 10 seconds (3*3s in FD and 1.5s in
VERIFY_SUSPECT), or immediately by FD_SOCK. I have not seen such
behavior in my own AWS tests...

> 3. Also, we can see, incidentally again, that a AWS EC2 instance even is
> in the cluster , messages has been send to that node for a while - for
> couple of minutes, let's say (again we could see "JGRP000034:" error).
> What could be the reason behind that ? The destination EC2 instance is
> too busy to accepts any kind of jgroups messages, per say ? Could we
> delegate that case with a configuration change or some other way ?

Not sure I understand what you mean... do you have any firewalls in place?

> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/c2e48e85-c109-4655-8dd1-9ca62f616ca1%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/c2e48e85-c109-4655-8dd1-9ca62f616ca1%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Bela Ban | http://www.jgroups.org

Simeon Angelov

unread,

Dec 11, 2018, 7:23:32 AM12/11/18

to jgroups-dev

Hi Ben,

Thanks for coming back and for replaying.

Please, have a look on our observations so far:

On Thursday, 6 December 2018 14:50:30 UTC, Bela Ban wrote:

On 31.10.18 12:42, Simeon Angelov wrote:
> Hi all,
>
> I would like to ask you for some propositions/recommendations on
> improvement on our jgroups configuration or some improvements as a whole
> in order to delegate some issues we have faced on our last sales period.
>
> Set up of the environment: We are using an e-commerce platform running
> on Hybris (Tomcat) that uses jgroups version 3.6.16. Please note that we
> are using TCP as opposed to UDP because AWS can't support multicast on
> our VPC.
>
> Issues:
>
> 1. Getting*"JGRP000034: hybrisnode-8355: failure sending message to
> hybrisnode-8369: java.net.SocketTimeoutException: connect timed out"
> *error , especially on a time when we scale up/down some of the nodes in
> the clusters as the users traffic increase significantly and the nodes
> in the cluster becomes heavy (high CPU - more than 80 ~ 90% ).
> Is the*sock_conn_timeout="300" *what we need to increase here in order
> to not facing such a SocketTimeoutException exception ? Are there any
> other depended configuration properties to it we should have in mind
> changing it? *

In general, a low timeout is fine: if a connection cannot be
established, the next attempt might succeed (retransmission will take
care of that).

Yeah, we have left the sock_conn_timeout as 300

> 2. At some point a AWS EC2 instance (node) (e.g. when a scale down
> policy is being executed, or a AWS EC2 instance (node) is being
> terminated), is receiving a SIGKILL(9) . This force the application to
> stop and also the AWS EC2 instance to be terminated. What we are facing
> incidentally (not always happening) is that that EC2 instance node is
> not recognized by the other nodes in the cluster that is out of the
> cluster / terminated already.

This would be determined by FD_SOCK (if the sockets of the process are
closed) or by FD (if FD_SOCK doesn't detect this). I recommend replace
FD with FD_ALL, which is much faster in detecting multiple member crashes.

Yeah, we are having both configs : FD_STOCK and FD

We changed the FD one as <FD timeout="6000" max_tries="3" /> - increasing the timeout of suspecting for a node

FD_ALL - I will revisit it and

We are not having any firewalls in place, breaking the nodes jgroups communication.

My idea above is there a scenario where a node could be so busy that can not accept new jgroups messages - credits limits per say?

We did some config changes - actually increase number of messages on the some threads pools - and when we did some performance tests we had cleared picture in terms of jgroups errors.

Changes are:

thread_pool.queue_max_size= from 10000 to 20000

oob_thread_pool.queue_max_size from 10000 to 20000

Changed MERGE to MERGE3

VERIFY_SUSPECT timeout= from 1500 to 3000

MFC max_credits= from 20M to 40M

We are giving more credits for the nodes accepting the jgroups messages and more queue sizes.

> <mailto:jgroups-dev+unsub...@googlegroups.com>.

> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/c2e48e85-c109-4655-8dd1-9ca62f616ca1%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/c2e48e85-c109-4655-8dd1-9ca62f616ca1%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Bela Ban | http://www.jgroups.org

Thanks,

Simeon

Bela Ban

unread,

Dec 11, 2018, 7:44:56 AM12/11/18

to jgrou...@googlegroups.com

On 11.12.18 13:23, Simeon Angelov wrote:

> Not sure I understand what you mean... do you have any firewalls in place?
>
> We are not having any firewalls in place, breaking the nodes jgroups
> communication.

OK

> My idea above is there a scenario where a node could be so busy that can
> not accept new jgroups messages - credits limits per say?

No, unless your thread pool is too small, or has a queue, which is not
the case. Although: you could be facing GC, or heavy context switching
when running a lot of threads, delaying things...

> We did some config changes - actually increase number of messages on
> the some threads pools - and when we did some performance tests we had
> cleared picture in terms of jgroups errors.
> Changes are:
> thread_pool.queue_max_size=from 10000to20000

This is moot as you have the queue diabled anyway

> oob_thread_pool.queue_max_size from 10000to20000

Ditto

> Changed MERGE to MERGE3

OK

> VERIFY_SUSPECT timeout=from 1500 to 3000
> MFC max_credits=from 20M to 40M

I don't think this is needed

> We are giving more credits for the nodes accepting the jgroups messages
> and more queue sizes.

> Thanks,

> Simeon
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com

> <mailto:jgroups-dev...@googlegroups.com>.

> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/jgroups-dev/96e39670-bdd2-47a1-ab3f-a928fb18b94e%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/96e39670-bdd2-47a1-ab3f-a928fb18b94e%40googlegroups.com?utm_medium=email&utm_source=footer>.

Harsh Modi

unread,

Apr 21, 2020, 1:40:41 PM4/21/20

to jgroups-dev

@simeon

Did you get any solution to fix it. please share if any have suggestions that it helping to solve this problem.

Thanks

Reply all

Reply to author

Forward