Scaling up EC2 instances in AWS cluster with jgroups protocol leads to increase number of DB connect

304 views
Skip to first unread message

Simeon Angelov

unread,
Aug 14, 2018, 12:04:48 PM8/14/18
to jgroups-dev

Hi guys,

Set up of the environment: We are using an e-commerce platform running on Hybris (Tomcat) that uses jgroups version 3.4.1. Please note that we are using TCP as opposed to UDP because AWS can't support multicast on our VPC.

On the last few occasions, when we had discount offer sale, we had to increase (auto scaling) the number of EC2 instances that serve customer traffic - let’s say 7 new nodes, where before that they were 3. This is where we are facing some weird issues - the DB connections get significantly high(for around 12 customer facing EC2 instances in the cluster - it goes up to 30000 - 35000 DB connections per EC2 instance) and that leads to DB resource contentions and eventually the application stops serving customer traffic.
Note  - we are facing increase in DB connections during the time when we scale up. Once all EC2 instances are up and running let’s say around 18 nodes in the cluster, then the number of DB connections is within normal limit - more or less around 100.

We have configured JDBC_PING in jgroups and we think some of the threads create few thousand connections to the DB. We have provided below the jgroups-tcp.xml


<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">


<TCP loopback="true"
recv_buf_size="${tcp.recv_buf_size:20M}"
send_buf_size="${tcp.send_buf_size:640K}"
discard_incompatible_packets="true"
max_bundle_size="64K"
max_bundle_timeout="30"
enable_bundling="true"
use_send_queues="true"
sock_conn_timeout="300"
timer_type="new"
timer.min_threads="4"
timer.max_threads="10"
timer.keep_alive_time="3000"
timer.queue_max_size="500"
thread_pool.enabled="true"
thread_pool.min_threads="40"
thread_pool.max_threads="250"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="10000"
thread_pool.rejection_policy="discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="5"
oob_thread_pool.max_threads="40"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="10000"
oob_thread_pool.rejection_policy="discard"
bind_addr="${hybris.jgroups.bind_addr}"
bind_port="${hybris.jgroups.bind_port}" />

<JDBC_PING connection_driver="${hybris.database.driver}"
connection_password="${hybris.database.password}"
connection_username="${hybris.database.user}"
connection_url="${hybris.database.url}"
initialize_sql="${hybris.jgroups.schema}"
datasource_jndi_name="${hybris.datasource.jndi.name}"/>

<MERGE2 min_interval="10000" max_interval="30000" />
<FD_SOCK />
<FD timeout="3000" max_tries="3" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" discard_delivered_msgs="true" />

<UNICAST />
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M" />
<pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" />

<UFC max_credits="20M" min_threshold="0.6" />
<MFC max_credits="20M" min_threshold="0.4" />

<FRAG2 frag_size="60K" />
<pbcast.STATE_TRANSFER />

</config>

Also we have attached thread dump from one of the affected EC2 instances.

We captured some jgroups statistics (we are using jgroups Probe - JMX MFC and NAKACK), although we do not have a full understanding of what this gives us.
NAKACK statistics:

#1 (1959 bytes):
local_addr=hybrisnode-82243 [5fa277bc-8216-d941-7c67-0792b84f671e]
cluster=tu-broadcast
view=[hybrisnode-83113|2606] (14) [hybrisnode-83113, hybrisnode-0, hybrisnode-82178, hybrisnode-83218, hybrisnode-82237, hybrisnode-83219, hybrisnode-83245, hybrisnode-83226, hybrisnode-82191, hybrisnode-82243, hybrisnode-8330, hybrisnode-83164, hybrisnode-82248, hybrisnode-83233]
physical_addr=172.31.82.243:7800

jmx=NAKACK={num_messages_received=4725458, use_mcast_xmit=false, current_seqno=271042, ergonomics=true, xmit_table_max_compaction_time=600000, become_server_queue_size=50, non_member_messages=0, xmit_stagger_timeout=200, print_stability_history_on_failed_xmit=false, discard_delivered_msgs=true, suppress_time_non_member_warnings=60000, xmit_rsps_sent=2161, xmit_table_num_rows=5, stats=true, xmit_from_random_member=false, size_of_all_messages=1152, log_not_found_msgs=true, xmit_table_resize_factor=1.2, xmit_table_missing_messages=0, id=15, max_rebroadcast_timeout=2000, msgs=hybrisnode-82243:

hybrisnode-83113: [2894377 (2894377)]
hybrisnode-0: [47023917 (47023917)]
hybrisnode-83219: [579595 (579595)]
hybrisnode-83245: [251862 (251862)]
hybrisnode-82237: [2827706 (2827706)]
hybrisnode-82243: [271042 (271042) (size=17, missing=0, highest stability=271025)]
hybrisnode-8330: [7676 (7676)]
hybrisnode-82248: [733 (733)]
hybrisnode-83164: [596664 (596664)]
hybrisnode-82191: [263111 (263111)]
hybrisnode-82178: [760904 (760904)]
hybrisnode-83218: [947031 (947031)]
hybrisnode-83233: [206 (206)]
hybrisnode-83226: [368099 (368099)]

, pending_xmit_requests=0, xmit_table_size=17, xmit_table_msgs_per_row=10000, retransmit_timeout=[I@67cdc0ee, size_of_all_messages_incl_headers=2053, max_msg_batch_size=100, exponential_backoff=500, num_messages_sent=271033, log_discard_msgs=true, xmit_reqs_received=2161, name=NAKACK, xmit_rsps_received=1413, use_mcast_xmit_req=false, xmit_reqs_sent=1413, use_range_based_retransmitter=true}

version=3.4.1.Final

MFC statistics: -- sending probe on /ff0e:0:0:0:0:0:75:75:7500


#1 (1208 bytes):

local_addr=hybrisnode-82243 [5fa277bc-8216-d941-7c67-0792b84f671e]
cluster=tu-broadcast
view=[hybrisnode-83113|2606] (14) [hybrisnode-83113, hybrisnode-0, hybrisnode-82178, hybrisnode-83218, hybrisnode-82237, hybrisnode-83219, hybrisnode-83245, hybrisnode-83226, hybrisnode-82191, hybrisnode-82243, hybrisnode-8330, hybrisnode-83164, hybrisnode-82248, hybrisnode-83233]
physical_addr=172.31.82.243:7800

jmx=MFC={ergonomics=true, average_time_blocked=0.0, number_of_credit_responses_sent=100, min_credits=8000000, ignore_synchronous_response=false, number_of_credit_requests_received=1, min_threshold=0.4, total_time_blocked=0, stats=true, receivers=hybrisnode-83113: 15207925

hybrisnode-0: 16166436
hybrisnode-83219: 16151370
hybrisnode-83245: 14427593
hybrisnode-82237: 19821107
hybrisnode-82243: 18221987
hybrisnode-8330: 17618119
hybrisnode-82248: 19950042
hybrisnode-83164: 11538170
hybrisnode-82191: 8629123
hybrisnode-82178: 17534244
hybrisnode-83218: 17558613
hybrisnode-83233: 19985889
hybrisnode-83226: 18462899
, max_credits=20000000, name=MFC, number_of_credit_requests_sent=0, number_of_credit_responses_received=106, id=44, number_of_blockings=0, max_block_time=5000}

version=3.4.1.Final

Our questions , who could support here, are:
- what could lead to such an increase of the DB connections at the time when we scale up customer facing EC2 instances from lets say 5 to 10 nodes ? (also please bear in mind that we are having other instances in the cluster - instances for batch processing, instances for employees managing cockpits and customer support)
- if somehow the jgroups is leading to the increase in DB connections  - how can we improve our configuration below ?


Thank you kindly
Simeon
jstack.5321.005422.280521129

Bela Ban

unread,
Aug 15, 2018, 4:51:48 AM8/15/18
to jgrou...@googlegroups.com


On 14/08/18 18:04, Simeon Angelov wrote:
>
> Hi guys,
>
> Set up of the environment: We are using an e-commerce platform running
> on Hybris (Tomcat) that uses jgroups version 3.4.1.

3.4.1 is ~7 years old, please use a more recent version, e.g. 3.6.16.
There's been quite a few fixes to JDBC_PING, and your version is missing
most of them.

This assumes that JDBC_PING is the one creating the DB connections -
have you verified that?

> Please note that we
> are using TCP as opposed to UDP because AWS can't support multicast on
> our VPC.


That should not be a problem.


> On the last few occasions, when we had discount offer sale, we had to
> increase (auto scaling) the number of EC2 instances that serve customer
> traffic - let’s say 7 new nodes, where before that they were 3. This is
> where we are facing some weird issues - the DB connections get
> significantly high(for around 12 customer facing EC2 instances in the
> cluster - it goes up to 30000 - 35000 DB connections per EC2 instance)
> and that leads to DB resource contentions and eventually the application
> stops serving customer traffic.
> Note  - we are facing increase in DB connections during the time when we
> scale up. Once all EC2 instances are up and running let’s say around 18
> nodes in the cluster, then the number of DB connections is within normal
> limit - more or less around 100.

OK.

> We have configured JDBC_PING in jgroups and we think some of the threads
> create few thousand connections to the DB.

Take a thread dump at the time this is happening, to see which threads
are causing this.
Do you have a valid datasource configured? If so, you don't need the
other attributes in JDBC_PING!


> <MERGE2 min_interval="10000" max_interval="30000" />
> <FD_SOCK />
> <FD timeout="3000" max_tries="3" />
> <VERIFY_SUSPECT timeout="1500" />
> <BARRIER />
> <pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500"
> discard_delivered_msgs="true" />
>
> <UNICAST />
> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
> max_bytes="4M" />
> <pbcast.GMS print_local_addr="true" join_timeout="3000"
> view_bundling="true" />
>
> <UFC max_credits="20M" min_threshold="0.6" />
> <MFC max_credits="20M" min_threshold="0.4" />
>
> <FRAG2 frag_size="60K" />
> <pbcast.STATE_TRANSFER />
>
> </config>


> Our questions , who could support here, are:
> - what could lead to such an increase of the DB connections at the time
> when we scale up customer facing EC2 instances from lets say 5 to 10
> nodes ? (also please bear in mind that we are having other instances in
> the cluster - instances for batch processing, instances for employees
> managing cockpits and customer support)


I suggest reproduce this in a testing env and then switch to 3.6.16. If
the issue disappears, then it must be a bug in 3.4.1. If the issue is
still there, then I'm interested in looking into it.

If you can't switch to a more recent version, then I'm afraid I won't be
able to help you, as I don't want to invest time to find a bug that's
been fixed a long time ago. In that case, an alternative might be to use
an AWS-centric discovery protocol, such as S3_PING or NATIVE_S3_PING [1].

[1] http://www.jgroups.org/manual/index.html#DiscoveryProtocols

> - if somehow the jgroups is leading to the increase in DB connections  -
> how can we improve our configuration below ?
>
>
> Thank you kindly
> Simeon
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/16173b78-2181-4ddf-a543-119b453a1f4a%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/16173b78-2181-4ddf-a543-119b453a1f4a%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Bela Ban | http://www.jgroups.org

Simeon Angelov

unread,
Oct 30, 2018, 5:55:35 AM10/30/18
to jgroups-dev


Thank you Bela for your right input.

We did some of the actions you proposed and please find my comments bellow

On Wednesday, 15 August 2018 09:51:48 UTC+1, Bela Ban wrote:


On 14/08/18 18:04, Simeon Angelov wrote:
>
> Hi guys,
>
> Set up of the environment: We are using an e-commerce platform running
> on Hybris (Tomcat) that uses jgroups version 3.4.1.

3.4.1 is ~7 years old, please use a more recent version, e.g. 3.6.16.
There's been quite a few fixes to JDBC_PING, and your version is missing
most of them.


Right, we increased the library to 3.6.16 and it is already in production.
 

This assumes that JDBC_PING is the one creating the DB connections -
have you verified that?

This was wrong as per our investigations. The DB was our constraint at the end. Oracle instance we are using it's seems accepts more or less 1500 connections max.
As we reached the 15 app nodes, serving the user's traffic, and getting in mind all DB connections from all other nodes we are having (batch, admin nodes and etc...) we increased the DB connections to more than 1700 forcing our DB to get stuck
Our action taken so far is to maximize our application nodes to be max 10. Even with high CPU time to time during sales period, it looks working so far. Of course more optimizations and actions will be taken.


 

> Please note that we
> are using TCP as opposed to UDP because AWS can't support multicast on
> our VPC.


That should not be a problem.


> On the last few occasions, when we had discount offer sale, we had to
> increase (auto scaling) the number of EC2 instances that serve customer
> traffic - let’s say 7 new nodes, where before that they were 3. This is
> where we are facing some weird issues - the DB connections get
> significantly high(for around 12 customer facing EC2 instances in the
> cluster - it goes up to 30000 - 35000 DB connections per EC2 instance)
> and that leads to DB resource contentions and eventually the application
> stops serving customer traffic.
> Note  - we are facing increase in DB connections during the time when we
> scale up. Once all EC2 instances are up and running let’s say around 18
> nodes in the cluster, then the number of DB connections is within normal
> limit - more or less around 100.

OK.

> We have configured JDBC_PING in jgroups and we think some of the threads
> create few thousand connections to the DB.

Take a thread dump at the time this is happening, to see which threads
are causing this.


Here, actually we were looking on the JMX DB connections, they were more that 15 000 , not actual db connections. Bad on the right metrics look up
On our last sale, where we had 10 applications nodes max we were not forcing this issue again, as we were not giving the possibility on increase of the actual number of db connections more than 15000
 
 
Reply all
Reply to author
Forward
0 new messages