Hi guys,
Set up of the environment: We are using an e-commerce platform running on Hybris (Tomcat) that uses jgroups version 3.4.1. Please note that we are using TCP as opposed to UDP because AWS can't support multicast on our VPC.
On the last few occasions, when we had discount offer sale, we had to increase (auto scaling) the number of EC2 instances that serve customer traffic - let’s say 7 new nodes, where before that they were 3. This is where we are facing some weird issues - the DB connections get significantly high(for around 12 customer facing EC2 instances in the cluster - it goes up to 30000 - 35000 DB connections per EC2 instance) and that leads to DB resource contentions and eventually the application stops serving customer traffic.
Note - we are facing increase in DB connections during the time when we scale up. Once all EC2 instances are up and running let’s say around 18 nodes in the cluster, then the number of DB connections is within normal limit - more or less around 100.
We have configured JDBC_PING in jgroups and we think some of the threads create few thousand connections to the DB. We have provided below the jgroups-tcp.xml
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
<TCP loopback="true"
recv_buf_size="${tcp.recv_buf_size:20M}"
send_buf_size="${tcp.send_buf_size:640K}"
discard_incompatible_packets="true"
max_bundle_size="64K"
max_bundle_timeout="30"
enable_bundling="true"
use_send_queues="true"
sock_conn_timeout="300"
timer_type="new"
timer.min_threads="4"
timer.max_threads="10"
timer.keep_alive_time="3000"
timer.queue_max_size="500"
thread_pool.enabled="true"
thread_pool.min_threads="40"
thread_pool.max_threads="250"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="10000"
thread_pool.rejection_policy="discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="5"
oob_thread_pool.max_threads="40"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="10000"
oob_thread_pool.rejection_policy="discard"
bind_addr="${hybris.jgroups.bind_addr}"
bind_port="${hybris.jgroups.bind_port}" />
<JDBC_PING connection_driver="${hybris.database.driver}"
connection_password="${hybris.database.password}"
connection_username="${hybris.database.user}"
connection_url="${hybris.database.url}"
initialize_sql="${hybris.jgroups.schema}"
datasource_jndi_name="${hybris.datasource.jndi.name}"/>
<MERGE2 min_interval="10000" max_interval="30000" />
<FD_SOCK />
<FD timeout="3000" max_tries="3" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" discard_delivered_msgs="true" />
<UNICAST />
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M" />
<pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" />
<UFC max_credits="20M" min_threshold="0.6" />
<MFC max_credits="20M" min_threshold="0.4" />
<FRAG2 frag_size="60K" />
<pbcast.STATE_TRANSFER />
</config>
Also we have attached thread dump from one of the affected EC2 instances.
We captured some jgroups statistics (we are using jgroups Probe - JMX MFC and NAKACK), although we do not have a full understanding of what this gives us.
NAKACK statistics:
#1 (1959 bytes):
local_addr=hybrisnode-82243 [5fa277bc-8216-d941-7c67-0792b84f671e]
cluster=tu-broadcast
view=[hybrisnode-83113|2606] (14) [hybrisnode-83113, hybrisnode-0, hybrisnode-82178, hybrisnode-83218, hybrisnode-82237, hybrisnode-83219, hybrisnode-83245, hybrisnode-83226, hybrisnode-82191, hybrisnode-82243, hybrisnode-8330, hybrisnode-83164, hybrisnode-82248, hybrisnode-83233]
physical_addr=
172.31.82.243:7800jmx=NAKACK={num_messages_received=4725458, use_mcast_xmit=false, current_seqno=271042, ergonomics=true, xmit_table_max_compaction_time=600000, become_server_queue_size=50, non_member_messages=0, xmit_stagger_timeout=200, print_stability_history_on_failed_xmit=false, discard_delivered_msgs=true, suppress_time_non_member_warnings=60000, xmit_rsps_sent=2161, xmit_table_num_rows=5, stats=true, xmit_from_random_member=false, size_of_all_messages=1152, log_not_found_msgs=true, xmit_table_resize_factor=1.2, xmit_table_missing_messages=0, id=15, max_rebroadcast_timeout=2000, msgs=hybrisnode-82243:
hybrisnode-83113: [2894377 (2894377)]
hybrisnode-0: [47023917 (47023917)]
hybrisnode-83219: [579595 (579595)]
hybrisnode-83245: [251862 (251862)]
hybrisnode-82237: [2827706 (2827706)]
hybrisnode-82243: [271042 (271042) (size=17, missing=0, highest stability=271025)]
hybrisnode-8330: [7676 (7676)]
hybrisnode-82248: [733 (733)]
hybrisnode-83164: [596664 (596664)]
hybrisnode-82191: [263111 (263111)]
hybrisnode-82178: [760904 (760904)]
hybrisnode-83218: [947031 (947031)]
hybrisnode-83233: [206 (206)]
hybrisnode-83226: [368099 (368099)]
, pending_xmit_requests=0, xmit_table_size=17, xmit_table_msgs_per_row=10000, retransmit_timeout=[I@67cdc0ee, size_of_all_messages_incl_headers=2053, max_msg_batch_size=100, exponential_backoff=500, num_messages_sent=271033, log_discard_msgs=true, xmit_reqs_received=2161, name=NAKACK, xmit_rsps_received=1413, use_mcast_xmit_req=false, xmit_reqs_sent=1413, use_range_based_retransmitter=true}
version=3.4.1.Final
MFC statistics: -- sending probe on /ff0e:0:0:0:0:0:75:75:7500
#1 (1208 bytes):
local_addr=hybrisnode-82243 [5fa277bc-8216-d941-7c67-0792b84f671e]
cluster=tu-broadcast
view=[hybrisnode-83113|2606] (14) [hybrisnode-83113, hybrisnode-0, hybrisnode-82178, hybrisnode-83218, hybrisnode-82237, hybrisnode-83219, hybrisnode-83245, hybrisnode-83226, hybrisnode-82191, hybrisnode-82243, hybrisnode-8330, hybrisnode-83164, hybrisnode-82248, hybrisnode-83233]
physical_addr=
172.31.82.243:7800jmx=MFC={ergonomics=true, average_time_blocked=0.0, number_of_credit_responses_sent=100, min_credits=8000000, ignore_synchronous_response=false, number_of_credit_requests_received=1, min_threshold=0.4, total_time_blocked=0, stats=true, receivers=hybrisnode-83113: 15207925
hybrisnode-0: 16166436
hybrisnode-83219: 16151370
hybrisnode-83245: 14427593
hybrisnode-82237: 19821107
hybrisnode-82243: 18221987
hybrisnode-8330: 17618119
hybrisnode-82248: 19950042
hybrisnode-83164: 11538170
hybrisnode-82191: 8629123
hybrisnode-82178: 17534244
hybrisnode-83218: 17558613
hybrisnode-83233: 19985889
hybrisnode-83226: 18462899
, max_credits=20000000, name=MFC, number_of_credit_requests_sent=0, number_of_credit_responses_received=106, id=44, number_of_blockings=0, max_block_time=5000}
version=3.4.1.Final
Our questions , who could support here, are:
- what could lead to such an increase of the DB connections at the time when we scale up customer facing EC2 instances from lets say 5 to 10 nodes ? (also please bear in mind that we are having other instances in the cluster - instances for batch processing, instances for employees managing cockpits and customer support)
- if somehow the jgroups is leading to the increase in DB connections - how can we improve our configuration below ?
Thank you kindly
Simeon