Network Partitioning issue in PRD env

Kushagra Bindal

unread,

Jul 22, 2020, 9:27:26 AM7/22/20

to rabbitm...@googlegroups.com

Hi Experts,

We have recently upgraded the production environment to v3.8.3 version with Erlang version 22.2.8 on 15th-July-2020.

In the last 7 days we have encountered multiple instances of network partitioning and thus has to perform Blue/Green multiple time. Below is the screenshot of the logs that I have received from production.

I am trying to get additional details from production. Please let me know if any specific details are required from the production environment which I can extract that can help in identifying the root cause of the problem.

Below is the rabbitmq.conf file entries in production.

# ======================================
# RabbitMQ Logging section
# ======================================
log.connection.level = error
log.channel.level = error
log.queue.level = error
log.mirroring.level = error
log.federation.level = error
log.upgrade.level = error
log.exchange.level = info

# ======================================
# RabbitMQ properties section
# ======================================
queue_master_locator = min-masters
collect_statistics_interval = 60000
cluster_partition_handling = autoheal
vm_memory_high_watermark.relative = 0.66

# =============================================================
# RabbitMQ (amq.rabbitmq.log) exchange section
# To view the amq.rabbitmq.log exchange in RabbitMQ new version
# =============================================================
log.exchange = true

Below is the entry of advanced.config.

[
{rabbit,
[
{credit_flow_default_credit, {200, 50}}
]}
].

Can you please suggest any specific setting/configuration that needs to be applied for this network partitioning issue.

Meanwhile I am trying to extract more details from production.

--

Regards,
Kushagra

Kushagra Bindal

unread,

Jul 23, 2020, 7:36:29 AM7/23/20

to rabbitm...@googlegroups.com

Hi,

Please find the splunk messages from PRD two environments.

On PRD we are using n1-standard-8 , 8 CPUs 30 GB Memory deployment with only ~600 queues, ~40 vhost and ~100 exchanges.

rabbitmq.conf and advanced.config I have already shared. Please let me know if any additional information is needed from my side.

Please help in resolving this issue and what are the ways around which we can avoid or recover the environment from this behavior.

--

Regards,
Kushagra

File1.json

File2.json

Luke Bakken

unread,

Jul 23, 2020, 4:26:55 PM7/23/20

to rabbitmq-users

Hi Kushagra,

This is the relevant error message -

2020-07-18 01:35:15.217 [info] <0.532.0> node 'rabbit@cust01-prd02-ins01-dmq10-app-1594705498-2' down: connection_closed

The TCP connection between the nodes that is used for distributed Erlang communication was closed somehow. This is the port 25672 connection.

Thanks,

Luke

Kushagra Bindal

unread,

Jul 24, 2020, 4:14:09 AM7/24/20

to rabbitm...@googlegroups.com

Hi Luke,

Thanks for your response.

I just received one information from support guys that due to the continuous issues they restarted rabbitmq nodes. So, maybe some logs belong to that occurrences.

Today morning we encountered the network partitioning issue again. I have attached the splunk messages of the same.

If possible please try to provide any resolution/configuration changes of the problem that we are facing so frequently.

If you need some additional information then please let me know as well.

PS: With 3.6.10 network partitioning issues used to come once a while (at a frequency of 6 month approx).

Just one thing comes to my mind, is there some evident issue in Erlang (22.2.8) or any new configuration introduced which will work in combination with auto-heal and that can solve the problem.

I have seen tick time somewhere, is it playing any role?

For my local debugging, is there a way by which I can reproduce this behavior on my local server? Maybe on my local deployment I can try to apply your suggested solution for quick turn-around.

Pardon me for raising so many questions in a single email, but since this is a production blocker so it is getting critical for us.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/ccca6941-d39a-44d0-abaa-4a8d12ee7991o%40googlegroups.com.

--

Regards,
Kushagra

rabbit.json

barak spoj

unread,

Aug 2, 2020, 10:40:22 AM8/2/20

to rabbitmq-users

We encounter a similar issue on one of our deployments, I was only able to reproduce it on a test environment by killing the 25672 TCP connection manually.

Does anyone has an idea on how to monitor that TCP connection on our production environment? so we can trace what is causing the problem initially?

Kushagra Bindal

unread,

Aug 2, 2020, 12:19:21 PM8/2/20

to rabbitm...@googlegroups.com

Hi Barak,

We are still trying to identify the root cause of the problem.

Does anyone from the group please help in identfying the root cause and what else we can try with autoheal that can solve the problem of network partitioning.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/c3d96c2b-ea17-4d58-9213-fad4a363935co%40googlegroups.com.

barak spoj

unread,

Aug 3, 2020, 5:01:01 AM8/3/20

to rabbitmq-users

We used tcp dump and found out the tcp connection received an ECN flag which stands for a network congestion, machine IO seemed fine at the time of the incident, but we use shared resources so it is really hard to tell what causes it. Btw it reproduces regularly once a day, not always at the same time.

On Sunday, 2 August 2020 19:19:21 UTC+3, Kushagra Bindal wrote:

Hi Barak,

We are still trying to identify the root cause of the problem.

Does anyone from the group please help in identfying the root cause and what else we can try with autoheal that can solve the problem of network partitioning.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Reply all

Reply to author

Forward