RabbitMQ 3.8.9 large number of Erlang processes

750 views
Skip to first unread message

Vladislav Dermendzhiev

unread,
Jan 4, 2021, 9:54:09 AM1/4/21
to rabbitm...@googlegroups.com
Hello Everybody,

First time posting a message in this group.

I'm running  a 3 node RabbitMQ 3.8.9 cluster (Erlang/OTP 23), deployed on Kubernetes using the rabbitmq/rabbitmq:3.8.9-management-alpine image from Docker Hub.

The cluster uses rabbitmq-peer-discovery-k8s for cluster formation.
TLS is enabled for cluster node communication.

The cluster has 100 quorum queues, 10 connections and 50 channels.

The environment is not in production and rarely sees any message traffic.

Despite this fact, around a week after the cluster is started, each node gets to above 300 000 Erlang processes. Each process uses some amount of memory (not big) and this causes the nodes to reach the Memory high watermark and blocks the publishers.

The memory breakdown reports most of the memory occupied by "other_processes"

I've used "rabbitmq-top" and "rabbitmq-diagnostics observer" but it does not point me to what is causing so much Erlang processes.

There is no connection churn.

Node logs do not show any relevant information.

Could you please give me some advice how to pinpoint what is causing so much active Erlang processes?

Thank You!

Best Regards,
Vlad



Michal Kuratczyk

unread,
Jan 4, 2021, 10:08:30 AM1/4/21
to rabbitm...@googlegroups.com
1. Have you tried using the official Kubernetes Operator (https://github.com/rabbitmq/cluster-operator)? We haven't seen such issues so most likely you can just migrate to the Operator to solve the problem.

2. Watch https://www.youtube.com/channel/UCSg9GRMGAo7euj3baJi4dOg for details on Kubernetes deployments, memory usage debugging and other related topics.

3. Share your setup/YAML so that we can see how you configured RabbitMQ and the deployment (eg. the readiness probes)

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CAK6U3pdvQ27PutLXbYXyoBTkuq%3DSgVTVRw%3DVcSFvh%2BKw%2B9-AoA%40mail.gmail.com.


--
Michał

Luke Bakken

unread,
Jan 4, 2021, 2:13:59 PM1/4/21
to rabbitmq-users
Hi Vlad,

Since this isn't a production system I would be curious to know what processes are accumulating. If you can run this command on a node that has a large number of processes it will collect some information for me. Please redirect the output to a file and attach the file to your response:

rabbitmqctl eval 'M0=maps:new(),lists:foldl(fun(P,M)->[_,{name,N}]=rabbit_top_util:obtain_name(P),C=maps:get(N,M,0),maps:put(N,C+1,M)end,M0,erlang:processes()).'

Thanks,
Luke

Vladislav Dermendzhiev

unread,
Jan 5, 2021, 5:32:15 AM1/5/21
to rabbitmq-users
Hi Luke,

the attached file erlang_proc_list_sanitized.txt  (inside the zip) contains the output of the command (I've obfuscated the queues names a bit).
The node on which the command was run had around 226000 Erlang processes. It is a dev env. There is very little message traffic.
Right away this grabbed my attention: "{ssl_server_session_cache,init,1}">> => 225551
We have ssl enabled on:
- amqp port (currently not used by publishers/consumers. the default non ssl port is used for now)
- internode cluster communication
- management api
- prometheus metrics

The SSL certificates used are autogenerated by Kubernetes and  self signed. The RabbitMQ cluster is not exposed outside Kubernetes.

Attached (inside the zip) are the config files that have ssl related configurations.

I just noticed that I have a wrong config in rabbitmq-env.conf - ERL_SSL_PATH="/usr/local/lib/erlang/lib/ssl-9.4/ebin"
The correct path for the container we are using should be: ERL_SSL_PATH="/usr/local/lib/erlang/lib/ssl-10.1/ebin"

I do not know if this is causing the issue. In any case I'll apply the correct config and will monitor the Erlang processes count.
logs_and_config_files.zip

Vladislav Dermendzhiev

unread,
Jan 5, 2021, 7:42:48 AM1/5/21
to rabbitmq-users
Hi Luke,

I've set  ERL_SSL_PATH to the correct value "/usr/local/lib/erlang/lib/ssl-10.1/ebin" but after the rabbitmq nodes are restarted the Erlang processes count slowly keeps increasing.

Yesterday I've done an experiment and disabled TLS for the inter-node cluster communication on another identical environment. Until now the Erlang processes on that environment have not increased.
Seems the issue appears when inter-node TLS is enabled.

Best Regards,
Vlad

Luke Bakken

unread,
Jan 5, 2021, 10:36:25 AM1/5/21
to rabbitmq-users
Hi Vlad,

I really appreciate the detailed diagnosis and information. My guess is that there is some Kubernetes health check that queries the HTTP API and causes this behavior. I will see if I can reproduce it on my workstation today.

Can you check to see if there are many open TCP connections on the node experiencing the high process count? I wonder if there are many open TLS connections.

Thanks,
Luke

Luke Bakken

unread,
Jan 5, 2021, 10:40:18 AM1/5/21
to rabbitmq-users
Hi again,

I suspect that the suggestions in this message will fix your issue:


You can add the settings to your rabbitmq.conf file as follows:

ssl_options.reuse_sessions = false
management.ssl.reuse_sessions = false
management.listener.ssl_opts.reuse_sessions = false

Let me know if the above fixes your issue. I'll try it here as well.
Luke

Vladislav Dermendzhiev

unread,
Jan 5, 2021, 11:18:41 AM1/5/21
to rabbitmq-users
Hi Luke,

the health checks Kubernetes run are "rabbitmq-diagnostics ping" and "rabbitmq-diagnostics status" every 10 seconds.
I've checked the TCP connection on one of the nodes, but we do not have a high count.

I've applied the suggested configs. Lets give it some time till tomorrow and see how does the process count go.

Thank You very much!

Best Regards,
Vlad

Vladislav Dermendzhiev

unread,
Jan 6, 2021, 5:48:11 AM1/6/21
to rabbitmq-users
Hello Luke,

unfortunately the above three settings did not solve the issue.

In addition to them, I've added "-ssl session_lifetime 120" and also "{reuse_sessions, false}" in the inter-node TLS options file but the issue still persists.

Best Regards,
Vlad

Luke Bakken

unread,
Jan 6, 2021, 9:02:56 AM1/6/21
to rabbitmq-users
Thank you for following up Vlad.

Could you please describe anything that interacts with your cluster? AMQP clients, HTTP API requests, etc. You mentioned the two commands that k8s uses for health checks.

The reason I ask is that it should help me reproduce the issue locally.

Thanks,
Luke

Vladislav Dermendzhiev

unread,
Jan 6, 2021, 9:59:28 AM1/6/21
to rabbitmq-users
Hello Luke,

we have the following interactions with the cluster:

- AMQP clients (not using tls here). The Erlang process count increases even without message traffic.
- "rabbitmq-diagnostics ping" and "rabbitmq-diagnostics status" every 10 seconds for Kubernetes health checks
- RabbitMQ Management GUI used manually
- we scrape the Prometheus metrics port every 30 sec
- inter-node cluster communication

Best Regards,
Vlad

Luke Bakken

unread,
Jan 8, 2021, 3:07:47 PM1/8/21
to rabbitmq-users
Hi Vlad,

I can reproduce this issue. It seems that any rabbitmqctl command executed will cause this counter to go up:

RABBITMQ_CONF_ENV_FILE=/home/lbakken/issues/rabbitmq-users/tls-session-explosion-47U6D9Hbhrg/repo/rabbitmq-env.conf ./sbin/rabbitmqctl eval 'M0=maps:new(),lists:foldl(fun(P,M)->[_,{name,N}]=rabbit_top_util:obtain_name(P),C=maps:get(N,M,0),maps:put(N,C+1,M)end,M0,erlang:processes()).' | fgrep session_cache
  dtls_server_session_cache_sup => 1,rabbit_log_prelaunch_lager_event => 1,
  ssl_server_session_cache_supdist => 1,ssl_manager_dist => 1,
  ssl_server_session_cache_sup => 1,
  <<"{ssl_server_session_cache,init,1}">> => 11,

More than likely this is a bug in Erlang/OTP itself. I'll continue investigating to see if I can find a workaround.

Thanks,
Luke

On Wednesday, January 6, 2021 at 6:59:28 AM UTC-8 Vladislav Dermendzhiev wrote:
Hello Luke,

Luke Bakken

unread,
Jan 8, 2021, 3:32:14 PM1/8/21
to rabbitmq-users
Hi again Vlad,

I have confirmed that this is a bug in Erlang - https://bugs.erlang.org/browse/ERL-1458

I haven't found a way to work around this issue. Hopefully we'll get some input from the Erlang team about it.

Have a great weekend!
Luke

Vladislav Dermendzhiev

unread,
Jan 11, 2021, 3:54:27 AM1/11/21
to rabbitmq-users
Luke, thank You very much for the help!

Best Regards,
Vlad

Vladislav Dermendzhiev

unread,
Feb 17, 2021, 7:45:27 AM2/17/21
to rabbitmq-users
Hello Luke,

according to ERL-1458 the bug is fixed in OTP 23.2.4

The Rabbitmq image 3.8.12 in Dockerhub uses OTP 23.2.5 (Dockerfile)

So I can use image 3.8.12 right?

Best Regards,
Vlad

Lucía Cheung

unread,
Feb 28, 2021, 10:08:49 PM2/28/21
to rabbitmq-users
I was having this problem in version 3.8.8. Upgrading to 3.8.12 solved it.
Thank you guys!

Lucía
Reply all
Reply to author
Forward
0 new messages