RabbitMQ intermittent memory rise and crash

Danijel Bosnjak

unread,

Feb 15, 2022, 10:08:00 AM2/15/22

to rabbitmq-users

Hi,

we have issues in environment RabbitMQ 3.9.11-1, Erlang 24.2-1 on Rocky Linux 8.5 where everything works fine but until few days of running it starts to:

* other_mem starts to rise

see attached below

* number of connections starts to rise and fall as publishers are slowed down

see attached below

* it hits a high water mark but memory continues to rise e.g. 14:00 high water mark was hit but memory continued to rise.

see attached below:

* rabbit_event memory and erlang mailbox rises

see attached below:

* and usually it ends with segmentation fault

[Mon Jan 24 16:18:30 2022] 9_scheduler[2241837]: segfault at 7f180bdd7cb8 ip 000055bd9a3a374a sp 00007f1a94c35780 error 4 in beam.smp[55bd9a209000+418000]

Jan 24 16:19:11 bogus-server.lan systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=11/SEGV

We tried to reproduce behaviour on a test environment but no luck, even with a higher number of publishers and reduced resources but no luck.

We have tuned kernel parameters:

net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time=30 net.ipv4.tcp_keepalive_intvl=10 net.ipv4.tcp_keepalive_probes=4 net.ipv4.tcp_tw_reuse = 1

and additional Erlang arguments:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +MBlmbcs 8192 +MHlmbcs 8192 +MMscs 4096"

and RabbitMQ config:

tcp_listen_options.backlog = 4096
tcp_listen_options.buffer = 65536
tcp_listen_options.sndbuf = 65536
tcp_listen_options.recbuf = 65536
log.file.level = critical

[
{kernel, [
{inet_default_connect_options, [{nodelay, true}]},
{inet_default_listen_options, [{nodelay, true}]}
]
}
].

Do you have any recommendations, where to start next and what could be causing this behaviour?

Thnx in advance!

number_of_conn.png

mem_usage.png

top_process.png

other_mem.png

jo...@cloudamqp.com

unread,

Feb 15, 2022, 5:37:14 PM2/15/22

to rabbitmq-users

In my experience "rabbit_event" can end up in this state when there are _a lot_ of events that need to be processed (usually from high connection/channel/consumer churn).

- What does your churn statistics look like?

- Are you logging _a lot_? (What is logged?) If not, change log level to debug for a second and then switch back to info/error/none.

- Is something connected to the event exchange? If not start it, bind it to a queue for a second and then remove the binding again.

I was helping someone with a similar issue here: https://rabbitmq.slack.com/archives/C1EDN83PA/p1632930772001800, but they never replied with what the actual issue was.

Feel free to also run these three commands and send us the output: (might need removing some terms if they contain PII or other sensitive info):

rabbitmqctl eval 'recon:info(whereis(rabbit_event),current_stacktrace).'

rabbitmqctl eval 'recon:info(whereis(rabbit_event),dictionary).'

rabbitmqctl eval 'recon:info(whereis(rabbit_event),backtrace).'

/Johan

Danijel Bosnjak

unread,

Feb 16, 2022, 6:07:16 PM2/16/22

to rabbitmq-users

Hi Johan,

Churn statistics attached below. Currently we are logging ERROR level and there are quite a lot of them, we'll try to enable DEBUG level for a couple of seconds as you suggest and then switch back to ERROR level. No, nothing is connected to the event exchange, we will try to bind it to a queue get some events and remove the binding again.

# rabbitmqctl eval 'recon:info(whereis(rabbit_event),current_stacktrace).'
{current_stacktrace,[{gen_event,fetch_msg,6,
[{file,"gen_event.erl"},{line,331}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,226}]}]}

# rabbitmqctl eval 'recon:info(whereis(rabbit_event),dictionary).'
{dictionary,[{'$initial_call',{gen_event,init_it,6}},
{'$ancestors',[rabbit_event_sup,rabbit_sup,<11920.224.0>]}]}

#rabbitmqctl eval 'recon:info(whereis(rabbit_event),backtrace).'
{backtrace,<<"Program counter: 0x00007f9a131dbbb4 (gen_event:fetch_msg/6 + 116)\ny(0) false\ny(1) []\ny(2) infinity\ny(3) [{handler,rabbit_mgmt_db_handler,false,[],false},{handler,rabbit_mgmt_reset_handler,false,[],false},{handler,rabbit_connection_tracking_handler,false,[],false},{handler,rabbit_channel_tracking_handler,false,[],false}]\ny(4) rabbit_event\ny(5) <0.415.0>\n\n0x00007f9abab0fc80 Return addr 0x00007f9a132f4994 (proc_lib:init_p_do_apply/3 + 196)\ny(0) []\ny(1) []\ny(2) Catch 0x00007f9a132f49b4 (proc_lib:init_p_do_apply/3 + 228)\n\n0x00007f9abab0fca0 Return addr 0x00007f9a130cd948 (<terminate process normally>)\n\n0x00007f9abab0fca8 Return addr 0x00007f9a130cd948 (<terminate process normally>)\n">>}

P.S. Johan thank for your patience and support!

churn_stats.png

jo...@cloudamqp.com

unread,

Feb 17, 2022, 12:20:50 PM2/17/22

to rabbitmq-users

OK, looks like you are doing the anti-pattern "open one connection, one channel, publish, close channel, close connection". And at a very high rate! This is the most computationally expensive way to send messages to/from RabbitMQ. The short term "fix" is to turn off logging completely (log level none), at least for connection related items, and keep on restarting rabbit_event when it grows to big (rabbitmqctl eval 'erlang:exit(whereis(rabbit_event),kill).')

The long-term solution here is to fix that massive connection and channel churn. You can either keep connections (and channels) long-lived (preferred) or put a proxy on the clients: https://www.cloudamqp.com/blog/maintaining-long-lived-connections-with-AMQProxy.html. If that is not feasible you could use the following system (if it is only publishers that are doing it, and consumers are long-lived): Use one or several servers that accepts these short lived clients and shovel the messages to another cluster where long-lived consumers are handling them.

/Johan

Danijel Bosnjak

unread,

Mar 9, 2022, 3:50:31 AM3/9/22

to rabbitmq-users

Hi Johan,

tnx a lot for this precious information we have internally discussed about your suggestions and we are still testing but in the meantime we have a couple of questions.

Does reducing the amount of collected logs and metrics can alleviate the issues described above?
We've also limited the heap size does it makes sense what's your opinion about that?

jo...@cloudamqp.com

unread,

Mar 9, 2022, 10:55:03 AM3/9/22

to rabbitmq-users

Yes turning off logs and metrics can help, as will regularly restarting rabbit_event. Not that these are short term fixes until you can start using long-lived connections.

Which heap are you referring to, and how are you changing the size of it?

/Johan

Danijel Bosnjak

unread,

Mar 18, 2022, 11:23:16 AM3/18/22

to rabbitmq-users

Hi Johan,

regarding the heap this is how we have specified the size:

cat /etc/rabbitmq/rabbitmq-env.conf

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +MBlmbcs 8192 +MHlmbcs 8192 +MMscs 4096 +hmax 2000000000"

jo...@cloudamqp.com

unread,

Mar 18, 2022, 1:54:25 PM3/18/22

to rabbitmq-users

OK! I've never seen that value tuned, and I cannot comment on if it would help or not (I will check with a co-worker next week).

/Johan

Danijel Bosnjak

unread,

Oct 4, 2024, 4:42:44 AM10/4/24

to rabbitmq-users

In the end it was due to high connection churn on publisher services - resolved by creating publisher thread which maintains connection long liveliness. Tnx Johan

Reply all

Reply to author

Forward