RabbitMQ intermittent memory rise and crash

333 views
Skip to first unread message

Danijel Bosnjak

unread,
Feb 15, 2022, 10:08:00 AM2/15/22
to rabbitmq-users
Hi,

we have issues in environment RabbitMQ 3.9.11-1, Erlang 24.2-1 on Rocky Linux 8.5 where everything works fine but until few days of running it starts to:

* other_mem starts to rise
see attached below

* number of connections starts to rise and fall as publishers are slowed down
see attached below

* it hits a high water mark but memory continues to rise e.g. 14:00 high water mark was hit but memory continued to rise.
see attached below:

* rabbit_event memory and erlang mailbox rises
see attached below:


* and usually it ends with segmentation fault

[Mon Jan 24 16:18:30 2022] 9_scheduler[2241837]: segfault at 7f180bdd7cb8 ip 000055bd9a3a374a sp 00007f1a94c35780 error 4 in beam.smp[55bd9a209000+418000]

Jan 24 16:19:11 bogus-server.lan systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=11/SEGV

We tried to reproduce behaviour on a test environment but no luck, even with a higher number of publishers and reduced resources but no luck.

We have tuned kernel parameters:

net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time=30 net.ipv4.tcp_keepalive_intvl=10 net.ipv4.tcp_keepalive_probes=4 net.ipv4.tcp_tw_reuse = 1

and additional Erlang arguments:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +MBlmbcs 8192 +MHlmbcs 8192 +MMscs 4096"
and RabbitMQ config:

tcp_listen_options.backlog = 4096
tcp_listen_options.buffer = 65536
tcp_listen_options.sndbuf = 65536
tcp_listen_options.recbuf = 65536
log.file.level = critical

[
  {kernel, [
    {inet_default_connect_options, [{nodelay, true}]},
    {inet_default_listen_options,  [{nodelay, true}]}
           ]
  }
].

Do you have any recommendations, where to start next and what could be causing this behaviour?

Thnx in advance!

number_of_conn.png
mem_usage.png
top_process.png
other_mem.png

jo...@cloudamqp.com

unread,
Feb 15, 2022, 5:37:14 PM2/15/22
to rabbitmq-users
In my experience "rabbit_event" can end up in this state when there are _a lot_ of events that need to be processed (usually from high connection/channel/consumer churn).
- What does your churn statistics look like?
- Are you logging _a lot_? (What is logged?) If not, change log level to debug for a second and then switch back to info/error/none.
- Is something connected to the event exchange? If not start it, bind it to a queue for a second and then remove the binding again.

I was helping someone with a similar issue here: https://rabbitmq.slack.com/archives/C1EDN83PA/p1632930772001800, but they never replied with what the actual issue was.

Feel free to also run these three commands and send us the output: (might need removing some terms if they contain PII or other sensitive info):
rabbitmqctl eval 'recon:info(whereis(rabbit_event),current_stacktrace).'

rabbitmqctl eval 'recon:info(whereis(rabbit_event),dictionary).'

rabbitmqctl eval 'recon:info(whereis(rabbit_event),backtrace).'

/Johan

Danijel Bosnjak

unread,
Feb 16, 2022, 6:07:16 PM2/16/22
to rabbitmq-users
Hi Johan,

Churn statistics attached below. Currently we are logging ERROR level and there are quite a lot of them, we'll try to enable DEBUG level for a couple of seconds as you suggest and then switch back to ERROR level. No, nothing is connected to the event exchange, we will try to bind it to a queue get some events and remove the binding again. 


# rabbitmqctl eval 'recon:info(whereis(rabbit_event),current_stacktrace).'
{current_stacktrace,[{gen_event,fetch_msg,6,
                                [{file,"gen_event.erl"},{line,331}]},
                     {proc_lib,init_p_do_apply,3,
                               [{file,"proc_lib.erl"},{line,226}]}]}

# rabbitmqctl eval 'recon:info(whereis(rabbit_event),dictionary).'
{dictionary,[{'$initial_call',{gen_event,init_it,6}},
             {'$ancestors',[rabbit_event_sup,rabbit_sup,<11920.224.0>]}]}

#rabbitmqctl eval 'recon:info(whereis(rabbit_event),backtrace).'
{backtrace,<<"Program counter: 0x00007f9a131dbbb4 (gen_event:fetch_msg/6 + 116)\ny(0)     false\ny(1)     []\ny(2)     infinity\ny(3)     [{handler,rabbit_mgmt_db_handler,false,[],false},{handler,rabbit_mgmt_reset_handler,false,[],false},{handler,rabbit_connection_tracking_handler,false,[],false},{handler,rabbit_channel_tracking_handler,false,[],false}]\ny(4)     rabbit_event\ny(5)     <0.415.0>\n\n0x00007f9abab0fc80 Return addr 0x00007f9a132f4994 (proc_lib:init_p_do_apply/3 + 196)\ny(0)     []\ny(1)     []\ny(2)     Catch 0x00007f9a132f49b4 (proc_lib:init_p_do_apply/3 + 228)\n\n0x00007f9abab0fca0 Return addr 0x00007f9a130cd948 (<terminate process normally>)\n\n0x00007f9abab0fca8 Return addr 0x00007f9a130cd948 (<terminate process normally>)\n">>}

P.S. Johan thank for your patience and support!
churn_stats.png

jo...@cloudamqp.com

unread,
Feb 17, 2022, 12:20:50 PM2/17/22
to rabbitmq-users
OK, looks like you are doing the anti-pattern "open one connection, one channel, publish, close channel, close connection". And at a very high rate! This is the most computationally expensive way to send messages to/from RabbitMQ. The short term "fix" is to turn off logging completely (log level none), at least for connection related items, and keep on restarting rabbit_event when it grows to big (rabbitmqctl eval 'erlang:exit(whereis(rabbit_event),kill).')

The long-term solution here is to fix that massive connection and channel churn. You can either keep connections (and channels) long-lived (preferred) or put a proxy on the clients: https://www.cloudamqp.com/blog/maintaining-long-lived-connections-with-AMQProxy.html. If that is not feasible you could use the following system (if it is only publishers that are doing it, and consumers are long-lived): Use one or several servers that accepts these short lived clients and shovel the messages to another cluster where long-lived consumers are handling them.

/Johan

Danijel Bosnjak

unread,
Mar 9, 2022, 3:50:31 AM3/9/22
to rabbitmq-users
Hi Johan,

tnx a lot for this precious information we have internally discussed about your suggestions and we are still testing but in the meantime we have a couple of questions.  
  • Does reducing the amount of collected logs and metrics can alleviate the issues described above? 
  • We've also limited the heap size does it makes sense what's your opinion about that?   

jo...@cloudamqp.com

unread,
Mar 9, 2022, 10:55:03 AM3/9/22
to rabbitmq-users
Yes turning off logs and metrics can help, as will regularly restarting rabbit_event. Not that these are short term fixes until you can start using long-lived connections.

Which heap are you referring to, and how are you changing the size of it?

/Johan

Danijel Bosnjak

unread,
Mar 18, 2022, 11:23:16 AM3/18/22
to rabbitmq-users
Hi Johan,

regarding the heap this is how we have specified the size:
 
cat /etc/rabbitmq/rabbitmq-env.conf

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +MBlmbcs 8192 +MHlmbcs 8192 +MMscs 4096 +hmax 2000000000"

jo...@cloudamqp.com

unread,
Mar 18, 2022, 1:54:25 PM3/18/22
to rabbitmq-users
OK! I've never seen that value tuned, and I cannot comment on if it would help or not (I will check with a co-worker next week).

/Johan

Danijel Bosnjak

unread,
Oct 4, 2024, 4:42:44 AM10/4/24
to rabbitmq-users
In the end it was due to high connection churn on publisher services - resolved by creating publisher thread which maintains connection long liveliness. Tnx Johan
Reply all
Reply to author
Forward
0 new messages