RabbitMQ crashing after reaching memory high watermark

810 views
Skip to first unread message

Jelle Smet

unread,
Aug 2, 2018, 4:44:37 AM8/2/18
to rabbitmq-users
Dear list,

Rabbitmq-server: 3.7.7
Erlang: 21.0.4

We are seeing the same issue with 3.6.5


We have a number of stand-alone RabbitMQ instances which accept incoming messages & shovel them to a set of central RabbitMQ instances.
There is 1 RabbitMQ instance which is misbehaving presumably because of its high load, as this is an outlier compared to the other instances. 

We have made the following observations:
  • The instance accepts and shovels incoming messages in real time. There is no queuing happening.
    The message rate is about 7K/s
  • The instance can run for about 5 minutes before running into the memory high watermark which is set to the default 0.4
    The host has 16G of RAM.
  • The memory high watermark remains flapping ... It continuously drops below the high water mark (clearing the error state) and then hits the limit again.
    After some time, the RabbitMQ process exits.
  • We can free up the allocated memory by "Force Closing" the connection the shovel created to consume the queue.
    After "force closing" this connection allocated memory drops to few MB after which it starts to buildup at a high running into the same issues.

Top shows following output:



The memory consumption:



The memory allocation breakdown chart:




The Shovel configuration is the following:





I have tried to apply the following setting {queue_explicit_gc_run_operation_threshold, 500} in an attempt to let gc kick in earlier but to no avail.
Logical I assume since the gc needs to happen within "the connections processes".



Please advice 


Jelle





Luke Bakken

unread,
Aug 2, 2018, 10:03:16 AM8/2/18
to rabbitmq-users
Hello,

Thanks for the very detailed information from your environment.

What stands out is the very high value in the connection writer's process - 1031841 messages. These are Erlang VM messages that have yet to be processed. This suggests a slow network connection or other TCP issue. If you use the netstat or ss command to investigate the connection, do you see a high TCP Send-Q value? What does the receiving connection look like? Can you use iperf to check that network link's throughput?

Thanks,
Luke

Michael Klishin

unread,
Aug 2, 2018, 3:27:24 PM8/2/18
to rabbitm...@googlegroups.com
To put it in slightly simpler terms, RabbitMQ has "scheduled" some protocol frames (quite a few of them, in fact) to be written out but
socket writes are falling behind.

Please check your network monitoring to see if it may be close to being maxed out.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Jelle Smet

unread,
Aug 2, 2018, 4:10:54 PM8/2/18
to rabbitmq-users
Hi Luke,

This suggests a slow network connection or other TCP issue.
Ok
 
If you use the netstat or ss command to investigate the connection, do you see a high TCP Send-Q value?

Yes indeed here's the relevant netstat output:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0 406888 host-10-67-78-20.:37218 10.70.69.193:amqp       ESTABLISHED

 
What does the receiving connection look like?

In the destination RabbitMQ I can see the connection is continuously in state "flow" 



 
Can you use iperf to check that network link's throughput?

Yes I guess I could do that ... The idea is to check the available bandwidth compared to what is needed for this shovel transfer?

Thanks!

Michael Klishin

unread,
Aug 2, 2018, 4:13:48 PM8/2/18
to rabbitm...@googlegroups.com
Yes. Keep in mind that some tools use megabytes/second and some use megabits/second when reporting network throughput.

RabbitMQ uses megabytes.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jelle Smet

unread,
Aug 2, 2018, 4:17:48 PM8/2/18
to rabbitmq-users
To put it in slightly simpler terms, RabbitMQ has "scheduled" some protocol frames (quite a few of them, in fact) to be written out but
socket writes are falling behind.

ok, ... I probably greatly over-simplify but wouldn't a lack of bandwidth not throttle the shovel connection "naturally" and because of that messages start to queue ?  
A lack of bandwidth occurs out of control of the sender so I'm wondering how to cope with such situations ...
  
 
Please check your network monitoring to see if it may be close to being maxed out.

Ok will do 

Michael Klishin

unread,
Aug 2, 2018, 4:31:35 PM8/2/18
to rabbitm...@googlegroups.com
Connection that max out socket throughput rate in theory will be throttled by the OS
when TCP buffers fill up. That's a lot more involved in a message-passing [as in Erlang] based system such as RabbitMQ
since there's more internal buffering going on and that cannot be avoided.

RabbitMQ tries to deliver messages to consumers (or send other protocol frames to clients, although a high rate of
those is fairly unusual) and the socket write doesn't keep up. The socket writer really doesn't do much so the only
thing that can slow it down is the network or runtime settings.

Note that in your netstat output there's a Send-Q that's well above 0. This is a yet another indicator of the local connection
not being able to send data fast enough. There can be plenty of reasons for that. Knowing your network bandwidth is essential
in reasoning about this.

You can limit Shovel rate by using a low prefetch [1][2] (1000 by default, which is fairly high [2])
and confirming on acknowledgements (this is the default) [1][3].

There are other options such as using compression of large message bodies or storing them in a common storage
and only passing identifiers/keys of the payloads around. The latter option sometimes will merely shift the problem
to a different place, of course.



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jelle Smet

unread,
Aug 3, 2018, 10:28:31 AM8/3/18
to rabbitmq-users
You can limit Shovel rate by using a low prefetch [1][2] (1000 by default, which is fairly high [2])
and confirming on acknowledgements (this is the default) [1][3].


I have done this (prefetch of 250) and acknowledgements set to on_publish and at the moment the problem seems to have disappeared.
Throughput is still fine and no queues are buildup.  We're going to leave this configuration for a while to evaluate.

Meanwhile the networking layer is being investigated for congestion points by my networking colleagues.
I'm a bit reluctant running the iperf test on the production network myself :)

Thanks for the great feedback & information.  I will update the thread with relevant info for future reference.

Michael Klishin

unread,
Aug 6, 2018, 3:04:42 AM8/6/18
to rabbitm...@googlegroups.com
Thank you for reporting back to the list!

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jelle Smet

unread,
Aug 14, 2018, 3:50:12 AM8/14/18
to rabbitmq-users
It seems we're not out of the woods yet.
A recap of things I have done:



  1. Setting configured shovel prefetch of 250 and ack on_publish ...
    This made the problem appear less frequent but didn't solve the issue after all.

  2. Bypassed a load balancer
    The RabbitMQ instances to which the shovel connections connect to are behind an F5 load balancer.
    I have configured the shovel to connect directly to one of the Rabbitmq nodes behind the load balancer and we experienced the same problem.

  3. We did an iperf test between the sending and receiving RabbitMQ and got a throughput of ~ 900Mbit/s without any further notable issues 

  4. Tweaked the TCP stack on the node with following settings:

      net.core.rmem_max:
        value: '16777216'
      net.core.wmem_max:
        value: '16777216'
      net.ipv4.tcp_rmem:
        value: '4096 87380 16777216'
      net.ipv4.tcp_wmem:
        value: '4096 65536 16777216'
      net.core.netdev_max_backlog:
        value: '300000'

    This change kept the TCP Send Q to 0 but didn't solve our problem however

    With these settings we ran into the same issue. Screenshot of memory consumption breakdown:

    screenshot_85.png

I'm not going to randomly tweak parameters and await any suggestions.

Thanks!

Jelle

Luke Bakken

unread,
Aug 14, 2018, 8:59:40 AM8/14/18
to rabbitmq-users
Hi Jelle,

Previously you showed some more information - output of Erlang top command and netstat. I would be interested to see that output from both the source and destination.

Is the connection in flow state as it was before?

What is the output of netstat -s on both sides of the connection?

Thanks for running the iperf test. How long did you run it?

Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages