high connection churn fills rabbit_event mailbox: unbound memory increase

57 views
Skip to first unread message

tho...@deepomatic.com

unread,
Apr 2, 2019, 12:44:46 PM4/2/19
to rabbitmq-users
Hello everyone,

This is a pseudo bug-report; https://github.com/rabbitmq/rabbitmq-server/blob/master/CONTRIBUTING.md strongly suggest to go here by default.

Context:
- using kubernetes helm chart: https://github.com/helm/charts/tree/master/stable/rabbitmq (v4.10.0)
- bitnami docker rabbitmq image: docker.io/bitnami/rabbitmq:3.7.12
- RabbitMQ 3.7.12 (same issue with 3.7.13, 3.7.14)
- Erlang 21.2
- default helm chart plugins: [rabbitmq_management,rabbitmq_peer_discovery_k8s].

Scenario:
- multiple (~5) consumers that connect, get an exception, and reconnect in loop
- rabbitmq log for one connection:
2019-04-02 16:09:22.043 [info] <0.18555.969> accepting AMQP connection <0.18555.969> (10.4.3.189:49982 -> 10.4.0.54:5672)
2019-04-02 16:09:22.050 [info] <0.18555.969> connection <0.18555.969> (10.4.3.189:49982 -> 10.4.0.54:5672): user 'xxx' authenticated and granted access to vhost '/'
2019-04-02 16:09:22.053 [error] <0.14635.969> Channel error on connection <0.18555.969> (10.4.3.189:49982 -> 10.4.0.54:5672, vhost: '/', user: 'xxx'), channel 1:
2019-04-02 16:09:22.056 [info] <0.18555.969> closing AMQP connection <0.18555.969> (10.4.3.189:49982 -> 10.4.0.54:5672, vhost: '/', user: 'xxx')
- after some time the memory alarm is raised

Analysis:
- rabbitmq RSS: >10GB

  => most of the memory is in "other_proc"
- after having installed rabbitmq_top, we have 4M messages(?) in the `rabbit_event` process erlang mailbox, using 10GB of memory, with ~500k reductions/s: https://user-images.githubusercontent.com/1730297/55419837-ee5b3480-5575-11e9-9890-296984b49644.png
- it seems related to the discussion in https://github.com/rabbitmq/rabbitmq-server/issues/1722 , but I don't believe I use the rabbitmq-event-exchange plugin (nothing in the chart mentions it, and rabbitmq-plugins list says it's *not* enabled)


Workaround:
As suggested in https://github.com/rabbitmq/rabbitmq-server/issues/1722, I will try to reduce the connection churn, but one (not even actively) misbehaving amqp client should ideally *not* be able to render the broker unstable as such.


Questions:
- is it expected?
- should I open an issue on github?


Thanks,
Thomas

Luke Bakken

unread,
Apr 2, 2019, 2:25:18 PM4/2/19
to rabbitmq-users
Hi Thomas,

When you say "get an exception" what is the client application doing to trigger the channel error?

Luke

tho...@deepomatic.com

unread,
Apr 3, 2019, 1:57:24 PM4/3/19
to rabbitmq-users
Hi Luke,
The line explaining the channel error somehow disappeared from my email, sorry...

The channel exception is caused by an application logic error: it tries to bind a queue to a non-existing exchange, the broker replies with a channel exception, then the application handles it by closing the whole connection and retrying from the start, in an infinite loop.


Thank you,
Thomas

Luke Bakken

unread,
Apr 3, 2019, 5:24:02 PM4/3/19
to rabbitmq-users
Hi Thomas,

I tried to reproduce what you report using RabbitMQ 3.7.14 and Erlang 21.3.3, but can't seem to.

Here is the configuration and reproduction scripts that I am using: https://gist.github.com/lukebakken/a1487672d26fa7b73a4c3cc7d08ddac1

Notice that I am using exclusive queues, which will be auto-deleted by RabbitMQ when the channel and connection dies. I let this run for 30 minutes and RMQ memory never exceeded 130MiB.

If I use non-exclusive queues, they will remain and will eventually cause RabbitMQ to run out of memory.

Could you please share your code that reproduces the issue, or a minimal set of code to reproduce it?

Thanks,
Luke

Thomas Riccardi

unread,
Apr 4, 2019, 10:43:28 AM4/4/19
to rabbitm...@googlegroups.com
Hi Luke,

I did some tests based on your reproduction scripts:

1/ using the exact same rabbitmq instance as in my initial issue, no memory issue, as you found out also.
I then checked the connection churn: I get ~130/s on my machine

When I encountered the issue initially it was with a C++ consumer, which seems to be much faster: with 5 consumers I was at ~500 new connections/s.

I tried with a rabbitmq deployed locally (using minikube + helm, cf [1]), same connection churn...

2/ patching your `run.sh`: with 50 repro.py processes instead of 10 I get ~450/s and I finally reproduce the issue: memory increase, rabbit_event mailbox filling up.

=> I suggest you try with more repro.py processes untill the connection churn saturates; you should then see the memory increasing.

If you still don't reproduce on you side, then can try reproducing with another rabbitmq source, maybe the official docker image?


Thanks,
Thomas

---

[1] minikube, helm:
values.yaml:

rabbitmq:
  username: guest
  password: guest
  plugins: |-
    [rabbitmq_management, rabbitmq_top].
  configuration: |-
    loopback_users.guest = false

# deploy helm chart
helm install --name rabbitmq-repro -f values.yaml stable/rabbitmq --version=4.10.0

# wait for pod and service to be ready
kubectl port-forward --namespace default svc/rabbitmq-repro 5672:5672 15672:15672 &

(It deploys the bitnami/rabbitmq:3.7.14 docker image with erlang 21.3.)



--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/Ll-JFwmTSuI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
Apr 4, 2019, 10:48:56 AM4/4/19
to rabbitmq-users
Hi Thomas,

I do see the same memory increase when I run 50 python processes.

The question at this point is, what really should be done about this? 500 connection attempts per second is effectively a DOS attack on the server.

I'll ask the team their opinion about this behavior, thanks for pointing it out.

Luke

Luke Bakken

unread,
Apr 9, 2019, 11:07:36 AM4/9/19
to rabbitmq-users
Hi Thomas,

I did some more testing. You can configure RabbitMQ to accept connections at a lower rate by using the following settings in rabbitmq.conf:

num_acceptors.tcp = 1
num_acceptors.ssl = 1
tcp_listen_options.backlog = 2

https://gist.github.com/lukebakken/a1487672d26fa7b73a4c3cc7d08ddac1#file-rabbitmq-conf

With those settings, my test script running 50 python processes no longer overwhelms the server. As soon as I increase the backlog to 4, I start seeing messages pile up in rabbit_event. Since the TCP accept backlog is much smaller, the backpressure comes from the TCP stack itself in this case.

Of course, this will also have a small effect on the rate at which valid connections can be made but I'm pretty sure you won't notice it in your environment. Testing is recommended, of course.

Thanks,
Luke

Thomas Riccardi

unread,
Apr 9, 2019, 11:19:36 AM4/9/19
to rabbitm...@googlegroups.com
Hi Luke,

It is always good to know we can throttle new connections.

For my particular case I simply ended up adding some sleep between the reconnections on the client side.

My message in this mailing list was mainly to create awareness in the community and for the rabbitmq developers; maybe following on the "moderately drastic" solutions discussed in https://github.com/rabbitmq/rabbitmq-server/issues/1722

Anyway, thanks for the analysis and proposed workarounds,
Thomas


--
Reply all
Reply to author
Forward
0 new messages