rabbitmq busy with auto delete queues eviction

281 views
Skip to first unread message

Antoine Galataud

unread,
Nov 14, 2017, 12:17:19 PM11/14/17
to rabbitmq-users
Hi,

Currently stress testing a 2 nodes RabbitMQ 3.6.12 cluster with a high number of MQTT clients. One of my tests consists in spawning a large number of clients, abruptly terminate them, and check if disconnections are correctly noticed and related items evicted.
When that happens, I noticed that connections count drops to 0 within seconds (low mqtt keepalive interval), but queues (auto delete as clean session is set to true) get slowly evicted. During that period of time (up to several minutes), new client connections are slowed down and 90% aren't accepted.

Here is my rabbitmq.config

[
    {kernel, [
        {net_ticktime, 120},
        {inet_default_connect_options, [{nodelay, true}]},
        {inet_default_listen_options, [{nodelay, true}]}
    ]},
    {rabbit, [
        {tcp_listeners, [5672]},
        {tcp_listen_options, [
            {backlog, 128},
            {nodelay, true},
            {linger, {true, 0}},
            {exit_on_close, false},
            {sndbuf, 196608},
            {recbuf, 196608}
        ]},
        {auth_backends, [rabbit_auth_backend_internal, {rabbit_auth_backend_http, rabbit_auth_backend_internal}, rabbit_auth_backend_cache]},
        {vm_memory_high_watermark, 0.8},
        {disk_free_limit, {mem_relative, 0.4}},
        {cluster_partition_handling, autoheal},
        {cluster_keepalive_interval, 30000},
        {collect_statistics_interval, 60000},
        {hipe_compile, false}
    ]},
    {rabbitmq_auth_backend_cache, [
        {cached_backend, rabbit_auth_backend_http},
        {cache_ttl, 360000}
    ]},
    {rabbitmq_auth_backend_http, [
        {user_path, "http://<IP>/v2/rauth/user/"},
        {vhost_path, "http://<IP>/v2/rauth/vhost/"},
        {resource_path, "http://<IP>/v2/rauth/resource/"}
    ]},
    {rabbitmq_mqtt, [
        {allow_anonymous, false},
        {vhost, <<"/">>},
        {exchange, <<"amq.topic">>},
        {subscription_ttl, 1800000},
        {prefetch, 10},
        {tcp_listeners, [1883]},
        {tcp_listen_options, [
            {backlog, 32768},
            {nodelay, true},
            {sndbuf, 8192},
            {recbuf, 8192}
        ]}
    ]},
    {rabbitmq_stomp, [
        {tcp_listeners, [61613]},
        {tcp_listen_options, [
            {backlog, 4096},
            {nodelay, true},
            {sndbuf, 32768},
            {recbuf, 32768}
        ]}
    ]}
].

as well as my kernel parameters:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_rfc1337 = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv6.conf.all.autoconf = 0
net.core.somaxconn = 32768

Antoine

Antoine Galataud

unread,
Nov 15, 2017, 5:25:28 AM11/15/17
to rabbitmq-users
I also tested with expiring queues: when consumer is gone, queue remains 30 min.
Before expiration happens, new connections are accepted as usual. When a high number of queues get evicted when ttl is reached, it ends up in the same situation than with auto delete queues: not connections are accepted until clean up is done.

What could explain that?

Luke Bakken

unread,
Nov 17, 2017, 7:00:22 AM11/17/17
to rabbitmq-users
Hi Antoine,

I can't explain the situation you are seeing right off the top of my head (maybe one of my teammates can provide their thoughts). I would be interested in reproducing what you are seeing since it sounds like you have code that reliably shows the issue. Would you mind sharing it?

Thanks,
Luke

Antoine Galataud

unread,
Nov 17, 2017, 8:46:10 AM11/17/17
to rabbitm...@googlegroups.com
Hi Luke,

I'll try to provide as much details as I can. In addition to rabbitmq.config and sysctl customizations (above), here is the output of rabbitmqctl status:

[{pid,1698},
 {running_applications,
     [{rabbitmq_topic_authorization,"RabbitMQ topic-based authorization",[]},
      {rabbitmq_web_stomp,"Rabbit WEB-STOMP - WebSockets to Stomp adapter",
          "3.6.12"},
      {rabbitmq_stomp,"RabbitMQ STOMP plugin","3.6.12"},
      {rabbitmq_mqtt,"RabbitMQ MQTT Adapter","3.6.12"},
      {rabbitmq_management,"RabbitMQ Management Console","3.6.12"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.12"},
      {rabbitmq_event_exchange,"Event Exchange Type","3.6.12"},
      {rabbitmq_auth_backend_http,"RabbitMQ HTTP Authentication Backend",
          "3.6.12+2.g430e0e3"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.12"},
      {rabbit,"RabbitMQ","3.6.12"},
      {rabbitmq_auth_backend_cache,"RabbitMQ Authentication Backend cache",[]},
      {amqp_client,"RabbitMQ AMQP Client","3.6.12"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.6.12"},
      {sockjs,"SockJS","0.3.4"},
      {xmerl,"XML parser","1.3.14"},
      {cowboy,"Small, fast, modular HTTP server.","1.0.4"},
      {cowlib,"Support library for manipulating Web protocols.","1.0.2"},
      {syntax_tools,"Syntax tools","2.1.1"},
      {mnesia,"MNESIA  CXC 138 12","4.14.3"},
      {inets,"INETS  CXC 138 49","6.3.9"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
      {ssl,"Erlang/OTP SSL application","8.1.3"},
      {os_mon,"CPO  CXC 138 46","2.4.2"},
      {public_key,"Public key infrastructure","1.4"},
      {crypto,"CRYPTO","3.7.4"},
      {compiler,"ERTS  CXC 138 10","7.0.4"},
      {asn1,"The Erlang ASN1 compiler version 4.0.4","4.0.4"},
      {sasl,"SASL  CXC 138 11","3.0.3"},
      {stdlib,"ERTS  CXC 138 10","3.3"},
      {kernel,"ERTS  CXC 138 10","5.2"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 19 [erts-8.3.5] [source] [64-bit] [smp:8:8] [async-threads:128] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,3376512},
      {queue_procs,1812168},
      {queue_slave_procs,11240},
      {plugins,88326640},
      {other_proc,238397456},
      {metrics,285464},
      {mgmt_db,18407600},
      {mnesia,83720},
      {other_ets,2667272},
      {binary,3312136},
      {msg_index,44176},
      {code,28833745},
      {atom,1139913},
      {other_system,2645091526},
      {total,3031789568}]},
 {alarms,[]},
 {listeners,
     [{clustering,25672,"::"},
      {'http/web-stomp',15674,"::"},
      {amqp,5672,"::"},
      {http,15672,"::"},
      {mqtt,1883,"::"},
      {stomp,61613,"::"}]},
 {vm_memory_calculation_strategy,rss},
 {vm_memory_high_watermark,0.8},
 {vm_memory_limit,26991175270},
 {disk_free_limit,13495587635},
 {disk_free,200424062976},
 {file_descriptors,
     [{total_limit,999900},
      {total_used,97},
      {sockets_limit,899908},
      {sockets_used,95}]},
 {processes,[{limit,1048576},{used,690}]},
 {run_queue,0},
 {uptime,6886},
 {kernel,{net_ticktime,120}}]

This is a 2 nodes cluster.

The test uses a hand made Java client that uses paho:

<pre>
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.UUID;
import java.util.concurrent.CountDownLatch;

import org.apache.commons.lang.math.RandomUtils;
import org.eclipse.paho.client.mqttv3.MqttClient;
import org.eclipse.paho.client.mqttv3.MqttConnectOptions;
import org.eclipse.paho.client.mqttv3.MqttMessage;
import org.eclipse.paho.client.mqttv3.persist.MemoryPersistence;

public class MqttTest {

    public static int            NB_CLIENTS;
    public static int            NB_THREADS;

    public static boolean        SEND_MSG           = true;
    public static long           MSG_TO_SEND_PER_CLI;
    public static long           PAUSE_BTW_SEND     = 200;
    public static byte[]         MSG_CONTENT        = "some data".getBytes();

    public static boolean        JOIN_BEFORE_CON    = true, JOIN_BEFORE_DISCO = true;
    public static CountDownLatch START_SIGNAL       = new CountDownLatch(1),
            STOP_SIGNAL = new CountDownLatch(NB_THREADS);
    public static int            SLEEP_BEFORE_DISCO = 1000000;

    public static int            MQTT_PING_INTERVAL = 10;
    
    public static List<String>   NODE_IPS           = Arrays.asList("IP_NODE1", "IP_NODE2");
    public static String         RABBIT_USERNAME    = "username";
    public static String         RABBIT_PASSWORD    = "password";
    
    static class Worker implements Runnable {

        List<MqttClient>   clients  = new ArrayList<>();
        MqttConnectOptions connOpts = new MqttConnectOptions();

        public Worker() throws Exception {

            connOpts.setUserName(RABBIT_USERNAME);
            connOpts.setPassword(RABBIT_PASSWORD.toCharArray());
            connOpts.setKeepAliveInterval(MQTT_PING_INTERVAL);
            connOpts.setCleanSession(true);

            for (int i = 0; i < (NB_CLIENTS / NB_THREADS); i++) {
                MqttClient client = new MqttClient("tcp://" + getNodeIp() + ":1883",
                        UUID.randomUUID().toString().substring(0, 7), new MemoryPersistence());
                clients.add(client);
            }
        }

        @Override
        public void run() {
            try {

                if (JOIN_BEFORE_CON) {
                    START_SIGNAL.await();
                }

                for (MqttClient client : clients) {
                    client.connect(connOpts);
                    client.subscribe("device/" + client.getClientId() + "/*");
                }

                if (SEND_MSG) {
                    for (int i = 0; i < MSG_TO_SEND_PER_CLI; i++) {
                        for (MqttClient client : clients) {
                            client.publish("device/" + client.getClientId() + "/fake",
                                    new MqttMessage(MSG_CONTENT));
                        }
                        Thread.sleep(PAUSE_BTW_SEND * RandomUtils.nextInt(5));
                    }
                }

                Thread.sleep(SLEEP_BEFORE_DISCO);

                STOP_SIGNAL.countDown();

            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String[] args) throws Exception {

        NB_CLIENTS = Integer.parseInt(args[0]);
        NB_THREADS = Integer.parseInt(args[1]);
        MSG_TO_SEND_PER_CLI = Integer.parseInt(args[2]);

        for (int i = 0; i < NB_THREADS; i++) {
            new Thread(new Worker()).start();
        }
        START_SIGNAL.countDown();

        if (JOIN_BEFORE_DISCO)
            STOP_SIGNAL.await();
    }

    private static final Random random = new Random();

    private static String getNodeIp() {
        return NODE_IPS.get(random.nextInt(NODE_IPS.size()));
    }
}
</pre>

You have to change NODE_IPS, RABBIT_USERNAME, RABBIT_PASSWORD.
Then I run it with:

java -Xmx16G -Xss256k -cp . MqttClientsThroughput <NB_CLIENTS> <NB_WORKERS> <MESSAGES_PER_CLIENT>

NB_CLIENTS = 10000 is enough. Once this number is reached, just Ctrl-C. In RabbitMQ Management UI connections count drops to 0 almost instantly, whereas queues get evicted little by little (~300 every 5s). Until it reaches 0, no new connection is allowed (times out).



--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/hgRmhpL8Y6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antoine Galataud

unread,
Nov 20, 2017, 8:32:53 AM11/20/17
to rabbitmq-users
Hi,

Let me know if I can provide more info. I can reproduce the problem 100% of the time, so I'm fine with providing rabbitmqctl output or logs.
Just note that during such massive queues eviction, some things aren't available (like rabbitmqctl node_health_check times out).

Antoine

Luke Bakken

unread,
Nov 20, 2017, 9:41:07 AM11/20/17
to rabbitmq-users
Hi Antoine -

Thanks for the information. I will try to spend some time diagnosing this when I am done with higher priority support issues.

Luke
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Antoine Galataud

unread,
Nov 29, 2017, 4:54:28 AM11/29/17
to rabbitm...@googlegroups.com
Hi,

We encountered the same problem on a production cluster this morning, with around 12000 MQTT clients connected to it. Most of the queues are auto-delete (85% of them).
An important portion of clients got disconnected, and as cluster wasn't accepting connections while evicting queues (and maybe doing something else, don't know) they were unable to reconnect.
As clients are also automatically reconnecting, we end up with an unstable system were:
- connections and channels pile up (on rabbitmq side, on load balancer and client they are seen as idle as they get no response)
- queues number do yoyo

Our workaround is to stop accepting connections on load balancer side for some time, wait for cluster to be ready to accept connections again, they let clients reconnect (also backoff mechanism helps, they reco less frequently after few attempts).

Antoine

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Antoine Galataud

unread,
Nov 30, 2017, 7:09:33 AM11/30/17
to rabbitmq-users
Hi,

A few more information about last occurrence of this: it was a 2 nodes cluster, running RabbitMQ 3.6.8 with Erlang 19.3. 
This cluster has a different sysctl and software setup so let me know if I can provide useful information like logs, rabbitmqctl output, etc. I can also provide access to the test cluster if necessary on which I reproduced the problem first.

I also tried to reproduce locally on a 3.6.14 single node setup but couldn't so far. I didn't go far enough though (mainly need increasing number of connections to make sure problem becomes visible).

Note also that for my first report (test environment) and my last one (prod), everything is running in AWS. Could be important if network performance is a key factor in it.

I'll perform more tests on my side, especially trying to see if problem arises only on a 2+ nodes installation.

Antoine

Antoine Galataud

unread,
Dec 1, 2017, 5:45:10 AM12/1/17
to rabbitmq-users
I spent some time testing and managed to reproduce the problem on 3.6.12 cluster, running in AWS on c4.4xlarge instances, servers and test clients in same availability zone. Test creates 25000 MQTT clients.

First, I tried with a single node cluster: in that case queues eviction is fast (around 1000/sec). I couldn't reproduce with this configuration but I guess it's because eviction is fast enough and client connection timeout isn't reached.

Then I added a 2nd node to the cluster and noticed that both connection acceptance rate and queues eviction rate are significantly decreased. Below is a test result where I started 25000 MQTT clients, then abruptly stopped them. Then I immediately restarted a new set of clients (25000 again). What I observed is:
- connections count drops to 0 almost instantly (netstat reports no connections to 1883)
- queues eviction rate is around 200/sec
- new clients attempt to connect but don't succeed: in below screenshot you'll see 100 connections accepted, while queues are still being evicted. netstat also reports around 50 connections per instance. Connections seem to be accepted but clients never get an answer. (note that test code is set to wait for each client to connect before proceeding to next one. As I run test clients on multiple servers, each with multiple threads, I can control connection rate but also notice when connections aren't accepted anymore. In below screenshots you'll see 100 connections because there were 10 tests servers with 10 threads each.)
- heartbeat timeout is reached for the 100 accepted new clients: 

SEVERE: 04b96d8: Timed out as no activity, keepAlive=15,000 lastOutboundActivity=1,512,122,293,382 lastInboundActivity=1,512,122,278,381 time=1,512,122,308,382 lastPing=1,512,122,293,382
Timed out waiting for a response from the server (32000)
at org.eclipse.paho.client.mqttv3.internal.ExceptionHelper.createMqttException(ExceptionHelper.java:31)
at org.eclipse.paho.client.mqttv3.internal.ClientState.checkForActivity(ClientState.java:694)
at org.eclipse.paho.client.mqttv3.internal.ClientComms.checkForActivity(ClientComms.java:784)
at org.eclipse.paho.client.mqttv3.internal.ClientComms.checkForActivity(ClientComms.java:770)
at org.eclipse.paho.client.mqttv3.TimerPingSender$PingTask.run(TimerPingSender.java:77)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

- finally, wait for queues to be fully evicted: all new connections are accepted again.

Management UI just after closing 25000 mqtt connections


















































Management UI a few seconds after

Management UI  when a new batch of 25000 clients tries to connect

I also noticed that queues eviction rate worsen if you increase number of nodes of if nodes are running in different AZs, so I believe network and nodes synchronization are part of the problem.

This is a major problem for us as we expect to have a lot more MQTT connections in the coming weeks, and any massive reconnection would mean downtime and manual intervention. The only workaround I see for now is to have MQTT clients connect with clean session = false. But that leads to other scalability problems (clients bouncing between nodes behind a LB until it connects to the node where its queue has been created, or need to set mqtt queues HA which has a cost).

Thanks in advance for your help.
Antoine

Antoine Galataud

unread,
Dec 6, 2017, 2:54:19 PM12/6/17
to rabbitmq-users
Hi,

I still have no clue of what's happening, despite spending time on this. If someone has a hint I'd be happy. Let me know if I can look into something I didn't think in the first place.

Antoine

Luke Bakken

unread,
Dec 6, 2017, 3:05:51 PM12/6/17
to rabbitmq-users
Hi Antoine -

I have been busy with higher-priority issues. At this point I recommend not using clustering for your use-case.

You have provided plenty of information to work with, thank you.

Luke

On Wednesday, December 6, 2017 at 11:54:19 AM UTC-8, Antoine Galataud wrote:
Hi,

Antoine Galataud

unread,
Dec 6, 2017, 4:33:49 PM12/6/17
to rabbitm...@googlegroups.com
Hi Luke,

Thanks for your suggestion. As you can imagine this has 1 or 2 impacts on our messaging infra, but it's worth trying.
I'd still be interested if there is a way to solve the issue I reported, as we already have some clusters running, and moving to non-clustered wouldn't be achieved overnight.

Thanks,
Antoine


--

Luke Bakken

unread,
Dec 6, 2017, 4:46:30 PM12/6/17
to rabbitmq-users
Hi Antoine -

I'm working on this issue now and am trying to reproduce it locally using plain AMQP with a two-node cluster to see if this issue manifests itself without using MQTT.

Quick question - do you have any HA policies defined? I don't see any in the discussion history but I thought I would confirm.

Thanks -
Luke

Antoine Galataud

unread,
Dec 6, 2017, 5:38:06 PM12/6/17
to rabbitm...@googlegroups.com
I have no HA policy defined. The only one I had was setting a max-length of 20 messages for MQTT queues, but i also tried without it and it didn’t change something.

Antoine 
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/hgRmhpL8Y6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Luke Bakken

unread,
Dec 6, 2017, 8:04:44 PM12/6/17
to rabbitmq-users
Thanks for the info.

I am getting the following exception when running the test Java code:

java.lang.OutOfMemoryError: unable to create new native thread

This is on an Arch Linux workstation with 32GiB of RAM. I've checked in the code you provided into this repository, and provided a README with system settings I have applied and how I am running the app:


So at this point I can't reproduce the issue you see with your code. If you can give assistance with running the test code that would be great.

I was able to create over 20000 AMQP connections using the consumer.go application that I also have in that repo (golang/consumer.go). Exiting that application does not appear to exhibit the slow queue eviction you see, but I also haven't tested it across a network. I will do that tomorrow.

Thanks,
Luke

Antoine Galataud

unread,
Dec 7, 2017, 4:11:16 AM12/7/17
to rabbitm...@googlegroups.com
Hi Luke,

Thanks for your feedback.
Paho library creates quite a lot of threads for each connection. Using repo you provided I wasn't able to go further than 5400 connections (less RAM on my workstation, but don't need to increase heap size anyway as thread stacks aren't allocated in heap), with basically the same sysctl / ulimit / jvm config. Using this library I was still able to reach 200k connections running it on several AWS instances (t2 type). 
Still I could observe that mqtt queues eviction is not instant.

I've also tried the go app you provided and, although I ran into some instabilities (seg faults, connections reset, ...), I couldn't observe any slowness in queues eviction either. 

I'm also convinced that network is playing a role here: there's a big difference between 2 nodes on the same AZs and 2 nodes on 2 different AZs (AWS Beijing, but I have no stats about network latency between zones there).

Antoine

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Antoine Galataud

unread,
Dec 7, 2017, 6:37:57 AM12/7/17
to rabbitm...@googlegroups.com
Hi,

I've also been able to reproduce this on a single node RabbitMQ 3.6.12 / Erlange 19.3.6, on a c4.8xlarge (EBS gp2 disk - 900 IOPS). I started 60000 MQTT clients, stopped them and immediately restarted them. I ran into the same issue as described before (timed out waiting for a server answer).

Antoine 

Antoine Galataud

unread,
Dec 7, 2017, 8:05:08 AM12/7/17
to rabbitm...@googlegroups.com
A few other tests I've performed:
- using a single c3.8xlarge (instance SSD disk) in order to eliminate potential network latency between instance and disk
- using qos 0 for subscriptions, so queues are non durable and auto delete by default, which greatly reduces disk accesses

Although queues eviction rate is improved, I can still reproduce the problem, using 60000 mqtt clients.

Luke Bakken

unread,
Dec 7, 2017, 10:07:31 AM12/7/17
to rabbitmq-users
Hi Antoine -

Clustering across AZs is possible but as you can see, has negative consequences due to the network.

I'll give reproducing this another try using a cloud provider. I think it should be obvious at this point that you should not expect a fix any time soon - this is most likely going to be a difficult issue for which to find a root cause.

Thanks -
Luke

Antoine Galataud

unread,
Dec 7, 2017, 10:29:13 AM12/7/17
to rabbitm...@googlegroups.com
Hi Luke,

Now I'm more focusing on finding ways to increase queues eviction rate in order to minimize time during which connections aren't accepted. It will probably also be a matter of correct sizing.

Antoine

--

Antoine Galataud

unread,
Dec 12, 2017, 8:29:16 AM12/12/17
to rabbitm...@googlegroups.com
Hi Luke,

Did you manage to reproduce the problem using a cloud provider?

Thanks,
Antoine

To post to this group, send email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Dec 13, 2017, 1:07:47 PM12/13/17
to rabbitmq-users
Last thing I heard was a "no, we can't reproduce". Luke will be back to the list next week.

Luke Bakken

unread,
Dec 18, 2017, 3:51:15 PM12/18/17
to rabbitmq-users
Hey Antoine,

I modified my Go test app and have been able to reproduce the issue (code: https://github.com/lukebakken/rabbitmq-users-hgRmhpL8Y6o)

You can check out the README for details about running the app to reproduce the issue.

Now I'll see what's happening under the covers that is preventing new connections during queue eviction.

Thanks -
Luke

Antoine Galataud

unread,
Dec 18, 2017, 4:14:28 PM12/18/17
to rabbitm...@googlegroups.com
Hi Luke,

That’s good news somehow. Were you able to reproduce with a deployment where network was involved (cluster in cloud)? AMQP only in your tests, so that’s more like a general problem than just a MQTT implementation issue.

Antoine 
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/hgRmhpL8Y6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Luke Bakken

unread,
Dec 18, 2017, 4:21:59 PM12/18/17
to rabbitmq-users
I can reproduce it on my local workstation, no network (other than loopback) involved.


On Monday, December 18, 2017 at 1:14:28 PM UTC-8, Antoine Galataud wrote:
Hi Luke,

Antoine Galataud

unread,
Dec 21, 2017, 10:21:40 AM12/21/17
to rabbitmq-users
Hi,

How can we follow up on this? I'd be interested to know if investigation is stalled (reproduced but root cause won't be investigated), if there is any github issue created (didn't find one, I can create one if needed), and if, basically, there's a chance to see a fix in 3.6 or if it's more likely to land in 3.7, or none.

Let me know if I can help in any matter.
Antoine

Luke Bakken

unread,
Dec 21, 2017, 1:15:52 PM12/21/17
to rabbitmq-users
Hi Antoine -

I've discussed this issue with other team members and, most likely, this is due to mnesia lock contention while cleaning up queue metadata. I'm doing eprof investigation to confirm that. I can't make any promises at prioritizing a fix for this behavior, but I have added this as a work item in our internal issue tracker.

One thing that I have noticed is that (at least in my test application) it is not the re-connections that are blocked, but the queue.declare operation. In order to not overload the TCP accept queue, my test application only attemps 2K connections and queue declares in a "batch" before proceeding on to the next batch. So, this means that the first batch of queue declarations is blocked for a while, but then they eventually succeed and the rest of the connection / declaration batches succeed as well.

If you have control over how your MQTT clients re-connect, I suggest following a similar strategy. Or, if the client can't connect successfully, use a backoff and re-try until it does.

As I find more information I will update this discussion thread, and if I find enough to file an issue I will do that and report it back here.

Thanks -
Luke

Antoine Galataud

unread,
Dec 22, 2017, 3:01:59 AM12/22/17
to rabbitm...@googlegroups.com
Hi Luke,

Thanks for sharing these details.
We also noticed that connections were accepted while reproducing the problem. We actually have backoff in place in our devices, but it's reset when connection is accepted. We're currently working on extending the mechanism so it's reset only when 1st MQTT heart beat response is received (which happens only when connection is fully working, queue declared, ...).

Antoine

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/hgRmhpL8Y6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages