MQTT cannot accept a connection: client ID registration timed out

Roy

unread,

Nov 1, 2023, 3:00:11 PM11/1/23

to rabbitmq-users

Hi, We recently upgraded our six node RabbitMQ cluster from v3.8.3 to v3.11.3. The install was done using conda; rabbitmq-server package v3.11.13 and erlang package v25.0.4. However, after the upgrade we started seeing MQTT connection issues from one of the nodes. We noticed the following error messages in the log on the problematic node.

MQTT cannot accept connection <client-ip>:<port> -> <server-ip>:<port> due to an internal error or unavailable component

[error] <> MQTT cannot accept a connection: client ID registration timed out

Restarting the cluster did not help; we continue seeing the same error even after restarting the cluster. We usually restart the cluster in the following order: stop the node in the order 6, 5, 4, 3, 2, 1 and then start the node in the order 1, 2, 3, 4, 5, 6. Since we were seeing the same error, from the problematic node, even after the restart of the cluster, we tried restarting RabbitMQ service from just the problematic node. Immediately after the restart of the problematic node, we noticed the following info/error messages in the log along with some successful MQTT connections.

2023-10-30 08:17:04.703399-05:00 [info] <0.1312.0> MQTT detected network error for "<client-ip>:<port> -> <server-ip>:<port>” timeout

2023-10-30 08:17:04.704050-05:00 [error] <0.1272.0> MQTT: a socket write failed, the socket might already be closed

…

2023-10-30 08:23:52.455705-05:00 [info] <0.9903.0> accepting MQTT connection <0.9903.0> (<client-ip>:<port> -> <server-ip>:<port>, client id: NY1020a62102c2Em00SF9o2)

2023-10-30 08:23:52.456803-05:00 [info] <0.9887.0> accepting MQTT connection <0.9887.0> (<client-ip>:<port> -> <server-ip>:<port>, client id: NY1020a6210814vS00SF1l2)

Since we were having trouble getting all our apps to successfully connect to the broker on time, we were forced to roll back the upgrade by stopping the v3.11.13 version of RabbitMQ from the cluster and restarting the old version. The v3.11.13 and v3.8.3 uses different mnesia databases, which is why we were able to stop one version of the app and start the other (in case if you were wondering). Also, we have other v3.11.13 clusters (3 node and 4 node clusters) successfully running in our environment.

Does anybody have any idea why we were seeing the “MQTT cannot accept connection … due to an internal error or unavailable component” error? Anyone seen this error message before? I appreciate any help in troubleshooting this issue.

I will attach the log from the problematic node to the case.

Thanks

Roy

rabbit@ny4e-supchp17.rabbit.jax.drw.log.1

rabbit@ny4e-supchp17.rabbit.jax.drw.log.0

rabbit@ny4e-supchp17.rabbit.jax.drw.log.2

Roy

unread,

Nov 1, 2023, 3:32:08 PM11/1/23

to rabbitmq-users

...to add additional information, we noticed the following error message from one of the other healthy node in case if it helps identifying the root issue. The "this should not happen!" message is interesting

2023-10-30 07:54:24.324259-05:00 [info] <0.784.0> Ready to start client connection listeners
2023-10-30 07:54:24.327024-05:00 [notice] <0.1702.0> mqtt_node: candidate -> leader in term: 4 machine version: 1
2023-10-30 07:54:24.341233-05:00 [info] <0.1788.0> started TCP listener on 10.98.8.25:5672
2023-10-30 07:54:24.375969-05:00 [error] <0.1702.0> mqtt_node: leader saw append_entries_rpc for same term 4 this should not happen!

I am attaching the latest log from the problematic node (I was not able attach thatin my previous message) and the log from the node where i saw the above posted error message.

thanks

rabbit@ny4e-supchp17.rabbit.jax.drw.log

rabbit@nj1n-supchp09.rabbit.jax.drw.log

Roy

unread,

Nov 4, 2023, 4:30:18 PM11/4/23

to rabbitmq-users

Hi, I would really appreciate any help with this issue. In a six node cluster, we have noticed some nodes not being able to recognize the ra leader. Sometimes after the restart of the cluster, one of the node comes back up online with NO messages like "mqtt_node: detected a new leader" in the log, even after waiting for longer than 10 minutes. Usually i see the leader message within 5 minutes from a healthy node.

Is there a way to force the non-working node to identify the leader?

When i run the following command on any of our node (even from a healthy node which accepts MQTT connections), I get the "system_not_started" output. What does that mean?

$ rabbitmqctl eval "ra:overview()."
system_not_started

Thanks

Luke Bakken

unread,

Nov 4, 2023, 5:56:15 PM11/4/23

to rabbitmq-users

Hi Roy,

It has only been 3 days since your first message, so please be patient. We try to answer within a week's time.

I strongly suggest upgrading to the latest version of RabbitMQ (3.12.8), if nothing else than for the native MQTT support. The latest version of Erlang (26) also has improvements.

https://blog.rabbitmq.com/posts/2023/03/native-mqtt/

Luke Bakken

unread,

Nov 4, 2023, 5:56:35 PM11/4/23

to rabbitmq-users

One other point - your cluster should (must) have an odd number of nodes.

Message has been deleted

Markus Gustavsson

unread,

Nov 5, 2023, 3:46:24 PM11/5/23

to rabbitmq-users

Just to add on the above, I recently asked a similar question about ra:overview() showing not_started (although for quorum queues), the response might be helpful to you as well. https://github.com/rabbitmq/rabbitmq-server/discussions/9666

Roy

unread,

Nov 27, 2023, 6:07:03 PM11/27/23

to rabbitmq-users

Hi Luke,

Please let me know if you got a chance to look at the logs and see what is causing this behavior ("MQTT cannot accept a connection: client ID registration timed out" error) from some 3.11.13 nodes time to time?

We are in the process of testing rolling out 3.12.x version as you suggested. Meanwhile we still have some 3.11.13 nodes/clusters, and would really appreciate if you can shed some lights into this error and possible workaround.

Hi Markus,

Thanks for linking the previous post. Unfortunately, that link doesn't seem to explain the root case of the issue of the steps to resolve it.

Thanks

Luke Bakken

unread,

Nov 28, 2023, 9:08:48 AM11/28/23

to rabbitmq-users

Hey Roy,

I have been very busy working on the version 7 release of the .NET client and have had to exclusively focus on that.

I strongly suggest using the latest version of RabbitMQ which includes much more performant MQTT support.

Roy

unread,

Nov 28, 2023, 10:48:43 AM11/28/23

to rabbitmq-users

Hi Luke,

Appreciate your quick response. I will work on upgrading our clusters to the latest version, rather than spending too much time fighting with this issue.

That said, without looking into the logs (or spending too much time on it), would you be able to let me know how to trigger a leader election in v3.11.13? That way, at least, I can rectify the problematic node, while I upgrade all our clusters in our environment to 3.12.x? I wasn’t able to find any doc on it online.

I am not too familiar with erlang expressions, but when I run “rabbitmqctl eval "ra:members('mqtt_node')."” from the problematic node I get “{timeout,mqtt_node}”, but the other healthy nodes respond with the following output and the output includes the problematic node as well.

$ rabbitmqctl eval "ra:members('mqtt_node')."

{ok,[{mqtt_node,'rab...@rabbithost01.rabbit.company'},

{mqtt_node,'rab...@rabbithost02.rabbit.company'}],

{mqtt_node,'rab...@rabbithost03.rabbit.company'}}

$

Running “rabbitmqctl eval "ra:trigger_election('mqtt_node')."” From healthy nodes doesn’t fix the issue either.

Appreciate if you can provide me some pointers in leader election commands (if that is something you can do without spending too much time on this case), so that I can avoid restarting the node.

Thank you!

Roy

Luke Bakken

unread,

Dec 11, 2023, 1:24:47 PM12/11/23

to rabbitmq-users

Hi Roy,

Here is what a colleague of mine said when I brought your question to his attention:

"This timeout can occur if there is currently no Ra leader. I suggest to upgrade to 3.12 and enable feature flag delete_ra_cluster_mqtt_node which will delete the Ra cluster and therefore prevent that timeout. With that feature flag enabled, MQTT client IDs will be stored in pg instead of Ra"

Reply all

Reply to author

Forward