Hi, We recently upgraded our six node RabbitMQ cluster from v3.8.3 to v3.11.3. The install was done using conda; rabbitmq-server package v3.11.13 and erlang package v25.0.4. However, after the upgrade we started seeing MQTT connection issues from one of the nodes. We noticed the following error messages in the log on the problematic node.
MQTT cannot accept connection <client-ip>:<port> -> <server-ip>:<port> due to an internal error or unavailable component
[error] <> MQTT cannot accept a connection: client ID registration timed out
Restarting the cluster did not help; we continue seeing the same error even after restarting the cluster. We usually restart the cluster in the following order: stop the node in the order 6, 5, 4, 3, 2, 1 and then start the node in the order 1, 2, 3, 4, 5, 6. Since we were seeing the same error, from the problematic node, even after the restart of the cluster, we tried restarting RabbitMQ service from just the problematic node. Immediately after the restart of the problematic node, we noticed the following info/error messages in the log along with some successful MQTT connections.
2023-10-30 08:17:04.703399-05:00 [info] <0.1312.0> MQTT detected network error for "<client-ip>:<port> -> <server-ip>:<port>” timeout
2023-10-30 08:17:04.704050-05:00 [error] <0.1272.0> MQTT: a socket write failed, the socket might already be closed
…
2023-10-30 08:23:52.455705-05:00 [info] <0.9903.0> accepting MQTT connection <0.9903.0> (<client-ip>:<port> -> <server-ip>:<port>, client id: NY1020a62102c2Em00SF9o2)
2023-10-30 08:23:52.456803-05:00 [info] <0.9887.0> accepting MQTT connection <0.9887.0> (<client-ip>:<port> -> <server-ip>:<port>, client id: NY1020a6210814vS00SF1l2)
Since we were having trouble getting all our apps to successfully connect to the broker on time, we were forced to roll back the upgrade by stopping the v3.11.13 version of RabbitMQ from the cluster and restarting the old version. The v3.11.13 and v3.8.3 uses different mnesia databases, which is why we were able to stop one version of the app and start the other (in case if you were wondering). Also, we have other v3.11.13 clusters (3 node and 4 node clusters) successfully running in our environment.
Does anybody have any idea why we were seeing the “MQTT cannot accept connection … due to an internal error or unavailable component” error? Anyone seen this error message before? I appreciate any help in troubleshooting this issue.
I will attach the log from the problematic node to the case.
Thanks
Roy
Hi Luke,
Appreciate your quick response. I will work on upgrading our clusters to the latest version, rather than spending too much time fighting with this issue.
That said, without looking into the logs (or spending too much time on it), would you be able to let me know how to trigger a leader election in v3.11.13? That way, at least, I can rectify the problematic node, while I upgrade all our clusters in our environment to 3.12.x? I wasn’t able to find any doc on it online.
I am not too familiar with erlang expressions, but when I run “rabbitmqctl eval "ra:members('mqtt_node')."” from the problematic node I get “{timeout,mqtt_node}”, but the other healthy nodes respond with the following output and the output includes the problematic node as well.
$ rabbitmqctl eval "ra:members('mqtt_node')."
{ok,[{mqtt_node,'rab...@rabbithost01.rabbit.company'},
{mqtt_node,'rab...@rabbithost02.rabbit.company'}],
{mqtt_node,'rab...@rabbithost03.rabbit.company'}}
$
Running “rabbitmqctl eval "ra:trigger_election('mqtt_node')."” From healthy nodes doesn’t fix the issue either.
Appreciate if you can provide me some pointers in leader election commands (if that is something you can do without spending too much time on this case), so that I can avoid restarting the node.
Thank you!
Roy