issue of status replication

74 views
Skip to first unread message

Uroš Kolarič

unread,
Nov 28, 2022, 8:05:19 AM11/28/22
to vernemq-users
Hej All

We are running verneMQ in k8s as MQTT cluster. We have many IoT devices that connect to this MQTT cluster. Connections go over dedicated Load balancer. 
LB is hardware box.

Here is the simplified picture:

MQTT setup simplified.png

On initial setup LB splits device connection between mqtt pods.

verne status page.png

We are using KEEPALIVE process to see if our IoT devices are connected.
mqtt keepalive.png

Over time our IoT devices disconnect and reconnect. When this happens they could be connected to different pods in the MQTT cluster than last time.
As a result in one MQTT_pod IoT device is set as KEEPALIVE offline while online in another one. When dedicated back SW connects to MQTT it can get mixed results.
What are we doing wrong? Do we need to set some policy where offline KEEPALIVES are removed. When MQTT cluster does not sync the OFFLINE  and ONLINE KEEPALIVE msg? 


Best Uros

André Fatton

unread,
Nov 28, 2022, 10:27:50 AM11/28/22
to vernemq-users
Hello Uros,

it is unclear to me what you are asking, and what you mean by getting "mixed results". Can you take the guess work out for me?
Happy to try to help as soon as I understand.

As a side note: please be aware that using the binary VerneMQ packages requires a paid subscription. (as per the EULA).
Kind regards,
André

Jarred Utt

unread,
Dec 2, 2022, 9:56:28 AM12/2/22
to vernemq-users
Hi!

Mainly we have an issue when a device goes offline and the keepalive message expires (done via last will). Then when that same device is healthy again and goes back online this message is not distributed to all nodes within the cluster so we get mixed results in our consumers depending on which node they end up connected to. We have not had any success in identifying a root cause.

We are building and maintaining our own VerneMQ docker container in compliance with the documentation :D

André Fatton

unread,
Dec 5, 2022, 5:24:07 AM12/5/22
to vernemq-users
Hi!

Mainly we have an issue when a device goes offline and the keepalive message expires (done via last will). Then when that same device is healthy again and goes back online this message is not distributed to all nodes within the cluster so we get mixed results in our consumers depending on which node they end up connected to. We have not had any success in identifying a root cause.

Hi,
Terminology: Keepalive messages are the MQTT PINGS that a client sends to the broker. Since you mention last will, what you call keepalive message is obviously not the standard understanding. You seem to build some sort of application side Presence service. That is, you seem to send a message when the client goes offline (using LWT); and then send another message when it comes online. You seem to expect that this "presence" message goes to all cluster nodes, so you must have subscribers for that presence topic on all cluster nodes. (why?) Also how do those subscribers behave? are those shared subscribers?

Maybe you do the second part with a plugin implementing the auth_on_register hook. But that's just another guess.


We are building and maintaining our own VerneMQ docker container in compliance with the documentation :D

Alright, thanks :)
For the exclusion of doubt, your Dockerfile has to compile Verne and *not* pull the EULA based tar.gz (see  https://github.com/vernemq/docker-vernemq/blob/9f4956a7e06b95368f4dc5d27bf1e377149de020/Dockerfile#L20)

Best,
André

Uroš Kolarič

unread,
Dec 16, 2022, 7:53:36 AM12/16/22
to vernemq-users
Hi  André

Maybe we do not follow the EULA correctly :/ .
We have contacted Verne salles to see the price for commercial license. 

On our setup problem: 
We have fixed our last will implantation and it is much more stable now. 
However we have a problem with split brain scenarios. 

We use MQTT Verne version: 1.11.0

In single k8s cluster we have 2 MQTT clusters. Lets call them CDG and ORY clusters. 
Lets look at CDG one. 

In CDG around 200 devices are connection to this MQTT cluster. MQTT cluster is represented by 2 pods of Verne MQ.
When we start cluster initially it works OK: all clients, connect and cluster is correctly formed. 

Last week over the  weekend our client has restarted some nodes in k8s cluster which restarted the MQTT cluster for CDG as well. 
After a restart pod 0 did not re-join the cluster. See the attached logs.
First we have  tried to restart just the pod 0. In order to make him re-join the cluster.
However we got such error instead:

11:25:09.764 [info] Try to start vmq_generic_msg_store: ok
12/16/2022 12:25:09 PM 11:25:09.877 [info] cluster event handler 'vmq_cluster' registered
12/16/2022 12:25:09 PM 11:25:09.924 [info] loaded 0 subscriptions into vmq_reg_trie
12/16/2022 12:25:10 PM 11:25:10.545 [error] can't auth publish [<<"ADB Safegate Safedock MQTT Default">>,{[],<<"ADBSG_Mop_0020980373fc_STATUS">>},1,[<<"SAFEDOCK">>,<<"DGS">>,<<"KEEPALIVE">>,<<"V1">>,<<"0020980373FC">>],<<"{\"apiVersion\":\"1.2.0\",\"data\":{\"state\":\"Offline\",\"sourceUID\":\"0020980373FC\",\"stationId\":\"LFPG\",\"terminalId\":\"Terminal_2\",\"locationId\":\"Terminal_2\",\"deviceId\":\"DGSC14\",\"deviceAddress\":\"172.28.97.68\",\"centerlines\":[{\"centerlineId\":\"C14\"}]}}">>,true] due to no_matching_hook_found
12/16/2022 12:25:10 PM 11:25:10.545 [warning] can't authenticate last will
12/16/2022 12:25:10 PM for client {[],<<"ADBSG_Mop_0020980373fc_STATUS">>} due to not_allowed
12/16/2022 12:25:10 PM 11:25:10.545 [error] can't auth publish [<<"ADB Safegate Safedock MQTT Default">>,{[],<<"ADBSG_Mop_002098037453_STATUS">>},1,[<<"SAFEDOCK">>,<<"DGS">>,<<"KEEPALIVE">>,<<"V1">>,<<"002098037453">>],<<"{\"apiVersion\":\"1.2.0\",\"data\":{\"state\":\"Offline\",\"sourceUID\":\"002098037453\",\"stationId\":\"LFPG\",\"terminalId\":\"Terminal_2\",\"locationId\":\"Terminal_2\",\"deviceId\":\"DGSC10_C12\",\"deviceAddress\":\"172.28.97.70\",\"centerlines\":[{\"centerlineId\":\"C10\"},{\"centerlineId\":\"C12\"}]}}">>,true] due to no_matching_hook_found
12/16/2022 12:25:10 PM 11:25:10.545 [warning] can't authenticate last will
12/16/2022 12:25:10 PM for client {[],<<"ADBSG_Mop_002098037453_STATUS">>} due to not_allowed
12/16/2022 12:25:11 PM 11:25:11.265 [info] Sent join request to: 'Ver...@safedock-mqtt-1.safedock-mqtt.sam-gate-cdg.svc.cluster.local'
12/16/2022 12:25:11 PM 11:25:11.278 [info] Unable to connect to 'Ver...@safedock-mqtt-1.safedock-mqtt.sam-gate-cdg.svc.cluster.local'

Since this did not work we have then restarted pod-1 (which looked like a leader). Up until this point pod-1 had all connections.
Restart did the trick, after restart pod-1 could join to the pod-0.

Why does individual pod try only once to re-join the cluster? It would be expected this would happen on repeat. 
Is there some know bug linked to cluster formation  in this version ?

Our client regular updates Linux servers on which k8s is running. When doing this they do not uncordon the node but restart VM directly.
While I do agree this is not good practice  we expect MQTT cluster to survive such failures. 

Best UK

ponedeljek, 5. december 2022 ob 11:24:07 UTC+1 je oseba a...@erl.io napisala:
safedock-mqtt-0.cdg.log
safedock-mqtt-1.cdg.log
Reply all
Reply to author
Forward
0 new messages