Hi
André
Maybe we do not follow the EULA correctly :/ .
We have contacted Verne salles to see the price for commercial license.
On our setup problem:
We have fixed our last will implantation and it is much more stable now.
However we have a problem with split brain scenarios.
We use MQTT Verne version: 1.11.0
In single k8s cluster we have 2 MQTT clusters. Lets call them CDG and ORY clusters.
Lets look at CDG one.
In CDG around 200 devices are connection to this MQTT cluster. MQTT cluster is represented by 2 pods of Verne MQ.
When we start cluster initially it works OK: all clients, connect and cluster is correctly formed.
Last week over the weekend our client has restarted some nodes in k8s cluster which restarted the MQTT cluster for CDG as well.
After a restart pod 0 did not re-join the cluster. See the attached logs.
First we have tried to restart just the pod 0. In order to make him re-join the cluster.
However we got such error instead:
11:25:09.764 [info] Try to start vmq_generic_msg_store: ok
12/16/2022 12:25:09 PM 11:25:09.877 [info] cluster event handler 'vmq_cluster' registered
12/16/2022 12:25:09 PM 11:25:09.924 [info] loaded 0 subscriptions into vmq_reg_trie
12/16/2022 12:25:10 PM 11:25:10.545 [error] can't auth publish [<<"ADB Safegate Safedock MQTT Default">>,{[],<<"ADBSG_Mop_0020980373fc_STATUS">>},1,[<<"SAFEDOCK">>,<<"DGS">>,<<"KEEPALIVE">>,<<"V1">>,<<"0020980373FC">>],<<"{\"apiVersion\":\"1.2.0\",\"data\":{\"state\":\"Offline\",\"sourceUID\":\"0020980373FC\",\"stationId\":\"LFPG\",\"terminalId\":\"Terminal_2\",\"locationId\":\"Terminal_2\",\"deviceId\":\"DGSC14\",\"deviceAddress\":\"172.28.97.68\",\"centerlines\":[{\"centerlineId\":\"C14\"}]}}">>,true] due to no_matching_hook_found
12/16/2022 12:25:10 PM 11:25:10.545 [warning] can't authenticate last will
12/16/2022 12:25:10 PM for client {[],<<"ADBSG_Mop_0020980373fc_STATUS">>} due to not_allowed
12/16/2022 12:25:10 PM 11:25:10.545 [error] can't auth publish [<<"ADB Safegate Safedock MQTT Default">>,{[],<<"ADBSG_Mop_002098037453_STATUS">>},1,[<<"SAFEDOCK">>,<<"DGS">>,<<"KEEPALIVE">>,<<"V1">>,<<"002098037453">>],<<"{\"apiVersion\":\"1.2.0\",\"data\":{\"state\":\"Offline\",\"sourceUID\":\"002098037453\",\"stationId\":\"LFPG\",\"terminalId\":\"Terminal_2\",\"locationId\":\"Terminal_2\",\"deviceId\":\"DGSC10_C12\",\"deviceAddress\":\"172.28.97.70\",\"centerlines\":[{\"centerlineId\":\"C10\"},{\"centerlineId\":\"C12\"}]}}">>,true] due to no_matching_hook_found
12/16/2022 12:25:10 PM 11:25:10.545 [warning] can't authenticate last will
12/16/2022 12:25:10 PM for client {[],<<"ADBSG_Mop_002098037453_STATUS">>} due to not_allowed
12/16/2022 12:25:11 PM 11:25:11.265 [info] Sent join request to: 'Ver...@safedock-mqtt-1.safedock-mqtt.sam-gate-cdg.svc.cluster.local'
12/16/2022 12:25:11 PM 11:25:11.278 [info] Unable to connect to 'Ver...@safedock-mqtt-1.safedock-mqtt.sam-gate-cdg.svc.cluster.local'
Since this did not work we have then restarted pod-1 (which looked like a leader). Up until this point pod-1 had all connections.
Restart did the trick, after restart pod-1 could join to the pod-0.
Why does individual pod try only once to re-join the cluster? It would be expected this would happen on repeat.
Is there some know bug linked to cluster formation in this version ?
Our client regular updates Linux servers on which k8s is running. When doing this they do not uncordon the node but restart VM directly.
While I do agree this is not good practice we expect MQTT cluster to survive such failures.
Best UK
ponedeljek, 5. december 2022 ob 11:24:07 UTC+1 je oseba
a...@erl.io napisala: