VerneMQ crashing repeatedly.

Eduardo Hering

unread,

Apr 1, 2021, 10:48:51 AM4/1/21

to vernem...@googlegroups.com

Hello!

I have two VerneMQ pods created by VerneMQ Kubernetes Operator. When I drive traffic to them the number of connections keep rising to less then 1000 and falling back to zero, repeatedly. Pod logs show the following repeated messages:

[…]
14:30:10.609 [error] Ranch listener {{0,0,0,0},1883} terminated with reason: bad argument in vmq_swc_store_batcher:add_to_batch/2 line 28
14:30:10.610 [error] CRASH REPORT Process <0.12525.0> with 0 neighbours crashed with reason: bad argument in vmq_swc_store_batcher:add_to_batch/2 line 28
14:30:10.611 [error] Ranch listener {{0,0,0,0},1883} terminated with reason: bad argument in vmq_swc_store_batcher:add_to_batch/2 line 28
14:30:10.611 [error] CRASH REPORT Process <0.12511.0> with 0 neighbours crashed with reason: bad argument in vmq_swc_store_batcher:add_to_batch/2 line 28
14:30:10.611 [error] Ranch listener {{0,0,0,0},1883} terminated with reason: bad argument in vmq_swc_store_batcher:add_to_batch/2 line 28
14:30:10.633 [error] vmq_queue process <0.13224.0> exit for subscriber {[],<<"68:02:b8:62:cf:d3">>} due to {badarg,[{vmq_swc_store_batcher,add_to_batch,2,[{file,"/ve
rnemq-build/apps/vmq_swc/src/vmq_swc_store_batcher.erl"},{line,28}]},{vmq_swc_store,enqueue_op_sync,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{li
ne,672}]},{vmq_swc_metrics,timed_measurement,4,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_metrics.erl"},{line,102}]},{vmq_queue,handle_session_down,3,[{file,"/v
ernemq-build/apps/vmq_server/src/vmq_queue.erl"},{line,715}]},{gen_fsm,handle_msg,8,[{file,"gen_fsm.erl"},{line,486}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.e
rl"},{line,249}]}]}
14:30:10.645 [error] vmq_queue process <0.13226.0> exit for subscriber {[],<<"68:02:b8:62:e6:50">>} due to {badarg,[{vmq_swc_store_batcher,add_to_batch,2,[{file,"/ve
rnemq-build/apps/vmq_swc/src/vmq_swc_store_batcher.erl"},{line,28}]},{vmq_swc_store,enqueue_op_sync,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{li
ne,672}]},{vmq_swc_metrics,timed_measurement,4,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_metrics.erl"},{line,102}]},{vmq_queue,handle_session_down,3,[{file,"/v
ernemq-build/apps/vmq_server/src/vmq_queue.erl"},{line,715}]},{gen_fsm,handle_msg,8,[{file,"gen_fsm.erl"},{line,486}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.e
rl"},{line,249}]}]}
14:30:10.652 [info] Replica meta9: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.clus
ter.local']
14:30:10.653 [error] vmq_queue process <0.13244.0> exit for subscriber {[],<<"68:02:b8:62:f7:66">>} due to {badarg,[{vmq_swc_store_batcher,add_to_batch,2,[{file,"/ve
rnemq-build/apps/vmq_swc/src/vmq_swc_store_batcher.erl"},{line,28}]},{vmq_swc_store,enqueue_op_sync,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{li
ne,672}]},{vmq_swc_metrics,timed_measurement,4,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_metrics.erl"},{line,102}]},{vmq_queue,handle_session_down,3,[{file,"/vernemq-build/apps/vmq_server/src/vmq_queue.erl"},{line,715}]},{gen_fsm,handle_msg,8,[{file,"gen_fsm.erl"},{line,486}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
14:30:10.654 [error] vmq_queue process <0.13266.0> exit for subscriber {[],<<"68:02:b8:5e:00:c8">>} due to {badarg,[{vmq_swc_store_batcher,add_to_batch,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store_batcher.erl"},{line,28}]},{vmq_swc_store,enqueue_op_sync,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,672}]},{vmq_swc_metrics,timed_measurement,4,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_metrics.erl"},{line,102}]},{vmq_queue,handle_session_down,3,[{file,"/vernemq-build/apps/vmq_server/src/vmq_queue.erl"},{line,715}]},{gen_fsm,handle_msg,8,[{file,"gen_fsm.erl"},{line,486}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
[…]
14:30:10.685 [info] Replica meta10: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.687 [info] Replica meta4: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.692 [info] Replica meta2: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.697 [info] Replica meta6: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.698 [info] Replica meta3: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.708 [info] Replica meta1: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.722 [info] Replica meta7: Peers updated ['v...@vernemq-k8s-0.vernemq-k8s-service.mqtt.svc.cluster.local','v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local']
14:30:10.754 [info] Sent join request to: 'v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local'
14:30:10.759 [info] Unable to connect to 'v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local'
14:30:10.760 [error] Execute error: ["vmq-admin","cluster","join","discovery-node=v...@vernemq-k8s-1.vernemq-k8s-service.mqtt.svc.cluster.local"] "Couldn't join cluster due to not_reachable\n"
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
14:30:11.615 [warning] lager_error_logger_h dropped 991 messages in the last second that exceeded the limit of 100 messages/sec
{"Kernel pid terminated",application_controller,"{application_terminated,vmq_server,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,vmq_server,shutdown})

Can somebody please help? Any information would be useful.

Best regards,

Eduardo

André Fatton

unread,

Apr 1, 2021, 11:23:00 AM4/1/21

to vernemq-users

Hi Eduardo,

what does "driving traffic" to the cluster mean? that is, how do you actually test?

Do you separate test runs so that you don't have historical state in the cluster? has the cluster seen restarts/rescheduling of nodes due to Kubernetes? (see https://github.com/vernemq/docker-vernemq/pull/264 for possible context).

have you also tested with plumtree protocol, instead of SWC?

All in all this looks like SWC sync state problems to me due to terminated (ie. "leave"'d) cluster nodes that re-join with existing state.

Best,

André

Eduardo Hering

unread,

Apr 1, 2021, 2:38:05 PM4/1/21

to vernem...@googlegroups.com

Hi André.

what does "driving traffic" to the cluster mean? that is, how do you actually test?

It means I have two subdomains that I can point to the cluster load balancer. One of them with a few hundred connections and the other with more than a hundred thousand connections.

I’ve got two VerneMQ pods, one on each, 8 core 32GB ram, node.

And I also have HQProxy Ingress Controller distibuting incoming traffic directly to the pods using proxy-protocol-v2.

When I point the high traffic subdomain to HAProxy external IP the number of open sessions start to increase to a few hundred then drop to zero and start to increase again, reepating this behaviour.

When I point the few connections domain to the cluster it remains stable.

Do you separate test runs so that you don't have historical state in the cluster?

Which kind of historical state? How can I reset that state?

has the cluster seen restarts/rescheduling of nodes due to Kubernetes? (see https://github.com/vernemq/docker-vernemq/pull/264 for possible context).

Nodes are not being restarted.

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME

aks-agentpool-10002502-vmss000000 Ready agent 39h v1.18.14 10.240.0.4 <none> Ubuntu 18.04.5 LTS 5.4.0-1041-azure docker://19.3.14

aks-bwmprodpool-10002502-vmss000000 Ready agent 39h v1.18.14 10.240.0.5 <none> Ubuntu 18.04.5 LTS 5.4.0-1041-azure docker://19.3.14

aks-bwmprodpool-10002502-vmss000001 Ready agent 39h v1.18.14 10.240.0.6 <none> Ubuntu 18.04.5 LTS 5.4.0-1041-azure docker://19.3.14

have you also tested with plumtree protocol, instead of SWC?

This is unknown to me. I’ll search and learn about it.

All in all this looks like SWC sync state problems to me due to terminated (ie. "leave"'d) cluster nodes that re-join with existing state.

Cluster nodes uptime is above 38h. They have not been restarted.

Best regards,

Eduardo.

Best,
André

--
You received this message because you are subscribed to the Google Groups "vernemq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vernemq-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vernemq-users/01be1412-1b3c-49d8-b69d-a71614a04a2cn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

André Fatton

unread,

Apr 2, 2021, 3:32:44 AM4/2/21

to vernemq-users

Ok, thanks. Is this production traffic, or a test setup?

If it's a test setup, what is the actual connection setup rate?

How fast would the connections fire re-connect attemps if they get a delayed answer?

Do you see "action timeouts" in the Verne logs?

This would point to a situation where your connections come in, get connected to a node but then somehow re-connect due to a timeout. This could create a lot of sync overhead.

do you observe Verne metrics and logs?

are you within configured rate limits of HAProxy as well as possible rate limits of Azure?

André

Reply all

Reply to author

Forward