I know this has come up a few times and I've read some of the relevant topics about it but I wanted to ask the group to be sure I didn't miss anything. Sorry for the length of the post, it's a bit of an involved topic.
I'm looking to use Node-RED for my home automation system but I really need/want high availability. I'm running everything on Raspberry Pis using the BitScope Blade for mounting and powering. I'm using MQTT and RabbitMQ for the bus. All flows begin with either timers firing or a message off the bus. The software running on the Pis is actually set up using Docker Swarm with everything in Docker containers. This provides for some amount of resiliency as Docker Swarm will restart downed applications and move them around if a Pi dies. I need to verify this works as well as I hope but if it doesn't, I'll switch to Kubernetes which I know will handle this nicely.
The problem I'm trying to solve is that I really don't want there to be an outage as the container orchestration system is detecting failure and moving things around. This, in-turn, leads me to research on Node-RED clustering and high availability. I'm wondering if someone already has a good solution. I know one suggested approach is to use a load balancer but that only works if your flows are inbound HTTP requests. Since I'm using timed tasks and bus message triggering, a load balancer isn't a good fit. There are ways to use queues with client ids to handle distribution but that doesn't support timers or load leveling. Basically, I don't know of an existing approach that is globally capable of what I want to do.
My plan, assuming someone here doesn't have something already, is to create three custom nodes based on etcd. Etcd is a high performance distributed key/value store and is already used as the backbone for Kubernetes. It's also one of the tools supported by Docker Swarm to make the swarm managers highly available.
The first and second custom node would be used for sharing data across the cluster: one for input, the other for output. One would take whatever you give it as well as a name and write it to etcd for use anywhere in the cluster. The other would take a name and return the value from the cluster. I might need to add more semantics or other node options but in the end, these are just for making data available on the cluster itself - like global but cluster-wide.
The next node is more interesting though dead simple to use - call it the "run once" node for sake of discussion. It would never be the first node in the flow, almost always the second. Basically, this node takes in any input at all and just passes it along to the output if this instance of Node-RED is the first to get past the distributed "lock". Under the covers, when this node receives data from another node it uses etcd's putIfAbsent behavior to try and insert the hash of the inbound value as the key and a random number as the value. If the call is successful, then it returns the value out of the node. If it's unsuccessful, it stops the flow from processing any farther. There are some options to figure out like using a random cached UUID for the value and maybe using the flow id to specify the distributed map to use but none of these change the core idea.
The reason I believe this is a workable solution is that you can deploy as many identical copies of Node-RED as you want. All of them would have the exact same flows deployed. Every time a flow is triggered, the "run once" node will ensure that that flow is only executed on a single instance while imposing only a very tiny overhead (etcd is amazingly fast). This has the added benefit of load balancing the work among instances as well though that's not really a requirement for me. Docker Swarm already requires something like etcd for HA operation and etcd is HA already. RabbitMQ is also capable of HA operation when set up with mirroring. So in case of a Pi failure (hardware, os, etc) the applications will die and Swarm will move them to another instance. There will be zero impact to Node-RED though as the loss of an instance will just mean another instances is picking up more of the flows. By using the clustering data nodes, it is possible to share data between instances as well in case that is required for your flows. Also, this will work for anything that triggers a flow - even if the flow is only triggered on a single node or if there is only a single instance running (like in development).
Anyways, I'd like to know if this is potentially of interest to anyone else or if there is already a better approach out there. Thanks!
-Mike