High Availability for Node Red - Possible Solution

1,734 views
Skip to first unread message

MikeFromKC

unread,
Jun 29, 2017, 3:06:20 PM6/29/17
to Node-RED
I know this has come up a few times and I've read some of the relevant topics about it but I wanted to ask the group to be sure I didn't miss anything. Sorry for the length of the post, it's a bit of an involved topic. 

I'm looking to use Node-RED for my home automation system but I really need/want high availability. I'm running everything on Raspberry Pis using the BitScope Blade for mounting and powering.  I'm using MQTT and RabbitMQ for the bus. All flows begin with either timers firing or a message off the bus. The software running on the Pis is actually set up using Docker Swarm with everything in Docker containers. This provides for some amount of resiliency as Docker Swarm will restart downed applications and move them around if a Pi dies. I need to verify this works as well as I hope but if it doesn't, I'll switch to Kubernetes which I know will handle this nicely. 

The problem I'm trying to solve is that I really don't want there to be an outage as the container orchestration system is detecting failure and moving things around. This, in-turn, leads me to research on Node-RED clustering and high availability. I'm wondering if someone already has a good solution. I know one suggested approach is to use a load balancer but that only works if your flows are inbound HTTP requests. Since I'm using timed tasks and bus message triggering, a load balancer isn't a good fit. There are ways to use queues with client ids to handle distribution but that doesn't support timers or load leveling. Basically, I don't know of an existing approach that is globally capable of what I want to do.

My plan, assuming someone here doesn't have something already, is to create three custom nodes based on etcd. Etcd is a high performance distributed key/value store and is already used as the backbone for Kubernetes. It's also one of the tools supported by Docker Swarm to make the swarm managers highly available. 

The first and second custom node would be used for sharing data across the cluster: one for input, the other for output. One would take whatever you give it as well as a name and write it to etcd for use anywhere in the cluster. The other would take a name and return the value from the cluster. I might need to add more semantics or other node options but in the end, these are just for making data available on the cluster itself - like global but cluster-wide. 

The next node is more interesting though dead simple to use - call it the "run once" node for sake of discussion. It would never be the first node in the flow, almost always the second. Basically, this node takes in any input at all and just passes it along to the output if this instance of Node-RED is the first to get past the distributed "lock". Under the covers, when this node receives data from another node it uses etcd's putIfAbsent behavior to try and insert the hash of the inbound value as the key and a random number as the value. If the call is successful, then it returns the value out of the node. If it's unsuccessful, it stops the flow from processing any farther. There are some options to figure out like using a random cached UUID for the value and maybe using the flow id to specify the distributed map to use but none of these change the core idea. 

The reason I believe this is a workable solution is that you can deploy as many identical copies of Node-RED as you want. All of them would have the exact same flows deployed. Every time a flow is triggered, the "run once" node will ensure that that flow is only executed on a single instance while imposing only a very tiny overhead (etcd is amazingly fast). This has the added benefit of load balancing the work among instances as well though that's not really a requirement for me. Docker Swarm already requires something like etcd for HA operation and etcd is HA already. RabbitMQ is also capable of HA operation when set up with mirroring. So in case of a Pi failure (hardware, os, etc) the applications will die and Swarm will move them to another instance. There will be zero impact to Node-RED though as the loss of an instance will just mean another instances is picking up more of the flows. By using the clustering data nodes, it is possible to share data between instances as well in case that is required for your flows. Also, this will work for anything that triggers a flow - even if the flow is only triggered on a single node or if there is only a single instance running (like in development). 

Anyways, I'd like to know if this is potentially of interest to anyone else or if there is already a better approach out there. Thanks!

-Mike

Julian Knight

unread,
Jun 30, 2017, 10:03:56 AM6/30/17
to Node-RED
Sounds like a lot of work for a HA project.

I'd be tempted to do something more low-tech. 

Firstly assess just how time-critical your HA system is - does it really matter if the system goes down for a few minutes - except maybe at a few critical moments. If not, the risk is really low, really only a problem if the whole thing goes down when a switch command should be executed. How many of those actually happen during a day?

A Pi with a decent power supply connected to a PC UPS and using an HDD/SSD rather than an SD card is an incredibly reliable platform.

If you decide to go further, a second Pi running a small Node.JS service could easily monitor for when your primary Pi's services fail (MQTT is good for this though it means you also have to test that the primary MQTT broker is still working if running all on the same primary Pi, easily done using a backup broker). The Node.JS service would fire up its own local copy of NR and associated services when it detected the loss of the primary. Should happen within a few seconds with maybe an additional 20sec of startup for the backup service - certainly under a minute. All fairly low tech but gets the job done without the massive complexities of the other tools which are great for enterprise architectures but horrible to learn, set up and maintain for a home system.

Add in a simple git and/or rsync service to ensure that changes to the primary pi (e.g. flows, new modules, etc) are reflected through to the backup.

I've used the example of a Pi as they are cheap but reliable. But of course, this would apply to any platform choice.

Bart Butenaers

unread,
Apr 1, 2018, 8:57:04 AM4/1/18
to Node-RED
Hi guys,

Mike's setup with the Bitscope blade looks very professional, but my wife is going to kick me out of the house when I start installing racks ;-)
Currently my landscape at home looks like this:

All devices (sensor ...) are wired to my 'production' Raspberry, where the (stable) production flow is running.
On my 'test' Raspberry, a flow is running with all my experimental stuff.   As soon as something works fine in test, I add that functionality to my production flow.
My production flow reads all the sensor data, and sends that data via MQTT to my 'test' Raspberry, where the test flow  is running.
So thanks to MQTT, I have all productive sensor data also available in my test flow ...

Last night my production Raspberry wasn't reachable anymore, so now I want to setup a second 'production' Raspberry as failover.
Lots of questions haunting in my mind currently:
  1. Is this possible with your setup?  Because the sensors are only wired to the first Raspberry...  Or do I need such a Blade solution.  Or other kind of hardware?
  2. And how should I setup such a NodeJs service that handles the switching between the Raspberries?  Or can I do the same from within the Node-Red flows ?
  3. How to ensure that all productive Raspberries are running the same up-to-date flows? Can this be automatically?
Thanks in advance !!!
Bart








Colin Law

unread,
Apr 1, 2018, 9:16:48 AM4/1/18
to node...@googlegroups.com
The biggest problem probably relates to the sensors. Unless you have redundant sensors wired separately to a backup system how can you cope with a failure of the pi?

I have tackled this problem by addressing each failure mode as it has come along.  A couple of times over a long period my pi (which is connected by wifi) has lost its wifi connection.  A reboot of the pi fixed it, so in the node red flow I have a watchdog checking that it can see the base station router. If it loses contact for more than five minutes then it reboots itself.  Another example is that I had an old router and the wifi on the router used to lock up occasionally, I handled this by having a watchdog in another pi on the wired network that checked that devices on the wifi were accessible, if they disappeared for an extended period then that triggered a reboot of the router.  I no longer need that however as I replaced the router so that problem no longer exists.

I have to accept the fact that if the pi completely fails then my system will fail and I have independent systems making sure that, for example, the heating does not stay on full blast heating up the house to sauna levels.

Colin

--
http://nodered.org
 
Join us on Slack to continue the conversation: http://nodered.org/slack
---
You received this message because you are subscribed to the Google Groups "Node-RED" group.
To unsubscribe from this group and stop receiving emails from it, send an email to node-red+unsubscribe@googlegroups.com.
To post to this group, send email to node...@googlegroups.com.
Visit this group at https://groups.google.com/group/node-red.
To view this discussion on the web, visit https://groups.google.com/d/msgid/node-red/7228dab5-f57b-4693-ad5b-c3706cf07cb5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Bart Butenaers

unread,
Apr 1, 2018, 10:37:53 AM4/1/18
to Node-RED
Hi Colin,

Ok, it is perhaps a good idea that my flow checks whether it is connected to the network.  If not it should (after X time) reboot.  Good tip...

Seems that yesterday my power adapter was defect, so end of story.  I will have to buy a decent one....  When reading about Raspberry's that become unresponsive, seems that most of the time it is due to a bad power supply (or a dip in the power).

Had a quick look at the Bitscope blade, and I don't think that can solve my wiring issue.  

Another idea that came up in my mind:  Suppose I don't connect the critical sensors directly to my Raspberry's GPIO pins, but via the I2C interface (e.g. via a IO PI Plus expansion board).  When I could add my both production Raspberries to the same I2C bus, they both could read the same sensors.  But from this discussion I assume it isn't possible to connect two Raspberry's to the same I2C bus.

So I'm running out of ideas ...

Bart

Colin Law

unread,
Apr 1, 2018, 10:46:58 AM4/1/18
to node...@googlegroups.com
On 1 April 2018 at 15:37, Bart Butenaers <bart.bu...@gmail.com> wrote:
Hi Colin,

Ok, it is perhaps a good idea that my flow checks whether it is connected to the network.  If not it should (after X time) reboot.  Good tip...

If you find that the wifi is 100% reliable then probably no need to do it.
 

Seems that yesterday my power adapter was defect, so end of story.  I will have to buy a decent one....  When reading about Raspberry's that become unresponsive, seems that most of the time it is due to a bad power supply (or a dip in the power).

I would agree with that.
 

Had a quick look at the Bitscope blade, and I don't think that can solve my wiring issue.  

Another idea that came up in my mind:  Suppose I don't connect the critical sensors directly to my Raspberry's GPIO pins, but via the I2C interface (e.g. via a IO PI Plus expansion board).  When I could add my both production Raspberries to the same I2C bus, they both could read the same sensors.  But from this discussion I assume it isn't possible to connect two Raspberry's to the same I2C bus.

I don't know anything about I2C so can't comment on that.

As a matter of interest what are you sensing with the critical sensors and what are you doing with the data?

Colin
 

Walter Kraembring

unread,
Apr 1, 2018, 11:02:15 AM4/1/18
to Node-RED
critical sensors

To get real redundancy if you have critical sensors is to have them doubled as well (or tripled and exclude the reading from the one that eventually is diverting from the other two)  

Walter Kraembring

unread,
Apr 1, 2018, 11:03:35 AM4/1/18
to Node-RED
..but then I guess we are talking space shuttles

Bart Butenaers

unread,
Apr 1, 2018, 1:59:03 PM4/1/18
to Node-RED
Well perhaps 'critical' is perhaps a little bit orbitant, since my life doesn't depend on it.  And to be complete: it is not going to be used by Nasa ...
For me critical sensors are:
  • Reed relays on doors that shouldn't be opened when I'm not at home.
  • Water level sensor, since we suffer from small flooding from time to time
I assume devices like reed relays could be measured by two PI's simultaneously.  Just add the wire (with a single external pull-up resistor) to the GPIO pins of both PI's.  When the relay closes, both GPIO pins will measure the same falling edge:


However for devices like the acoustic water level sensor this doesn't work: a trigger pulse needs to be send periodically to the pisrf node, and the time period until the echo pulse is measured.  Only one of the PI's is allowed to send a trigger node, not simultaneously:

So I have no solution for the moment ...

Bart
Reply all
Reply to author
Forward
0 new messages