SD Card Use and System Reliability

162 views
Skip to first unread message

JethroNull

unread,
Jul 5, 2015, 4:10:56 PM7/5/15
to node...@googlegroups.com
So, picking up from one of the meanderings that the thread "Console Output to SSH?" started I am hoping to continue the discussion on best practice for long term reliability in RPi and Pi-like setups.

My own requirement is a commercial one where an RPi 2, or something like it, is likely to be used 24x7x365 (+ 1 sec) in outdoor environments with zero accessibility by technical people.  My background has been embedded processors without OS doing cool stuff in the background and with simple hardware watchdogs to pick up errant code.  Switching to RPi/Linux introduces a whole bunch of new concerns over long term reliability. I'm teetering on the fence of minimal SD card use or running read only.  My application could go read only but a lot of configuration data would be lost on a power failure and that might be bad for customer opinion.  On the other hand, not having to worry about diagnosing unpredictable weirdness as a result of failing SD's and not having to build UPS functionality into the devices would save a bunch of cost and heartache.  In the previous thread I learnt a lot about minimizing SD card writes, mostly by using tmpfs, but in the absence of hard specs that define what maximum card usage/time is allowable, and moreover, not really knowing what Linux wants/needs to write to the card, I'm still uncomfortable.  On the read-only side I found a great blog that covers a lot of the issues involved http://k3a.me/how-to-make-raspberrypi-truly-read-only-reliable-and-trouble-free/ but it still can't decide which way to jump.  Anyone else had to make this decision?  Are there any wrinkles to either approach you've come across?

Jon (JethroNull)

Adrian Brown

unread,
Jul 5, 2015, 5:06:27 PM7/5/15
to node...@googlegroups.com
Have faced similar issues in our potential applications and like you we are using tmpfs, 
use industrial grade sd cards. or use designs based on the Raspberry Pi compute module

obviously the last 2 options come at a cost penalty

cheers
Adrian

Lawrence Griffiths

unread,
Jul 6, 2015, 5:45:00 AM7/6/15
to node...@googlegroups.com
There is a sort half way house between consumer grade and industrial grade that might do the job.

Shem Jamieson

unread,
Jul 6, 2015, 9:36:55 AM7/6/15
to node...@googlegroups.com
I have pulled the plug on a RPi many times with no problems. Has anyone done a test of randomly pulling the power on one multiple times and measured the results?

SJ

Greg EVA

unread,
Jul 6, 2015, 10:00:15 AM7/6/15
to node...@googlegroups.com
Hey Jon,

I guess if you've been doing telematics/M2M for 25 years, you have a pretty good handle on whats important in a data collection system.

A problem for me is that it is missing an RTC; you need to know the time in order to log data, hoping that you can get it online is pretty good, except if connectivity is sporatic.

No watchdog.  I have OpenHAB running on an RPi... there is some Java bug going on that slowly makes OpenHAB unresponsive, and then the other system services, and then the entire system (no ping response).  Unknown bugs in the field which cause the system to be unreachable are pretty bad if you intend to do everything remotely.

The other thing that I have found is that you can add things onto the RPi (or similar) systems to deal with these issues, however they are afterthoughts which add complexity and cost, and quickly make the total BOM cost of the $35 computer to be more like $200.

I know you mentioned that cost was a major limiting factor for you, but how many of these devices are you planning on deploying in the "fields"?  Are these multiple customers buying one unit, or multiple units being sold to a few customers?  Have you considered just using a small embedded computer which is designed for real world deployment?  Unless you're planning on selling tens of thousands of these things, I can't help but think that the cost savings that you might get by using a cheap device may quickly be eaten up with extra engineering, support, travel costs.

@Shem - due to my aforementioned OpenHAB/Java bug, I often pull the power on one of my Pi's and in the past year, I haven't had a problem where it won't boot.  That said, the FS has a counter for number of improper unmounts, and should this be reached, the device may no longer boot (this used to be the case, but may have changed).  When I backup the SD card, I also take the opportunity to run a file system check on it to make sure it's all good while it is in my laptop.

Cheers,

Greg

Jon Richings

unread,
Jul 6, 2015, 3:19:44 PM7/6/15
to node...@googlegroups.com
Shem, I've been wishing I had the time to do that very test.  But my experience is the same as yours, lots of brutal power-downs with no (obvious) ill-effects.  I need to chek out that counter Greg mentioned.  That would be interesting to keep track of.

Adrian, the Compute module is tempting since they seem to intend it for commercial applications, and yet they are not promising industrial temp spec (though the individual components seem capable).  But when you have to add back all the stuff that RPi has it's not a very good deal.  I want to get my hands on an OlinuXino A10.  Higher spec than RPi in many ways but still about $40.  At some point I'll have to start a thread about using NR on that and whatever else might be different from RPi.

Greg, yeah, I've been in this a looong time.  But I feel like a real noob having got all excited about Linux and cheap RPi-ish platforms.  Our application is more than a few hundred, but probably less than 10's of thousands (I should be so lucky).  In the past when we've designed things for really high volume we bespoke everything.  The unit price is then as low as it can possibly be, but the development time and cost is huge.  RPi-ish things seem to give us a platform where it is both economical (because of IT's volume) and yet still versatile enough to be able to be origamied into whatever we need (thank you Node-Red!).  We can still keep the price down by designing our own motherboard that has anything RPi doesn't give us (a good PSU, PoE, USB-to Serial, RTC, etc).  And that doesn't cost much.  Custom enclosures are still the biggest cost.

RPi does appear to have a watchdog as part of the Broadcom processor.  Stumbled on a little description of using it here: http://k3a.me/how-to-make-raspberrypi-truly-read-only-reliable-and-trouble-free/  but I havn't looked into how that works.  Sounds like it may not be timer watchdog but triggered by low memory?  From experience with OS-less processors in other systems the watchdog can be a source of many problems on it's own, but worth the effort as an 'if-all-else-fails' measure.  There is also mention in that article of reboot on "kernel.panic" that sounds worth investigation too.  We've jumped on the PM2 thing as that seems to have a number of ways of picking things up when they break (and then killing them if they keep breaking), which I'm hoping will ensure that if our code ('coz NR never would) causes an inescapable problem we'll still be able to remote into a shell via SSH and do some cleanup in the background.

Jon

--
http://nodered.org
---
You received this message because you are subscribed to a topic in the Google Groups "Node-RED" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/node-red/4Ihe1jT14eA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to node-red+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Biddell

unread,
Jul 6, 2015, 4:04:43 PM7/6/15
to node...@googlegroups.com

Jon

I worked in Aerospace most of my life. The way we made High Integrity Computers, was to have two computers running the same programme and taking the address and data lines into comparators. Any discrepancy in the address data lines caused the comparator circuit to reboot both computers.

mike

JethroNull

unread,
Jul 6, 2015, 5:44:35 PM7/6/15
to node...@googlegroups.com
Mike, I've heard tail of similar things in aerospace and other high reliability systems, most recently we were peripherally involved with some wind farm stuff that did the same where the consequences of a turbine going out of control are both scary and expensive.  What happened in your world if the problem causes continual resets so that nothing was online?  Anyway that's not even an option for the price point of our products.

Jon

Mike Biddell

unread,
Jul 7, 2015, 2:39:56 AM7/7/15
to node...@googlegroups.com
Jon

The HIC comparator pulled both channel A computers into reset/reboot and changed to channel B, where there were two identical computers running the same program. Only the controlling computers had access to the Input/Output Bus..... so in other words when the channel toggled, the I/O did too.... amazing system and is still used on Jumbo 747 today. I designed one of the engine control computers (as part of a 3 man team). It worked and still works very well today.

So if you want high integrity  control, it wouldn't be too hard to do with Raspberry Pi's (4 of them) and two I/O boards and of course Node-red. The first Node-red HIC computer !!!!!

Mike 

Dave C-J

unread,
Jul 7, 2015, 3:35:16 AM7/7/15
to node...@googlegroups.com
+1 - Jumbo-Pi ! 

Greg EVA

unread,
Jul 7, 2015, 6:24:27 AM7/7/15
to node...@googlegroups.com
@Jon - very interesting about the watchdog... didn't know that was there.  Certainly worth investigation!  Let us know what you find.

Regarding the Linux side of things for embedded devices, I'm far from an expert, but have been getting into the field more and more over the years, and there are quite a large number of pros and cons which come up.  Obviously with the sheer power and capability comes complexity and challenge.  Over here in France, there are firms who focus on the open source software stack of this sort of application.  I mentioned Yocto.... to build an embedded Linux version... you can have it all in ~30Mb, booting in a couple seconds... obviously meaning that OTA updates are much more managable.  Just an idea, but you might consider outsouring the OS side of things and staying focused on the M2M app and hardware side.  Wind River are "THE" Embedded Linux guys... but their OS is licensed, Yocto custom compilations obviously just require initial engineering.

It's a lot of overhead, and more of a full stack Linux option, but you should at least have a look at Eclipse Kura.  Even if you don't program in Java, the platform provide some pretty cool stuff for device management, configuration deployment, etc. using MQTT.  (Coming from the same origins as MQTT).

Thanks for the link for that Olimex board... it looks pretty darn awesome!

JethroNull

unread,
Jul 7, 2015, 1:18:14 PM7/7/15
to node...@googlegroups.com
Yocto looks awesome Greg, but over my head right now.  That'll have to wait till I am much more up to Linux speed, or maybe get you t give us a hand.  For now I think we'll keep on the RPi road and plan to jump to OlinuXino and Yocto in due course.  And as you say, we are better used focussed on the hardware and the app (we have a ton of web-end stuff to look after too).

In the meantime though it does sound like we could strip out a lot of Raspbian's standard packages to make more room for other things to play.  Any experience with that?  Is it just a case of apt-get remov'ing things or does it run deeper than that?

I'll definitely keep the thread updated on what I find with the watchdog.

Java?   no, no, no way, nope, ...

Jon

Dave C-J

unread,
Jul 7, 2015, 2:10:34 PM7/7/15
to node...@googlegroups.com

Lots of good links like...
Stripping down a standard Raspbian installation | Technology ... http://blog.qruizelabs.com/2014/04/23/stripping-down-a-standard-raspbian-installation/

Also remove Scratch... Minecraft.. Etc...

Jon Richings

unread,
Jul 7, 2015, 4:33:00 PM7/7/15
to node...@googlegroups.com
Thanks Dave, I had found that one.  Taking Wolfram off has cleared a big chunk of space.  One version of our gizmo runs headless and so all the GUI stuff can go.  But another version runs a kind of kiosk mode and so we need some of the desktop stuff.  It's difficult to know what's what amongst the zillions of bits and pieces.  You don't happen to know of a site that lists everything and it's purpose and dependencies do you?

--

Julian Knight

unread,
Jul 7, 2015, 4:54:48 PM7/7/15
to node...@googlegroups.com
Not done any scientific measurements but I've certainly had cause to pull the plug fairly regularly. And I've certainly corrupted one or two cheap cards this way but I've not yet experienced any corruption on branded cards.

Julian Knight

unread,
Jul 7, 2015, 5:13:52 PM7/7/15
to node...@googlegroups.com
Looks like the watchdog module has quite a few tests it can react to. Not just memory or CPU:

The watchdog daemon does several tests to check the system status:

       
·  Is the process table full?

       
·  Is there enough free memory?

       
·  Are some files accessible?

       
·  Have some files changed within a given interval?

       
·  Is the average work load too high?

       
·  Has a file table overflow occurred?

       
·  Is  a process still running? The process is specified by a pid file.

       
·  Do some IP addresses answer to ping?

       
·  Do network interfaces receive traffic?

       
·  Is  the  temperature  too  high?  (Temperature   data   not   always
          available
.)

       
·  Execute a user defined command to do arbitrary tests.

       
If  any of these checks fail watchdog will cause a shutdown. Should any
       of these tests
except the user defined  binary  last  longer  than  one
       minute the machine will be rebooted
, too.

If, however, you happen to also be including an embedded microcontroller I would think you might be better using that to directly control the power for the ultimate reset. Similar to the discussion on Aircraft systems (I trained as an aeronautical engineer but found IT to be a more reliable and easier to get into industry :) ) but much cheaper to set up. If, for example you were to include an ESP8266 for Wi-Fi, that has enough of an embedded controller to do the trick. Perhaps using software to send a "tick" out of one of the Pi's GPIO ports, monitored by the microcontroller.

On Monday, 6 July 2015 20:19:44 UTC+1, JethroNull wrote:
...

Julian Knight

unread,
Jul 7, 2015, 5:16:37 PM7/7/15
to node...@googlegroups.com
BTW, do you not find that PM2 is a memory hog? I found that it was almost unusable on my old Pi and even caused some problems on my dev laptop.

JethroNull

unread,
Jul 7, 2015, 5:20:56 PM7/7/15
to node...@googlegroups.com
Nice find Julian, where was that?

Yeah using a little external  uC to run the watchdog is my first thought too.  But that much monitoring by the Broadcom ship is hard to ignore!

Julian Knight

unread,
Jul 7, 2015, 5:36:51 PM7/7/15
to node...@googlegroups.com
That is from the man page for watchdog.

You might also find this page useful that has a little more explanation.

Jon Richings

unread,
Jul 7, 2015, 6:04:37 PM7/7/15
to node...@googlegroups.com
That's a great help, Thanks Julian.

PM2 a memory hog?  Hasn't looked like that to me.  A quick look at one of our kiosk mode boxes that's been running for a couple of weeks has Chromium (not surprisingly) consuming some 25%, Node-Red 8.5%, Node 6.3% and PM2 at 2%. That's maybe a lot for what it does but I can live with that.  But given your concern I'll keep an eye on it.  Thanks for the headsup.

Julian Knight

unread,
Jul 7, 2015, 6:11:06 PM7/7/15
to node...@googlegroups.com
No worries.

Re PM2, maybe it was some other problem and I'm being paranoid. Maybe I'll try again.
Reply all
Reply to author
Forward
0 new messages