How to make a SW for a micro controller, that in addition to its normal
operation (control of something), from time to time it will also check
itself if it is doing okay or not ? How a program can test itself? Can
some one suggest any intelligent method (other than watch dog) ?
That's called a 'watchdog' timer and is standard in most microcontrollers.
It's basically a countdown timer which the computer program running on the
microcontroller needs to set every x times per second to prevent it reaching
zero. When it reaches zero the microcontroller is reset. So when a program
'hangs' the program stops setting the watchdog countdown timer and the
microcontroller is reset.
One way to check hardware is to run another identical processor and compare
that they behave the same. If you have three or more then you can perform
voting so that the most popular answer is the one that gets used.
Peter
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.659 / Virus Database: 423 - Release Date: 15/04/04
In comp.arch.embedded SelfTest <SelfTEst> wrote:
> How to make a SW for a micro controller, that in addition to its normal
> operation (control of something), from time to time it will also check
> itself if it is doing okay or not ?
Ultimately, you can't. A CPU can no more meaningfully ask itself "Am
I still working OK?" than you can ask yourself meaningfully "Have I
fallen asleep yet?"
You can use watchdogs or internal consistency checking to some extent
to determine general health of the software. Assertions can be
inserted into the code, i.e. conditions that you know must come out
true at all times, because otherwise something's fatally wrong.
But there's often little or no point trying to detect hardware faults
--- if the hardware does break you're quite probably toast anyway.
You can't usually fix such a problem from the software side, and by
The Usual Kind of Luck, the faults that do occur will be exactly those
you can't, or at least didn't test for. And that's before you
consider that such tests mean more code in total, and thus more
opportunities for bugs.
Morale: if you don't know what to do with the answer, don't ask the
question.
--
Hans-Bernhard Broeker (bro...@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
That is cool idea !..
Of course, the next question you should ask is "What do I do when I detect a
failure". If it is a safety critical system (e.g. the something you're
controlling is a train, nuclear reactor or gas furnace rather than a lego
windmill) there's a whole other set of questions you should ask even before
asking the first one.
hth,
Alf
And not so simple. What takes the vote? What if it fails?
--
Chuck F (cbfal...@yahoo.com) (cbfal...@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
have a look at
http://www.embedded.com/story/OEG20030115S0042
There seems to be a lot to getting just a little old WD bullit proof
martin
Three things are certain:
Death, taxes and lost data.
Guess which has occurred.
> Say we have a micro controller with limited memory.
> Say it will perform some realtime control of something.
>
> How to make a SW for a micro controller, that in addition to its normal
> operation (control of something), from time to time it will also check
> itself if it is doing okay or not?
Without special hardware support, you can't.
> How a program can test itself?
It can't.
> Can some one suggest any intelligent method (other than watch dog) ?
Redundant hardware running independantly developed sw with
majority voting of outputs.
--
Grant Edwards grante Yow! I HAVE a towel.
at
visi.com
You also need to consider the likelihood of a problem occurring in the first place - time spent
designing the hardware to be reliable (e.g. EM/ESD immunity) is time much better spent than trying
to second-guess what might go wrong and then hope you can do something useful about it.
For example, in the old days when systems typically comprised seperate MCU/RAM/ROM chips, it made
sense to test SRAM and checksum ROM, as these involved many interconnections and sockets which could
fail. It makes much less sense to do it on a single- chip MCU, where the sort of failures that are
plausible on a seperate-chip system just don't happen.
> For example, in the old days when systems typically comprised
> seperate MCU/RAM/ROM chips, it made sense to test SRAM and
> checksum ROM, as these involved many interconnections and
> sockets which could fail. It makes much less sense to do it on
> a single- chip MCU, where the sort of failures that are
> plausible on a seperate-chip system just don't happen.
And the probability that your program will still be able to run
and do predictable things when there is a failure in the MCU is
also small.
Multiply the probability of MCU failure by the probability your
program will run with such a failure, and you get a number
sufficiently close to zero yadda, yadda, ...
--
Grant Edwards grante Yow! Spreading peanut
at butter reminds me of
visi.com opera!! I wonder why?
Hans,
In this case, the very next question should be
Moral: if you don't know how the answer [i.e. the sensor/hardware] could
fool, don't ask the
question.
Most of the microcontrollers I've seen that are intended for
applications like this have a built-in watchdog timer (I'm assuming
when you say "other than watch dog" you mean "other than external
watchdog"). In the case of the processor I know best, the HC11, it's
called the COP (Computer Operating Properly) timer. The idea here is
your software has to reset it occasionally; if the timer ever goes
off, it's because your control program has gotten itself wedged.
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
Southwestern NM Regional Science and Engr Fair: http://www.nmsu.edu/~scifair
If you have access to a decent library, check out one these standards
before you choose which hardware to use:
ANSI/AAMI SW68, Medical Device Software - Software Life-Cycle
Processes
ANSI UL1998, the Standard for Safety of Software in Programmable
Systems
EN/IEC 60601-1-4, the Collateral Standard for Programmable Electrical
Medical Systems
Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
sp...@interlog.com Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com
>jiang wrote:
>>
>>> One way to check hardware is to run another identical processor
>>> and compare that they behave the same. If you have three or more
>>> then you can perform voting so that the most popular answer is
>>> the one that gets used.
>>
>> That is cool idea !..
>
>And not so simple. What takes the vote? What if it fails?
Use mechanical or pneumatic voting, not electric.
For instance, if you want to control a bidirectional relay, use a core
with three separate coils, each controlled by a separate processor. If
the current in two coils flow in opposite direction, the resultant
magnetic field is zero. Then the third coil will determine the
resultant force alone.
Paul
..and adding to that list. External Pulse Maintained relay. This device has
to be fed a change of polarity of its input signal at a regular rate in
order for it to maintain a relay in its energised state. If any single
component fails, the power supply goes off or the input does not change
then the relay just de-energises and opens its contacts. The pulse drive
for such a circuit should be driven from the processor internal sanity
checks that your software is performing (all check OK so change the state
of the output). This device can elevate a single processor from SIL0 to
SIL1 with very little effort.
Further, your microcontroller may be comunicating with other systems in
order to perform its control. Doing sanity checks on the communication link
and checking its integrity in operation will yield a good idea of
sub-system health. You will need checksums and/or CRC's on all messages
between systems.
Integral step-wise walking memory test and other walking sanity checks.
This can detect potential failure points quite early on.
There are a number of others.
> Of course, the next question you should ask is "What do I do when I detect
> a
> failure". If it is a safety critical system (e.g. the something you're
> controlling is a train, nuclear reactor or gas furnace rather than a lego
> windmill) there's a whole other set of questions you should ask even
> before asking the first one.
You should do an evaluation of what the system safe state is going to be
(off, bypassed or gracefully degrading). Then your design efforts should
always lean the system toward achieving those safe states unless it is
continuing to work properly.
--
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
There are plenty of simple things you can consider if something is failing.
1) Turns yourself off, no need to draw power if you are battery operated.
2) Turn off any external device, which should not operate when the program
is not active
3) Reset yourself.
If it is not OK, due to a temporary problem, this is quite good.
>
--
Best Regards,
Ulf Samuelsson u...@a-t-m-e-l.com
This is a personal view which may or may not be
share by my Employer Atmel Nordic AB
On one machine I'm very familiar with there are three safety
interlocks (one electrical (not electronic), one hydraulic, and one
mechanical). Only when all 3 agree it is safe is the electronics
allowed to do what it wants.
--
Guy Macon, Electronics Engineer & Project Manager. http://www.guymacon.com/
I worked on an aerospace actuator that did it like this:
Three hydraulic actuators have three electronic control systems.
Each actuator monitors the other two and has two outputs that
are at +5V if it thinks that actuator is good, -25V if it
thinks that actuator is bad. The actual monitoring consists
of challenges/responses through six dual-redundant actuator-
to-actuator digital communication links and looking at extra
pressure transducers on the monitored actuator that are read
by the monitoring actuator. This identifies wrong behavior.
Each actuator has an input that connects to the outputs of
the other actuators through two resistors that form a summing
junction. If the sum is > -5V, it operates normally. If the sum
is < -5V, it goes into "freewheeling mode", where it exerts no
force and is easy to move. If one or both of the other actuators
asserts -15V it freewheels.
Each of the two resistors mentioned above is actually a pair of
resistors in series. The summing junction also has a pair of
high-value resistors in series to local common to hold the input
at 0V in the case of two open input signals.
One actuator can drag along two freewheeling actuators and
control the aircraft.
Two actuators working together can drag along a third actuator
that is trying as hard as it can to go the other way and control
the aircraft.
Result: no single point of failure in the actuator electronics
or voting system can result in loss of control of the aircraft.
--
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
have an "impossible" engineering project that only someone like
Doc Brown can solve? My resume is at http://www.guymacon.com/
>it was feasible to write one, or a small number, of "sanity checks",
>small tests that would evaluate whether arguments being passed and/or
>state variables had values that were appropriate at the moment.
>
>If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn
>was the program counter at the point where the check failed, and then
>we halted the processor.
[snip]
Don, may I have permission to put your story up on my web page?
Here is another technique which I use:
Start with "finished" and "debugged" code.
Have one programmer insert N bugs in another programmer's code, keeping
careful records of what and where. The idea is to put in errors typical
of the errors that the person writing the code normally makes.
Have the author of the code debug and fix all bugs that he can find,
stopping when he can't find any more bugs. Keep record of all bugs
fixed. Don't tell him which are his or how many were inserted.
Let's say that we inserted 20 bugs, he found 10 of them, and he found
20 of his own bugs. That tells us that there are around 20 of his
own bugs still undiscovered.
The psychology is interesting. The programmers write code with far
fewer bugs and do a far better job of testing before saying that they
are done. The programmer who finds all of the inserted bugs and no
new bugs is a hero. (I reinforce that with bonuses and with specific
mention in writing of this accomplishment during performance reviews.)
As SelfTest hasn't come back yet to give any more info or comments, I am
looking at his "(other than watch dog)" and wondering if the question is
really "Is my micro still running and going about its normal business?"
Usually the first thing any programmer learns is how to flash a LED.
By adding a LED and resistor to an output pin, you can call a "turn LED
on", and "turn LED off" in a sequence, say flash 4 times on power up
being OK.
Extending this further, you can test for certain I/O operations taking
place correctly with a set number of flashes.
Many companies use 7 segment LEDs on their products, and such things as
"system alive" can mean the 7 segment LED running around in a figure 8.
Power up, self test, and real time diagnostics can be performed from a
simple single LED, right up to multiple computer systems to monitor the
operations.
I believe that anybody that designs a useful lump of hardware should
have at least one LED that can be pulsed under program control for this
purpose.
Cheers Don...
--
Don McKenzie
E-Mail Contact Page: http://www.e-dotcom.com/ecp.php?un=Dontronics
USB to RS232 Converter that works http://www.dontronics.com/usb_232.html
Don's Free Guide To Spam Reduction http://www.e-dotcom.com/spam_exp.php
> One other item that helped with the sanity checks, we filled all memory
> with 0xAAAA initially, and even when some memory was released. That
> oddball value was unlikely to be a reasonable value for most state
> variables and helped us fail more sanity checks.
On the Amiga computer one of the testing packages used 0xDEADBEEF to
fill unused memory. ;-)
It also added guard band areas around allocated memory and then checked
those after the free to be sure you didn't write outside of your
allocated area.
That second idea would work best if you had an OS or at least memory
management code.
--
Gerald Bonnstetter
Bonnsoft
bonn...@antispamextrastuffnetins.net
>>Here is another technique which I use:
>
>>Start with "finished" and "debugged" code.
>
>>Have one programmer insert N bugs in another programmer's code, keeping
>>careful records of what and where. The idea is to put in errors typical
>>of the errors that the person writing the code normally makes.
>
>I've read about that and given that considerable thought. But I've
>never quite been able to convince myself just what would be appropriate
>to put into the code and where. If you have really found a successful
>way of doing that I'd be interested.
I let the other engineers make that decision after seeing the programmer's
past errors. And when I am waring my manager hat I insist that any result
other than perfect performance be kept confidential, even from me. This
is a tool for reducing errors, not a tool for beating programmers over
the head.
Let me guess, it was too heavy to fly? ;-)
--
Ben Jackson
<b...@ben.com>
http://www.ben.com/
>>Don, may I have permission to put your story up on my web page?
>
>Feel free. I might even be able to do a better job describing this.
It's quite good as is, but if you want to rewrite it so much the better.
Just post the improved version if you decide to improve it.
Judge for yourself:
http://www.fas.org/man/dod-101/sys/ac/c-17.htm
:)
I am sure that this can be an effective tool. But it seems less than
optimal to introduce bugs in order to get the programmers to debug
existing bugs. Maybe that is just me...
I have read that it can be useful to track the number of bugs found over
time. This typically follows a curve of exponential decay and can help
you predict the number of bugs left in a product. Certainly this is
less intrusive and has less overhead.
One thing I don't support is the idea of engineers beating each other up
over mistakes. I worked at one place where a mistake that was checked
back into version control would result in the author receiving the
"Arrow of Shame". I did not agree that the tip of version control is
what you work with or ship and I certainly did not agree with whacking
people over the head when they made a mistake. I stopped this tradition
on my project.
--
Rick "rickman" Collins
rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX
> Say we have a micro controller with limited memory.
> Say it will perform some realtime control of something.
>
> How to make a SW for a micro controller, that in addition to its
> normal operation (control of something), from time to time it will
> also check itself if it is doing okay or not ? How a program can test
> itself? Can some one suggest any intelligent method (other than watch
> dog) ?
>
>
Going to the ridiculous extreme, we adapted the production test vectors
for the ARM7 core and turned them into a modular program which could be
fired off at intervals, perform a few instructions that exercised part
of the core and affected some of the registers, then wrote those
registers out into a hardware register that accumulated a CRC value. We
actually set this up for a dual-processor system that was used in an
Anti-lock Braking System. The nice feature of that braking system is
that it could fall back to a "dumb" mode if either of the processors
noticed that the other wasn't getting the same results.
The test sets were fine-tuned by running them through a simulation of
the core that allowed us to simulate every possible stuck at one, stuck
at zero fault. The best we could come up with in the time and codespace
allowed was something like a 92% fault detection rate (which equated to
96% of all 'discoverable' faults).
I believe this is now a licensable package available from ARM.
Peter.
I agree completely. Source control works best when developers check in
often. This should really be tempered with individual developer branches but
that requires a little more discipline. At one place I worked it was the
rule that 'main' was sacred. Only a small handful of assigned people could
touch it. All developers would create a branch even if just to fix one bug.
The flexibility to isolate the developer's changes is worth it if you can
afford the demands required by such a system.
The toughest thing was merging everyone's changes back together but the
system served many purposes well.
Also, since I'm ranting already, some source control packages are adept at
supporting the developers like Perforce. It is fast and convenient to
'synch' your workstation to whichever check-in point you desire. This makes
it easy to find that one place where some difficult to find bug crept in.
This is a form of a technique known as "process pairs". The OP should
do some searching using those keywords.
Anyone who enables the Watchdog timer is advertising:-
1) My code is dogdy.
2) My hardware is EMC prone.
3) I have a new source of error; the watchdog itself.
Cheers
Robin
>Anyone who enables the Watchdog timer is advertising:-
>
>1) My code is dogdy.
>2) My hardware is EMC prone.
>3) I have a new source of error; the watchdog itself.
You will forgive me if I prefer that you stay out of aerospace... <smile>
For any non-trivial application, all three are true.
What a pile of bullshit.
There are more reasons for an embedded system to fail that you
can even begin to imagine. Not using watchdogs (in a sensible
way, of course) is totally irresponsible in my opinion.
http://www.ganssle.com/watchdogs.htm
--
- Alan Kilian <alank(at)timelogic.com>
Director of Bioinformatics, TimeLogic Corporation 763-449-7622
Anyone with a concern for safety and reliability should read this -
and then some.
Well, you have me there, I can only think of four (ignoring <hardware failure>):-
<firmware bug>
<spontaneous alpha particle emmission>
<brown-out>
<lightning strike>
Cheers
Robin
There is a lot of interesting detail about space-craft software and
the claim that a WDT could have saved the mission is no more or less
true than fixing the original floating point exception that caused it.
The article then gives an example of crashing cooker-hood-fan firmware
and assumes the WDT had *not* been used. He cannot know this. If the
firmware is poor, then the WDT was likely poorly implemented too.
Here is a quote from the article:-
<start of quote>
"Well-designed watchdog timers fire off a lot, daily and quietly
saving systems and lives without the esteem offered to other, human,
heroes. Perhaps the developers producing such reliable WDTs deserve a
parade. Poorly-designed WDTs fire off a lot, too,sometimes saving
things, sometimes making them worse."<end of quote>
I disagree. When the WDT fires, it is a disaster that needs fixing and
if it goes off "a lot" and especially "quietly" it is a cover-up where
the developers *should* be paraded.
Cheers
Robin
>I disagree. When the WDT fires, it is a disaster that needs fixing and
>if it goes off "a lot" and especially "quietly" it is a cover-up where
>the developers *should* be paraded.
You don't understand.
Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
sp...@interlog.com Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com
Here is a counter-example. The hardware is operating in a noisy
environment. This induces dropped bits, etc. The software can
handle most of the data errors, but has a few problems when the IC
is altered and it is driven off to executing random data. Time
for the three fingered salute, administered by the faithful hound.
--
Chuck F (cbfal...@yahoo.com) (cbfal...@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
>"robin...@tesco.net" wrote:
>>
>... snip ...
Let me "requote" some of that, so I can respond to it here:
>>The article then gives an example of crashing cooker-hood-fan firmware
>>and assumes the WDT had *not* been used. He cannot know this. If the
>>firmware is poor, then the WDT was likely poorly implemented too.
Putting the discussion of WDT's aside for a moment, I find it
inexcusable (engineering-wise) that such a simple application as the
cooker-hood-fan would crash or fail (maybe in development, but
certainly not in production), whether it's from (a) firmware bug(s) or
susceptibility to static discharge.
OTOH, I can see where a marketing person might play with it for two
minutes (before adequate testing is done), declare to management in
the heat of time-to=market pressures "It works, let's ship it" and a
bad/untested design goes out the door, perhaps even over the
protestations of the person(s) who designed it.
>>Here is a quote from the article:-
>>
>><start of quote>
>>"Well-designed watchdog timers fire off a lot, daily and quietly
>>saving systems and lives without the esteem offered to other, human,
>>heroes. Perhaps the developers producing such reliable WDTs deserve a
>>parade. Poorly-designed WDTs fire off a lot, too,sometimes saving
>>things, sometimes making them worse."<end of quote>
WDT's ARE valuable, but certainly not for the reasoning given
above.
What it SHOULD have said (IMHO) is:
Well-designed watchdog timers in well-designed systems RARELY if
EVER fire off, but like an airbag and seat belts in a car accident,
when they do fire off they save systems that would otherwise, perhaps
literally as well as figuratively, be "lost in space."
>> I disagree. When the WDT fires, it is a disaster that needs
>> fixing and if it goes off "a lot" and especially "quietly" it
>> is a cover-up where the developers *should* be paraded.
I certainly agree that WDT's should RARELY if ever fire. It helps
to have it turned off for general development, but there should be a
testing time where it's on (and the timer reset point should of course
be carefully thought out as part of the design), and any reset
generated should be investigated for its cause (this is where an
emulator and logic analyzer are really worth their rental fees) and a
correction put into place.
I've read and enjoyed some of Jack Gannsle's articles before, but
Robin points out very well that Jack misses the mark on this one. Has
anyone emailed him about this thread yet?
>Here is a counter-example. The hardware is operating in a noisy
>environment. This induces dropped bits, etc. The software can
>handle most of the data errors, but has a few problems when the IC
>is altered and it is driven off to executing random data. Time
>for the three fingered salute, administered by the faithful hound.
This is an example where the hardware isn't shielded well enough
from the environment, or isn't robust enough or rad-hard enough to
operate reliably in the environment. Fix that, then go for long-term
testing to see of the WDT ever fires.
Having a WDT reset the hardware doesn't make a system reliable. It
is only a protection against rare, worst-case conditions. And I mean
TRULY rare conditions, not "rare" as the word is (ab)used on eBay.
Here, I'll frame it for you. Print it, cut it out and paste it on
your monitor:
_________________________________________________________________
/ \
| Having a WDT reset the hardware doesn't make a system reliable. |
\_________________________________________________________________/
I use a similar technique to keep the developers and validators
thinking. Developers occasionally add little changes that aren't
specified or are true mistakes. The validators occasionally report
or demonstrate problems that are fictitious. Both groups keep tabs
on each other.
Worried that your new function isn't properly tested? Break it
or add something silly like an off-color display or easter egg.
After a change is validated, break it again a litle while later
and see if a regression test was done.
Worried that a developer isn't paying attention? Report an
error that was fixed or can't happen. See how long it takes
to discover the hoax. If you are evil and spot a developers
terminal unoccupied, make a small change -- wording, duplicate
line, etc.
Each group can play the game. Does your develper sneak in
undocumented code changes? Do random check on version control.
Will the person checking in the final code notice the random
comment "XYZ checked in this code and didn't notice; owes ABC
a snickers bar." Did the manager really read that status
report or design document?
Good natured fun can liven up the group and keep them 'awake'.
David
The causes could be numerous - static discharge (not just the effects
of lightning strikes), radio interference, other forms of radiation,
electrical shortages due to fluid spillage, inappropriate scope of
device usage (I don't consider it a software bug here) --- all these
faults could leave the device in a state where the software can't run.
The reason that it is used in the medical field is that it provides a
cost-effective mitigation for many ailments. Designing equipment to
operate in a room full of X-Ray, MRI, etc equipment - some dating back
a few decades, can be a very daunting exercise. Of course there is a
minimum standard EMC requirement that medical equipment conform to.
Also I disagree with the notion that using a watchdog "advertises"
some deficiency of the device (paraphrasing here). For me it's use
does suggest that the developer's have applied due diligence and have
used it as a mitigation against faults which they've arrived at
through some analysis.
Ken.
>
>
>Cheers
>Robin
+====================================+
I hate junk email. Please direct any
genuine email to: kenlee at hotpop.com
I am glad you have unlimited funds to spend on your productions.
A few pounds of lead around the system is always welcome, and
encourages sales. Some of us believe in engineering the product
to fit the desired use.
--
fix (vb.): 1. to paper over, obscure, hide from public view; 2.
to work around, in a way that produces unintended consequences
that are worse than the original problem. Usage: "Windows ME
fixes many of the shortcomings of Windows 98 SE". - Hutchison
>Ben Bradley wrote:
>> CBFalconer <cbfal...@yahoo.com> wrote:
>>
>... snip ...
>>
>>> Here is a counter-example. The hardware is operating in a noisy
>>> environment. This induces dropped bits, etc. The software can
>>> handle most of the data errors, but has a few problems when the IC
>>> is altered and it is driven off to executing random data. Time
>>> for the three fingered salute, administered by the faithful hound.
>>
>> This is an example where the hardware isn't shielded well enough
>> from the environment, or isn't robust enough or rad-hard enough to
>> operate reliably in the environment. Fix that, then go for
>> long-term testing to see of the WDT ever fires.
>
>I am glad you have unlimited funds to spend on your productions.
It appears that you are thinking that the proper way to design a
product is to make a complete product and then start to wonder how to
get it through the EMC and other tests and hoping that a ferrite bead
there and a bypass capacitor will solve the problems. Then you spend a
lot of time trying, usually with several iterations, to get the device
just pass the test and still wonder about random lockups and justify
the use of the WDT.
EMC design should be part of the whole design cycle. You should design
the RF filter return paths and static electricity discharge paths so
that it does not go through any sensitive areas, since the tracks will
have a significant inductance and thus have a high reactance (or even
resonate) at high frequencies or generate quite a high voltage, when a
high current from a static discharge passes through it. This does not
necessary cost very much as a whole, since it is done in the design
phase.
A metallic (or at least conductive) box may also be required or
require extra ground planes on the PCB, this of course may cost some
extra, but reduce support cost in the field.
A system designed for good EMC performance should also be quite immune
to "unexplained" crashes or lockups and thus reduce the need for WDT.
>A few pounds of lead around the system is always welcome, and
>encourages sales. Some of us believe in engineering the product
>to fit the desired use.
"Desired use" seems to be get the product sold, but not care, if the
customer has to throw it away as useless. Just wondering, if the
customer is going to buy anything else with the same brand name in the
future. I am glad that the CE requirements removed at least some the
worst trash from the European market.
Paul
>> This is an example where the hardware isn't shielded well enough
>> from the environment, or isn't robust enough or rad-hard enough to
>> operate reliably in the environment. Fix that, then go for
>> long-term testing to see of the WDT ever fires.
>
> I am glad you have unlimited funds to spend on your productions.
> A few pounds of lead around the system is always welcome, and
> encourages sales. Some of us believe in engineering the product
> to fit the desired use.
>
> --
> fix (vb.): 1. to paper over, obscure, hide from public view; 2.
> to work around, in a way that produces unintended consequences
> that are worse than the original problem. Usage: "Windows ME
> fixes many of the shortcomings of Windows 98 SE". - Hutchison
Protecting the hardware is not really a costyly exercise. Most of the
time it involves little more than appropriate filtering of the inputs,
maybe a thin metal can over sensitive circuitry, using metal boxes
instead of plastic ones. Look at it as developing boxes within boxes
and using appropriate barrier techniques at the barrier boundaries.
The total cost can often be less than not doing these simple things.
--
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Lead? You're afraid of cosmic rays? Is not magnetic induction more of a risk?
Robin
: Well, you have me there, I can only think of four (ignoring <hardware failure>):-
I would think hardware failure is a good enough reason in and of itself, and
in fact that is the usual reason I thought watchdogs were for.
If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing
random memory as code, you want to make sure your motors, pumps, X-ray tube,
etc shuts down.
--
==========================================================
Chris Candreva -- ch...@westnet.com -- (914) 967-7816
WestNet Internet Services of Westchester
http://www.westnet.com/
Whatever the cause of the problem, a WDT won't fix it, though it
may cover it up for a while.
I suspect CB was angered that I pointed out a flaw in his
counter-example, so he came back with something mean-spirited. I
didn't mean my response as a personal attack, but this is Usenet and I
can't take responsibility for how others read my posts.
>Robin
>In comp.robotics.misc robin...@tesco.net <robin...@tesco.net> wrote:
>
>: Well, you have me there, I can only think of four (ignoring <hardware failure>):-
>
>I would think hardware failure is a good enough reason in and of itself, and
>in fact that is the usual reason I thought watchdogs were for.
If it appears that the hardware is falling apart, how could you trust
that it makes any sensible decisions ? Of course, if each output
individually fall into a fail safe state if not refreshed by the
processor, then it makes sense to halt the processor immediately, if
something suspicious happens. Trying to do something after a watchdog
reset usually just will worsen the situation, if the hardware is
suspect.
>If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing
>random memory as code, you want to make sure your motors, pumps, X-ray tube,
>etc shuts down.
In any really safety critical system, you should use double or triple
(voting) redundant system, not watchdogs.
Paul
Hardly. The particulars do not matter. The point is that,
whatever the product, there is a limit to the practical production
cost. You need the best bang for the buck. Random external
events may require prodigious efforts to block. You, not I,
brought up radiation shielding, and I only mentioned a means of
blocking such. (To robin: cosmics are only one of a wide range of
radiation extant. They are extremely hard to block.)
You need to face reality, in that something is going to fail.
When it does, you need a means of avoiding further damage and/or
effecting recovery. If you think you can build anything that is
failure, damage, and idiot proof you have delusions of grandeur.
: If it appears that the hardware is falling apart, how could you trust
: that it makes any sensible decisions ? Of course, if each output
You've changed the situation -- 'the hardware is falling apart' is hardly the
same as a single hardware failure.
Generally, an MCU on reset sets the outputs to a known value -- all 0 or all
1. If you design fail-safe, then a hardware reset, in the face of some
failing hardware, will at least make sure everything is off.
: In any really safety critical system, you should use double or triple
: (voting) redundant system, not watchdogs.
There is a WHOLE class of problems for which that is completely overkill.
Take an arcade game, or vending machine, or any machine that is going to take
physical punishment and need regular maintanance.
People are going to beat on a soda machine. Do you want to put
tripple-redunancy memory on that, or just design it such that when it breaks
it just sits there resetting itself, so no one can get free soda ?
Arcade games use watchdogs because there is a very small window where they
will make money. (Or used, when it was dedicated hardware, now it's largely
PC level hardware, but I digress) Competition means getting the thing out
the door relatively quickly, and cheap enough to sell.
You want to get every bug, but if you wait too long, you'll be into the next
generation. The watchdog means that if there IS a bug, the machine will just
reset and keep earning money, instead of not earning money until an op gets
to it.
Fail-safe means that WHEN the thing fails, you try your best to make sure
it's in a 'safe' condition.
Which brings up Robin's original point about "dodgy code". Like it or
not, code defects will occasionally make their way into any non-trivial
project produced in the real world. In the face of difficult deadlines,
compromises will ocassionaly get made, people may screw-up, QA may fall
down on the job.
Anyone who claims NEVER, EVER to have unwittingly released "dodgy code",
or to have been part of a team that did so is either:
1) lying
2) never had to code under pressure (time and cost constraints)
3) lying -- to themselves
4) not been coding for very long, or never on a project with much complexity
As another poster put it, watchdogs are one facet of an entire process
of due diligence, which should also encompass code reviews, sane coding
and design techniques, thorough QA, etc. In general, not implementing
watchdogs where it might make sense to do so is, frankly, foolish.
--
(Replies: cleanse my address of the Mark of the Beast!)
Teleoperate a roving mobile robot from the web:
http://www.swampgas.com/robotics/rover.html
Coauthor with Dennis Clark of "Building Robot Drive Trains".
Buy several copies today!
Besides: redundancy still isn't a good reason not to use watchdogs.
You may have 4 redundant devices, but what if they all fail at the same
time (which could happen under extreme, unplanned condition)?
What if only one of them fails, but there is another unexpected failure
that prevents redundancy to function as expected (that is, you have
3 working devices, but the whole system fails to notice there is
something wrong with the 4th)? Well, you get the idea.
If fighting planes were perfect, pilots were perfect and conditions
were perfect, guaranteed 100% of the time, we wouldn't need to design
ejecting seats. But we still design them, and once in a while, they
are actually useful and save a life. That's exactly the same thing.
Who cares whose fault it is when an unexpected event occurs? It's
useful to be able to retrieve detailed info of failures, but right
when it happens, nobody cares at this point: the system has to
recover in the quickest way possible. Period.
As a basic rule of thumb, I'd just say that watchdogs are good for
dealing with transient, temporary, unexpected failures. Redundancy
is used more with a long-term (or complete) failure of one or several
devices in mind. Of course, if designed in a sensible manner, they
can complement one other and even interact with one another. That's
when things get interesting.
>>If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing
>>random memory as code, you want to make sure your motors, pumps, X-ray
>>tube, etc shuts down.
>
> In any really safety critical system, you should use double or triple
> (voting) redundant system, not watchdogs.
Double or triple redundancy is not always the answer for Safety Critical
Systems. Sometimes just a different logical processor (or even a relay
based interlocking scheme) will provide the protection. Sometimes you
have to even consider fully mechanical interlocking as part of the
system. Whatever mitigation scheme you need to use should be based on
the risk assessment arising from a fully discovered HAZOP study.
Having watched over a lot of the responses, I am in the camp that is
aimed at getting the code as correct as you possibly can before you
begin to worry about turning the watchdog on. However, I also use a
separate Puilse Maintained Relay circuit that has to be kept energised
by a correctly responding system. This relay automaticazlly signals
unhealthy if it de-energises due to a system failing to kick it
properly or by a failure in its own circuitry (see my Reading and
Writing the World articles on my website).
>
>Double or triple redundancy is not always the answer for Safety Critical
>Systems. Sometimes just a different logical processor (or even a relay
>based interlocking scheme) will provide the protection. Sometimes you
>have to even consider fully mechanical interlocking as part of the
>system. Whatever mitigation scheme you need to use should be based on
>the risk assessment arising from a fully discovered HAZOP study.
The main purpose of redundant systems is to let the system operate
normally even if some controllers fail, not safety. I fully agree that
the last ditch security system should not rely on computer logic and
preferably not even on electricity.
Paul
>In comp.robotics.misc Paul Keinanen <kein...@sci.fi> wrote:
>
>: If it appears that the hardware is falling apart, how could you trust
>: that it makes any sensible decisions ? Of course, if each output
>
>You've changed the situation -- 'the hardware is falling apart' is hardly the
>same as a single hardware failure.
But how does the WDT tell the difference between a transient failure
and the hardware falling apart ?
The self test routines after reset may detect some permanent failure
or it might not. The self test routine itself could go crazy due to
permanent hardware problems and the WDT kicks in again.
Now we have an other interesting situation, which has not been
discussed so far. If there is a permanent hardware/software error and
the WDT triggers over and over again, this can also cause a lot of
damage (e.g. due to repeated large startup currents in some big
loads). Thus, the WDT should be allowed to kick in only for a
predefined number of times and then disable the whole system until
manual intervention.
Paul
I have also noticed a trend for some newer WDOG devices to have quite
long timeout options (mins to even hours). This can have merit, as
examples given in another thread show the problems with designing too
close to a WDOG's poorly defined timebase.
Other WDOGs I've seen have a longer FIRST trigger window, to allow
more elasticity on POST/Boot modes, until the opeational SW proper
starts working.
It would be a good idea to check for annoyance/damage modes, in a
continually firing WDOG failure instance.
-jg
> But how does the WDT tell the difference between a transient failure
> and the hardware falling apart ?
>
> The self test routines after reset may detect some permanent failure
> or it might not. The self test routine itself could go crazy due to
> permanent hardware problems and the WDT kicks in again.
>
> Now we have an other interesting situation, which has not been
> discussed so far. If there is a permanent hardware/software error and
> the WDT triggers over and over again, this can also cause a lot of
> damage (e.g. due to repeated large startup currents in some big
> loads). Thus, the WDT should be allowed to kick in only for a
> predefined number of times and then disable the whole system until
> manual intervention.
The answer to that is you DO NOT turn on any outputs until your system
can determine for itself that it is able to function within its design
parameters. You can count the number of watchdog kicks once you have
completed the POST routines to ensure that a minimum number of correct
kicks have happened before you enable the outputs to be turned on.
As colleague DW said, " ... idiot proof. It proves we're
idiots." He was kidding, of course.
Regards. Mel.
Remember reading the warrantly clause like this:
"The only guarantee you'll get from us is that eventually all our equipment
will fail."
--
Best Regards
Ulf at atmel dot com
These comments are intended to be my own opinion and they
may, or may not be shared by my employer, Atmel Sweden.
> Anyone who claims NEVER, EVER to have unwittingly released "dodgy
> code", or to have been part of a team that did so is either:
>
> 1) lying
> 2) never had to code under pressure (time and cost constraints)
> 3) lying -- to themselves
> 4) not been coding for very long, or never on a project with much
> complexity
Amen to #4. I remember reading a story
about a company that, when hiring salesmen, would always ask the
prospective salesman about the major accounts that he had *lost*. If he
had never lost a customer, he didn't get hired, because that meant that he
had never "played in the major leagues."
Part of being a geek is having a tendency to grossly overestimate the role
that personal ability plays in the success of one's work. The reality is
that the highest levels of intelligence (or its correlates) that have been
observed in human beings are *far, far* away from the levels that would
guarentee perfection. Any business process that relies on humans being
omniscient is, by definition, a failure. There is *no* way to guarantee
that Mr. Murphy will never pay you a visit. There are practices that will
make him feel distinctly unwelcome (and there are practices that amount to
buying him a first-class plane ticket and putting him up in the penthouse
suite of the most expensive hotel in town), but none of them will offer you
absolute certainty.
No system can be 100% reliable. All that matters is to get the level of
reliability required by the application.
Generally, the voter mechanism is designed in order to be far more reliable
than the other part of the system (reliability even better than the
resulting reliability of the voting algorithm).
Marc