"Am I still working okay?" asked the micro controller...

SelfTest

unread,

May 19, 2004, 9:05:23 AM5/19/04

to

Say we have a micro controller with limited memory.
Say it will perform some realtime control of something.

How to make a SW for a micro controller, that in addition to its normal
operation (control of something), from time to time it will also check
itself if it is doing okay or not ? How a program can test itself? Can
some one suggest any intelligent method (other than watch dog) ?

Uddo Graaf

unread,

May 19, 2004, 9:12:38 AM5/19/04

to

"SelfTest" <SelfTEst> wrote in message
news:40ab5b93$0$3034$afc3...@news.optusnet.com.au...

That's called a 'watchdog' timer and is standard in most microcontrollers.
It's basically a countdown timer which the computer program running on the
microcontroller needs to set every x times per second to prevent it reaching
zero. When it reaches zero the microcontroller is reset. So when a program
'hangs' the program stops setting the watchdog countdown timer and the
microcontroller is reset.

moocowmoo

unread,

May 19, 2004, 9:46:03 AM5/19/04

to

"SelfTest" <SelfTEst> wrote in message
news:40ab5b93$0$3034$afc3...@news.optusnet.com.au...

One way to check hardware is to run another identical processor and compare
that they behave the same. If you have three or more then you can perform
voting so that the most popular answer is the one that gets used.

Peter

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.659 / Virus Database: 423 - Release Date: 15/04/04

Hans-Bernhard Broeker

unread,

May 19, 2004, 9:43:46 AM5/19/04

to

[OP forgot to limit F'up2; fixed. Removed non-existant c.a.e.piclist
from Newsgroups:]

In comp.arch.embedded SelfTest <SelfTEst> wrote:

> How to make a SW for a micro controller, that in addition to its normal
> operation (control of something), from time to time it will also check
> itself if it is doing okay or not ?

Ultimately, you can't. A CPU can no more meaningfully ask itself "Am
I still working OK?" than you can ask yourself meaningfully "Have I
fallen asleep yet?"

You can use watchdogs or internal consistency checking to some extent
to determine general health of the software. Assertions can be
inserted into the code, i.e. conditions that you know must come out
true at all times, because otherwise something's fatally wrong.

But there's often little or no point trying to detect hardware faults
--- if the hardware does break you're quite probably toast anyway.
You can't usually fix such a problem from the software side, and by
The Usual Kind of Luck, the faults that do occur will be exactly those
you can't, or at least didn't test for. And that's before you
consider that such tests mean more code in total, and thus more
opportunities for bugs.

Morale: if you don't know what to do with the answer, don't ask the
question.

--
Hans-Bernhard Broeker (bro...@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

jiang

unread,

May 19, 2004, 9:48:47 AM5/19/04

to

> One way to check hardware is to run another identical processor and
compare
> that they behave the same. If you have three or more then you can perform
> voting so that the most popular answer is the one that gets used.
>
> Peter
>
>

That is cool idea !..

Unbeliever

unread,

May 19, 2004, 9:58:24 AM5/19/04

to

"SelfTest" <SelfTEst> wrote in message
news:40ab5b93$0$3034$afc3...@news.optusnet.com.au...

You are correct in identifying watchdog timers as one form of COP (computer
operating properly test). Other things I've often used are:
1) Background checksum on code and constant/initializer areas of memory
2) Flags and timers which indicate that critical routines and interrupts
are running at about the right rate, usually checked in the watchdog timer
interrupt.
3) Guardwords between stacks and other memory and regular checks that
these have not been compromised (agail often in the watchdog timer
interrupt.
4) Feedback of critical output signals to ensure the hardware is working
correctly (the hardware is much more likely to suffer random failures than
the software).
5) A decent watchdog timer with an algorithmic stimulus and response
(e.g. watchdog processor supplies a pseudorandom number and main processor
replies with next pseudo-random number in a sequence). Much better than the
primitive kick within a certain time style of watchdog, which is prone to
failure to detect runaway software which includes a kick.
6) One I haven't used but seen used on a critical plc style system is an
odd number of redundant processors (3 in this case) which vote on the state
of an output (output follows the state of two agreeing inputs).

Of course, the next question you should ask is "What do I do when I detect a
failure". If it is a safety critical system (e.g. the something you're
controlling is a train, nuclear reactor or gas furnace rather than a lego
windmill) there's a whole other set of questions you should ask even before
asking the first one.

hth,
Alf

moocowmoo

unread,

May 19, 2004, 10:13:51 AM5/19/04

to

<jiang> wrote in message
news:40ab65bf$0$1587$afc3...@news.optusnet.com.au...

It's not my idea, NASA uses a set of five computers for the Space Shuttle
flight software.

CBFalconer

unread,

May 19, 2004, 10:54:55 AM5/19/04

to

jiang wrote:
>
>> One way to check hardware is to run another identical processor
>> and compare that they behave the same. If you have three or more
>> then you can perform voting so that the most popular answer is
>> the one that gets used.
>

> That is cool idea !..

And not so simple. What takes the vote? What if it fails?

--
Chuck F (cbfal...@yahoo.com) (cbfal...@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

martin griffith

unread,

May 19, 2004, 10:55:21 AM5/19/04

to

have a look at
http://www.embedded.com/story/OEG20030115S0042
There seems to be a lot to getting just a little old WD bullit proof

martin

Three things are certain:
Death, taxes and lost data.
Guess which has occurred.

Grant Edwards

unread,

May 19, 2004, 11:11:40 AM5/19/04

to

On 2004-05-19, SelfTest <> wrote:

> Say we have a micro controller with limited memory.
> Say it will perform some realtime control of something.
>
> How to make a SW for a micro controller, that in addition to its normal
> operation (control of something), from time to time it will also check
> itself if it is doing okay or not?

Without special hardware support, you can't.

> How a program can test itself?

It can't.

> Can some one suggest any intelligent method (other than watch dog) ?

Redundant hardware running independantly developed sw with
majority voting of outputs.

--
Grant Edwards grante Yow! I HAVE a towel.
at
visi.com

Mike Harrison

unread,

May 19, 2004, 12:21:25 PM5/19/04

to

On Wed, 19 May 2004 23:05:23 +1000, "SelfTest" <SelfTEst> wrote:

You also need to consider the likelihood of a problem occurring in the first place - time spent
designing the hardware to be reliable (e.g. EM/ESD immunity) is time much better spent than trying
to second-guess what might go wrong and then hope you can do something useful about it.

For example, in the old days when systems typically comprised seperate MCU/RAM/ROM chips, it made
sense to test SRAM and checksum ROM, as these involved many interconnections and sockets which could
fail. It makes much less sense to do it on a single- chip MCU, where the sort of failures that are
plausible on a seperate-chip system just don't happen.

Grant Edwards

unread,

May 19, 2004, 12:36:14 PM5/19/04

to

On 2004-05-19, Mike Harrison <mi...@whitewing.co.uk> wrote:

> For example, in the old days when systems typically comprised
> seperate MCU/RAM/ROM chips, it made sense to test SRAM and
> checksum ROM, as these involved many interconnections and
> sockets which could fail. It makes much less sense to do it on
> a single- chip MCU, where the sort of failures that are
> plausible on a seperate-chip system just don't happen.

And the probability that your program will still be able to run
and do predictable things when there is a failure in the MCU is
also small.

Multiply the probability of MCU failure by the probability your
program will run with such a failure, and you get a number
sufficiently close to zero yadda, yadda, ...

--
Grant Edwards grante Yow! Spreading peanut
at butter reminds me of
visi.com opera!! I wonder why?

Jim Hewitt

unread,

May 19, 2004, 1:18:24 PM5/19/04

to

"Hans-Bernhard Broeker" <bro...@physik.rwth-aachen.de> wrote in message
news:2h16kiF...@uni-berlin.de...

> Morale: if you don't know what to do with the answer, don't ask the
> question.

Hans,

In this case, the very next question should be
Moral: if you don't know how the answer [i.e. the sensor/hardware] could
fool, don't ask the
question.

Joe Pfeiffer

unread,

May 19, 2004, 12:28:34 PM5/19/04

to

"SelfTest" <SelfTEst> writes:

Most of the microcontrollers I've seen that are intended for
applications like this have a built-in watchdog timer (I'm assuming
when you say "other than watch dog" you mean "other than external
watchdog"). In the case of the processor I know best, the HC11, it's
called the COP (Computer Operating Properly) timer. The idea here is
your software has to reset it occasionally; if the timer ever goes
off, it's because your control program has gotten itself wedged.
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
Southwestern NM Regional Science and Engr Fair: http://www.nmsu.edu/~scifair

Spehro Pefhany

unread,

May 19, 2004, 1:40:20 PM5/19/04

to

On Wed, 19 May 2004 23:05:23 +1000, the renowned "SelfTest" <SelfTEst>
wrote:

If you have access to a decent library, check out one these standards
before you choose which hardware to use:

ANSI/AAMI SW68, Medical Device Software - Software Life-Cycle
Processes

ANSI UL1998, the Standard for Safety of Software in Programmable
Systems

EN/IEC 60601-1-4, the Collateral Standard for Programmable Electrical
Medical Systems

Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
sp...@interlog.com Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com

Paul Keinanen

unread,

May 19, 2004, 2:12:52 PM5/19/04

to

On Wed, 19 May 2004 14:54:55 GMT, CBFalconer <cbfal...@yahoo.com>
wrote:

>jiang wrote:
>>
>>> One way to check hardware is to run another identical processor
>>> and compare that they behave the same. If you have three or more
>>> then you can perform voting so that the most popular answer is
>>> the one that gets used.
>>
>> That is cool idea !..
>
>And not so simple. What takes the vote? What if it fails?

Use mechanical or pneumatic voting, not electric.

For instance, if you want to control a bidirectional relay, use a core
with three separate coils, each controlled by a separate processor. If
the current in two coils flow in opposite direction, the resultant
magnetic field is zero. Then the third coil will determine the
resultant force alone.

Paul

Paul E. Bennett

unread,

May 19, 2004, 1:22:19 PM5/19/04

to

Unbeliever wrote:

..and adding to that list. External Pulse Maintained relay. This device has
to be fed a change of polarity of its input signal at a regular rate in
order for it to maintain a relay in its energised state. If any single
component fails, the power supply goes off or the input does not change
then the relay just de-energises and opens its contacts. The pulse drive
for such a circuit should be driven from the processor internal sanity
checks that your software is performing (all check OK so change the state
of the output). This device can elevate a single processor from SIL0 to
SIL1 with very little effort.

Further, your microcontroller may be comunicating with other systems in
order to perform its control. Doing sanity checks on the communication link
and checking its integrity in operation will yield a good idea of
sub-system health. You will need checksums and/or CRC's on all messages
between systems.

Integral step-wise walking memory test and other walking sanity checks.
This can detect potential failure points quite early on.

There are a number of others.

> Of course, the next question you should ask is "What do I do when I detect
> a
> failure". If it is a safety critical system (e.g. the something you're
> controlling is a train, nuclear reactor or gas furnace rather than a lego
> windmill) there's a whole other set of questions you should ask even
> before asking the first one.

You should do an evaluation of what the system safe state is going to be
(off, bypassed or gracefully degrading). Then your design efforts should
always lean the system toward achieving those safe states unless it is
continuing to work properly.

--
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Ulf Samuelsson

unread,

May 19, 2004, 2:08:30 PM5/19/04

to

> You can use watchdogs or internal consistency checking to some extent
> to determine general health of the software. Assertions can be
> inserted into the code, i.e. conditions that you know must come out
> true at all times, because otherwise something's fatally wrong.
>
> But there's often little or no point trying to detect hardware faults
> --- if the hardware does break you're quite probably toast anyway.
> You can't usually fix such a problem from the software side, and by
> The Usual Kind of Luck, the faults that do occur will be exactly those
> you can't, or at least didn't test for. And that's before you
> consider that such tests mean more code in total, and thus more
> opportunities for bugs.
>
> Morale: if you don't know what to do with the answer, don't ask the
> question.

There are plenty of simple things you can consider if something is failing.
1) Turns yourself off, no need to draw power if you are battery operated.
2) Turn off any external device, which should not operate when the program
is not active
3) Reset yourself.
If it is not OK, due to a temporary problem, this is quite good.

>

--
Best Regards,
Ulf Samuelsson u...@a-t-m-e-l.com
This is a personal view which may or may not be
share by my Employer Atmel Nordic AB

Spehro Pefhany

unread,

May 19, 2004, 2:58:17 PM5/19/04

to

On one machine I'm very familiar with there are three safety
interlocks (one electrical (not electronic), one hydraulic, and one
mechanical). Only when all 3 agree it is safe is the electronics
allowed to do what it wants.

Guy Macon

unread,

May 19, 2004, 3:14:12 PM5/19/04

to

What do you plan to have the microcontroller do if the answer in "no?"

--
Guy Macon, Electronics Engineer & Project Manager. http://www.guymacon.com/

Message has been deleted

Guy Macon

unread,

May 19, 2004, 4:37:44 PM5/19/04

to

CBFalconer <cbfal...@yahoo.com> says...

>
>jiang wrote:
>
>>> One way to check hardware is to run another identical processor
>>> and compare that they behave the same. If you have three or more
>>> then you can perform voting so that the most popular answer is
>>> the one that gets used.
>>
>> That is cool idea !..
>
>And not so simple. What takes the vote? What if it fails?

I worked on an aerospace actuator that did it like this:

Three hydraulic actuators have three electronic control systems.

Each actuator monitors the other two and has two outputs that
are at +5V if it thinks that actuator is good, -25V if it
thinks that actuator is bad. The actual monitoring consists
of challenges/responses through six dual-redundant actuator-
to-actuator digital communication links and looking at extra
pressure transducers on the monitored actuator that are read
by the monitoring actuator. This identifies wrong behavior.

Each actuator has an input that connects to the outputs of
the other actuators through two resistors that form a summing
junction. If the sum is > -5V, it operates normally. If the sum
is < -5V, it goes into "freewheeling mode", where it exerts no
force and is easy to move. If one or both of the other actuators
asserts -15V it freewheels.

Each of the two resistors mentioned above is actually a pair of
resistors in series. The summing junction also has a pair of
high-value resistors in series to local common to hold the input
at 0V in the case of two open input signals.

One actuator can drag along two freewheeling actuators and
control the aircraft.

Two actuators working together can drag along a third actuator
that is trying as hard as it can to go the other way and control
the aircraft.

Result: no single point of failure in the actuator electronics
or voting system can result in loss of control of the aircraft.

--
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
have an "impossible" engineering project that only someone like
Doc Brown can solve? My resume is at http://www.guymacon.com/

Guy Macon

unread,

May 19, 2004, 4:48:52 PM5/19/04

to

There are some applications where instead of having a watchdog reset
the system when it goes astray you can simply reset the system again
and again with a periodic reset. This can be the output of an
oscillator or even the push of a button (a common way of designing
toys).

Guy Macon

unread,

May 19, 2004, 5:09:45 PM5/19/04

to

Don Taylor <do...@agora.rdrop.com> says...

>it was feasible to write one, or a small number, of "sanity checks",
>small tests that would evaluate whether arguments being passed and/or
>state variables had values that were appropriate at the moment.
>
>If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn
>was the program counter at the point where the check failed, and then
>we halted the processor.

[snip]

Don, may I have permission to put your story up on my web page?

Here is another technique which I use:

Start with "finished" and "debugged" code.

Have one programmer insert N bugs in another programmer's code, keeping
careful records of what and where. The idea is to put in errors typical
of the errors that the person writing the code normally makes.

Have the author of the code debug and fix all bugs that he can find,
stopping when he can't find any more bugs. Keep record of all bugs
fixed. Don't tell him which are his or how many were inserted.

Let's say that we inserted 20 bugs, he found 10 of them, and he found
20 of his own bugs. That tells us that there are around 20 of his
own bugs still undiscovered.

The psychology is interesting. The programmers write code with far
fewer bugs and do a far better job of testing before saying that they
are done. The programmer who finds all of the inserted bugs and no
new bugs is a hero. (I reinforce that with bonuses and with specific
mention in writing of this accomplishment during performance reviews.)

Don McKenzie

unread,

May 19, 2004, 5:36:29 PM5/19/04

to

As SelfTest hasn't come back yet to give any more info or comments, I am
looking at his "(other than watch dog)" and wondering if the question is
really "Is my micro still running and going about its normal business?"

Usually the first thing any programmer learns is how to flash a LED.
By adding a LED and resistor to an output pin, you can call a "turn LED
on", and "turn LED off" in a sequence, say flash 4 times on power up
being OK.

Extending this further, you can test for certain I/O operations taking
place correctly with a set number of flashes.

Many companies use 7 segment LEDs on their products, and such things as
"system alive" can mean the 7 segment LED running around in a figure 8.

Power up, self test, and real time diagnostics can be performed from a
simple single LED, right up to multiple computer systems to monitor the
operations.

I believe that anybody that designs a useful lump of hardware should
have at least one LED that can be pulsed under program control for this
purpose.

Cheers Don...

--
Don McKenzie
E-Mail Contact Page: http://www.e-dotcom.com/ecp.php?un=Dontronics

USB to RS232 Converter that works http://www.dontronics.com/usb_232.html
Don's Free Guide To Spam Reduction http://www.e-dotcom.com/spam_exp.php

Gerald Bonnstetter

unread,

May 19, 2004, 6:33:07 PM5/19/04

to

Don Taylor wrote:

> One other item that helped with the sanity checks, we filled all memory
> with 0xAAAA initially, and even when some memory was released. That
> oddball value was unlikely to be a reasonable value for most state
> variables and helped us fail more sanity checks.

On the Amiga computer one of the testing packages used 0xDEADBEEF to
fill unused memory. ;-)

It also added guard band areas around allocated memory and then checked
those after the free to be sure you didn't write outside of your
allocated area.

That second idea would work best if you had an OS or at least memory
management code.

--
Gerald Bonnstetter
Bonnsoft
bonn...@antispamextrastuffnetins.net

Message has been deleted

Guy Macon

unread,

May 19, 2004, 9:50:49 PM5/19/04

to

Don Taylor <do...@agora.rdrop.com> says...

>
>Guy Macon <http://www.guymacon.com> writes:

>>Here is another technique which I use:
>
>>Start with "finished" and "debugged" code.
>
>>Have one programmer insert N bugs in another programmer's code, keeping
>>careful records of what and where. The idea is to put in errors typical
>>of the errors that the person writing the code normally makes.
>

>I've read about that and given that considerable thought. But I've
>never quite been able to convince myself just what would be appropriate
>to put into the code and where. If you have really found a successful
>way of doing that I'd be interested.

I let the other engineers make that decision after seeing the programmer's
past errors. And when I am waring my manager hat I insist that any result
other than perfect performance be kept confidential, even from me. This
is a tool for reducing errors, not a tool for beating programmers over
the head.

Ben Jackson

unread,

May 19, 2004, 9:50:55 PM5/19/04

to

In article <asidna6bw4c...@speakeasy.net>,

Guy Macon <http://www.guymacon.com> wrote:
>
>I worked on an aerospace actuator that did it like this:
>
>Three hydraulic actuators have three electronic control systems.

Let me guess, it was too heavy to fly? ;-)

--
Ben Jackson
<b...@ben.com>
http://www.ben.com/

Guy Macon

unread,

May 19, 2004, 9:52:39 PM5/19/04

to

Don Taylor <do...@agora.rdrop.com> says...
>
>Guy Macon <http://www.guymacon.com> writes:

>>Don, may I have permission to put your story up on my web page?
>

>Feel free. I might even be able to do a better job describing this.

It's quite good as is, but if you want to rewrite it so much the better.
Just post the improved version if you decide to improve it.

Guy Macon

unread,

May 19, 2004, 10:50:54 PM5/19/04

to

Ben Jackson <b...@ben.com> says...

>
>Guy Macon <http://www.guymacon.com> wrote:
>>
>>I worked on an aerospace actuator that did it like this:
>>
>>Three hydraulic actuators have three electronic control systems.
>
>Let me guess, it was too heavy to fly? ;-)

Judge for yourself:

http://www.fas.org/man/dod-101/sys/ac/c-17.htm

:)

rickman

unread,

May 20, 2004, 10:40:41 AM5/20/04

to

Guy Macon wrote:
>
> Don Taylor <do...@agora.rdrop.com> says...
> >
> >Guy Macon <http://www.guymacon.com> writes:
>
> >>Here is another technique which I use:
> >
> >>Start with "finished" and "debugged" code.
> >
> >>Have one programmer insert N bugs in another programmer's code, keeping
> >>careful records of what and where. The idea is to put in errors typical
> >>of the errors that the person writing the code normally makes.
> >
> >I've read about that and given that considerable thought. But I've
> >never quite been able to convince myself just what would be appropriate
> >to put into the code and where. If you have really found a successful
> >way of doing that I'd be interested.
>
> I let the other engineers make that decision after seeing the programmer's
> past errors. And when I am waring my manager hat I insist that any result
> other than perfect performance be kept confidential, even from me. This
> is a tool for reducing errors, not a tool for beating programmers over
> the head.

I am sure that this can be an effective tool. But it seems less than
optimal to introduce bugs in order to get the programmers to debug
existing bugs. Maybe that is just me...

I have read that it can be useful to track the number of bugs found over
time. This typically follows a curve of exponential decay and can help
you predict the number of bugs left in a product. Certainly this is
less intrusive and has less overhead.

One thing I don't support is the idea of engineers beating each other up
over mistakes. I worked at one place where a mistake that was checked
back into version control would result in the author receiving the
"Arrow of Shame". I did not agree that the tip of version control is
what you work with or ship and I certainly did not agree with whacking
people over the head when they made a mistake. I stopped this tradition
on my project.

--

Rick "rickman" Collins

rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

CodeSprite

unread,

May 21, 2004, 1:16:34 AM5/21/04

to

"SelfTest" <SelfTEst> wrote in
news:40ab5b93$0$3034$afc3...@news.optusnet.com.au:

> Say we have a micro controller with limited memory.
> Say it will perform some realtime control of something.
>
> How to make a SW for a micro controller, that in addition to its
> normal operation (control of something), from time to time it will
> also check itself if it is doing okay or not ? How a program can test
> itself? Can some one suggest any intelligent method (other than watch
> dog) ?
>
>

Going to the ridiculous extreme, we adapted the production test vectors
for the ARM7 core and turned them into a modular program which could be
fired off at intervals, perform a few instructions that exercised part
of the core and affected some of the registers, then wrote those
registers out into a hardware register that accumulated a CRC value. We
actually set this up for a dual-processor system that was used in an
Anti-lock Braking System. The nice feature of that braking system is
that it could fall back to a "dumb" mode if either of the processors
noticed that the other wasn't getting the same results.

The test sets were fine-tuned by running them through a simulation of
the core that allowed us to simulate every possible stuck at one, stuck
at zero fault. The best we could come up with in the time and codespace
allowed was something like a 92% fault detection rate (which equated to
96% of all 'discoverable' faults).

I believe this is now a licensable package available from ARM.

Peter.

DM McGowan II

unread,

May 21, 2004, 11:59:28 AM5/21/04

to

"rickman" <spamgo...@yahoo.com> wrote in message
news:40ACC369...@yahoo.com...

I agree completely. Source control works best when developers check in
often. This should really be tempered with individual developer branches but
that requires a little more discipline. At one place I worked it was the
rule that 'main' was sacred. Only a small handful of assigned people could
touch it. All developers would create a branch even if just to fix one bug.
The flexibility to isolate the developer's changes is worth it if you can
afford the demands required by such a system.

The toughest thing was merging everyone's changes back together but the
system served many purposes well.

Also, since I'm ranting already, some source control packages are adept at
supporting the developers like Perforce. It is fast and convenient to
'synch' your workstation to whichever check-in point you desire. This makes
it easy to find that one place where some difficult to find bug crept in.

Clifford Heath

unread,

May 23, 2004, 8:01:32 PM5/23/04

to

CodeSprite wrote:
> ... it could fall back to a "dumb" mode if either of the processors

> noticed that the other wasn't getting the same results.

This is a form of a technique known as "process pairs". The OP should
do some searching using those keywords.

robin...@tesco.net

unread,

May 26, 2004, 4:36:41 AM5/26/04

to

"SelfTest" <SelfTEst> wrote in message news:<40ab5b93$0$3034$afc3...@news.optusnet.com.au>...

> Say we have a micro controller with limited memory.
> Say it will perform some realtime control of something.
>
> How to make a SW for a micro controller, that in addition to its normal
> operation (control of something), from time to time it will also check
> itself if it is doing okay or not ? How a program can test itself? Can
> some one suggest any intelligent method (other than watch dog) ?

Anyone who enables the Watchdog timer is advertising:-

1) My code is dogdy.
2) My hardware is EMC prone.
3) I have a new source of error; the watchdog itself.

Cheers
Robin

Guy Macon

unread,

May 26, 2004, 6:02:57 AM5/26/04

to

robin...@tesco.net <robin...@tesco.net> says...

>Anyone who enables the Watchdog timer is advertising:-
>
>1) My code is dogdy.
>2) My hardware is EMC prone.
>3) I have a new source of error; the watchdog itself.

You will forgive me if I prefer that you stay out of aerospace... <smile>

Dave VanHorn

unread,

May 26, 2004, 9:40:37 AM5/26/04

to

> Anyone who enables the Watchdog timer is advertising:-
>
> 1) My code is dogdy.
> 2) My hardware is EMC prone.
> 3) I have a new source of error; the watchdog itself.
>
> Cheers
> Robin

For any non-trivial application, all three are true.

Captain Bly

unread,

May 26, 2004, 11:12:36 AM5/26/04

to

Robin should stick to lego's and not electronics:

Guillaume

unread,

May 26, 2004, 11:48:12 AM5/26/04

to

> Anyone who enables the Watchdog timer is advertising:-
>
> 1) My code is dogdy.
> 2) My hardware is EMC prone.
> 3) I have a new source of error; the watchdog itself.

What a pile of bullshit.
There are more reasons for an embedded system to fail that you
can even begin to imagine. Not using watchdogs (in a sensible
way, of course) is totally irresponsible in my opinion.

Alan Kilian

unread,

May 26, 2004, 3:36:04 PM5/26/04

to

Jack Gannsle wrote a GREAT article on why you should use watchdogs, and why they are so tricky to use properly.

http://www.ganssle.com/watchdogs.htm

--
- Alan Kilian <alank(at)timelogic.com>
Director of Bioinformatics, TimeLogic Corporation 763-449-7622

Guillaume

unread,

May 26, 2004, 8:44:32 PM5/26/04

to

I had already read most points he talks about in other articles,
but this is great nevertheless.

Anyone with a concern for safety and reliability should read this -
and then some.

robin...@tesco.net

unread,

May 27, 2004, 4:59:00 AM5/27/04

to

Guillaume <grsN...@NO-SPAMmail.com> wrote in message news:<40b4bc3c$0$314$7a62...@news.club-internet.fr>...

Well, you have me there, I can only think of four (ignoring <hardware failure>):-

Cheers
Robin

robin...@tesco.net

unread,

May 27, 2004, 10:04:12 AM5/27/04

to

Guillaume <grsN...@NO-SPAMmail.com> wrote in message news:<40b539ef$0$317$7a62...@news.club-internet.fr>...

There is a lot of interesting detail about space-craft software and
the claim that a WDT could have saved the mission is no more or less
true than fixing the original floating point exception that caused it.

The article then gives an example of crashing cooker-hood-fan firmware
and assumes the WDT had *not* been used. He cannot know this. If the
firmware is poor, then the WDT was likely poorly implemented too.

Here is a quote from the article:-

<start of quote>
"Well-designed watchdog timers fire off a lot, daily and quietly
saving systems and lives without the esteem offered to other, human,
heroes. Perhaps the developers producing such reliable WDTs deserve a
parade. Poorly-designed WDTs fire off a lot, too,sometimes saving
things, sometimes making them worse."<end of quote>

I disagree. When the WDT fires, it is a disaster that needs fixing and
if it goes off "a lot" and especially "quietly" it is a cover-up where
the developers *should* be paraded.

Cheers
Robin

Spehro Pefhany

unread,

May 27, 2004, 10:17:12 AM5/27/04

to

On 27 May 2004 07:04:12 -0700, the renowned robin...@tesco.net
(robin...@tesco.net) wrote:

>I disagree. When the WDT fires, it is a disaster that needs fixing and
>if it goes off "a lot" and especially "quietly" it is a cover-up where
>the developers *should* be paraded.

You don't understand.

Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
sp...@interlog.com Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com

CBFalconer

unread,

May 27, 2004, 12:35:17 PM5/27/04

to

"robin...@tesco.net" wrote:
>
... snip ...

>
> I disagree. When the WDT fires, it is a disaster that needs
> fixing and if it goes off "a lot" and especially "quietly" it
> is a cover-up where the developers *should* be paraded.

Here is a counter-example. The hardware is operating in a noisy
environment. This induces dropped bits, etc. The software can
handle most of the data errors, but has a few problems when the IC
is altered and it is driven off to executing random data. Time
for the three fingered salute, administered by the faithful hound.

--
Chuck F (cbfal...@yahoo.com) (cbfal...@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Ben Bradley

unread,

May 27, 2004, 1:36:49 PM5/27/04

to

On Thu, 27 May 2004 16:35:17 GMT, CBFalconer <cbfal...@yahoo.com>
wrote:

>"robin...@tesco.net" wrote:
>>
>... snip ...

Let me "requote" some of that, so I can respond to it here:

>>The article then gives an example of crashing cooker-hood-fan firmware
>>and assumes the WDT had *not* been used. He cannot know this. If the
>>firmware is poor, then the WDT was likely poorly implemented too.

Putting the discussion of WDT's aside for a moment, I find it
inexcusable (engineering-wise) that such a simple application as the
cooker-hood-fan would crash or fail (maybe in development, but
certainly not in production), whether it's from (a) firmware bug(s) or
susceptibility to static discharge.
OTOH, I can see where a marketing person might play with it for two
minutes (before adequate testing is done), declare to management in
the heat of time-to=market pressures "It works, let's ship it" and a
bad/untested design goes out the door, perhaps even over the
protestations of the person(s) who designed it.

>>Here is a quote from the article:-
>>
>><start of quote>
>>"Well-designed watchdog timers fire off a lot, daily and quietly
>>saving systems and lives without the esteem offered to other, human,
>>heroes. Perhaps the developers producing such reliable WDTs deserve a
>>parade. Poorly-designed WDTs fire off a lot, too,sometimes saving
>>things, sometimes making them worse."<end of quote>

WDT's ARE valuable, but certainly not for the reasoning given
above.
What it SHOULD have said (IMHO) is:

Well-designed watchdog timers in well-designed systems RARELY if
EVER fire off, but like an airbag and seat belts in a car accident,
when they do fire off they save systems that would otherwise, perhaps
literally as well as figuratively, be "lost in space."

>> I disagree. When the WDT fires, it is a disaster that needs
>> fixing and if it goes off "a lot" and especially "quietly" it
>> is a cover-up where the developers *should* be paraded.

I certainly agree that WDT's should RARELY if ever fire. It helps
to have it turned off for general development, but there should be a
testing time where it's on (and the timer reset point should of course
be carefully thought out as part of the design), and any reset
generated should be investigated for its cause (this is where an
emulator and logic analyzer are really worth their rental fees) and a
correction put into place.
I've read and enjoyed some of Jack Gannsle's articles before, but
Robin points out very well that Jack misses the mark on this one. Has
anyone emailed him about this thread yet?

>Here is a counter-example. The hardware is operating in a noisy
>environment. This induces dropped bits, etc. The software can
>handle most of the data errors, but has a few problems when the IC
>is altered and it is driven off to executing random data. Time
>for the three fingered salute, administered by the faithful hound.

This is an example where the hardware isn't shielded well enough
from the environment, or isn't robust enough or rad-hard enough to
operate reliably in the environment. Fix that, then go for long-term
testing to see of the WDT ever fires.

Having a WDT reset the hardware doesn't make a system reliable. It
is only a protection against rare, worst-case conditions. And I mean
TRULY rare conditions, not "rare" as the word is (ab)used on eBay.

Here, I'll frame it for you. Print it, cut it out and paste it on
your monitor:

_________________________________________________________________
/ \
| Having a WDT reset the hardware doesn't make a system reliable. |
\_________________________________________________________________/

-----
http://mindspring.com/~benbradley

David

unread,

May 27, 2004, 9:06:07 PM5/27/04

to

I use a similar technique to keep the developers and validators
thinking. Developers occasionally add little changes that aren't
specified or are true mistakes. The validators occasionally report
or demonstrate problems that are fictitious. Both groups keep tabs
on each other.

Worried that your new function isn't properly tested? Break it
or add something silly like an off-color display or easter egg.
After a change is validated, break it again a litle while later
and see if a regression test was done.

Worried that a developer isn't paying attention? Report an
error that was fixed or can't happen. See how long it takes
to discover the hoax. If you are evil and spot a developers
terminal unoccupied, make a small change -- wording, duplicate
line, etc.

Each group can play the game. Does your develper sneak in
undocumented code changes? Do random check on version control.
Will the person checking in the final code notice the random
comment "XYZ checked in this code and didn't notice; owes ABC
a snickers bar." Did the manager really read that status
report or design document?

Good natured fun can liven up the group and keep them 'awake'.

David

Ken Lee

unread,

May 27, 2004, 10:07:47 PM5/27/04

to

On 27 May 2004 01:59:00 -0700, robin...@tesco.net
(robin...@tesco.net) wrote:

The causes could be numerous - static discharge (not just the effects
of lightning strikes), radio interference, other forms of radiation,
electrical shortages due to fluid spillage, inappropriate scope of
device usage (I don't consider it a software bug here) --- all these
faults could leave the device in a state where the software can't run.

The reason that it is used in the medical field is that it provides a
cost-effective mitigation for many ailments. Designing equipment to
operate in a room full of X-Ray, MRI, etc equipment - some dating back
a few decades, can be a very daunting exercise. Of course there is a
minimum standard EMC requirement that medical equipment conform to.

Also I disagree with the notion that using a watchdog "advertises"
some deficiency of the device (paraphrasing here). For me it's use
does suggest that the developer's have applied due diligence and have
used it as a mitigation against faults which they've arrived at
through some analysis.

Ken.

>
>
>Cheers
>Robin

+====================================+
I hate junk email. Please direct any
genuine email to: kenlee at hotpop.com

CBFalconer

unread,

May 27, 2004, 10:19:51 PM5/27/04

to

Ben Bradley wrote:
> CBFalconer <cbfal...@yahoo.com> wrote:
>
... snip ...

>
>> Here is a counter-example. The hardware is operating in a noisy
>> environment. This induces dropped bits, etc. The software can
>> handle most of the data errors, but has a few problems when the IC
>> is altered and it is driven off to executing random data. Time
>> for the three fingered salute, administered by the faithful hound.
>
> This is an example where the hardware isn't shielded well enough
> from the environment, or isn't robust enough or rad-hard enough to
> operate reliably in the environment. Fix that, then go for
> long-term testing to see of the WDT ever fires.

I am glad you have unlimited funds to spend on your productions.
A few pounds of lead around the system is always welcome, and
encourages sales. Some of us believe in engineering the product
to fit the desired use.

--
fix (vb.): 1. to paper over, obscure, hide from public view; 2.
to work around, in a way that produces unintended consequences
that are worse than the original problem. Usage: "Windows ME
fixes many of the shortcomings of Windows 98 SE". - Hutchison

Paul Keinanen

unread,

May 28, 2004, 2:24:25 AM5/28/04

to

On Fri, 28 May 2004 02:19:51 GMT, CBFalconer <cbfal...@yahoo.com>
wrote:

>Ben Bradley wrote:
>> CBFalconer <cbfal...@yahoo.com> wrote:
>>
>... snip ...
>>
>>> Here is a counter-example. The hardware is operating in a noisy
>>> environment. This induces dropped bits, etc. The software can
>>> handle most of the data errors, but has a few problems when the IC
>>> is altered and it is driven off to executing random data. Time
>>> for the three fingered salute, administered by the faithful hound.
>>
>> This is an example where the hardware isn't shielded well enough
>> from the environment, or isn't robust enough or rad-hard enough to
>> operate reliably in the environment. Fix that, then go for
>> long-term testing to see of the WDT ever fires.
>
>I am glad you have unlimited funds to spend on your productions.

It appears that you are thinking that the proper way to design a
product is to make a complete product and then start to wonder how to
get it through the EMC and other tests and hoping that a ferrite bead
there and a bypass capacitor will solve the problems. Then you spend a
lot of time trying, usually with several iterations, to get the device
just pass the test and still wonder about random lockups and justify
the use of the WDT.

EMC design should be part of the whole design cycle. You should design
the RF filter return paths and static electricity discharge paths so
that it does not go through any sensitive areas, since the tracks will
have a significant inductance and thus have a high reactance (or even
resonate) at high frequencies or generate quite a high voltage, when a
high current from a static discharge passes through it. This does not
necessary cost very much as a whole, since it is done in the design
phase.

A metallic (or at least conductive) box may also be required or
require extra ground planes on the PCB, this of course may cost some
extra, but reduce support cost in the field.

A system designed for good EMC performance should also be quite immune
to "unexplained" crashes or lockups and thus reduce the need for WDT.

>A few pounds of lead around the system is always welcome, and
>encourages sales. Some of us believe in engineering the product
>to fit the desired use.

"Desired use" seems to be get the product sold, but not care, if the
customer has to throw it away as useless. Just wondering, if the
customer is going to buy anything else with the same brand name in the
future. I am glad that the CE requirements removed at least some the
worst trash from the European market.

Paul

Paul E. Bennett

unread,

May 28, 2004, 3:47:19 AM5/28/04

to

CBFalconer wrote:

>> This is an example where the hardware isn't shielded well enough
>> from the environment, or isn't robust enough or rad-hard enough to
>> operate reliably in the environment. Fix that, then go for
>> long-term testing to see of the WDT ever fires.
>
> I am glad you have unlimited funds to spend on your productions.
> A few pounds of lead around the system is always welcome, and
> encourages sales. Some of us believe in engineering the product
> to fit the desired use.
>
> --
> fix (vb.): 1. to paper over, obscure, hide from public view; 2.
> to work around, in a way that produces unintended consequences
> that are worse than the original problem. Usage: "Windows ME
> fixes many of the shortcomings of Windows 98 SE". - Hutchison

Protecting the hardware is not really a costyly exercise. Most of the
time it involves little more than appropriate filtering of the inputs,
maybe a thin metal can over sensitive circuitry, using metal boxes
instead of plastic ones. Look at it as developing boxes within boxes
and using appropriate barrier techniques at the barrier boundaries.
The total cost can often be less than not doing these simple things.

--
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

robin...@tesco.net

unread,

May 28, 2004, 11:45:14 AM5/28/04

to

CBFalconer <cbfal...@yahoo.com> wrote in message news:<40B6388D...@yahoo.com>...

> Ben Bradley wrote:
> > CBFalconer <cbfal...@yahoo.com> wrote:
> >
> ... snip ...
> >
> >> Here is a counter-example. The hardware is operating in a noisy
> >> environment. This induces dropped bits, etc. The software can
> >> handle most of the data errors, but has a few problems when the IC
> >> is altered and it is driven off to executing random data. Time
> >> for the three fingered salute, administered by the faithful hound.
> >
> > This is an example where the hardware isn't shielded well enough
> > from the environment, or isn't robust enough or rad-hard enough to
> > operate reliably in the environment. Fix that, then go for
> > long-term testing to see of the WDT ever fires.
>
> I am glad you have unlimited funds to spend on your productions.
> A few pounds of lead around the system is always welcome, and
> encourages sales. Some of us believe in engineering the product
> to fit the desired use.

Lead? You're afraid of cosmic rays? Is not magnetic induction more of a risk?

Robin

Christopher X. Candreva

unread,

May 28, 2004, 11:45:20 AM5/28/04

to

In comp.robotics.misc robin...@tesco.net <robin...@tesco.net> wrote:

: Well, you have me there, I can only think of four (ignoring <hardware failure>):-

I would think hardware failure is a good enough reason in and of itself, and
in fact that is the usual reason I thought watchdogs were for.

If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing
random memory as code, you want to make sure your motors, pumps, X-ray tube,
etc shuts down.

--
==========================================================
Chris Candreva -- ch...@westnet.com -- (914) 967-7816
WestNet Internet Services of Westchester
http://www.westnet.com/

Ben Bradley

unread,

May 28, 2004, 2:13:35 PM5/28/04

to

On 28 May 2004 08:45:14 -0700, robin...@tesco.net
(robin...@tesco.net) wrote:

Whatever the cause of the problem, a WDT won't fix it, though it
may cover it up for a while.
I suspect CB was angered that I pointed out a flaw in his
counter-example, so he came back with something mean-spirited. I
didn't mean my response as a personal attack, but this is Usenet and I
can't take responsibility for how others read my posts.

>Robin

-----
http://mindspring.com/~benbradley

Paul Keinanen

unread,

May 28, 2004, 3:26:17 PM5/28/04

to

On Fri, 28 May 2004 15:45:20 GMT, "Christopher X. Candreva"
<ch...@westnet.com> wrote:

>In comp.robotics.misc robin...@tesco.net <robin...@tesco.net> wrote:
>
>: Well, you have me there, I can only think of four (ignoring <hardware failure>):-
>
>I would think hardware failure is a good enough reason in and of itself, and
>in fact that is the usual reason I thought watchdogs were for.

If it appears that the hardware is falling apart, how could you trust
that it makes any sensible decisions ? Of course, if each output
individually fall into a fail safe state if not refreshed by the
processor, then it makes sense to halt the processor immediately, if
something suspicious happens. Trying to do something after a watchdog
reset usually just will worsen the situation, if the hardware is
suspect.

>If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing
>random memory as code, you want to make sure your motors, pumps, X-ray tube,
>etc shuts down.

In any really safety critical system, you should use double or triple
(voting) redundant system, not watchdogs.

Paul

CBFalconer

unread,

May 28, 2004, 4:10:50 PM5/28/04

to

Ben Bradley wrote:
> (robin...@tesco.net) wrote:
>> CBFalconer <cbfal...@yahoo.com> wrote

>>> Ben Bradley wrote:
>>>> CBFalconer <cbfal...@yahoo.com> wrote:
>>>>
>>> ... snip ...
>>>>
>>>>> Here is a counter-example. The hardware is operating in a noisy
>>>>> environment. This induces dropped bits, etc. The software can
>>>>> handle most of the data errors, but has a few problems when the IC
>>>>> is altered and it is driven off to executing random data. Time
>>>>> for the three fingered salute, administered by the faithful hound.
>>>>
>>>> This is an example where the hardware isn't shielded well enough
>>>> from the environment, or isn't robust enough or rad-hard enough to
>>>> operate reliably in the environment. Fix that, then go for
>>>> long-term testing to see of the WDT ever fires.
>>>
>>> I am glad you have unlimited funds to spend on your productions.
>>> A few pounds of lead around the system is always welcome, and
>>> encourages sales. Some of us believe in engineering the product
>>> to fit the desired use.
>>
>> Lead? You're afraid of cosmic rays? Is not magnetic induction
>> more of a risk?
>
> Whatever the cause of the problem, a WDT won't fix it, though it
> may cover it up for a while.
>
> I suspect CB was angered that I pointed out a flaw in his
> counter-example, so he came back with something mean-spirited. I
> didn't mean my response as a personal attack, but this is Usenet
> and I can't take responsibility for how others read my posts.

Hardly. The particulars do not matter. The point is that,
whatever the product, there is a limit to the practical production
cost. You need the best bang for the buck. Random external
events may require prodigious efforts to block. You, not I,
brought up radiation shielding, and I only mentioned a means of
blocking such. (To robin: cosmics are only one of a wide range of
radiation extant. They are extremely hard to block.)

You need to face reality, in that something is going to fail.
When it does, you need a means of avoiding further damage and/or
effecting recovery. If you think you can build anything that is
failure, damage, and idiot proof you have delusions of grandeur.

Christopher X. Candreva

unread,

May 28, 2004, 4:57:21 PM5/28/04

to

In comp.robotics.misc Paul Keinanen <kein...@sci.fi> wrote:

: If it appears that the hardware is falling apart, how could you trust

: that it makes any sensible decisions ? Of course, if each output

You've changed the situation -- 'the hardware is falling apart' is hardly the
same as a single hardware failure.

Generally, an MCU on reset sets the outputs to a known value -- all 0 or all
1. If you design fail-safe, then a hardware reset, in the face of some
failing hardware, will at least make sure everything is off.

: In any really safety critical system, you should use double or triple

: (voting) redundant system, not watchdogs.

There is a WHOLE class of problems for which that is completely overkill.
Take an arcade game, or vending machine, or any machine that is going to take
physical punishment and need regular maintanance.

People are going to beat on a soda machine. Do you want to put
tripple-redunancy memory on that, or just design it such that when it breaks
it just sits there resetting itself, so no one can get free soda ?

Arcade games use watchdogs because there is a very small window where they
will make money. (Or used, when it was dedicated hardware, now it's largely
PC level hardware, but I digress) Competition means getting the thing out
the door relatively quickly, and cheap enough to sell.

You want to get every bug, but if you wait too long, you'll be into the next
generation. The watchdog means that if there IS a bug, the machine will just
reset and keep earning money, instead of not earning money until an op gets
to it.

Fail-safe means that WHEN the thing fails, you try your best to make sure
it's in a 'safe' condition.

The Artist Formerly Known as Kap'n Salty

unread,

May 28, 2004, 5:09:11 PM5/28/04

to

Christopher X. Candreva wrote:

Which brings up Robin's original point about "dodgy code". Like it or
not, code defects will occasionally make their way into any non-trivial
project produced in the real world. In the face of difficult deadlines,
compromises will ocassionaly get made, people may screw-up, QA may fall
down on the job.

Anyone who claims NEVER, EVER to have unwittingly released "dodgy code",
or to have been part of a team that did so is either:

1) lying
2) never had to code under pressure (time and cost constraints)
3) lying -- to themselves
4) not been coding for very long, or never on a project with much complexity

As another poster put it, watchdogs are one facet of an entire process
of due diligence, which should also encompass code reviews, sane coding
and design techniques, thorough QA, etc. In general, not implementing
watchdogs where it might make sense to do so is, frankly, foolish.
--
(Replies: cleanse my address of the Mark of the Beast!)

Teleoperate a roving mobile robot from the web:
http://www.swampgas.com/robotics/rover.html

Coauthor with Dennis Clark of "Building Robot Drive Trains".
Buy several copies today!

Message has been deleted

Guillaume

unread,

May 28, 2004, 8:15:03 PM5/28/04

to

> : In any really safety critical system, you should use double or triple
> : (voting) redundant system, not watchdogs.
>
> There is a WHOLE class of problems for which that is completely overkill.

Besides: redundancy still isn't a good reason not to use watchdogs.

You may have 4 redundant devices, but what if they all fail at the same
time (which could happen under extreme, unplanned condition)?
What if only one of them fails, but there is another unexpected failure
that prevents redundancy to function as expected (that is, you have
3 working devices, but the whole system fails to notice there is
something wrong with the 4th)? Well, you get the idea.

If fighting planes were perfect, pilots were perfect and conditions
were perfect, guaranteed 100% of the time, we wouldn't need to design
ejecting seats. But we still design them, and once in a while, they
are actually useful and save a life. That's exactly the same thing.
Who cares whose fault it is when an unexpected event occurs? It's
useful to be able to retrieve detailed info of failures, but right
when it happens, nobody cares at this point: the system has to
recover in the quickest way possible. Period.

As a basic rule of thumb, I'd just say that watchdogs are good for
dealing with transient, temporary, unexpected failures. Redundancy
is used more with a long-term (or complete) failure of one or several
devices in mind. Of course, if designed in a sensible manner, they
can complement one other and even interact with one another. That's
when things get interesting.

Paul E. Bennett

unread,

May 28, 2004, 7:40:22 PM5/28/04

to

Paul Keinanen wrote:

>>If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing
>>random memory as code, you want to make sure your motors, pumps, X-ray
>>tube, etc shuts down.
>
> In any really safety critical system, you should use double or triple
> (voting) redundant system, not watchdogs.

Double or triple redundancy is not always the answer for Safety Critical
Systems. Sometimes just a different logical processor (or even a relay
based interlocking scheme) will provide the protection. Sometimes you
have to even consider fully mechanical interlocking as part of the
system. Whatever mitigation scheme you need to use should be based on
the risk assessment arising from a fully discovered HAZOP study.

Having watched over a lot of the responses, I am in the camp that is
aimed at getting the code as correct as you possibly can before you
begin to worry about turning the watchdog on. However, I also use a
separate Puilse Maintained Relay circuit that has to be kept energised
by a correctly responding system. This relay automaticazlly signals
unhealthy if it de-energises due to a system failing to kick it
properly or by a failure in its own circuitry (see my Reading and
Writing the World articles on my website).

Paul Keinanen

unread,

May 29, 2004, 5:21:57 AM5/29/04

to

On Sat, 29 May 2004 00:40:22 +0100, "Paul E. Bennett"
<p...@amleth.demon.co.uk> wrote:

>
>Double or triple redundancy is not always the answer for Safety Critical
>Systems. Sometimes just a different logical processor (or even a relay
>based interlocking scheme) will provide the protection. Sometimes you
>have to even consider fully mechanical interlocking as part of the
>system. Whatever mitigation scheme you need to use should be based on
>the risk assessment arising from a fully discovered HAZOP study.

The main purpose of redundant systems is to let the system operate
normally even if some controllers fail, not safety. I fully agree that
the last ditch security system should not rely on computer logic and
preferably not even on electricity.

Paul

Paul Keinanen

unread,

May 29, 2004, 5:21:58 AM5/29/04

to

On Fri, 28 May 2004 20:57:21 GMT, "Christopher X. Candreva"
<ch...@westnet.com> wrote:

>In comp.robotics.misc Paul Keinanen <kein...@sci.fi> wrote:
>
>: If it appears that the hardware is falling apart, how could you trust
>: that it makes any sensible decisions ? Of course, if each output
>
>You've changed the situation -- 'the hardware is falling apart' is hardly the
>same as a single hardware failure.

But how does the WDT tell the difference between a transient failure
and the hardware falling apart ?

The self test routines after reset may detect some permanent failure
or it might not. The self test routine itself could go crazy due to
permanent hardware problems and the WDT kicks in again.

Now we have an other interesting situation, which has not been
discussed so far. If there is a permanent hardware/software error and
the WDT triggers over and over again, this can also cause a lot of
damage (e.g. due to repeated large startup currents in some big
loads). Thus, the WDT should be allowed to kick in only for a
predefined number of times and then disable the whole system until
manual intervention.

Paul

Jim Granville

unread,

May 29, 2004, 6:27:43 AM5/29/04

to

Paul Keinanen wrote:
<snip>

> Now we have an other interesting situation, which has not been
> discussed so far. If there is a permanent hardware/software error and
> the WDT triggers over and over again, this can also cause a lot of
> damage (e.g. due to repeated large startup currents in some big
> loads). Thus, the WDT should be allowed to kick in only for a
> predefined number of times and then disable the whole system until
> manual intervention.

I have also noticed a trend for some newer WDOG devices to have quite
long timeout options (mins to even hours). This can have merit, as
examples given in another thread show the problems with designing too
close to a WDOG's poorly defined timebase.
Other WDOGs I've seen have a longer FIRST trigger window, to allow
more elasticity on POST/Boot modes, until the opeational SW proper
starts working.

It would be a good idea to check for annoyance/damage modes, in a
continually firing WDOG failure instance.

-jg

Paul E. Bennett

unread,

May 29, 2004, 7:59:21 AM5/29/04

to

Paul Keinanen wrote:

> But how does the WDT tell the difference between a transient failure
> and the hardware falling apart ?
>
> The self test routines after reset may detect some permanent failure
> or it might not. The self test routine itself could go crazy due to
> permanent hardware problems and the WDT kicks in again.
>
> Now we have an other interesting situation, which has not been
> discussed so far. If there is a permanent hardware/software error and
> the WDT triggers over and over again, this can also cause a lot of
> damage (e.g. due to repeated large startup currents in some big
> loads). Thus, the WDT should be allowed to kick in only for a
> predefined number of times and then disable the whole system until
> manual intervention.

The answer to that is you DO NOT turn on any outputs until your system
can determine for itself that it is able to function within its design
parameters. You can count the number of watchdog kicks once you have
completed the POST routines to ensure that a minimum number of correct
kicks have happened before you enable the outputs to be turned on.

Mel Wilson

unread,

May 29, 2004, 9:19:59 AM5/29/04

to

In article <-vKdnRJZ9Ml...@speakeasy.net>,

Guy Macon <http://www.guymacon.com> wrote:
>

>CBFalconer <cbfal...@yahoo.com> says...

>
>>If you think you can build anything that is failure, damage,
>>and idiot proof you have delusions of grandeur.
>

>...or you are in management. "Our company policy states that
>all of our products are failure, damage, and idiot proof."

As colleague DW said, " ... idiot proof. It proves we're
idiots." He was kidding, of course.

Regards. Mel.

Ulf Samuelsson

unread,

Jun 1, 2004, 7:50:11 AM6/1/04

to

"Guy Macon" <http://www.guymacon.com> skrev i meddelandet
news:-vKdnRJZ9Ml...@speakeasy.net...
>
> CBFalconer <cbfal...@yahoo.com> says...

>
> >If you think you can build anything that is failure, damage,
> >and idiot proof you have delusions of grandeur.
>

> ...or you are in management. "Our company policy states that
> all of our products are failure, damage, and idiot proof."
>

Remember reading the warrantly clause like this:
"The only guarantee you'll get from us is that eventually all our equipment
will fail."

--
Best Regards
Ulf at atmel dot com
These comments are intended to be my own opinion and they
may, or may not be shared by my employer, Atmel Sweden.

Eric Bohlman

unread,

Jun 5, 2004, 2:17:40 AM6/5/04

to

The Artist Formerly Known as Kap'n Salty <mike...@swamp666gas.com>
wrote in news:10bfajp...@corp.supernews.com:

> Anyone who claims NEVER, EVER to have unwittingly released "dodgy
> code", or to have been part of a team that did so is either:
>
> 1) lying
> 2) never had to code under pressure (time and cost constraints)
> 3) lying -- to themselves
> 4) not been coding for very long, or never on a project with much
> complexity

Amen to #4. I remember reading a story
about a company that, when hiring salesmen, would always ask the
prospective salesman about the major accounts that he had *lost*. If he
had never lost a customer, he didn't get hired, because that meant that he
had never "played in the major leagues."

Part of being a geek is having a tendency to grossly overestimate the role
that personal ability plays in the success of one's work. The reality is
that the highest levels of intelligence (or its correlates) that have been
observed in human beings are *far, far* away from the levels that would
guarentee perfection. Any business process that relies on humans being
omniscient is, by definition, a failure. There is *no* way to guarantee
that Mr. Murphy will never pay you a visit. There are practices that will
make him feel distinctly unwelcome (and there are practices that amount to
buying him a first-class plane ticket and putting him up in the penthouse
suite of the most expensive hotel in town), but none of them will offer you
absolute certainty.

Marc Le Roy

unread,

Jun 13, 2004, 12:49:18 PM6/13/04

to

CBFalconer wrote:
> jiang wrote:
>>
>>> One way to check hardware is to run another identical processor
>>> and compare that they behave the same. If you have three or more
>>> then you can perform voting so that the most popular answer is
>>> the one that gets used.
>>
>> That is cool idea !..
>
> And not so simple. What takes the vote? What if it fails?

No system can be 100% reliable. All that matters is to get the level of
reliability required by the application.
Generally, the voter mechanism is designed in order to be far more reliable
than the other part of the system (reliability even better than the
resulting reliability of the voting algorithm).

Marc