Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Power On Self Test

173 views

Skip to first unread message

pozz

unread,

Sep 23, 2020, 6:13:49 AM9/23/20

I'd like to implement a Power On Self Test to be sure all (or many)
parts of the electronics are functioning well.

The tests to be done to check external hardware depends on the actual
hardware that is present.

What about the MCU that features an internal Flash (where is the code)
and RAM and some peripherals? Are there any strategies to test if the
internal RAM or Flash are good? Do you think these kind of tests could
be useful?

What about a test of the clock based on an external crystal?

Don Y

unread,

Sep 23, 2020, 6:58:04 AM9/23/20

You can test whatever you think you need confidence in prior to declaring the
system healthy enough to boot.

Historically, a small area of ROM was relied upon to contain enough code to
verify it's integrity along with the rest of the POST/BIST code's integrity.
This was done without referencing RAM (which may be defective).

Some folks would include "CPU tests" to verify the basic integrity of the
processor. I think these are dubious as it likely either works or it doesn't
("Gee, it can ADD but can't JMP!"). With more advanced CPUs, you'd likely
want to verify the cache, ECC and VMM hardware behave as expected (sometimes
this requires adding hooks in order to be able to synthesize faults).

RAM was then tested using strategies appropriate to the RAM technology used.
E.g., DRAM wanted to be tested with long delays between the write and
read-verify to ensure any failures in hardware refresh mechanisms were
given an opportunity to manifest. (Even nonexistent memory can appear to
be present and functional if the test is poorly designed, in certain hardware
configurations)

From there, different peripherals could be tested while relying on the now
assumed *functional* ROM & RAM to conduct those tests. I.e., the test
application can start to look like a more full-featured application instead
of tight little bits of code.

You're at the mercy of the hardware designer to incorporate appropriate hooks
to test many aspects of the circuitry. E.g., can you generate a serial
data stream to test if a UART is receiving correctly? transmitting? Does
your MAC let you push octets onto the wire and see them or is the loopback
interface purely inside the NIC?

In the past, I've taken unused outputs and used them as termination voltages
for high impedance pullups/pulldowns that I could use to determine if an
external bit of kit was plugged into the system. I.e., drive the termination
up, then down -- possibly multiple times, depending on what those "inputs"
feed -- and see if anything is detected. If not, it is hopefully because the
external device is driving those inputs with lower impedance signals. So,
test the external device!

You can test for stuck keys/buttons -- if you can ensure the user (or
mechanism) -- can be relied upon NOT to activate them during the POST.

You can test for a functional XTAL -- but only if you have some other timebase
(which may be crude/inaccurate) is operational.

[I once diagnosed a pinball machine as having a defective crystal simply by
observing the refresh of the displays with my unaided eyes -- PGDs appear to
vibrate when lit. Had the POST for the machine been able to detect -- and
flag -- that, it could have diagnosed itself!]

You also have to decide what role the test will play in the user's device
experience; will you flash an indicator telling the user that a fault has
been detected (if so, what will the user do)? Or, will you attempt to
workaround any faults? How reliable will your "indicator" be?? Will you
want to convey anything more informative than "check engine"?

I've added circuitry to my designs to allow me to dynamically (POST as well
as BIST) verify the operational status of the hardware. E.g., every speaker
is accompanied by a microphone -- so I can "listen" to the sounds that I'm
generating to verify the speaker is operational. And, likewise, so I can
generate sounds of particular characteristics to know that my microphone is
working! Of course, having those "devices" on hand means I can also
find uses for them that I might not have originally included in the design!

In my application, I can move the device to a "testing" state at any time.
In this state, I can load diagnostics (once the device itself has verified
that it is capable of executing those diagnostics!) to do whatever testing
I deem necessary. E.g., if I encounter lots of ECC errors in the onboard
RAM, I can take the device offline and run a comprehensive memory diagnostic.
Depending on the results of that test, I can recertify the device for
normal operation, some subset of normal *or* flag it as faulted.

But, my environment expects the devices to operate "unattended" for very
long periods of time, 24/7/365, so I can't rely on the activation of a
POST at power-up.

Think hard about the types of failures you EXPECT to see (i.e., many are
USER errors!) and don't invest too much time detecting things that will
likely never fail OR whose failure you won't be able to do much about.

David Brown

unread,

Sep 23, 2020, 8:13:23 AM9/23/20

On 23/09/2020 12:57, Don Y wrote:
> On 9/23/2020 3:13 AM, pozz wrote:
>> I'd like to implement a Power On Self Test to be sure all (or many)
>> parts of the electronics are functioning well.
>>
>> The tests to be done to check external hardware depends on the actual
>> hardware that is present.
>>
>> What about the MCU that features an internal Flash (where is the code)
>> and RAM and some peripherals? Are there any strategies to test if the
>> internal RAM or Flash are good? Do you think these kind of tests could
>> be useful?
>>
>> What about a test of the clock based on an external crystal?
>

<snip>

> Think hard about the types of failures you EXPECT to see (i.e., many are
> USER errors!) and don't invest too much time detecting things that will
> likely never fail OR whose failure you won't be able to do much about.

This last bit is crucial.

A lot of testing "requirements" that are specified are completely
pointless - or far worse than useless, as they introduce real points of
failure in their attempts to cover everything.

First, figure out what you should /not/ test.

Don't bother testing something unless you can usefully handle the
failure. If the way you communicate errors is through a UART, there is
no point in trying to check that the UART is working. If you have a
single microcontroller in the system, there is no point in trying to
check the cpu or the on-chip ram. There is no point in checking that
you can write to flash or on-chip eeprom - all you do is reduce its
lifetime and make it more likely to fail.

Don't write any test code which cannot itself be tested. If you cannot
induce a failure, or at least simulate it reasonably, do not write code
to check or handle that failure. The reality is that the untested code
will have a higher risk of problems than the thing you are testing.

Don't check the ram or the flash of the microcontroller - there's
nothing you can do if there is a failure. (You can check that you have
successfully loaded a new software update, or that there hasn't been a
reset during an update - a CRC for that kind of thing is a good idea.)
If you have a system that is important enough that ram or flash failures
need to be checked and handled, use a safety-qualified microcontroller
with ECC ram, flash, cache, etc., and perhaps even redundant cores (you
get these with PowerPC and Cortex-R cores).

And think about what can reasonably go wrong, how it can go wrong, and
what can be done about it. Other than for devices susceptible to
current surges (like filament light bulbs), most hardware failures are
in usage, not while power is off - checking on power-up (rather than
while the system is in use) usually only makes sense if it is likely for
a user to see there is a problem and try to "fix" it by turning power
off and on again.

Richard Damon

unread,

Sep 23, 2020, 8:52:03 AM9/23/20

Testing RAM can be useful, letting the system fail gracefully rather
than acting flaky, perhaps just locking up into a tight loop flashing a
LED as a fault indicator. Similarly, you could CRC check the program
flash, and fail on an error, preferably falling into a minimal system
that allows a user reflash, but it might mean just bricking.

Note, that as you say, most faults will happen while powered up, but
many faults will cause a system crash, that the user is likely to power
cycle to try and clear, so power up is a good time to check (since many
things are a lot harder to check while the system is running in operation).

David Brown

unread,

Sep 23, 2020, 10:03:11 AM9/23/20

Have you ever seen microcontroller RAM that failed? It's a possibility
for dynamic ram on PC's that is pushed to its limits for power and
speed, and made as cheaply as possible. But for static RAM in a
microcontroller, the risk of failures is pretty much negligible. The
exception is if a bit is hit by a cosmic ray (or other serious
radiation), which can flip a bit, but that won't be detected by any RAM
test of this kind.

Testing RAM is useful /if/ it can fail, and /if/ you can do something
useful when it fails. (I agree that "a tight loop flashing an LED"
might count as something useful, depending on the situation.)

I've seen "safety standards requirements" that included regular ram
tests. Such requirements generally originate decades ago, and are not
appropriately nuanced for real-life systems. I've seen resulting code
used to implement such tests, added solely to fulfil such requirements.
And I've seen such code written in a way that is untested and
untestable, and in a way that has risks that /hugely/ outweigh those of
a fault occurring in the on-board RAM.

If the OP is in the situation where there are customer requirements for
fulfilling certain safety requirements that include ram tests, and where
"mindlessly obeying these rules no matter how pointless they are in
reality" is the right choice to please arse-covering lawyers, then go
for it. If not, then think long and hard about the realism of such a
failure and such a test, and whether it is truly a positive contribution
to the project as a whole.

> Similarly, you could CRC check the program
> flash, and fail on an error, preferably falling into a minimal system
> that allows a user reflash, but it might mean just bricking.
>

The possibility of a flash failure is a great deal higher than that of a
RAM failure. Flash writes are analogue - a bit can be written in such a
way that it reads back correctly at programming time, but goes outside
the margins over time or at different temperatures or voltages. So yes,
sometimes a CRC of the flash is worth doing. But remember that the
program doing the check is just as much at risk of such failures
(perhaps even more so, if you have a "boot" program that does the check
of the "main" program, as the boot program is less likely to be updated
and thus its bits will have decayed over a longer time).

If flash fails are a real risk, and the system is important enough, it's
better to pick a microcontroller with ECC flash.

> Note, that as you say, most faults will happen while powered up, but
> many faults will cause a system crash, that the user is likely to power
> cycle to try and clear, so power up is a good time to check (since many
> things are a lot harder to check while the system is running in operation).
>

Yes, I mentioned that. (It assumes the embedded system has a user that
can do such a power-cycle.)

Don Y

unread,

Sep 23, 2020, 12:37:00 PM9/23/20

On 9/23/2020 7:03 AM, David Brown wrote:
> On 23/09/2020 14:51, Richard Damon wrote:
>> Testing RAM can be useful, letting the system fail gracefully rather
>> than acting flaky, perhaps just locking up into a tight loop flashing a
>> LED as a fault indicator.

Exactly. If memory is expected to work -- and NEVER expected to fail -- then
it's a small cost to actually make some attempt to prove that is actually the
case. Otherwise, when that "Can't Happen" actually does, you're left clueless.

[In the 70's, a common system failure I encountered was an address decoding
error which would effectively disable all memory (think misprogrammed PLA).
It was readily apparent as the processor would be found halted at ~0x0076
(IIRC) -- 0x76 being the opcode for HALT which would be the low byte of the
address still "floating" on the multiplexed address/data bus. Nowadays, one
can imagine similar failures -- including grown defects -- deleteriously
affecting deployed product.]

> Have you ever seen microcontroller RAM that failed? It's a possibility
> for dynamic ram on PC's that is pushed to its limits for power and
> speed, and made as cheaply as possible. But for static RAM in a
> microcontroller, the risk of failures is pretty much negligible. The
> exception is if a bit is hit by a cosmic ray (or other serious
> radiation), which can flip a bit, but that won't be detected by any RAM
> test of this kind.

You're assuming that there is only one, predetermined way to get into the
self-test routine. And, that nothing in the machine has failed that would
render that assumption false.

At each point in your code, you should know what assumptions are safe and
which are yet to be proven/made safe. If you're in the self-test routine,
you shouldn't have to wonder if memory works, is configured as you expect
it to be, etc. ("Hey, I'm running code so why bother to TEST the code
image??") Assuming that the memory is operational NOW (while I am executing
this piece of self-test code) is a hazard waiting to happen.

For example, an errant RETURN could land the program IN the self-test code
WITHOUT the benefit of having been through the controlled, repeatable start-up
sequence. (i.e., the RAM -- or other resource -- may NOW be mapped to a
different location in the address space such that the code written under
the assumption that it resides in its "power on reset" configuration no
longer works properly.)

I'd rather have that code FAIL and report the error to me -- because it
tried to verify some assumption(s) and failed -- than have the code continue
to operate FROM THERE on the assumption that it is actually (later) talking
to functional RAM that has yet-to-be reconfigured. Otherwise, you get a
"fluke" that you can never resolve (and, because you can't easily sort out
what might have happened in order to reproduce and repair, you shrug it
off due to time pressures -- even though YOU SAW IT FAIL!).

[I have an entry point in all of my products called RESET. It manually and
deliberately works to restore the hardware to the same condition that it
was in just after the application of power. So, any code that executes
after passing through that entry point -- to "RESETTED" -- SHOULD behave the
same regardless of whether power was just applied, or not]

Note that there's a difference between the sort of "confidence testing"
that occurs at POST (how many devices perform exhaustive tests at POST?
How many users would tolerate that sort of delay?) and "diagnostic testing"
which truly provides an assessment of the health of the device and can often
be used to assist in determine the need for replacement (or, for self-healing).

In most cases, you can test RAM with a single write pass followed by a
verification read pass and be reasonably sure that you've caught stuck-at
failures as well as decode failures -- no need for a whole barrage of
different tests when you're typically looking for a simple Go-NoGo.

[I run three passes on a 512MB block and use that as a crude assessment
as to whether or not the memory will LIKELY accept a program image.
Installing -- and verifying -- that image acts as a further test of
the memory's crude functionality. Thereafter, I swap pages of memory
out and exercise them to verify that I'm not seeing an increase in
ECC activity in a particular region -- which I will remap if need be.]

You also need to know how the device is fabricated; a memory module
will experience different errors than memory that is soldered down.
(and, in the latter case, you have to be prepared for the memory to
NOT be what you THOUGHT it was going to be, at design time). And,
soldered down memory will behave different than chip-on-chip.

Folks write ONE memory test and then assume all memory behaves (fails!)
the same.

If you don't understand your hardware and how it can fail, you shouldn't
be the one who is designing the test suite!

David Brown

unread,

Sep 23, 2020, 3:02:58 PM9/23/20

On 23/09/2020 18:36, Don Y wrote:

> If you don't understand your hardware and how it can fail, you shouldn't
> be the one who is designing the test suite!

That bit is correct. The rest - well, I don't want to get into a long
and protracted argument.

Any system is made up of layers. Higher level layers assume that lower
level layers work according to specification (which may include
indicating an error for some kinds of detectable fault). If you think
the higher level part can fully verify the lower level parts - "prove"
that the assumptions hold - you are fooling yourself. When you design a
system based on a microcontroller, you pick a device that is as reliable
as you need it to be - so that you /can/ assume the core parts (cpu,
ram, flash, interrupts, etc.) work well enough for your needs. If you
are not sure it is reliable enough, pick a different device or make a
redundant system.

No amount of testing can /ever/ prove that something works - it can only
prove that something does /not/ work.

Don Y

unread,

Sep 23, 2020, 3:32:58 PM9/23/20

A system is not a static entity. It changes over time (even if the design is
frozen). So, while your RAM (or any other component) may not be LIKELY to
fail, the rest of the system that enables the RAM to function as intended
can change in ways that manifest as RAM (or other resources) failures.

[Picking the "world's most reliable MCU" won't guarantee that it won't throw
RAM errors in a deployed product.]

Simply assuming it "can't fail" is naive.

And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
than late gives you a better idea of what to report to the user/customer
because you are closer to the problem's manifestation. You don't end
up misbehaving and wondering "why?"

[And, all of this assumes "bugfree software" so any errors are entirely a
result of hardware faults]

Paul Rubin

unread,

Sep 23, 2020, 4:05:25 PM9/23/20

David Brown <david...@hesbynett.no> writes:
> When you design a system based on a microcontroller, you pick a device
> that is as reliable as you need it to be

That might not exist. E.g. it's common for security processors and
software to continuously self-test while running, since the user might
be trying to tamper with them. "Differential fault analysis" is a
relevant search string. The attacker does stuff like intentionally
overclock the processor in the hope of introducing errors, so they can
observe the difference between the error result and the normal result,
and infer stuff about the supposedly-secured info inside the processor.
There is no magic way to defeat these attacks, but the cpu designers do
what they can.

Mike Perkins

unread,

Sep 23, 2020, 6:13:21 PM9/23/20

I've done this with the STM32 variety of MCUs. The device itself has a
Flash checksum and if this fails it won't start.

ST also proved some example code and libraries for POST. These are more
comprehensive than just checking RAM.

Might be worth have a look.

--
Mike Perkins
Video Solutions Ltd
www.videosolutions.ltd.uk

Richard Damon

unread,

Sep 23, 2020, 9:55:48 PM9/23/20

I think I have seen it once, the part had gotten electrically stressed
in debugging and one of the banks of internal ram failed. We had only
put the test in because the uniit was going to be in critical
infrastructure where certain types of malfunctions could present dangers
to people in the area.

It is also possible that many failure might trip a watchdog that forces
a reset, and the unit then finds the fault and locks itself 'safe'.

pozz

unread,

Sep 24, 2020, 3:18:05 AM9/24/20

Could you give me a link on this code? Thanks

David Brown

unread,

Sep 24, 2020, 5:11:07 AM9/24/20

On 23/09/2020 21:32, Don Y wrote:
> On 9/23/2020 12:02 PM, David Brown wrote:
>> On 23/09/2020 18:36, Don Y wrote:
>>
>>> If you don't understand your hardware and how it can fail, you shouldn't
>>> be the one who is designing the test suite!
>>
>> That bit is correct. The rest - well, I don't want to get into a long
>> and protracted argument.
>>
>> Any system is made up of layers. Higher level layers assume that lower
>> level layers work according to specification (which may include
>> indicating an error for some kinds of detectable fault). If you think
>> the higher level part can fully verify the lower level parts - "prove"
>> that the assumptions hold - you are fooling yourself. When you design a
>> system based on a microcontroller, you pick a device that is as reliable
>> as you need it to be - so that you /can/ assume the core parts (cpu,
>> ram, flash, interrupts, etc.) work well enough for your needs. If you
>> are not sure it is reliable enough, pick a different device or make a
>> redundant system.
>>
>> No amount of testing can /ever/ prove that something works - it can only
>> prove that something does /not/ work.
>
> A system is not a static entity. It changes over time (even if the
> design is
> frozen). So, while your RAM (or any other component) may not be LIKELY to
> fail, the rest of the system that enables the RAM to function as intended
> can change in ways that manifest as RAM (or other resources) failures.

Ah, nonsense. Sure, it is /conceivable/, but it's a one in a million
possibility. This is not the 70's any more, and we are not talking
about dynamic ram. The onboard ram is static ram - it is a sea of
simple flip-flops. These /can/ be bit-flipped by cosmic rays, if they
are small enough, but they don't suddenly stop working. If "the rest of
the system" fails in a way that stops the ram bits working, you can be
confident that the cpu core and other critical parts have stopped too,
as the problem is your clock, your voltage supply, or overheating.

If you are going to try to make sensible decisions about what can fail,
and where it is useful to test, you need to understand how devices work
- devices that you are using /today/, not systems from 50 years ago.
Otherwise your testing is counter-productive as the tests have higher
risks of failures than the thing you are testing.

>
> [Picking the "world's most reliable MCU" won't guarantee that it won't
> throw
> RAM errors in a deployed product.]
>

/Nothing/ will give you guarantees like that. But if you pick a
microcontroller with ECC on its onboard ram (and cache, if it has it),
you reduce, by many orders of magnitude, the risk of single-event upsets
(such as cosmic rays) leading to failures of the system. Anything else
you can do in software is pointless in comparison. "Testing" your ram
can't possibly detect such issues.

Not many products justify the extra expense of such microcontrollers,
but they are available for those that need them.

> Simply assuming it "can't fail" is naive.

Of course. Simply assuming that you can do a test at startup and think
that makes the system more reliable is at least equally naïve.

>
> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
> than late gives you a better idea of what to report to the user/customer
> because you are closer to the problem's manifestation. You don't end
> up misbehaving and wondering "why?"
>
> [And, all of this assumes "bugfree software" so any errors are entirely a
> result of hardware faults]

And there is perhaps your biggest invalid assumption. Software is
always a risk. Software that can't be properly tested is a
significantly higher risk. Software designed to handle situations that
cannot possibly be reproduced for testing purposes, cannot be properly
tested. So writing software test routines for something that has no
realistic chance of happening in the field, /reduces/ the reliability of
the product.

David Brown

unread,

Sep 24, 2020, 5:15:50 AM9/24/20

Security against deliberate attacks is a completely different ballgame.

If you have made a system where an attacker can cause processor
overclocking, and such attacks are realistic, then you need to put
whatever checks, tests and mitigations are needed to deal with that
situation.

If there are no feasible scenarios where the processor clock can be
suddenly increased to the point where hardware or software becomes
unreliable, then any tests or handling of such a situation is worse than
useless. You are just adding more stuff that can go wrong (or be
attacked), without any benefits.

David Brown

unread,

Sep 24, 2020, 5:27:00 AM9/24/20

On 24/09/2020 03:55, Richard Damon wrote:
> On 9/23/20 10:03 AM, David Brown wrote:

>> Have you ever seen microcontroller RAM that failed? It's a possibility
>> for dynamic ram on PC's that is pushed to its limits for power and
>> speed, and made as cheaply as possible. But for static RAM in a
>> microcontroller, the risk of failures is pretty much negligible. The
>> exception is if a bit is hit by a cosmic ray (or other serious
>> radiation), which can flip a bit, but that won't be detected by any RAM
>> test of this kind.
>
> I think I have seen it once, the part had gotten electrically stressed
> in debugging and one of the banks of internal ram failed. We had only
> put the test in because the uniit was going to be in critical
> infrastructure where certain types of malfunctions could present dangers
> to people in the area.

Presumably you are careful about keeping the systems that developers
have potentially broken separate from the systems that get delivered to
customers. (Another possible cause of this kind of failure is ESD
damage. Production departments are usually a lot more meticulous about
ESD than developers.)

If you have a system that is safety critical, you have to do an analysis
of the risks of things going wrong, the consequences of those failures,
and how these (risks and consequences) can be reduced or mitigated. If
you figure out that static failure of the memory is a risk, then testing
can be worth doing. You might also decide that ECC ram, or redundant
devices, or external monitors are a better solution. There's no fixed
answer.

>
> It is also possible that many failure might trip a watchdog that forces
> a reset, and the unit then finds the fault and locks itself 'safe'.
>

That is definitely possible. But again, be very careful with watchdogs
- watchdog handling code is rarely properly tested, because it is
handling situations that don't occur. (Usually it /can/ be tested, but
that does not mean it /is/ tested.)

Don Y

unread,

Sep 24, 2020, 7:11:44 AM9/24/20

On 9/24/2020 2:10 AM, David Brown wrote:
> On 23/09/2020 21:32, Don Y wrote:
>> On 9/23/2020 12:02 PM, David Brown wrote:
>> A system is not a static entity. It changes over time (even if the
>> design is
>> frozen). So, while your RAM (or any other component) may not be LIKELY to
>> fail, the rest of the system that enables the RAM to function as intended
>> can change in ways that manifest as RAM (or other resources) failures.
>
> Ah, nonsense. Sure, it is /conceivable/, but it's a one in a million
> possibility. This is not the 70's any more, and we are not talking
> about dynamic ram. The onboard ram is static ram - it is a sea of
> simple flip-flops. These /can/ be bit-flipped by cosmic rays, if they
> are small enough, but they don't suddenly stop working. If "the rest of
> the system" fails in a way that stops the ram bits working, you can be
> confident that the cpu core and other critical parts have stopped too,
> as the problem is your clock, your voltage supply, or overheating.

By your argument, you should test NOTHING and just wait for the user to
complain that the device "isn't working". And, hope that this manifests in
a spectacular -- but not costly -- way. AND, hope it doesn't piss off
the user who now has a product that isn't performing as he had hoped
(and you had ADVERTISED) it would.

The whole point of BIST/POST is to provide a point in time where failures
will hopefully manifest -- instead of SILENTLY affecting the operation
of the device in question, in typically unpredictable ways.

> If you are going to try to make sensible decisions about what can fail,
> and where it is useful to test, you need to understand how devices work
> - devices that you are using /today/, not systems from 50 years ago.
> Otherwise your testing is counter-productive as the tests have higher
> risks of failures than the thing you are testing.

How is a RAM test going to fail post deployment that didn't happen
prior to release? POST/BIST are considerably easier to "get right"
than application code. Their goals are much more concretely defined
and implementation verified.

"50 years ago" you didn't have SRAM suffering from disturb errors.
Yet, now this is a fact of life for even caches. Technology advances
and, with it, come new "challenges".

I suggest you've been basing your assumptions on SRAM reliability on
50 year old anecdotes and not the consequences of more modern implementations,
shrinking device geometries and lower operating voltages. Have a run through
the literature to see...

>> [Picking the "world's most reliable MCU" won't guarantee that it won't
>> throw
>> RAM errors in a deployed product.]
>
> /Nothing/ will give you guarantees like that. But if you pick a
> microcontroller with ECC on its onboard ram (and cache, if it has it),
> you reduce, by many orders of magnitude, the risk of single-event upsets
> (such as cosmic rays) leading to failures of the system. Anything else
> you can do in software is pointless in comparison. "Testing" your ram
> can't possibly detect such issues.
>
> Not many products justify the extra expense of such microcontrollers,
> but they are available for those that need them.

Few designs have the features that they require, let alone DESIRE.
Unless you're working in a market where customers will pay "whatever it
takes", most designs have to live with some subset of what they would
LIKE to have in their product.

>> Simply assuming it "can't fail" is naive.
>
> Of course. Simply assuming that you can do a test at startup and think
> that makes the system more reliable is at least equally naïve.

You miss the point of POST. It doesn't MAKE a system more reliable.
Instead, it tells you when a system is not meeting your expectations.
This is true of ALL testing. You have a defined point in time -- and
operating conditions -- in which you hope to catch a failure so that
you can report on it. A user (customer) is more willing to accept
"there's a flashing red light on the device" than "the &*^($^& thing
doesn't work worth a sh*t -- but I can't provide Tech Support with
any information beyond the fact that I'm frustrated and UNHAPPY WITH
MY PURCHASE"

(and, even if they are willing to replace the device for me -- at no
charge and only minor inconvenience to me -- I still don't feel
confident that the next device won't have exactly the same problem!)

"The worst thing you can do to a system is power it up; the SECOND worst
thing you can do to a system is power it DOWN!" Both of these bad things
have happened just before you run POST.

Few systems can afford to test RAM (regardless of technology) while the system
is actively running. And, few can defer power off to run such a test just
prior to shutdown (where it will be LESS useful as it will miss the
consequences of that impending shutdown and the subsequent powerup).

[OTOH, there are systems that don't see regular/periodic power cyclings.
Do you just let defects grow until the system resets itself (and THEN
invokes POST)?]

BUT, the cost and ease of testing RAM (regardless of technology) at power up
is typically easy to bear in a product's design. It costs me a fraction of
a second to give a cursory test of 500MB. Chances are, I'm going to find
failures THERE instead of "dubious behaviors" in the running product.

>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
>> than late gives you a better idea of what to report to the user/customer
>> because you are closer to the problem's manifestation. You don't end
>> up misbehaving and wondering "why?"
>>
>> [And, all of this assumes "bugfree software" so any errors are entirely a
>> result of hardware faults]
>
> And there is perhaps your biggest invalid assumption. Software is
> always a risk. Software that can't be properly tested is a
> significantly higher risk. Software designed to handle situations that
> cannot possibly be reproduced for testing purposes, cannot be properly
> tested. So writing software test routines for something that has no
> realistic chance of happening in the field, /reduces/ the reliability of
> the product.

YOUR biggest invalid assumption is that is has no realistic chance of
happening.

Your SECOND biggest assumption is thinking that folks who are qualified
to write application software (for often ill-defined scenarios) are
NOT capable of developing reliable test programs (for very WELL-DEFINED
scenarios).

Do you think *all* MCU-device failures are simply attributable to software
bugs? Why test anything? ASSUME the power supply and power conditioning
circuitry will never fail. Assume the various I/Os will never fail.
Blame every failure on "it must be a bug". Never scrap returned product
cuz all it needs -- along with every unit coming off the line, TODAY -- is
a reflash!

Are all of your products short-lived and in inconsequential applications?

Naive.

Please DO the research regarding TODAY's SRAM implementations. Understand why
they fail and why folks are now adding EDAC to SRAM ("50 years ago" you
wouldn't see EDAC and SRAM discussed in the same sentence).

Then: 10um process. Now: < 30nm.

Then: "5V". Now: < 2V.

Then: 60W/Mb. Now: 20nW/Mb.

Then: ~200ns. Now: ~1ns.

Then: fixed power/speed. Now: dynamically variable power vs. speed.

None of these things were heard of "50 years ago". Do you think there are no
consequences of these changes? The illusive "win-win"??

Better yet, convince your employer/client to let you design a full custom.
Make sure it has SRAM onboard. Then, notice how much attention the fab
pays to TESTING that SRAM vs. junk logic -- as well as HOW they test it.
Ask them what to expect from your component after a year operating in "typ"
conditions; two years; five years. Ask them if they can quantifiably predict
the effects of electromigration on the component in those periods. Changes
in power supply sensitivity. Etc. "Chips" age. When you commit to purchasing
your parts from their fab, ask them what sort of guarantees they'd be willing
to offer on the devices that THEY will be producing for you. Will they
defend it's operational status 5 years down the road? 10? 20??

("Hey, SRAM doesn't fail so you should be willing to extend GENEROUS
warranty terms to me, right? BEYOND just the cost of replacing the
component! After all, there's no REALISTIC CHANCE of it failing!!")

Do some reading. You'll learn something.

David Brown

unread,

Sep 24, 2020, 7:58:37 AM9/24/20

On 24/09/2020 13:11, Don Y wrote:
> On 9/24/2020 2:10 AM, David Brown wrote:
>> On 23/09/2020 21:32, Don Y wrote:
>>> On 9/23/2020 12:02 PM, David Brown wrote:
>>> A system is not a static entity. It changes over time (even if the
>>> design is
>>> frozen). So, while your RAM (or any other component) may not be
>>> LIKELY to
>>> fail, the rest of the system that enables the RAM to function as
>>> intended
>>> can change in ways that manifest as RAM (or other resources) failures.
>>
>> Ah, nonsense. Sure, it is /conceivable/, but it's a one in a million
>> possibility. This is not the 70's any more, and we are not talking
>> about dynamic ram. The onboard ram is static ram - it is a sea of
>> simple flip-flops. These /can/ be bit-flipped by cosmic rays, if they
>> are small enough, but they don't suddenly stop working. If "the rest of
>> the system" fails in a way that stops the ram bits working, you can be
>> confident that the cpu core and other critical parts have stopped too,
>> as the problem is your clock, your voltage supply, or overheating.
>
> By your argument, you should test NOTHING and just wait for the user to
> complain that the device "isn't working". And, hope that this manifests in
> a spectacular -- but not costly -- way. AND, hope it doesn't piss off
> the user who now has a product that isn't performing as he had hoped
> (and you had ADVERTISED) it would.

I can't see how you came to that bizarre conclusion.

>
> The whole point of BIST/POST is to provide a point in time where failures
> will hopefully manifest -- instead of SILENTLY affecting the operation
> of the device in question, in typically unpredictable ways.
>

Failures rarely occur when a device is switched off. They happen when
the device is running. (They also happen during production or putting
together a system, and it's worth doing checks then.)

If you think that failures might realistically occur, and the tradeoffs
between costs, reliability, safety, etc., warrant it, then you put in
the appropriate level of failure detection and mitigation at /runtime/
in the system. There's little help in the failure leading to operation
problems, and then saying afterwards that you could have spotted that
problem in a POST.

>> If you are going to try to make sensible decisions about what can fail,
>> and where it is useful to test, you need to understand how devices work
>> - devices that you are using /today/, not systems from 50 years ago.
>> Otherwise your testing is counter-productive as the tests have higher
>> risks of failures than the thing you are testing.
>
> How is a RAM test going to fail post deployment that didn't happen
> prior to release? POST/BIST are considerably easier to "get right"
> than application code. Their goals are much more concretely defined
> and implementation verified.
>

Never underestimate the complexity of these things, nor the ability of
software developers to get things wrong.

> "50 years ago" you didn't have SRAM suffering from disturb errors.
> Yet, now this is a fact of life for even caches. Technology advances
> and, with it, come new "challenges".
>

Yes, "disturb errors" as you call them - "single-event upsets",
bit-flips, etc., are a possibility with ram. They are more likely in
dynamic ram, but can occur in small, fast static ram cells. And POSTs
and other ram checks are totally and completely /useless/ at identifying
them or dealing with them. That is why I say you need to understand the
hardware and the possible failure modes in order to make reliable systems.

Are you sure you understand what POSTs can do, and the difference
between transient failures and static failures?

> I suggest you've been basing your assumptions on SRAM reliability on
> 50 year old anecdotes and not the consequences of more modern
> implementations,
> shrinking device geometries and lower operating voltages. Have a run
> through
> the literature to see...

You are the one that was discussing 50 year old anecdotes!

>
>>> [Picking the "world's most reliable MCU" won't guarantee that it won't
>>> throw
>>> RAM errors in a deployed product.]
>>
>> /Nothing/ will give you guarantees like that. But if you pick a
>> microcontroller with ECC on its onboard ram (and cache, if it has it),
>> you reduce, by many orders of magnitude, the risk of single-event upsets
>> (such as cosmic rays) leading to failures of the system. Anything else
>> you can do in software is pointless in comparison. "Testing" your ram
>> can't possibly detect such issues.
>>
>> Not many products justify the extra expense of such microcontrollers,
>> but they are available for those that need them.
>
> Few designs have the features that they require, let alone DESIRE.
> Unless you're working in a market where customers will pay "whatever it
> takes", most designs have to live with some subset of what they would
> LIKE to have in their product.
>

In a safety-critical system, the cost of using a microcontroller with
ECC ram is negligible. These are used all the time in the automotive
industry.

>>> Simply assuming it "can't fail" is naive.
>>
>> Of course. Simply assuming that you can do a test at startup and think
>> that makes the system more reliable is at least equally naïve.
>
> You miss the point of POST. It doesn't MAKE a system more reliable.

I know it doesn't do that - I've been saying this all along.

> Instead, it tells you when a system is not meeting your expectations.
> This is true of ALL testing. You have a defined point in time -- and
> operating conditions -- in which you hope to catch a failure so that
> you can report on it. A user (customer) is more willing to accept
> "there's a flashing red light on the device" than "the &*^($^& thing
> doesn't work worth a sh*t -- but I can't provide Tech Support with
> any information beyond the fact that I'm frustrated and UNHAPPY WITH
> MY PURCHASE"

For /some/ devices, some kind of POST can be useful. For many, it is
pointless - it does not detect the failures that actually matter, and
can only detect ones that have negligible chances of occurring.

If you have a device that is regularly restarted, and where the hardware
is so fault-prone that you really are finding problems with a POST, then
yes - go for it.

All I am arguing for is that people /think/ before making a POST, and do
some analysis and investigation to see if it really is a useful feature.

>
> BUT, the cost and ease of testing RAM (regardless of technology) at
> power up
> is typically easy to bear in a product's design. It costs me a fraction of
> a second to give a cursory test of 500MB. Chances are, I'm going to find
> failures THERE instead of "dubious behaviors" in the running product.

Do you understand the concept of cost/use analysis? If something is
useless, or worse than useless, it doesn't help if it is cheap. Well,
it helps for the marketing folks.

>
>>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
>>> than late gives you a better idea of what to report to the user/customer
>>> because you are closer to the problem's manifestation. You don't end
>>> up misbehaving and wondering "why?"
>>>
>>> [And, all of this assumes "bugfree software" so any errors are
>>> entirely a
>>> result of hardware faults]
>>
>> And there is perhaps your biggest invalid assumption. Software is
>> always a risk. Software that can't be properly tested is a
>> significantly higher risk. Software designed to handle situations that
>> cannot possibly be reproduced for testing purposes, cannot be properly
>> tested. So writing software test routines for something that has no
>> realistic chance of happening in the field, /reduces/ the reliability of
>> the product.
>
> YOUR biggest invalid assumption is that is has no realistic chance of
> happening.
>

Again, in your enthusiasm you have failed to notice what I have written
repeatedly. If there is a /realistic/ chance of a failure, then it will
often make sense to test for it. If there is no such chance - or
negligible chance of it failing without some other major failure, or
nothing you can do about a failure, then there is no point in trying to
test.

> Your SECOND biggest assumption is thinking that folks who are qualified
> to write application software (for often ill-defined scenarios) are
> NOT capable of developing reliable test programs (for very WELL-DEFINED
> scenarios).

That is often a realistic assumption - different people specialise in
different things. However, it was not an assumption I made - again, you
seem to prefer to make things up than read my posts.

Software is always a risk. It might be low risk, but it is always a risk.

>
> Do you think *all* MCU-device failures are simply attributable to software
> bugs? Why test anything? ASSUME the power supply and power conditioning
> circuitry will never fail. Assume the various I/Os will never fail.
> Blame every failure on "it must be a bug". Never scrap returned product
> cuz all it needs -- along with every unit coming off the line, TODAY -- is
> a reflash!

Another wild idea all of your own.

>
> Are all of your products short-lived and in inconsequential applications?
>

I've made systems that are buried in concrete in oil installations,
working for decades. Do I do that by relying on POSTs, memory tests and
perhaps a watchdog? No.

>
> Do some reading. You'll learn something.

Try it yourself. You could start by reading what I wrote. Then, when
you have learned a bit about this stuff, you can start applying a bit of
/thought/ to the process. And when you look at my posts here, you'll
see that what I have been advocating is that people /think/ about what
they are doing with tests - what are they actually trying to achieve,
what use it is, what the risks are. And stop making pointless code just
because you can.

Richard Damon

unread,

Sep 24, 2020, 9:01:46 AM9/24/20

I wasn't saying that such a test does make sense, but that such a test
CAN be done reasonably, if for some legal/political reason it is
introduced as a requirement. I brought up the example to show that this
type of error CAN occur. Yes, unless some externally imposed requirement
says to test internal ram, I am unlikely to add such a test for a
production system (I have at time done it in development, mostly to
confirm that I understand the limitations and operation of the device).

>
>>
>> It is also possible that many failure might trip a watchdog that forces
>> a reset, and the unit then finds the fault and locks itself 'safe'.
>>
>
> That is definitely possible. But again, be very careful with watchdogs
> - watchdog handling code is rarely properly tested, because it is
> handling situations that don't occur. (Usually it /can/ be tested, but
> that does not mean it /is/ tested.)
>

Yes, testing watchdogs is tricky.

Don Y

unread,

Sep 24, 2020, 10:10:05 AM9/24/20

Well, if the "clock, voltage supply or overheating" is the problem -- and
you can't DIRECTLY test for any of those -- then why are you testing ANYTHING
(except as secondary evidence that some ASSUMPTION your design relies upon
has been violated -- clock, volts, temp)?

>> The whole point of BIST/POST is to provide a point in time where failures
>> will hopefully manifest -- instead of SILENTLY affecting the operation
>> of the device in question, in typically unpredictable ways.
>
> Failures rarely occur when a device is switched off. They happen when
> the device is running. (They also happen during production or putting
> together a system, and it's worth doing checks then.)

Failures rarely occur when the device IS off. But, the act of removing power
to a device is just as hazardous as APPLYING power. Power supplies rarely are
designed to cleanly go up and down without inflicting transients on the devices
they power. Many designers fail to note, carefully, how power transitions
are expected to be managed (in ages past, with many supplies per device, this
was more "in your face" and less easy to ignore)

Of course, to a typical user, the failure will only manifest when the device is
NEXT powered up. You can't test while it's powered down!

> If you think that failures might realistically occur, and the tradeoffs
> between costs, reliability, safety, etc., warrant it, then you put in
> the appropriate level of failure detection and mitigation at /runtime/
> in the system. There's little help in the failure leading to operation
> problems, and then saying afterwards that you could have spotted that
> problem in a POST.

POST provides a reassurance that "all appears well". It can't be thorough
because it is a serial activity with "bringing the system on-line" -- and
few people are willing to wait for exhaustive tests to complete when they
will typically not uncover errors.

But, systems/devices *routinely* fail POST -- for a variety of reasons.
Some may be misapplication (the user has done something he shouldn't).
Some hardware faults (the system hasn't endured as expected). Some
from tampering (nowadays, you can rest assured that folks WILL open
your product and try to tinker with it... to increase memory, enable
an unused feature, patch the firmware, access "hidden" capabilities, etc.)

Your code, however, is based on a set of assumptions -- some formally
codified and some simply internalized. Before it runs, it should verify
that those assumptions are valid, NOW (or, just shrug if the product
misbehaves).

I designed a device used in performing blood assays. It had socketed
DRAM (DIPS) to allow the data store to be increased in 6KB increments
(replace a 16Kx1 DRAM with a 64Kx1 DRAM and you've got 6KB more capacity).
Of course, I had to "size" and "query" the data store's complexion on
startup (which devices are 16Kb and which are 64Kb). But, I also
had to address the fact that the technician in the hospital may have
removed ALL of the devices (shame on him! but, maybe he simply forgot
to install the new set?) *or* left one "bit lane" empty (I used a
portion of the lower 16KB as "system RAM" so can't do much without it)

Do I just wait until someone tries to use the device and then <cough>...
while they have a micropipette loaded with a blood sample in their hand?
I've got no writeable memory -- how can I tell the user that this has
happened? Do I just start "squealing" to induce a panic??

Similarly, the "sensor array" onto which the assayed samples were placed
was connected by a detachable cord. What if it is not present? What if
it IS present but one of the conductors in the cord has failed? What if
the cord is connected and intact but the array has been "soiled" by a
sample (rendering portions of it unusable)? (these are actions that the
USER -- not the technician -- could initiate)

IME, it's foolish to blindly rely on anything being as you hope. If
you NEED something to be a certain way, then you have to do whatever it
takes to gain confidence that it IS that way.

[Think about how much happens inside a PC that the manufacturers'
likely didn't INTEND in creating their designs. Overclocking processors,
replacing CPUs and active coolers, adding daughter cards (does anyone actually
verify that their system can electrically -- not just mechanically -- support
al of these things? or, do they just plug them in and "let's see if it
works"??)]

>>> If you are going to try to make sensible decisions about what can fail,
>>> and where it is useful to test, you need to understand how devices work
>>> - devices that you are using /today/, not systems from 50 years ago.
>>> Otherwise your testing is counter-productive as the tests have higher
>>> risks of failures than the thing you are testing.
>>
>> How is a RAM test going to fail post deployment that didn't happen
>> prior to release? POST/BIST are considerably easier to "get right"
>> than application code. Their goals are much more concretely defined
>> and implementation verified.
>
> Never underestimate the complexity of these things, nor the ability of
> software developers to get things wrong.

As I said, there is a difference between POST/BIST and "diagnostics".
The former provide a basic reassurance of expected operating condition.
The latter provide (often exhaustive) analysis to QUANTIFY the operating
condition.

How many ECC errors do you tolerate in your product? Do you try to
recover/self-heal from problems -- or, just illuminate "check engine"?
How do you handle a checksum error in your ROM/FLASH -- do you reload
a backup copy or panic()? Do you keep track of how OFTEN you are doing
this? Or, do you just do it open-loop? What costs have you ADDED to
your product (and passed along to the customer) to support these "fixes"?

How costly is it to your customer (and, by extension, YOU!) to encounter
an error and have to take some remedial action (even if that is just an
irate phone call)? How long do you expect your customer to keep the
device in service? How reluctant will he be to "upgrade" (for enhanced
functionality OR to fix a fault)? Does he already bear the cost of
maintaining kit similar to yours? Or, is this a cost he's going to be
unhappy with bearing?

In the 80's, I designed a bit of medical kit that cost a few hundred dollars
to produce. A firmware upgrade/fix cost $600 in labor to perform if the
device was sited "just down the road". You can imagine there was a big
emphasis on NOT having to update the firmware and to be able to provide
an indication of machine faults that the user could convey to support
staff over the phone (instead of requiring a visit). The same sort of costs
were present if I had to replace (swap out, repair at depot) a display
board, power supply, backup battery, etc.

Much consumer kit places the cost of maintenance on the consumer.
Worst case, he returns the product for a refund. This is a costly
proposition because you've lost more than you would have made on
the sale (handling the return) AND have likely annoyed a customer who
MIGHT have represented repeat business -- as well as performing in an
advertising role (word of mouth).

Industrial kit often has local support staff on hand that can diagnose
problems (IF your product and documentation provide a means for them to
do so). But, the cost of that staff is figured into the "burden"
your product imposes; if they are spending inordinate amounts of time
fixing YOUR problems, then your products suffer in their eyes (cuz
management is always under pressure to "do more with less" -- staff)

My experience has been that providing MORE information to a user
always works to the manufacturer's advantage. A user confronted
with a flashing red light will cost you more (even if you don't lose
the sale) than a user who is told to "check connection at J1".
Anything that removes a potential "issue" from his thought process
is an improvement ("How do I know that the cache memory isn't defective?
Is he testing that, too? Am I going to spend hours tracking down
a problem that's buried in a place that I can't access/test?")

>> "50 years ago" you didn't have SRAM suffering from disturb errors.
>> Yet, now this is a fact of life for even caches. Technology advances
>> and, with it, come new "challenges".
>
> Yes, "disturb errors" as you call them - "single-event upsets",
> bit-flips, etc., are a possibility with ram. They are more likely in
> dynamic ram, but can occur in small, fast static ram cells. And POSTs
> and other ram checks are totally and completely /useless/ at identifying
> them or dealing with them. That is why I say you need to understand the
> hardware and the possible failure modes in order to make reliable systems.

Please tell me where I indicated that puzz should be checking for
disturb errors in SRAM, DRAM or FLASH (where all can occur -- as well as
in "junk logic"). You can't just run a simple, quick test to determine
if you have a problem with these.

OTOH, if you have a system that is running and can "do this on the side"
(with or without hardware EDAC), then you can compile statistics regarding
their likely frequency.

If you DON'T have a closed system, you can also use these observations as
indicators of possible "attacks" or poorly coded applications (that, left
to their own BENIGN devices, could compromise your system). If you notice
WHEN they occur, you can also take actions to thwart them (e.g., if
TryToGainRoot() is the active process when a statistically greater frequency
of such events occurs, then you might want to blacklist TryToGainRoot()
so that it never runs, again.)

> Are you sure you understand what POSTs can do, and the difference
> between transient failures and static failures?

You do understand that there are differences between truly transient (i.e.,
self-healing) errors and persistent consequences of things like SEUs?

Are you sure the code in your FLASH (ROM) is intact, NOW (assuming XIP)?
Are you sure the code that you loaded from that FLASH into (S/D)RAM hasn't
been corrupted, NOW (ignore the effects of bugs)?

Will your customer notice if it has been corrupted? Will the consequences
of the corruption be masked (by whatever)? Or, will it manifest in a
spectacular way?

[There have been several studies of how resilient various applications
are to memory errors. Given that they can occur "anywhere", it's easy to
see how some can be masked or contribute to "system noise". But, that's
not a given for all...]

What are you doing about this, besides hoping to catch it at the next POST
(assuming you even bother to test for it)?

>> I suggest you've been basing your assumptions on SRAM reliability on
>> 50 year old anecdotes and not the consequences of more modern
>> implementations,
>> shrinking device geometries and lower operating voltages. Have a run
>> through
>> the literature to see...
>
> You are the one that was discussing 50 year old anecdotes!

I'm showing how YOUR confidence in SRAM is rooted in 50 year old
anecdotes and not "modern practices".

>>>> [Picking the "world's most reliable MCU" won't guarantee that it won't
>>>> throw
>>>> RAM errors in a deployed product.]
>>>
>>> /Nothing/ will give you guarantees like that. But if you pick a
>>> microcontroller with ECC on its onboard ram (and cache, if it has it),
>>> you reduce, by many orders of magnitude, the risk of single-event upsets
>>> (such as cosmic rays) leading to failures of the system. Anything else
>>> you can do in software is pointless in comparison. "Testing" your ram
>>> can't possibly detect such issues.
>>>
>>> Not many products justify the extra expense of such microcontrollers,
>>> but they are available for those that need them.
>>
>> Few designs have the features that they require, let alone DESIRE.
>> Unless you're working in a market where customers will pay "whatever it
>> takes", most designs have to live with some subset of what they would
>> LIKE to have in their product.
>
> In a safety-critical system, the cost of using a microcontroller with
> ECC ram is negligible. These are used all the time in the automotive
> industry.

So, only safety critical products need to work, reliably? It must be
really easy designing with a bar set that low!

You don't need to rely on hardware EDAC to improve your confidence in
the retentive powers of the RAM (any RAM). That just provides a more
immediate indication of a particular detected/corrected fault.

It's not uncommon for me to have running checksum processes that continually
scan the program store looking for "disturbances". I can't necessarily point
to a specific location. Or, an exact time at which the disturbance crept into
the system.

But, I *do* know that the contents of that memory region are no longer what
they SHOULD be. If I have hardware protecting write access to that region,
then I can deduce that the error is caused by a fault in a device (even if
I can't point to a specific device).

In either case, I can't vouch for my product's "output"/functionality.
(Or, I can stick my head in the sand and assume that memory is never
corrupted)

Hardware EDAC also only tells you about errors in REFERENCED locations.
So, if your code doesn't reference every location "frequently" (for some
value of "frequently"), you may not discover the corruption until hours
after it occurred. And, the single error may have become a multiple-bit
error -- now your EDAC (SECDEC) is useless.

[This is the same false sense of security that folks using RAID rely on;
if you aren't looking at EVERYTHING periodically, then you have no idea
as to whether or not it's been corrupted and/or is recoverable (hence
the reason for patrol reads)]

>>>> Simply assuming it "can't fail" is naive.
>>>
>>> Of course. Simply assuming that you can do a test at startup and think
>>> that makes the system more reliable is at least equally naïve.
>>
>> You miss the point of POST. It doesn't MAKE a system more reliable.
>
> I know it doesn't do that - I've been saying this all along.

Then why are you assuming *I* am professing that?

>> Instead, it tells you when a system is not meeting your expectations.
>> This is true of ALL testing. You have a defined point in time -- and
>> operating conditions -- in which you hope to catch a failure so that
>> you can report on it. A user (customer) is more willing to accept
>> "there's a flashing red light on the device" than "the &*^($^& thing
>> doesn't work worth a sh*t -- but I can't provide Tech Support with
>> any information beyond the fact that I'm frustrated and UNHAPPY WITH
>> MY PURCHASE"
>
> For /some/ devices, some kind of POST can be useful. For many, it is
> pointless - it does not detect the failures that actually matter, and
> can only detect ones that have negligible chances of occurring.

You install POST/BIST *before* you release the product. You likely
discover hardware reliability problems AFTER the design is complete
(potentially after it has been released to manufacturing). Few people
intentionally design with poor reliability as a goal, implied or
otherwise.

You don't know what your problems will be -- until you start doing
/post mortems/ on returned product. This is the WORST time to find
out because you likely have lots of product in the field before
you can see a pattern in their failures. Now you throw away profit
and reputation in trying to compensate for those shortcomings.

> If you have a device that is regularly restarted, and where the hardware
> is so fault-prone that you really are finding problems with a POST, then
> yes - go for it.
>
> All I am arguing for is that people /think/ before making a POST, and do
> some analysis and investigation to see if it really is a useful feature.

An engineer should always be "thinking" (not necessarily true of a
"programmer"). But, there are costs to "omissions" that can be
sizeable.

>> BUT, the cost and ease of testing RAM (regardless of technology) at
>> power up
>> is typically easy to bear in a product's design. It costs me a fraction of
>> a second to give a cursory test of 500MB. Chances are, I'm going to find
>> failures THERE instead of "dubious behaviors" in the running product.
>
> Do you understand the concept of cost/use analysis? If something is
> useless, or worse than useless, it doesn't help if it is cheap. Well,
> it helps for the marketing folks.

Again, you're assuming it IS "useless". Most memory failures that
I've encountered are caught in a POST -- stuck at faults, decode
faults or problems with "external factors". By catching them, there,
before the application runs, I avoid annoying the user. (yeah, he
may be disappointed that the device won't run -- or will only
run with reduced capabilities -- but he won't be annoyed that he
produced $30,000 of stainless steel parts that are out of tolerance.
Or, that 8 hours' production of pharmaceuticals have to be scrapped
(cuz you can't test millions of individual tablets!)

>>>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
>>>> than late gives you a better idea of what to report to the user/customer
>>>> because you are closer to the problem's manifestation. You don't end
>>>> up misbehaving and wondering "why?"
>>>>
>>>> [And, all of this assumes "bugfree software" so any errors are
>>>> entirely a result of hardware faults]
>>>
>>> And there is perhaps your biggest invalid assumption. Software is
>>> always a risk. Software that can't be properly tested is a
>>> significantly higher risk. Software designed to handle situations that
>>> cannot possibly be reproduced for testing purposes, cannot be properly
>>> tested. So writing software test routines for something that has no
>>> realistic chance of happening in the field, /reduces/ the reliability of
>>> the product.
>>
>> YOUR biggest invalid assumption is that is has no realistic chance of
>> happening.
>
> Again, in your enthusiasm you have failed to notice what I have written
> repeatedly. If there is a /realistic/ chance of a failure, then it will
> often make sense to test for it. If there is no such chance - or
> negligible chance of it failing without some other major failure, or
> nothing you can do about a failure, then there is no point in trying to
> test.

But you dismiss this testing as being targeted at something that "won't
happen". I contend that it will and does. (though I can't speak re:
the OP's specific product)

>> Your SECOND biggest assumption is thinking that folks who are qualified
>> to write application software (for often ill-defined scenarios) are
>> NOT capable of developing reliable test programs (for very WELL-DEFINED
>> scenarios).
>
> That is often a realistic assumption - different people specialise in
> different things. However, it was not an assumption I made - again, you
> seem to prefer to make things up than read my posts.

You've stated that adding the test(s) decreases reliability. Do the tests
physically damage the product? If not, then the only potential downside
is if they are implemented defectively -- hence the above.

> Software is always a risk. It might be low risk, but it is always a risk.
>
>> Do you think *all* MCU-device failures are simply attributable to software
>> bugs? Why test anything? ASSUME the power supply and power conditioning
>> circuitry will never fail. Assume the various I/Os will never fail.
>> Blame every failure on "it must be a bug". Never scrap returned product
>> cuz all it needs -- along with every unit coming off the line, TODAY -- is
>> a reflash!
>
> Another wild idea all of your own.
>
>> Are all of your products short-lived and in inconsequential applications?
>
> I've made systems that are buried in concrete in oil installations,
> working for decades. Do I do that by relying on POSTs, memory tests and
> perhaps a watchdog? No.

Instead, you rely on expensive staff being available in the event that
a problem occurs. Thats not the case with most products or customers.

I design differently for environments where I can reasonably expect to
have "capable" staff on hand. I expose more details about what I've
"noticed" in my product(s) so they can use that to determine how to
further test, repair or replace the items. This is no different than
"test equipment" manufacturers making diagnostic and calibration
procedures available to end users.

In some cases, downtime is paramount so I design the entire product
with ease of replacement in mind -- swap out the questionable unit,
install the spare, forward the old one to us for analysis (or do
your own testing, "offline"). This is more than just thinking about
making it replaceable; you also have to consider the activities that
will be involved in making that replacement!

In consumer applications, the typical remedy is to have the consumer
get annoyed -- dealing with online "chat", or phone support -- as even
the simplest problems (operator error) take hours or more to resolve
("The current hold time is 27 minutes.") This has a direct cost to
the manufacturer (support staff, repairs, returns) as well as an indirect
cost (pissed off customer who typically is more willing to badmouth a
disappointing product than praise a delightfully performant one!).

The dollars involved "per incident" vary -- as do the quantities.
But, I can survive a "bad experience" (in THEIR minds) with an industrial
user more readily; they might make me squirm a bit or may extract
other concessions from me going forward... but, chances are, they
aren't going to pull all of my products and move on to a competitor.
It's a more rational "business decision" instead of an EMOTIONAL
reaction (for a consumer).

OTOH, if I misdiagnose or mistreat a patient and some litigation
(and possibly loss) ensues, I can likely write off that business
for the foreseeable future (even if I don't directly incur those
losses)!

>> Do some reading. You'll learn something.
>
> Try it yourself. You could start by reading what I wrote. Then, when
> you have learned a bit about this stuff, you can start applying a bit of
> /thought/ to the process. And when you look at my posts here, you'll
> see that what I have been advocating is that people /think/ about what
> they are doing with tests - what are they actually trying to achieve,
> what use it is, what the risks are. And stop making pointless code just
> because you can.

You've not JUST said that. You've said testing SRAM is pointless
because (effectively) it never fails.

David Brown

unread,

Sep 24, 2020, 10:21:11 AM9/24/20

Agreed.

> I brought up the example to show that this
> type of error CAN occur.

Fair enough (and I specifically asked for such examples).

> Yes, unless some externally imposed requirement
> says to test internal ram, I am unlikely to add such a test for a
> production system (I have at time done it in development, mostly to
> confirm that I understand the limitations and operation of the device).
>

And that is fine, of course.

I've also had such tests in production, especially for external memories
- it confirms there are no (obvious) soldering defects.

Mike Perkins

unread,

Sep 30, 2020, 10:14:38 AM9/30/20

https://www.st.com/en/embedded-software/stm32-classb-spl.html

If I recall it uses a timer which you will have to repurpose if you use
it yourself.

HTH

Dave Nadler

unread,

Oct 2, 2020, 8:29:20 PM10/2/20

On Thursday, September 24, 2020 at 10:10:05 AM UTC-4, Don Y wrote:
> Failures rarely occur when the device IS off. But, the act of
> removing power to a device is just as hazardous as APPLYING power.

Absolutely!
My two favorites, sadly seen multiple times in shipping systems where
I was asked to find out what happened, both involving capacitors:

1) Large cap added to uC ADC to 'smooth things out'.
Guess where that stored energy goes when power is removed?

2) Large cap after secondary linear regulator.
Guess where that stored energy goes when power is removed?

Sorry nothing to do with POST...
See ya, Dave

David Brown

unread,

Oct 3, 2020, 12:22:54 PM10/3/20

I like to put a diode from each low-voltage power lines to the next
higher voltage line. That way you get simple and clean power-off
sequencing automatically. (Yes, that's a little bit of a simplification
- if you've got boost regulators, standby power lines, etc., then it
gets more complicated.)

Dave Nadler

unread,

Oct 3, 2020, 12:27:50 PM10/3/20

Right, to avoid exceeding reverse-voltage limits on the step-down regulator,
you often need a Schottky diode, but they can have non-trivial reverse leakage..

Don Y

unread,

Oct 3, 2020, 3:09:04 PM10/3/20

On 10/2/2020 5:29 PM, Dave Nadler wrote:
> On Thursday, September 24, 2020 at 10:10:05 AM UTC-4, Don Y wrote:
>> Failures rarely occur when the device IS off. But, the act of
>> removing power to a device is just as hazardous as APPLYING power.
>
> Absolutely!
> My two favorites, sadly seen multiple times in shipping systems where
> I was asked to find out what happened, both involving capacitors:
>
> 1) Large cap added to uC ADC to 'smooth things out'.
> Guess where that stored energy goes when power is removed?

What about the off-board cumzintas and gozoutas? What can they
potentially encounter that could deleteriously affect (yet not
spectacularly destroy) the functionality of the circuit?

While you can put invariants in your code to constrain/verify the
nature of inputs at some place in the code, doing so with hardware
requires incurring recurring costs and an active imagination to
consider the many ways your device can be "misapplied".

How often do your products see any formal shake-n-bake prior to release?
Any predictive analysis as to their expected longevity? Expected
warranty repair costs?

Do you routinely get reports from manufacturing test as to the
problems they encountered after formal release? If not YOUR
responsibility, does the manufacturing engineer consult (even
informally) with you so you are aware of the problems that
your designs are experiencing "getting out the door"?

What about returned devices? Do you get a report as to why
they are being returned? Are there genuine failures? UX issues?
How do you close your design loop so the next product gets
*better*?

[Yeah, you might be 99.73% sure that the user "did something
wrong"... but, do you refuse to replace the device and risk
the customer's wrath (and bad "publicity") just to cut your
warranty costs?]

I rescued a Nest thermostat some time ago (wanting it for it's
ergonomics, not electronics). It had been scrapped because the
Rh input was no longer operational (the software can detect
the presence of signals on each wire and could never see, nor
utilize, the signal on Rh).

Did the user legitimately connect the device to their home
wiring and it failed -- from NORMAL use? Or, did the user
screw up, in some way?

> 2) Large cap after secondary linear regulator.
> Guess where that stored energy goes when power is removed?
>
> Sorry nothing to do with POST...

The problem with "power" -- for most devices -- is that the designer
(and, thus, the device) has very little control of it after deployment.

The user can rapidly reapply power after removing it (off, then on,
quickly). Will the supplies come up in the prescribed sequence, given
that they may not have completely discharged to their NORMAL "starting
point" (0V)?

The mains can do the same.

The mains can sag or run hot -- whenever and for as long as they want.
Does your device "know" when this is happening? Or, does it only
"know" in the sense that it is stressed beyond its nominal design
conditions?

The "wrong" power supply can be used. E.g., a wall wart that "happens
to fit" but is the wrong voltage (low or high), polarity (+, -, AC),
or inadequate ampacity.

The power supply may fail to age "gracefully" (does anyone really think
all these crappy 'lytics were used in products INTENTIONALLY?). Were
those 2000Hr caps? ... now still trying to perform 3 years down the road?

The power supply may be considered a "check off" item and not truly
designed to the actual usage of the device in question. E.g.,
a device left powered on 24/7/365 sees ~8000 PoHrs each year.
Are the components rated for that (esp low ESR filter caps used
in the switcher)? Was the wall wart outsourced to the cheapest
vendor on Alibaba? ("Yes, it's 12V @3A... you can trust us!")

A device may see a different usage pattern than intended (any
environmental factor stressed: time, temperature, humidity, etc.).

I power the devices in my current design via PoE. So, it is
worth my design effort to ensure the PSE has added hooks to let
me QUANTITATIVELY verify its proper operation; any monies spent
there are amortized over the 100+ PDs.

And, as "I" am the entity powering the devices up/down, there's
little cost to tracking how often I do this (for each device)
along with the PoH that each device experiences. The alternative
is just to make FUTURE design decisions based on assumptions about
usage patterns that led to failures/faults.

David Brown

unread,

Oct 4, 2020, 6:48:17 AM10/4/20

As I said, it's a simplification.

To be more honest, what I like to do is tell the schematic designer "put
a diode or something between these lines to get a controlled power-off
sequencing". The details are not my area of expertise.

0 new messages