Well, if the "clock, voltage supply or overheating" is the problem -- and
you can't DIRECTLY test for any of those -- then why are you testing ANYTHING
(except as secondary evidence that some ASSUMPTION your design relies upon
has been violated -- clock, volts, temp)?
>> The whole point of BIST/POST is to provide a point in time where failures
>> will hopefully manifest -- instead of SILENTLY affecting the operation
>> of the device in question, in typically unpredictable ways.
>
> Failures rarely occur when a device is switched off. They happen when
> the device is running. (They also happen during production or putting
> together a system, and it's worth doing checks then.)
Failures rarely occur when the device IS off. But, the act of removing power
to a device is just as hazardous as APPLYING power. Power supplies rarely are
designed to cleanly go up and down without inflicting transients on the devices
they power. Many designers fail to note, carefully, how power transitions
are expected to be managed (in ages past, with many supplies per device, this
was more "in your face" and less easy to ignore)
Of course, to a typical user, the failure will only manifest when the device is
NEXT powered up. You can't test while it's powered down!
> If you think that failures might realistically occur, and the tradeoffs
> between costs, reliability, safety, etc., warrant it, then you put in
> the appropriate level of failure detection and mitigation at /runtime/
> in the system. There's little help in the failure leading to operation
> problems, and then saying afterwards that you could have spotted that
> problem in a POST.
POST provides a reassurance that "all appears well". It can't be thorough
because it is a serial activity with "bringing the system on-line" -- and
few people are willing to wait for exhaustive tests to complete when they
will typically not uncover errors.
But, systems/devices *routinely* fail POST -- for a variety of reasons.
Some may be misapplication (the user has done something he shouldn't).
Some hardware faults (the system hasn't endured as expected). Some
from tampering (nowadays, you can rest assured that folks WILL open
your product and try to tinker with it... to increase memory, enable
an unused feature, patch the firmware, access "hidden" capabilities, etc.)
Your code, however, is based on a set of assumptions -- some formally
codified and some simply internalized. Before it runs, it should verify
that those assumptions are valid, NOW (or, just shrug if the product
misbehaves).
I designed a device used in performing blood assays. It had socketed
DRAM (DIPS) to allow the data store to be increased in 6KB increments
(replace a 16Kx1 DRAM with a 64Kx1 DRAM and you've got 6KB more capacity).
Of course, I had to "size" and "query" the data store's complexion on
startup (which devices are 16Kb and which are 64Kb). But, I also
had to address the fact that the technician in the hospital may have
removed ALL of the devices (shame on him! but, maybe he simply forgot
to install the new set?) *or* left one "bit lane" empty (I used a
portion of the lower 16KB as "system RAM" so can't do much without it)
Do I just wait until someone tries to use the device and then <cough>...
while they have a micropipette loaded with a blood sample in their hand?
I've got no writeable memory -- how can I tell the user that this has
happened? Do I just start "squealing" to induce a panic??
Similarly, the "sensor array" onto which the assayed samples were placed
was connected by a detachable cord. What if it is not present? What if
it IS present but one of the conductors in the cord has failed? What if
the cord is connected and intact but the array has been "soiled" by a
sample (rendering portions of it unusable)? (these are actions that the
USER -- not the technician -- could initiate)
IME, it's foolish to blindly rely on anything being as you hope. If
you NEED something to be a certain way, then you have to do whatever it
takes to gain confidence that it IS that way.
[Think about how much happens inside a PC that the manufacturers'
likely didn't INTEND in creating their designs. Overclocking processors,
replacing CPUs and active coolers, adding daughter cards (does anyone actually
verify that their system can electrically -- not just mechanically -- support
al of these things? or, do they just plug them in and "let's see if it
works"??)]
>>> If you are going to try to make sensible decisions about what can fail,
>>> and where it is useful to test, you need to understand how devices work
>>> - devices that you are using /today/, not systems from 50 years ago.
>>> Otherwise your testing is counter-productive as the tests have higher
>>> risks of failures than the thing you are testing.
>>
>> How is a RAM test going to fail post deployment that didn't happen
>> prior to release? POST/BIST are considerably easier to "get right"
>> than application code. Their goals are much more concretely defined
>> and implementation verified.
>
> Never underestimate the complexity of these things, nor the ability of
> software developers to get things wrong.
As I said, there is a difference between POST/BIST and "diagnostics".
The former provide a basic reassurance of expected operating condition.
The latter provide (often exhaustive) analysis to QUANTIFY the operating
condition.
How many ECC errors do you tolerate in your product? Do you try to
recover/self-heal from problems -- or, just illuminate "check engine"?
How do you handle a checksum error in your ROM/FLASH -- do you reload
a backup copy or panic()? Do you keep track of how OFTEN you are doing
this? Or, do you just do it open-loop? What costs have you ADDED to
your product (and passed along to the customer) to support these "fixes"?
How costly is it to your customer (and, by extension, YOU!) to encounter
an error and have to take some remedial action (even if that is just an
irate phone call)? How long do you expect your customer to keep the
device in service? How reluctant will he be to "upgrade" (for enhanced
functionality OR to fix a fault)? Does he already bear the cost of
maintaining kit similar to yours? Or, is this a cost he's going to be
unhappy with bearing?
In the 80's, I designed a bit of medical kit that cost a few hundred dollars
to produce. A firmware upgrade/fix cost $600 in labor to perform if the
device was sited "just down the road". You can imagine there was a big
emphasis on NOT having to update the firmware and to be able to provide
an indication of machine faults that the user could convey to support
staff over the phone (instead of requiring a visit). The same sort of costs
were present if I had to replace (swap out, repair at depot) a display
board, power supply, backup battery, etc.
Much consumer kit places the cost of maintenance on the consumer.
Worst case, he returns the product for a refund. This is a costly
proposition because you've lost more than you would have made on
the sale (handling the return) AND have likely annoyed a customer who
MIGHT have represented repeat business -- as well as performing in an
advertising role (word of mouth).
Industrial kit often has local support staff on hand that can diagnose
problems (IF your product and documentation provide a means for them to
do so). But, the cost of that staff is figured into the "burden"
your product imposes; if they are spending inordinate amounts of time
fixing YOUR problems, then your products suffer in their eyes (cuz
management is always under pressure to "do more with less" -- staff)
My experience has been that providing MORE information to a user
always works to the manufacturer's advantage. A user confronted
with a flashing red light will cost you more (even if you don't lose
the sale) than a user who is told to "check connection at J1".
Anything that removes a potential "issue" from his thought process
is an improvement ("How do I know that the cache memory isn't defective?
Is he testing that, too? Am I going to spend hours tracking down
a problem that's buried in a place that I can't access/test?")
>> "50 years ago" you didn't have SRAM suffering from disturb errors.
>> Yet, now this is a fact of life for even caches. Technology advances
>> and, with it, come new "challenges".
>
> Yes, "disturb errors" as you call them - "single-event upsets",
> bit-flips, etc., are a possibility with ram. They are more likely in
> dynamic ram, but can occur in small, fast static ram cells. And POSTs
> and other ram checks are totally and completely /useless/ at identifying
> them or dealing with them. That is why I say you need to understand the
> hardware and the possible failure modes in order to make reliable systems.
Please tell me where I indicated that puzz should be checking for
disturb errors in SRAM, DRAM or FLASH (where all can occur -- as well as
in "junk logic"). You can't just run a simple, quick test to determine
if you have a problem with these.
OTOH, if you have a system that is running and can "do this on the side"
(with or without hardware EDAC), then you can compile statistics regarding
their likely frequency.
If you DON'T have a closed system, you can also use these observations as
indicators of possible "attacks" or poorly coded applications (that, left
to their own BENIGN devices, could compromise your system). If you notice
WHEN they occur, you can also take actions to thwart them (e.g., if
TryToGainRoot() is the active process when a statistically greater frequency
of such events occurs, then you might want to blacklist TryToGainRoot()
so that it never runs, again.)
> Are you sure you understand what POSTs can do, and the difference
> between transient failures and static failures?
You do understand that there are differences between truly transient (i.e.,
self-healing) errors and persistent consequences of things like SEUs?
Are you sure the code in your FLASH (ROM) is intact, NOW (assuming XIP)?
Are you sure the code that you loaded from that FLASH into (S/D)RAM hasn't
been corrupted, NOW (ignore the effects of bugs)?
Will your customer notice if it has been corrupted? Will the consequences
of the corruption be masked (by whatever)? Or, will it manifest in a
spectacular way?
[There have been several studies of how resilient various applications
are to memory errors. Given that they can occur "anywhere", it's easy to
see how some can be masked or contribute to "system noise". But, that's
not a given for all...]
What are you doing about this, besides hoping to catch it at the next POST
(assuming you even bother to test for it)?
>> I suggest you've been basing your assumptions on SRAM reliability on
>> 50 year old anecdotes and not the consequences of more modern
>> implementations,
>> shrinking device geometries and lower operating voltages. Have a run
>> through
>> the literature to see...
>
> You are the one that was discussing 50 year old anecdotes!
I'm showing how YOUR confidence in SRAM is rooted in 50 year old
anecdotes and not "modern practices".
>>>> [Picking the "world's most reliable MCU" won't guarantee that it won't
>>>> throw
>>>> RAM errors in a deployed product.]
>>>
>>> /Nothing/ will give you guarantees like that. But if you pick a
>>> microcontroller with ECC on its onboard ram (and cache, if it has it),
>>> you reduce, by many orders of magnitude, the risk of single-event upsets
>>> (such as cosmic rays) leading to failures of the system. Anything else
>>> you can do in software is pointless in comparison. "Testing" your ram
>>> can't possibly detect such issues.
>>>
>>> Not many products justify the extra expense of such microcontrollers,
>>> but they are available for those that need them.
>>
>> Few designs have the features that they require, let alone DESIRE.
>> Unless you're working in a market where customers will pay "whatever it
>> takes", most designs have to live with some subset of what they would
>> LIKE to have in their product.
>
> In a safety-critical system, the cost of using a microcontroller with
> ECC ram is negligible. These are used all the time in the automotive
> industry.
So, only safety critical products need to work, reliably? It must be
really easy designing with a bar set that low!
You don't need to rely on hardware EDAC to improve your confidence in
the retentive powers of the RAM (any RAM). That just provides a more
immediate indication of a particular detected/corrected fault.
It's not uncommon for me to have running checksum processes that continually
scan the program store looking for "disturbances". I can't necessarily point
to a specific location. Or, an exact time at which the disturbance crept into
the system.
But, I *do* know that the contents of that memory region are no longer what
they SHOULD be. If I have hardware protecting write access to that region,
then I can deduce that the error is caused by a fault in a device (even if
I can't point to a specific device).
In either case, I can't vouch for my product's "output"/functionality.
(Or, I can stick my head in the sand and assume that memory is never
corrupted)
Hardware EDAC also only tells you about errors in REFERENCED locations.
So, if your code doesn't reference every location "frequently" (for some
value of "frequently"), you may not discover the corruption until hours
after it occurred. And, the single error may have become a multiple-bit
error -- now your EDAC (SECDEC) is useless.
[This is the same false sense of security that folks using RAID rely on;
if you aren't looking at EVERYTHING periodically, then you have no idea
as to whether or not it's been corrupted and/or is recoverable (hence
the reason for patrol reads)]
>>>> Simply assuming it "can't fail" is naive.
>>>
>>> Of course. Simply assuming that you can do a test at startup and think
>>> that makes the system more reliable is at least equally naïve.
>>
>> You miss the point of POST. It doesn't MAKE a system more reliable.
>
> I know it doesn't do that - I've been saying this all along.
Then why are you assuming *I* am professing that?
>> Instead, it tells you when a system is not meeting your expectations.
>> This is true of ALL testing. You have a defined point in time -- and
>> operating conditions -- in which you hope to catch a failure so that
>> you can report on it. A user (customer) is more willing to accept
>> "there's a flashing red light on the device" than "the &*^($^& thing
>> doesn't work worth a sh*t -- but I can't provide Tech Support with
>> any information beyond the fact that I'm frustrated and UNHAPPY WITH
>> MY PURCHASE"
>
> For /some/ devices, some kind of POST can be useful. For many, it is
> pointless - it does not detect the failures that actually matter, and
> can only detect ones that have negligible chances of occurring.
You install POST/BIST *before* you release the product. You likely
discover hardware reliability problems AFTER the design is complete
(potentially after it has been released to manufacturing). Few people
intentionally design with poor reliability as a goal, implied or
otherwise.
You don't know what your problems will be -- until you start doing
/post mortems/ on returned product. This is the WORST time to find
out because you likely have lots of product in the field before
you can see a pattern in their failures. Now you throw away profit
and reputation in trying to compensate for those shortcomings.
> If you have a device that is regularly restarted, and where the hardware
> is so fault-prone that you really are finding problems with a POST, then
> yes - go for it.
>
> All I am arguing for is that people /think/ before making a POST, and do
> some analysis and investigation to see if it really is a useful feature.
An engineer should always be "thinking" (not necessarily true of a
"programmer"). But, there are costs to "omissions" that can be
sizeable.
>> BUT, the cost and ease of testing RAM (regardless of technology) at
>> power up
>> is typically easy to bear in a product's design. It costs me a fraction of
>> a second to give a cursory test of 500MB. Chances are, I'm going to find
>> failures THERE instead of "dubious behaviors" in the running product.
>
> Do you understand the concept of cost/use analysis? If something is
> useless, or worse than useless, it doesn't help if it is cheap. Well,
> it helps for the marketing folks.
Again, you're assuming it IS "useless". Most memory failures that
I've encountered are caught in a POST -- stuck at faults, decode
faults or problems with "external factors". By catching them, there,
before the application runs, I avoid annoying the user. (yeah, he
may be disappointed that the device won't run -- or will only
run with reduced capabilities -- but he won't be annoyed that he
produced $30,000 of stainless steel parts that are out of tolerance.
Or, that 8 hours' production of pharmaceuticals have to be scrapped
(cuz you can't test millions of individual tablets!)
>>>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
>>>> than late gives you a better idea of what to report to the user/customer
>>>> because you are closer to the problem's manifestation. You don't end
>>>> up misbehaving and wondering "why?"
>>>>
>>>> [And, all of this assumes "bugfree software" so any errors are
>>>> entirely a result of hardware faults]
>>>
>>> And there is perhaps your biggest invalid assumption. Software is
>>> always a risk. Software that can't be properly tested is a
>>> significantly higher risk. Software designed to handle situations that
>>> cannot possibly be reproduced for testing purposes, cannot be properly
>>> tested. So writing software test routines for something that has no
>>> realistic chance of happening in the field, /reduces/ the reliability of
>>> the product.
>>
>> YOUR biggest invalid assumption is that is has no realistic chance of
>> happening.
>
> Again, in your enthusiasm you have failed to notice what I have written
> repeatedly. If there is a /realistic/ chance of a failure, then it will
> often make sense to test for it. If there is no such chance - or
> negligible chance of it failing without some other major failure, or
> nothing you can do about a failure, then there is no point in trying to
> test.
But you dismiss this testing as being targeted at something that "won't
happen". I contend that it will and does. (though I can't speak re:
the OP's specific product)
>> Your SECOND biggest assumption is thinking that folks who are qualified
>> to write application software (for often ill-defined scenarios) are
>> NOT capable of developing reliable test programs (for very WELL-DEFINED
>> scenarios).
>
> That is often a realistic assumption - different people specialise in
> different things. However, it was not an assumption I made - again, you
> seem to prefer to make things up than read my posts.
You've stated that adding the test(s) decreases reliability. Do the tests
physically damage the product? If not, then the only potential downside
is if they are implemented defectively -- hence the above.
> Software is always a risk. It might be low risk, but it is always a risk.
>
>> Do you think *all* MCU-device failures are simply attributable to software
>> bugs? Why test anything? ASSUME the power supply and power conditioning
>> circuitry will never fail. Assume the various I/Os will never fail.
>> Blame every failure on "it must be a bug". Never scrap returned product
>> cuz all it needs -- along with every unit coming off the line, TODAY -- is
>> a reflash!
>
> Another wild idea all of your own.
>
>> Are all of your products short-lived and in inconsequential applications?
>
> I've made systems that are buried in concrete in oil installations,
> working for decades. Do I do that by relying on POSTs, memory tests and
> perhaps a watchdog? No.
Instead, you rely on expensive staff being available in the event that
a problem occurs. Thats not the case with most products or customers.
I design differently for environments where I can reasonably expect to
have "capable" staff on hand. I expose more details about what I've
"noticed" in my product(s) so they can use that to determine how to
further test, repair or replace the items. This is no different than
"test equipment" manufacturers making diagnostic and calibration
procedures available to end users.
In some cases, downtime is paramount so I design the entire product
with ease of replacement in mind -- swap out the questionable unit,
install the spare, forward the old one to us for analysis (or do
your own testing, "offline"). This is more than just thinking about
making it replaceable; you also have to consider the activities that
will be involved in making that replacement!
In consumer applications, the typical remedy is to have the consumer
get annoyed -- dealing with online "chat", or phone support -- as even
the simplest problems (operator error) take hours or more to resolve
("The current hold time is 27 minutes.") This has a direct cost to
the manufacturer (support staff, repairs, returns) as well as an indirect
cost (pissed off customer who typically is more willing to badmouth a
disappointing product than praise a delightfully performant one!).
The dollars involved "per incident" vary -- as do the quantities.
But, I can survive a "bad experience" (in THEIR minds) with an industrial
user more readily; they might make me squirm a bit or may extract
other concessions from me going forward... but, chances are, they
aren't going to pull all of my products and move on to a competitor.
It's a more rational "business decision" instead of an EMOTIONAL
reaction (for a consumer).
OTOH, if I misdiagnose or mistreat a patient and some litigation
(and possibly loss) ensues, I can likely write off that business
for the foreseeable future (even if I don't directly incur those
losses)!
>> Do some reading. You'll learn something.
>
> Try it yourself. You could start by reading what I wrote. Then, when
> you have learned a bit about this stuff, you can start applying a bit of
> /thought/ to the process. And when you look at my posts here, you'll
> see that what I have been advocating is that people /think/ about what
> they are doing with tests - what are they actually trying to achieve,
> what use it is, what the risks are. And stop making pointless code just
> because you can.
You've not JUST said that. You've said testing SRAM is pointless
because (effectively) it never fails.