Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RAM Failure modes [long -- whiners don't read]

99 views
Skip to first unread message

Don Y

unread,
Apr 25, 2016, 11:18:58 PM4/25/16
to
[crossposted; feel free to elide either group in your reply]

I have typically tested "RAM" (writeable memory) in POST for
gross errors. Usually, a couple of passes writing, then reading
back the output of a LFSR derived PRNG with a long and "relatively
prime" period. (One goal of POST being 'short and sweet')

This catches "stuck at" errors, decode errors, etc. It does
very little by way of catching soft errors (unless it gets lucky).

For modest amounts of memory and relatively short power-on times
(hours/days), this has been satisfactory.

But, for larger memory configurations and much longer up-times
(weeks/months/years), I suspect not so much.

My question(s) concern internal (MCU) memory (typ static or psuedostatic)
and external (DRAM) memory.

Additionally, memory that is used to store code (r/o) as well as data.

"Code" is protected from accidental overwrites by hardware (so, only
"suspect" if that hardware fails *or* software deliberately disables
it -- bug, no need to worry about those).

"Data" is, well, data; hard to really know WHAT it should be at any
time (unless I wrap everything in monitors, etc.).

All the memory is "soldered down" so no issues with flakey connectors,
vibration, etc. ECC is not (easily) available' I'd have to create and
verify syndromes with external logic and would be unable to do anything
more than complain (crash) when an error was detected (no ability to
rerun bus cycles)

Assume operating conditions are "within published specifications".
Separately, I'll ask the value of putting in hardware to VERIFY that
is true, ongoing.

BEYOND POST...

I can verify the contents of "code" memory by simply running ongoing
checksums (hashes) periodically. I.e., compute the hash when the code
is loaded; then verify it remains unchanged during execution.

I can regularly check pages of memory as they are released from use
("free") as well as regularly swap out in use pages (code or data)
for analysis.

I can coordinate groups of such pages -- at some difficulty and cost
(in terms of idled resources) -- to check for decode errors.

And, of course, rely on various watchdogs/daemons to HOPEFULLY spot
behaviors that manifest as the result of corrupted data or code.

There are, of course, run time costs for all of this (which I can
bear -- IF they are fruitful).

[I'm operating in the 1Mb internal SRAM, 2Gb DRAM (DDR2/LPDDR) arena.]

So, the questions I have are:

- what are the typical POST DELIVERY failure modes for DRAM technologies?
internal SRAM/PSRAM?
- how do these correlate with product age, temperature, etc.? (is it
more effective to MORE tightly control certain aspects of the environment)
- is it worthwhile to monitor the environment and signal operating conditions
that are suggestive of memory failures (instead of hunting for broken bits)?
- is it better to just "refresh" memory contents inherently in the design
than to rely on them remaining static and unchanged?

Jasen Betts

unread,
Apr 26, 2016, 7:01:59 AM4/26/16
to
On 2016-04-26, Don Y <blocked...@foo.invalid> wrote:
> [crossposted; feel free to elide either group in your reply]

> - what are the typical POST DELIVERY failure modes for DRAM technologies?
> internal SRAM/PSRAM?

"row hammer" springs to mind.

--
\_(ツ)_

David Brown

unread,
Apr 26, 2016, 7:23:49 AM4/26/16
to
I would first ask why you are concerned about this. I assume you have
already thought about at least some of the points below (I know you are
not doing memory testing merely for fun!), but perhaps you have not
thought about them all, and perhaps answers to them can help you or
others to find answers to your specific questions.

First, have you ever found memory problems with the systems you have?
My experience with memory is that it very rarely fails, and when it does
it is mostly a system problem (like poorly terminated buses, bad
connections, running beyond maximum speed, etc.) rather than an issue
with the memory itself. Almost all memory errors will then be caught in
a brief check of address lines and data lines during production testing
- power-up or online testing is then unnecessary.

Secondly, what would you do if you found memory problems? If you cannot
rely on your memory, it is difficult to rely on /anything/ in the
system. On many systems with ECC memory, detection of an uncorrectable
error leads to immediate shutdown because it is better to stop /now/,
that to risk causing more problems. I have no idea what sort of systems
you are designing, but if you can't make such an immediate shutdown, and
you feel memory issues are a realistic issue, then perhaps you have no
choice but to use some sort of ECC memory or other redundancy rather
than trying to spot a problem after it has happened.


Don Y

unread,
Apr 26, 2016, 1:23:18 PM4/26/16
to
On 4/26/2016 4:23 AM, David Brown wrote:
> I would first ask why you are concerned about this.

Obviously, because I am concerned with reliability and availability.

> I assume you have
> already thought about at least some of the points below (I know you are
> not doing memory testing merely for fun!), but perhaps you have not
> thought about them all, and perhaps answers to them can help you or
> others to find answers to your specific questions.
>
> First, have you ever found memory problems with the systems you have?

Historically? Yes. But, memory technology has improved greatly
in the decades that have passed. I'd never even consider a gigabit
of memory built from 4kx1 devices!

> My experience with memory is that it very rarely fails, and when it does
> it is mostly a system problem (like poorly terminated buses, bad
> connections, running beyond maximum speed, etc.) rather than an issue
> with the memory itself. Almost all memory errors will then be caught in
> a brief check of address lines and data lines during production testing
> - power-up or online testing is then unnecessary.

That's not what the literature indicates. Also, doesn't explain why
ECC memory is used.

> Secondly, what would you do if you found memory problems? If you cannot
> rely on your memory, it is difficult to rely on /anything/ in the
> system. On many systems with ECC memory, detection of an uncorrectable
> error leads to immediate shutdown because it is better to stop /now/,

That's not true. You *expect* some number of errors in any memory subsystem.
A more interesting question is "how many ECC *corrected* errors before you
start worrying about the ability of your ECC to *detect* errors -- even
uncorrectable ones.

> that to risk causing more problems. I have no idea what sort of systems
> you are designing, but if you can't make such an immediate shutdown, and
> you feel memory issues are a realistic issue, then perhaps you have no
> choice but to use some sort of ECC memory or other redundancy rather
> than trying to spot a problem after it has happened.

Memory errors do not imply a system has erred.

If I'm examining a page of memory that isn't currently executing
(because the program's control is currently somewhere else in
the text segment) and I find an error (hard or soft) and I
correct it or replace that page BEFORE the program has a chance
to execute any of the commands affected by that error, then
the program hasn't been compromised.

If I find an error that causes a bit to assume the value that it
*should* assume (e.g., lsb of a location is stuck at one but the
location is intended to hold the value '0x9') then, likewise, no
problem.

If I find an error that causes a bit to assume a BAD value -- but
conditions in the program effectively make that irrelevant
(e.g., the value specifies a timeout for an operation -- but,
the operation still manages to successfully complete before the
timeout expires), again, no problem.

Etc.

You can have a system apparently running successfully in spite
of ongoing errors.

Or not.

The difference is, whether you KNOW about the errors or wait to
find out about them by the system misbehaving (e.g., a watchdog
kicking in or some other VERY INDIRECT measurement of reliability).

If the only time you have to test memory is POST (or, an explicit
BIST invoked by the user), then you have to rely on interrupting
the normal services of your device in order to perform that test
and (re)gain that confidence.

"We'll be making a stop in East Bumph*ck, Iowa, while we run
a regular test on the memory in our avionics systems. We're
sorry for the delay and promise to have you back on your way
as soon as possible!"

The point of my questions is to inquire as to how people see and
expect to see memory failures -- in (external) DRAM as well as
(internal) SRAM.

As I suspect most folks only test at POST, how would they react to
a situation where the user just happened NOT to shut down their
product/system for 10 years? Would they feel confident that it
was still intact? Executing (out of RAM) the same code that they
loaded, there, 10 years earlier? (bugs in their software can't
corrupt the RAM's contents -- but the RAM can degrade!)

Don Y

unread,
Apr 26, 2016, 1:46:51 PM4/26/16
to
That's typically a result of a specific usage pattern.
So, you can adopt the attitude of NOT letting those
types of behaviors into your code *or* resolve yourself
to their inevitability and count on some increased
number of SOFT errors, as a result.

Dimiter_Popoff

unread,
Apr 26, 2016, 2:08:45 PM4/26/16
to
On 26.4.2016 г. 20:22, Don Y wrote:
> On 4/26/2016 4:23 AM, David Brown wrote:
>> I would first ask why you are concerned about this.
>
> Obviously, because I am concerned with reliability and availability.
> .....
>
> The point of my questions is to inquire as to how people see and
> expect to see memory failures -- in (external) DRAM as well as
> (internal) SRAM.
> ....

Hi Don,

I think nowadays David's attitude is both the obvious and the correct
one.
Leave memory testing to the silicon and board manufacturers, they have
better means to test it than the CPU which uses it. If you need the
feeling of some extra reliability use a part with ECC (and populate
the chips for it....).

I have never noticed a memory failure last 30 years which could
not be tracked down to something external to the memory, e.g. bad board
connection, missing/bad bypass caps etc.

It is just that the failure probability of everything else dwarfs the
one memory silicon has, all these tests won't buy you much if anything.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/






Robert Wessel

unread,
Apr 26, 2016, 3:08:38 PM4/26/16
to
It's not like there's a lack of literature on that subject...

But if you assume independence in bit errors, the calculation is for
uncorrectable faults is simple. If you want to consider some
probability of more difficult failures (and entire chip, and entire
DIMM), those generate hard failures immediately, unless you add
heroics like DRAM sparing (basically RAID for RAM - IBM likes the term
"RAIM"), as is done on high end servers.

If you're just doing monitoring, and then preventive maintenance,
based on an accumulated soft error rate, there again has been a fair
bit of literature, but they all come to approximately the same
conclusion - soft errors are pretty rare for most devices, and on a
handful they tend to be much more common. So the exact threshold is
actually not that important. So a DIMM getting a soft error every few
months in ignorable, several per day is not, and there's little in the
real world between those.
That's called scrubbing, and is fundamental to any redundant storage
scheme, RAM or disk. Even if you have nothing but ordinary single bit
(for RAM) errors, the odds of that turning into an uncorrectable error
are just the amount of time the condition persists and the odds of
another bit in the protected block getting hit). So you must scrub.

And proper machine check architectures do distinguish between
immediate errors and deferrable ones. If you get a uncorrectable
memory error during an instruction fetch, that thread is going to be
dead. But if scrubbing turns up an uncorrectable error, it can be
reported to the OS, which can deal with it at its leisure. That could
be reloading the page (if possible), or killing every process that has
that page mapped (if it cannot be reconstructed).

Don Y

unread,
Apr 26, 2016, 3:41:53 PM4/26/16
to
Hi Dimiter,

On 4/26/2016 11:08 AM, Dimiter_Popoff wrote:

> I think nowadays David's attitude is both the obvious and the correct
> one.

Would you leave one of your instruments running (with code executing
out of RAM) for a year and expect the program image to be intact?
Would you leave 50 of them, side by side, and expect the same?

[Recall, I have lots of processors coordinating their efforts so
all that much more opportunity for failures to manifest]

Errors *do* occur. How vulnerable you are to them is a different
issue. If a soft/hard error causes a light to blink at 3Hz instead
of 2Hz... <shrug>

> Leave memory testing to the silicon and board manufacturers, they have
> better means to test it than the CPU which uses it. If you need the
> feeling of some extra reliability use a part with ECC (and populate
> the chips for it....).

But there aren't many parts that *do* support ECC, natively.
You can glue on external syndrome logic -- but all that will
tell you is if *it* detected/corrected a memory error. If
testing in software "doesn't make sense", then how does adding
hardware make sense?

> I have never noticed a memory failure last 30 years which could
> not be tracked down to something external to the memory, e.g. bad board
> connection, missing/bad bypass caps etc.

How do you know you've had a memory failure? Or, have they been
"catastrophic" (hard to ignore)? Without ECC -- and runtime
tools that monitor and log those errors -- you can't say whether
your experiencing none... or MANY! And, where the threshold lies
between "none", "some" and "many".

> It is just that the failure probability of everything else dwarfs the
> one memory silicon has, all these tests won't buy you much if anything.

Then how do you tell when one of those "everything else"s fails?
Wait until your instrument starts "acting funny", reset it and see
if the POST finds "bad memory"?

This was why I presented "monitoring the environment" as an alternative
strategy to "testing memory"; if the goal is to ensure the product is
accurately executing its intended sequence of instructions, is it
better to look at how well those instructions are preserved *in*
memory? Look carefully at the environmental characteristics around
the chip? etc.

David Brown

unread,
Apr 26, 2016, 4:00:24 PM4/26/16
to
On 26/04/16 20:08, Dimiter_Popoff wrote:
> On 26.4.2016 г. 20:22, Don Y wrote:
>> On 4/26/2016 4:23 AM, David Brown wrote:
>>> I would first ask why you are concerned about this.
>>
>> Obviously, because I am concerned with reliability and availability.
>> .....
>>
>> The point of my questions is to inquire as to how people see and
>> expect to see memory failures -- in (external) DRAM as well as
>> (internal) SRAM.
> > ....
>
> Hi Don,
>
> I think nowadays David's attitude is both the obvious and the correct
> one.
> Leave memory testing to the silicon and board manufacturers, they have
> better means to test it than the CPU which uses it. If you need the
> feeling of some extra reliability use a part with ECC (and populate
> the chips for it....).
>
> I have never noticed a memory failure last 30 years which could
> not be tracked down to something external to the memory, e.g. bad board
> connection, missing/bad bypass caps etc.
>
> It is just that the failure probability of everything else dwarfs the
> one memory silicon has, all these tests won't buy you much if anything.
>

You put that very well.

When considering the reliability of anything, and its failure modes and
their consequences, you have to consider the balance of probabilities.
There is no point in testing memory just because you can test it - the
testing is not free, and at some point it becomes negative value (for
example, it takes so much of the processor time that you need a faster
processor with lower reliability). You do the tests that make sense,
based on the likelihood of there being a problem, the consequences of
the problem, and what you can do if you spot a problem.

You see the same sort of situation in security. It makes sense to have
a good lock on your front door - but once it is so good that it is
easier for burglars to break a window, then any expense on more locks is
wasted.

And in the memory test situation, who cares if your card's memory is
perfect after 10 years if the electrolytic capacitors have decayed after
5 years? You need to make sure the effort is put in the right places.

Now, I am not saying that Don should /not/ test his memory - I am just
asking if he has clear thoughts (and preferably, real numbers)
justifying it.


Don Y

unread,
Apr 26, 2016, 4:06:18 PM4/26/16
to
But the literature tends to be concerned with large memory arrays.
And, large memory arrays tend to be constructed differently. And,
operated in different environments, attended by "professionals", etc.

> But if you assume independence in bit errors, the calculation is for

The literature suggests failures repeat. So, its not a uniform distribution
across an entire device/array. I.e., stumbling onto a soft error suggests
you're more likely to find another in that same spot than elsewhere.
(This, in turn, suggests you treat the soft error as if it is -- or will
become -- a hard error)

> uncorrectable faults is simple. If you want to consider some
> probability of more difficult failures (and entire chip, and entire
> DIMM), those generate hard failures immediately, unless you add
> heroics like DRAM sparing (basically RAID for RAM - IBM likes the term
> "RAIM"), as is done on high end servers.
>
> If you're just doing monitoring, and then preventive maintenance,
> based on an accumulated soft error rate, there again has been a fair
> bit of literature, but they all come to approximately the same
> conclusion - soft errors are pretty rare for most devices, and on a
> handful they tend to be much more common. So the exact threshold is
> actually not that important. So a DIMM getting a soft error every few
> months in ignorable, several per day is not, and there's little in the
> real world between those.

Some studies show hard errors are more prevalent than soft; others
show the exact opposite. A google study (big data farms) claimed
~50,000 FiT/Mb. Further, it appeared to correlate error rates with
device age -- as if cells were "wearing out" from use.

And, its not "a soft error every few months" but, rather, several thousands
per year (per GB so figure I'm at 1/4 of that -- per device node!)
But you can do more than that. You can retire the affected memory
so it no longer presents a (potential) problem.

> And proper machine check architectures do distinguish between
> immediate errors and deferrable ones. If you get a uncorrectable
> memory error during an instruction fetch, that thread is going to be

That isn't true, either. What if the opcode should have been a
"jump if zero" and, instead, gets decoded (bad fetch) as "jump
unconditionally". This only causes a problem if the value was NOT
zero!

And, even if it *was* zero, the consequences may not be fatal.
Maybe a light blinks a little faster. Or, a newline gets inserted
in the middle of a line of text. etc.

This can be happening all the time and no one could be the wiser!

OTOH, if something told you that it was likely happening -- and could
quantify that frequency -- you might be more predisposed to taking
remedial action before the fit hits the shan!

If the only assessment you make of the integrity of your memory
happens at POST, a cautious user would be reseting the device often
just to ensure POST runs often!

Don Y

unread,
Apr 26, 2016, 4:30:00 PM4/26/16
to
On 4/26/2016 12:41 PM, Don Y wrote:
>> I have never noticed a memory failure last 30 years which could
>> not be tracked down to something external to the memory, e.g. bad board
>> connection, missing/bad bypass caps etc.
>
> How do you know you've had a memory failure? Or, have they been
> "catastrophic" (hard to ignore)? Without ECC -- and runtime
> tools that monitor and log those errors -- you can't say whether
> your experiencing none... or MANY! And, where the threshold lies
> between "none", "some" and "many".

Here's a good starting point:
<https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf>

Note you can argue that google's environment is more *benign* (their
machines don't have motor controls colocated on the same PCB; they
have a bigger budget for hardware/maintenance/monitoring; they actively
control the environment; etc.). Or, you could argue that they are
relying on commodity hardware/software instead of things designed for
a specific task/application.

<shrug> The same is true of much of the other literature...

Dimiter_Popoff

unread,
Apr 26, 2016, 5:21:21 PM4/26/16
to
On 26.4.2016 г. 22:41, Don Y wrote:
> Hi Dimiter,
>

Hi Don,

> On 4/26/2016 11:08 AM, Dimiter_Popoff wrote:
>
>> I think nowadays David's attitude is both the obvious and the correct
>> one.
>
> Would you leave one of your instruments running (with code executing
> out of RAM) for a year and expect the program image to be intact?

Of course, happens all the time. For many months at least.

> Would you leave 50 of them, side by side, and expect the same?

Yes (64 or 128 M DDRAM, no ECC).

> Errors *do* occur. How vulnerable you are to them is a different
> issue. If a soft/hard error causes a light to blink at 3Hz instead
> of 2Hz... <shrug>

Of course errors do occur. My point is "do you take preventive measures
to not be hit by a meteor while crossing a rush hour street?".

>
>> Leave memory testing to the silicon and board manufacturers, they have
>> better means to test it than the CPU which uses it. If you need the
>> feeling of some extra reliability use a part with ECC (and populate
>> the chips for it....).
>
> But there aren't many parts that *do* support ECC, natively.

Well silicon makers must have good risk assessment strategies for
the decision when to put in ECC, which underscores my point.
If a chip does not have it it is because in all likelihood it
does not need it. What is the point of ECC for a part which
costs $10 or less and will be programmed in C or some other
HLL where the programmer will never know what exactly does the
code do.

The larger parts from Freescale/NXP do have ECC on their DDRAM
controllers, IIRC those I have seen correct single bit errors
and signal larger ones. How justified is this technically I
just don't know, don't have their testing etc. data collected
over the years, but it clearly is economically justified (your
system won't be discarded at some decision making point because
it does not have ECC - and at this size ECC is no big part of
the cost).

> How do you know you've had a memory failure? Or, have they been
> "catastrophic" (hard to ignore)? Without ECC -- and runtime
> tools that monitor and log those errors -- you can't say whether
> your experiencing none... or MANY! And, where the threshold lies
> between "none", "some" and "many".

Well like I said many systems running for many months without being
reset is nothing special for me so if there were some memory problem
I'd have noticed it. I have noticed things much harder to imagine;
like interference into an I2C line which could cause some hang (was
software correctable once I realized this occurred) etc., and other
events which occur very rarely and are very hard to detect. Memory
issues have not been one of them. Then again, I speak only of a
few systems I have designed and manufactured.

Don Y

unread,
Apr 26, 2016, 8:36:09 PM4/26/16
to
Hi Dimiter,

>> Would you leave one of your instruments running (with code executing
>> out of RAM) for a year and expect the program image to be intact?
>
> Of course, happens all the time. For many months at least.

Then I suspect you are just not aware of the errors that are
occurring -- or, that they are masked by "expectations", etc.

Using the error rates predicted in google's paper:

25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit
12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs
or, one every ~80 hours.

Using their high figure (75000 FiT/Mb) cuts that to one error
every ~1 day!

For a 128MB system, that's a range of 1 error every 12 - 40 hours.

If you're not seeing these (have you verified the code image in
RAM is unchanged?), then there is some difference between an
embedded product (e.g., soldered down RAM devices?) or a difference
in the components you're using or conditions under which they
are operated (e.g. larger device geometries -- though some
studies claim smaller geometries are not responsible for increases
in error rates)

>> Would you leave 50 of them, side by side, and expect the same?
>
> Yes (64 or 128 M DDRAM, no ECC).

With 50 units running concurrently (and independently distributed
errors), you should see one of those machines experiencing an error
every 15 - 60 minutes.

>> Errors *do* occur. How vulnerable you are to them is a different
>> issue. If a soft/hard error causes a light to blink at 3Hz instead
>> of 2Hz... <shrug>
>
> Of course errors do occur. My point is "do you take preventive measures
> to not be hit by a meteor while crossing a rush hour street?".

There were 63 reported meteorites in the 2001-2012 period, WORLD-WIDE.
Let's extrapolate that rate to an average lifetime -- say 500 (regardless
of where you might be at the time)

The planet's surface area is ~200 million square miles. Let's assume I
am a one square mile target -- even when I'm indoors! So, in my lifetime,
I stand a 500/200,000,000 chance of being hit by a meteorite.

In an 80 year span (roughly 700,000 hours... call it a million hours),
that says I'd have a 500/200M*1M chance of getting hit in a given hour.
Or, the equivalent of 0.5FiT

[lots of handwaving here to give a relative sense of scale]

>>> Leave memory testing to the silicon and board manufacturers, they have
>>> better means to test it than the CPU which uses it. If you need the
>>> feeling of some extra reliability use a part with ECC (and populate
>>> the chips for it....).
>>
>> But there aren't many parts that *do* support ECC, natively.
>
> Well silicon makers must have good risk assessment strategies for
> the decision when to put in ECC, which underscores my point.

No, they also have MARKETING strategies involved! PC's have been sold
with ECC "optional" for many years -- despite the sizes of the memory
complements installed! Because adding 15% to the cost of a DIMM
would make the product "too expensive"?

> If a chip does not have it it is because in all likelihood it
> does not need it. What is the point of ECC for a part which
> costs $10 or less and will be programmed in C or some other
> HLL where the programmer will never know what exactly does the
> code do.
>
> The larger parts from Freescale/NXP do have ECC on their DDRAM
> controllers, IIRC those I have seen correct single bit errors
> and signal larger ones. How justified is this technically I
> just don't know, don't have their testing etc. data collected
> over the years, but it clearly is economically justified (your
> system won't be discarded at some decision making point because
> it does not have ECC - and at this size ECC is no big part of
> the cost).
>
>> How do you know you've had a memory failure? Or, have they been
>> "catastrophic" (hard to ignore)? Without ECC -- and runtime
>> tools that monitor and log those errors -- you can't say whether
>> your experiencing none... or MANY! And, where the threshold lies
>> between "none", "some" and "many".
>
> Well like I said many systems running for many months without being
> reset is nothing special for me so if there were some memory problem

I disagree. There are numerous ways for a memory error to slip through
without disturbing operation in a noticeable (*verifiable*) way.
Will your customers notice if the LSB in a raw datum is toggled?

Have a read of:
<http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf>
pay attention to the "not manifested" results -- cases where a KNOWN
error was intentionally injected into the system but the system appeared
to not react to it.

As I say, I suspect errors *are* happening (the FiT figures suggest
it and the experiment above shows how easily errors can slip through)

Dimiter_Popoff

unread,
Apr 26, 2016, 9:21:55 PM4/26/16
to
On 27.4.2016 г. 03:35, Don Y wrote:
> Hi Dimiter,
>
>>> Would you leave one of your instruments running (with code executing
>>> out of RAM) for a year and expect the program image to be intact?
>>
>> Of course, happens all the time. For many months at least.
>
> Then I suspect you are just not aware of the errors that are
> occurring -- or, that they are masked by "expectations", etc.
>
> Using the error rates predicted in google's paper:
>
> 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit
> 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs
> or, one every ~80 hours.
>
> Using their high figure (75000 FiT/Mb) cuts that to one error
> every ~1 day!
>
> For a 128MB system, that's a range of 1 error every 12 - 40 hours.

Hi Don,

I would first question the basic data you are using. Having never
seen the google paper I doubt they can produce a result on memory
reliability judging by the memories on their servers. Knowing what
a mess the software they distribute is I would say about all the
errors they have attributed to memory failure must have been
down to their buggy software.
Again, I have not seen their paper and I won't spend time investigating
but I'll choose to stay where my intuition/experience has lead me, I
have more reason to trust these than to trust google.


> Have a read of:
> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf>
> pay attention to the "not manifested" results -- cases where a KNOWN
> error was intentionally injected into the system but the system appeared
> to not react to it.
>
> As I say, I suspect errors *are* happening (the FiT figures suggest
> it and the experiment above shows how easily errors can slip through)

Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital
to survive months without being reset, there are measurements and
experiments which just last very long. While damage to the data memory
would be unnoticed - the data themselves are random enough - a few
megabytes of code and critical system data are constantly in use,
damage something there and you'll just see a crash or at least
erratic behaviour.

So my "mind the meteors while crossing a rush hour street in a big
city" still holds as far as I am concerned.

I have never looked at memory maker data about bit failures, I might
pay more attention to these if available than I would to some google
talk.

Robert Wessel

unread,
Apr 27, 2016, 12:29:14 AM4/27/16
to
On Tue, 26 Apr 2016 13:05:56 -0700, Don Y
<blocked...@foo.invalid> wrote:

>> If you're just doing monitoring, and then preventive maintenance,
>> based on an accumulated soft error rate, there again has been a fair
>> bit of literature, but they all come to approximately the same
>> conclusion - soft errors are pretty rare for most devices, and on a
>> handful they tend to be much more common. So the exact threshold is
>> actually not that important. So a DIMM getting a soft error every few
>> months in ignorable, several per day is not, and there's little in the
>> real world between those.
>
>Some studies show hard errors are more prevalent than soft; others
>show the exact opposite. A google study (big data farms) claimed
>~50,000 FiT/Mb. Further, it appeared to correlate error rates with
>device age -- as if cells were "wearing out" from use.
>
>And, its not "a soft error every few months" but, rather, several thousands
>per year (per GB so figure I'm at 1/4 of that -- per device node!)


On a per-DIMM basis, the Google paper has 8.2% of DIMMs experiencing
one or more correctable errors per year, and of those 8.2%, the median
number of errors is 64 per year (with the overall average being
3751!). The go on to mention that for the DIMMs with errors, 20% of
those account for 94% of the errors.

Don Y

unread,
Apr 27, 2016, 1:26:13 AM4/27/16
to
There are LOTS of holes in the study -- you'd need access to all
the raw data *plus* things they probably haven't even considered
to record (e.g., physical locations of the individual servers in
their racks -- esp if you consider SEU's from cosmic rays probably
affecting those at the top of their racks more than those "shaded"
by the machines above).

If you were doing this sort of thing for yourself, you'd try
moving DIMMs, moving servers, etc. to try to identify the cause
of the unresolved ambiguities reported.

Regardless, the takeaway is: can *you* predict what sort of error
rate YOUR device will experience "in the lab"? What about "in the
wild"? Do you know if your customer will be operating it at sea
level or a mile (or more) up? Do you know how the design wears
with age? etc.

Ages ago, you could build a DRAM controller out of discrete logic.
Now, the complexities of timing signals for the various DDR technologies
suggests you have to rely on the MCU vendor's implementation; is it
guaranteed to be "bug free" in all possible combinations of scheduled
cycles?

I can't see how you can rely on a one-time QUICK check of RAM to
express any sort of confidence in the CONTINUING operation of a
device -- short of catching permanent "stuck at" or "decode" faults.

And, if you have to restart a device to get that information, then
you're relying on implicit down time as part of your normal operating
procedure -- like MS's "reboot windows" approach to reliability!

Don Y

unread,
Apr 27, 2016, 9:36:57 AM4/27/16
to
Hi Dimiter,

On 4/26/2016 6:21 PM, Dimiter_Popoff wrote:

>> Using the error rates predicted in google's paper:
>>
>> 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit
>> 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs
>> or, one every ~80 hours.
>>
>> Using their high figure (75000 FiT/Mb) cuts that to one error
>> every ~1 day!
>>
>> For a 128MB system, that's a range of 1 error every 12 - 40 hours.
>
> I would first question the basic data you are using. Having never
> seen the google paper I doubt they can produce a result on memory
> reliability judging by the memories on their servers.

There have been other papers looking at other "processor pools"
(workstations, other "big iron", etc. Their data vary but all
suggest memory can't be relied upon (without ECC -- or some other
"assurance method"). Of course, bigger arrays see more errors.
"Even using a relatively conservative error rate (500 FIT/Mbit),
a system with 1 GByte of RAM can expect an error every two weeks"
(note that's 100 times lower error rate than google's study turned up;
and 10 times lower than what other surveys have concluded)

And, if you treat your population of products as if a single
collection of memory, that means SOMEONE, SOMEWHERE is seeing
an error (and the thing they all have in common is the vendor
from whom they purchased the product)

Sun apparently had some spectacular failures traced to some memory
manufactured by IBM.

Of course, SRAM is also subject to the same sorts of "upset events".
And, SRAM is increasingly found in large FPGA's. (e.g., XCV1000)
"If a product contains just a single 1 megagate SRAM-based FPGA and
has shipped 50,000 units, there is a significant risk of field failures
due to firm errors. Even for such a simple system, the manufacturer
can expect that within his customer base, there will be a field failure
due to a firm error every 17 hours."
And, of course, an SRAM error in an FPGA can cause the hardware to be
configured in a "CAN'T HAPPEN" state (like turning on a pullup AND a
pulldown, simultaneously)

> Knowing what
> a mess the software they distribute is I would say about all the
> errors they have attributed to memory failure must have been
> down to their buggy software.

One of the researchers was not affiliated with google. Note that other
similar experiments (conducted by other firms on other hardware) have
yielded FiT's in the 20,000 range. It's not like google's numbers
are an isolated report.

> Again, I have not seen their paper and I won't spend time investigating
> but I'll choose to stay where my intuition/experience has lead me, I
> have more reason to trust these than to trust google.

<frown> I don't like relying on intuition when it comes to product
design. Just because you haven't seen (or, perhaps, RECOGNIZED) an
error, doesn't mean it doesn't exist.

>> Have a read of:
>> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf>
>> pay attention to the "not manifested" results -- cases where a KNOWN
>> error was intentionally injected into the system but the system appeared
>> to not react to it.
>>
>> As I say, I suspect errors *are* happening (the FiT figures suggest
>> it and the experiment above shows how easily errors can slip through)
>
> Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital
> to survive months without being reset, there are measurements and
> experiments which just last very long. While damage to the data memory
> would be unnoticed - the data themselves are random enough - a few
> megabytes of code and critical system data are constantly in use,
> damage something there and you'll just see a crash or at least
> erratic behaviour.

No, that's not a necessary conclusion. *READ* the papers cited. Or,
do you want to dismiss their software/techniques ALSO?

In that case, INSTRUMENT one of your NetMCA's and see what *it*
reports for errors over the course of months of operation.

The takeaway, for me, is that I should actually LOG any observed errors
knowing they would represent just the tip of the iceberg in terms of what
must be happening in normal operation -- but undetected in the absence of
ECC hardware! Let my devices gather data.

> So my "mind the meteors while crossing a rush hour street in a big
> city" still holds as far as I am concerned.
>
> I have never looked at memory maker data about bit failures, I might
> pay more attention to these if available than I would to some google
> talk.

Their silence is deafening. Given the "buzz" in the literature questioning
the integrity of their products (after all, the sole purpose of MEMORY
is to REMEMBER, *accurately*!), you would assume an organization with
access to virtually unlimited amounts of memory would conduct and
publish a comprehensive study refuting these claims!

Dimiter_Popoff

unread,
Apr 27, 2016, 7:47:43 PM4/27/16
to
On 27.4.2016 г. 16:36, Don Y wrote:
> Hi Dimiter,
>
> On 4/26/2016 6:21 PM, Dimiter_Popoff wrote:
>
>>> Using the error rates predicted in google's paper:
>>>
>>> 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit
>>> 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs
>>> or, one every ~80 hours.
>>>
>>> Using their high figure (75000 FiT/Mb) cuts that to one error
>>> every ~1 day!
>>>
>>> For a 128MB system, that's a range of 1 error every 12 - 40 hours.
>>
>> I would first question the basic data you are using. Having never
>> seen the google paper I doubt they can produce a result on memory
>> reliability judging by the memories on their servers.
>
> There have been other papers looking at other "processor pools"
> (workstations, other "big iron", etc. Their data vary but all
> suggest memory can't be relied upon (without ECC -- or some other
> "assurance method"). Of course, bigger arrays see more errors.
> "Even using a relatively conservative error rate (500 FIT/Mbit),
> a system with 1 GByte of RAM can expect an error every two weeks"
> (note that's 100 times lower error rate than google's study turned up;
> and 10 times lower than what other surveys have concluded)

Hi Don,

The more papers you read on the topic the wider the interval of results
will get. Apparently these have been done by people who have had some
problem with their memory - or thought to have it and could not
discover the true source of the error, typically a bug.
And I am not saying there are no faulty memories and poorly designed
boards where memory errors do occur - but the solution is just to
have good silicon on properly designed boards. Much more efficient
than chasing errors which nobody knows when, if and why they do occur.
Our testing here is by running a newborn unit for 72 hours measuring
continuously with its HV at maximum (usually 5kV), have never had
a memory issue during this test and have never had one with devices
in the field some of which run for months without being reset, that
while being on a network.

Then at different densities the probability of an error might well
be different. At the DDR1 - I typically use 2 x16 chips to get 64
or 128 megabytes - I have not seen one error for years and I am
pretty good at spotting things if they are not right.

Perhaps on gigabytes per chip densities things get worse. But then
the controllers meant for such chips have ECC which eliminates the
probability of an error hitting you completely (if the silicon/board are
good); even if you get 1 bit per hour the probability of getting
two at the same time at the same address is infinitely small.
Once I have a system with ECC to port DPS on I'll probably be able
to see how many memory errors the ECC sees and corrects, I strongly
suspect they will still be 0 but we'll see. Anyway at a few G of
memory ECC makes sense I suppose.


>> Again, I have not seen their paper and I won't spend time investigating
>> but I'll choose to stay where my intuition/experience has lead me, I
>> have more reason to trust these than to trust google.
>
> <frown> I don't like relying on intuition when it comes to product
> design. Just because you haven't seen (or, perhaps, RECOGNIZED) an
> error, doesn't mean it doesn't exist.

Well I said intuition/experience, we all use both all the time simply
because we don't have many other options.


>
>>> Have a read of:
>>> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf>
>>> pay attention to the "not manifested" results -- cases where a KNOWN
>>> error was intentionally injected into the system but the system appeared
>>> to not react to it.
>>>
>>> As I say, I suspect errors *are* happening (the FiT figures suggest
>>> it and the experiment above shows how easily errors can slip through)
>>
>> Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital
>> to survive months without being reset, there are measurements and
>> experiments which just last very long. While damage to the data memory
>> would be unnoticed - the data themselves are random enough - a few
>> megabytes of code and critical system data are constantly in use,
>> damage something there and you'll just see a crash or at least
>> erratic behaviour.
>
> No, that's not a necessary conclusion. *READ* the papers cited. Or,
> do you want to dismiss their software/techniques ALSO?

Well I am not sure I'll read it any time soon, have other things to do.
But I may get back to it if at some point I feel I have a related
problem or something.

Dimiter_Popoff

unread,
Apr 27, 2016, 10:09:47 PM4/27/16
to
On 28.4.2016 г. 02:47, Dimiter_Popoff wrote:
> On 27.4.2016 г. 16:36, Don Y wrote:
>....
>>>> Have a read of:
>>>> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf>
>>>> pay attention to the "not manifested" results -- cases where a KNOWN
>>>> error was intentionally injected into the system but the system
>>>> appeared
>>>> to not react to it.
>>>>
>>>> As I say, I suspect errors *are* happening (the FiT figures suggest
>>>> it and the experiment above shows how easily errors can slip through)
>>>
>>> Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital
>>> to survive months without being reset, there are measurements and
>>> experiments which just last very long. While damage to the data memory
>>> would be unnoticed - the data themselves are random enough - a few
>>> megabytes of code and critical system data are constantly in use,
>>> damage something there and you'll just see a crash or at least
>>> erratic behaviour.
>>
>> No, that's not a necessary conclusion. *READ* the papers cited. Or,
>> do you want to dismiss their software/techniques ALSO?
>
> Well I am not sure I'll read it any time soon, have other things to do.
> But I may get back to it if at some point I feel I have a related
> problem or something.
>

Hi Don,

had a look at the paper (just the abstract). It is not really relevant,
they compare error immunity of different processors but this has little
to do with RAM errors and me not seeing them. They try to crash linux
by injecting errors - well, given that it is C written and bloated
by at least a factor of 10 (on occasions 100+ times) to its DPS
equivalent (vpa written) no wonder there is plenty of room in RAM
wasted which can be damaged to no consequences.
I am quite sure dps won't survive a fraction of their intentional
memory damage - remember, I am programming into it all day and I know
what happens if my code does something stupid... I am not saying their
results are invalid, just not applicable to how I estimate the
possibility of memory errors, the bloat factor difference is just
way too big.

Dimiter


0 new messages