Hi Dimiter,
On 4/26/2016 6:21 PM, Dimiter_Popoff wrote:
>> Using the error rates predicted in google's paper:
>>
>> 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit
>> 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs
>> or, one every ~80 hours.
>>
>> Using their high figure (75000 FiT/Mb) cuts that to one error
>> every ~1 day!
>>
>> For a 128MB system, that's a range of 1 error every 12 - 40 hours.
>
> I would first question the basic data you are using. Having never
> seen the google paper I doubt they can produce a result on memory
> reliability judging by the memories on their servers.
There have been other papers looking at other "processor pools"
(workstations, other "big iron", etc. Their data vary but all
suggest memory can't be relied upon (without ECC -- or some other
"assurance method"). Of course, bigger arrays see more errors.
"Even using a relatively conservative error rate (500 FIT/Mbit),
a system with 1 GByte of RAM can expect an error every two weeks"
(note that's 100 times lower error rate than google's study turned up;
and 10 times lower than what other surveys have concluded)
And, if you treat your population of products as if a single
collection of memory, that means SOMEONE, SOMEWHERE is seeing
an error (and the thing they all have in common is the vendor
from whom they purchased the product)
Sun apparently had some spectacular failures traced to some memory
manufactured by IBM.
Of course, SRAM is also subject to the same sorts of "upset events".
And, SRAM is increasingly found in large FPGA's. (e.g., XCV1000)
"If a product contains just a single 1 megagate SRAM-based FPGA and
has shipped 50,000 units, there is a significant risk of field failures
due to firm errors. Even for such a simple system, the manufacturer
can expect that within his customer base, there will be a field failure
due to a firm error every 17 hours."
And, of course, an SRAM error in an FPGA can cause the hardware to be
configured in a "CAN'T HAPPEN" state (like turning on a pullup AND a
pulldown, simultaneously)
> Knowing what
> a mess the software they distribute is I would say about all the
> errors they have attributed to memory failure must have been
> down to their buggy software.
One of the researchers was not affiliated with google. Note that other
similar experiments (conducted by other firms on other hardware) have
yielded FiT's in the 20,000 range. It's not like google's numbers
are an isolated report.
> Again, I have not seen their paper and I won't spend time investigating
> but I'll choose to stay where my intuition/experience has lead me, I
> have more reason to trust these than to trust google.
<frown> I don't like relying on intuition when it comes to product
design. Just because you haven't seen (or, perhaps, RECOGNIZED) an
error, doesn't mean it doesn't exist.
>> Have a read of:
>> <
http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf>
>> pay attention to the "not manifested" results -- cases where a KNOWN
>> error was intentionally injected into the system but the system appeared
>> to not react to it.
>>
>> As I say, I suspect errors *are* happening (the FiT figures suggest
>> it and the experiment above shows how easily errors can slip through)
>
> Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital
> to survive months without being reset, there are measurements and
> experiments which just last very long. While damage to the data memory
> would be unnoticed - the data themselves are random enough - a few
> megabytes of code and critical system data are constantly in use,
> damage something there and you'll just see a crash or at least
> erratic behaviour.
No, that's not a necessary conclusion. *READ* the papers cited. Or,
do you want to dismiss their software/techniques ALSO?
In that case, INSTRUMENT one of your NetMCA's and see what *it*
reports for errors over the course of months of operation.
The takeaway, for me, is that I should actually LOG any observed errors
knowing they would represent just the tip of the iceberg in terms of what
must be happening in normal operation -- but undetected in the absence of
ECC hardware! Let my devices gather data.
> So my "mind the meteors while crossing a rush hour street in a big
> city" still holds as far as I am concerned.
>
> I have never looked at memory maker data about bit failures, I might
> pay more attention to these if available than I would to some google
> talk.
Their silence is deafening. Given the "buzz" in the literature questioning
the integrity of their products (after all, the sole purpose of MEMORY
is to REMEMBER, *accurately*!), you would assume an organization with
access to virtually unlimited amounts of memory would conduct and
publish a comprehensive study refuting these claims!