My motive of course for posting this is that I am close to making a decision
on upgrading my current 5-year old Athlon XP system to something dual-core
etc., and simply don't know at this point to go the Intel or AMD route. I'm
pretty agnostic along these lines, just want the best bang for my buck.
Anyway, here's the link:
http://www.gigabyte.com.tw/FileList/NewTech/2006_motherboard_newtech/article_02_all_solid.htm
Thanks to you all in advance.
TLG
This is essentially a high-availability server design practice trickling down
to the desktop. Every five-nines+ server motherboard I've designed from late
1999 on used aluminum capacitors sporting solid polymer material (see Sanyo
OS-CON, for example) wherever high capacitance/low ESR devices were required -
which was mostly on the processor and chipset VRD rails.
These aren't cheap, so I was restrained from using more than theory required
(though I always put a few extra, strategically located footprints down and
left them off the assembly boms). The generally reflects a trade-off between
load-lines and how complex (which pretty much means how many phases) a
switching regulator one wants to construct. 6-phase switchers have been fairly
standard in the Xeon space, but 8-phasers are gaining, trading fets for fat
caps. With fewer caps comes less resistance to putting down premium devices
and burying the life-span issue.
Does it make any difference to a typical desktop user? Plainly, the likelihood
of failure of electrolytic caps as the desktop ages is going to be a major
factor, with those who stick with a system for 5 years more likely to succumb
to sudden system death, and those who keep on the cutting edge (if not the
bleeding edge) most likely to beat the reaper...
Cheers
/daytripper
: This is essentially a high-availability server design practice
: trickling down to the desktop. Every five-nines+ server
: motherboard I've designed from late 1999 on used aluminum
: capacitors sporting solid polymer material (see Sanyo OS-CON,
: for example) wherever high capacitance/low ESR devices were
: required - which was mostly on the processor and chipset VRD
: rails.
:
: These aren't cheap, so I was restrained from using more than
: theory required (though I always put a few extra,
: strategically located footprints down and left them off the
: assembly boms). The generally reflects a trade-off between
: load-lines and how complex (which pretty much means how many
: phases) a switching regulator one wants to construct. 6-phase
: switchers have been fairly standard in the Xeon space, but
: 8-phasers are gaining, trading fets for fat caps. With fewer
: caps comes less resistance to putting down premium devices and
: burying the life-span issue.
:
: Does it make any difference to a typical desktop user?
: Plainly, the likelihood of failure of electrolytic caps as the
: desktop ages is going to be a major factor, with those who
: stick with a system for 5 years more likely to succumb to
: sudden system death, and those who keep on the cutting edge
: (if not the bleeding edge) most likely to beat the reaper...
:
: Cheers
:
: /daytripper
Thanks for the great post, Daytripper. I **think** I understand most of
what you said. Yeah, my mobo is approaching five years now and I'm starting
to see problems that I'm almost sure are h/w related. Won't go into that
here, just wanted some feedback on the OP. You're the MAN, Daytripper.
Glad to see you're still alive! :-)
/TLG (still wondering why there are so many MIA's from the .chips NG)
Sure it is more likely that a sudden system death will occur in 5
years than in 3 years, as long as the probability of sudden system
death in the latter two years is non-zero.
What is more interesting is whether the rate of system death increases
with age, and by how much.
We have two consumer-type boards running all the time since seven
years, so I think that any fear that the reaper is coming soon to
visit desktops as soon as they reach five years is exaggerated,
especially since many desktops are turned off much of the time.
Sudden deaths that we see occur more in power supplies, RAM, fans, and
disks.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
>daytripper <day_t...@REMOVEyahoo.com> writes:
>>Does it make any difference to a typical desktop user? Plainly, the likelihood
>>of failure of electrolytic caps as the desktop ages is going to be a major
>>factor, with those who stick with a system for 5 years more likely to succumb
>>to sudden system death,
>
>Sure it is more likely that a sudden system death will occur in 5
>years than in 3 years, as long as the probability of sudden system
>death in the latter two years is non-zero.
>
>What is more interesting is whether the rate of system death increases
>with age, and by how much.
>
>We have two consumer-type boards running all the time since seven
>years, so I think that any fear that the reaper is coming soon to
>visit desktops as soon as they reach five years is exaggerated,
>especially since many desktops are turned off much of the time.
>
>Sudden deaths that we see occur more in power supplies, RAM, fans, and
>disks.
>
>- anton
Your methodology is intrinsically unsound. Two measurements using boards of
unknown construction does not support your conclusion.
Liquid electrolyte caps have a very low mtbf.
That's the beginning and end of the question...
Cheers
/daytripper
> Liquid electrolyte caps have a very low mtbf.
> That's the beginning and end of the question...
Actually, no it's not. The issue is whether, empirically, electrolytic
failure *on motherborads* (not in power supplies) is a significant
cause of motherboard death, and in turn whether motherboard death is a
significant contributor to the death of desktops. My guess is the
answers to these are "yes" and "no" respectively.
The kinds of design adopted for high-reliability machines are
interesting here, but not for the reason you might think: every machine
which has any pretension to high-reliability I've ever seen has
redundant hot-swappable PSUs and fans (as well as disks, obviously).
Assuming the designers were competent that tells you what tends to
fail: power supplies and fans (and disks). Further, since those things
can now be swapped without taking the machine down, the reliability of
other components becomes the thing that controls the reliability of the
whole system. For machines where the system board can't be swapped
with the machine up, which is most of them, that means you might need
to pay serious attention to that (for machines where system boards
*can* be swapped with the machine up you're probably paying so much for
the machine that you expect serious attention to be paid anyway).
It would be interesting to see statistics as to what kills desktops.
My guess (which should be taken for what it's worth, namely nothing
is):
1. most of them are thrown away, working;
2. disk failure
3. PSU failure
4. fan failure
5. everything else, trailing a long way behind.
--tim
No pretensions here: http://www.stratus.com/products/index.htm
An entire "side" - power, cooling, logic - is field replaceable without any
user or process ever realizing it's happening - from fault, through automagic
call-home-to-mama, through FRU replacement, through re synchronization.
Yet, and, as I said earlier, the motherboards use capacitors that have an mtbf
an order of magnitude higher than that found on desktops (until recently).
Because you don't get to 6 nines on the cheap, son.
No matter what is replaceable once the system's been fired up...
Cheers
/daytripper
> No pretensions here: http://www.stratus.com/products/index.htm
Do these machines (or any other HA systems) have much to do with
desktops (which was the original question, remember)? No.
> Because you don't get to 6 nines on the cheap, son.
> No matter what is replaceable once the system's been fired up...
Indeed you do not. And again: what exactly does this have to do with
the reliability of desktop motherboards? I suggest: nothing.
...
> 1. most of them are thrown away, working;
> 2. disk failure
> 3. PSU failure
> 4. fan failure
> 5. everything else, trailing a long way behind.
That (with the exception of the top point) was indeed the approximate
ordering I read recently in a paper by credible people (may have been
one of the recent ones discussing real-world vs. specced disk MTBFs - or
not). But it's not clear what the service life of motherboards was
considered to be, so it's possible that inclusion of 5+ year old MBs
would have changed the conclusions.
- bill
>On 2007-03-30 23:34:00 +0100, daytripper <day_t...@REMOVEyahoo.com> said:
>
>> No pretensions here: http://www.stratus.com/products/index.htm
>
>Do these machines (or any other HA systems) have much to do with
>desktops (which was the original question, remember)? No.
Yes, it has to do with the thread, specifically the migration of components
used in highly-available systems down to the desktop.
Also, it refutes your concept that something easily replaceable in the field
without disruption could afford to use cheaper, lower-mtbf components.
>> Because you don't get to 6 nines on the cheap, son.
>> No matter what is replaceable once the system's been fired up...
>
>Indeed you do not. And again: what exactly does this have to do with
>the reliability of desktop motherboards? I suggest: nothing.
Ok, so you got confused along the way. Mostly your own doing, as the point has
always been: use cheap, low-mtbf caps and you run a higher risk of a system
failure, compared to the use of high-mtbf components.
For what ever misguided reason, you got caught up in the phrase "sudden system
death", and tried to make some kind of point countering it. I can't imagine
why, a failure that takes down the system is usually sudden, and the passage
of time is no friend to a low-mtbf component used in fairly high quantity...
Cheers
/daytripper
> Ok, so you got confused along the way. Mostly your own doing, as the point has
> always been: use cheap, low-mtbf caps and you run a higher risk of a system
> failure, compared to the use of high-mtbf components.
Perhaps I was not clear enough for you: the issue is whether the lower
reliability of electrolytics on the system boards of desktops (go and
read the ortiginal article if you have forgotten) is a significant
factor in desktop reliability. Don't bring up Stratus or other HA
systems in your answer.
Or to put it another way: if I wanted to spend a given amount of money
to make a desktop more reliable would I do it by buying a motherboard
with no electrolytics or (say) buying a system with redundant power or
mirrored disks. Don't bring up Stratus or other HA systems in your
answer.
>
> For what ever misguided reason, you got caught up in the phrase "sudden system
> death",
I at no point used the word "sudden". I did use the word "death" to
mean "failure".
Never mind, I'm not going to waste more of comp.arch's time on this idiocy.
Okay then: [Moe to the boys: "Settle down".....'SLAP']
Hey everyone, I didn't post this question as a troll or to start an
argument. It was just an honest question about whether or not this hardware
(ie, solid caps) can make a significant difference in the overall "lifespan"
of a consumer-grade, desktop machine. As I originally posted, my homebuilt
machine (motherboard) is about five years old now, and is starting to
misbehave a bit. I finally found the time to pull the guts out of my
machine the other day only to find that yes, one of the capacitors is indeed
showing signs of leaking and I guess imminent failure. The machine still
works, but I'm definitely now on borrowed time.
I just simply wanted to know if the new solid-cap technology from Gigabyte
was worth the few extra euros (oops, I mean dollars) that I'm sure they will
ask. I find this thread so far quite interesting but I simply don't
understand all the "mala leche." ;-) Thanks to everyone that's offered
their thoughts.
/TLG
I don't disagree with any of the previous posters, but wish to
point out this depends entirely on what you mean by "lifespan",
and what the solid caps are replacing -- there have been
several epidemics of bad caps (often unstable isolation oil).
When the choice is between bad caps and solid, then yes, solid
will give longer desktop life. When the choice is between
good electrolytics (not always an oxymoron!), then the life
extention is probably in the out-years beyond normal lifespan.
> As I originally posted, my homebuilt machine (motherboard)
> is about five years old now, and is starting to misbehave a bit.
Five years is above the historical "lifespan" of most
MS-Windows machines (3-4 usually considered max). OTOH,
I've used the same mobo continuously powered for 8 years.
You may be seeing abnormally bad caps. Try
http://www.badcaps.net
-- Robert
The real problem is not that cheap desktop machines use inferior parts,
rather that they often run the parts close to or past their ratings. If
you buy a well designed corporate quality pc, you are likely to find
that it's been properly engineered and all the components will be run
well within their ratings. This costs money. You buy cheap and may find
that many of parts will be just good enough to get the job done. Smaller
fans, lower value / ripple current caps, smaller heatsinks on regulators
etc.
Nothing kills electronics faster than high temperatures and the cooler
you can keep everything, the better. Electrolytics are especially
sensitive to temperature and tend to dry out. This is exacerbated if
excessive ripple current is causing self heating. Run well within
ratings. modern electrolytics can have a 10 or 20 year lifetime with no
problems. For example, consumer tv caps are thermally cycled every day,
get quite warm, yet often last for 10 years or more before failure.
Solid electrolytics may have a lower esr and higher ripple current
rating, which might mean fewer / smaller caps for the same perfomance,
or the same number for better performance. The benefits are better
supply regulation, lower ripple and added life *only* if the capacitors
are correctly sized to the problem. Since you can't tell what the design
criteria were, it's difficult to decide if their use is about sales
spin, or if they really are trying to build a better product...
Chris
What often helps in such cases is to push all socketed stuff a little
bit (or alternatively, pull it out, and reseat it).
>I just simply wanted to know if the new solid-cap technology from Gigabyte
>was worth the few extra euros (oops, I mean dollars) that I'm sure they will
>ask.
If you are using liquid cooling or some other exotic cooler that does
not provide airflow to the CPU capacitors, electrolytic capacitors get
hotter and age faster than with normal air cooling (especially if you
also overclock); in these circumstances the solid capacitors may have
a significant advantage.
To be honest, I don't know what I mean by "lifespan." For me personally, I
would think it would mean a mobo that can last at least five years before
going belly up. <shrug>
:
: When the choice is between bad caps and solid, then yes, solid
: will give longer desktop life. When the choice is between
: good electrolytics (not always an oxymoron!), then the life
: extention is probably in the out-years beyond normal lifespan.
Uh, ok. Not entirely sure what you mean here, but I think I get the idea.
:: As I originally posted, my homebuilt machine (motherboard)
:: is about five years old now, and is starting to misbehave a
:: bit.
:
: Five years is above the historical "lifespan" of most
: MS-Windows machines (3-4 usually considered max). OTOH,
: I've used the same mobo continuously powered for 8 years.
That's where I differ. My machine is almost never powered on 24/7. To the
contrary, it get's powered up/off sometimes three or four times a day. I
fully realize that repeated cold starts (vs running continuously at constant
state) is **much** harder on electronics, especially HD's and PS's.
: You may be seeing abnormally bad caps. Try
: http://www.badcaps.net
I KNOW I'm seeing a (singular) bad cap. Didn't you read the final paragraph
of my last post? <snigger> ;-)
/TLG
Been there, done that. But thanks anyway.
:: I just simply wanted to know if the new solid-cap technology
:: from Gigabyte was worth the few extra euros (oops, I mean
:: dollars) that I'm sure they will ask.
:
: If you are using liquid cooling or some other exotic cooler
: that does not provide airflow to the CPU capacitors,
: electrolytic capacitors get hotter and age faster than with
: normal air cooling (especially if you also overclock); in
: these circumstances the solid capacitors may have a
: significant advantage.
Just using "normal", standard run-of-the mill cooling here. I DO NOT
overclock...that's either for dummy's or people who don't care about
reliability. I mean, Geez! With so much horsepower available nowadays, why
on earth would anyone consider OC their machine with it's concomitant
instabilities??
/TLG
>Hey everyone, I didn't post this question as a troll or to start an
>argument. It was just an honest question about whether or not this hardware
>(ie, solid caps) can make a significant difference in the overall "lifespan"
>of a consumer-grade, desktop machine. As I originally posted, my homebuilt
>machine (motherboard) is about five years old now, and is starting to
>misbehave a bit. I finally found the time to pull the guts out of my
>machine the other day only to find that yes, one of the capacitors is indeed
>showing signs of leaking and I guess imminent failure. The machine still
>works, but I'm definitely now on borrowed time.
>
>I just simply wanted to know if the new solid-cap technology from Gigabyte
>was worth the few extra euros (oops, I mean dollars) that I'm sure they will
>ask.
IMO, the most important characteristics of capacitors used in
switchmode PSUs are their equivalent series resistance (ESR) and
temperature rating. In fact the most important tool in my toolkit,
even more useful than a DMM, is my ESR meter. A capacitor with high
ESR will experience ohmic heating by ripple currents, resulting in
premature failure. The "solid" caps are claimed to have half the ESR
of typical high grade low ESR aluminium electrolytics.
See the range of "Aluminum, Organic Semiconductor" types here:
http://www.vishay.com/capacitors/aluminum/radial/
Here is one 105degC OS-CON example:
http://www.vishay.com/docs/90009/94sv.pdf
It boasts "approximately two times the capacitance of existing
capacitors and less than half the ESR".
Here is a tech page that talks about a popular ESR meter kit:
http://members.ozemail.com.au/~bobpar/esrmeter.htm
This page describes the mechanism of capacitor failure:
http://members.ozemail.com.au/~bobpar/esrtext.htm
- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
>For what ever misguided reason, you got caught up in the phrase "sudden system
>death", and tried to make some kind of point countering it. I can't imagine
>why, a failure that takes down the system is usually sudden, and the passage
>of time is no friend to a low-mtbf component used in fairly high quantity...
>
>Cheers
>
>/daytripper
Aluminium electrolytic capacitor failures do not usually result in
"sudden system death". I have replaced *thousands* of electrolytics in
all manner of equipment and the prevailing failure mode is an
intermittent one, or a thermal one. Marginal capacitors often come
good after being allowed to warm up. You can sometimes see this on an
old TV where the image suffers from reduced height and has retrace
lines at the top until the set stabilises. This is caused by dried out
caps in the vertical deflection circuit.
An example of caps that *do* fail suddenly and catastrophically are
tantalum electrolytics. These often go short circuit and/or catch
fire.
Thanks for the links, Frank. I'll get to reading them today.
/TLG
Ouch! I don't power-down even MS-Windows machines (which need frequent
reboots to recover memory leaks) unless the anticipated powerdown
is greater than 8 hours. If you want to save energy/environment,
please don't suboptimize: consider life-cycle costs.
> : You may be seeing abnormally bad caps. Try :
> http://www.badcaps.net
> I KNOW I'm seeing a (singular) bad cap. Didn't you read
> the final paragraph of my last post? <snigger> ;-)
Certainly I read. I think you should check out your mobo to be
able to put your current experience into context. Is it one of the
known bad-actors? Then a normal "good" mobo should be sufficient.
Is it not on the list? Then either it's a random failure _or_ your
usage is extreme, and you'd really best get a mobo with solid caps.
-- Robert
Do you have any data supporting your insinuation that the life-cycle
costs are higher with his usage model than with yours. From my
experience, I doubt it. Power is a significant cost during the
lifetime of a PC (especially in an air-conditioned environment),
whereas failures of PCs that are powered on and off frequently are
rare in my experience.
Strangely, in my experience machines that are generally powered on all
the time seem to be more likely to fail when power cycled.
Since almost *all* electronic device failures occur during power
transitions, your observation is not "strange" at all. A power
cycle is stressful in two ways: voltage/current surges, and
temperature cycling. The latter is the more serious in the
long term but sets up the conditions for failure due to the
former.
Recall that failures of modern electronics (other than "infant
mortality") are extremely rare. Devices which are power-cycled
frequently are just as likely to fail during a power cycle as
devices that are powered on all the time, but in the latter case
the power cycle itself is also "rare", and thus our psychological
tendencies towards "trigger induction" tend to cause us to notice
the correlation more than when power-cycling is a common event.
Severe temperature cycling is also stressful even when the
device is powered *off*, as when a laptop is repeatedly carried
from an air-conditioned office to the trunk of a car that's
left in the sun, and vice-versa. [Note: Exposure of lithium-ion
batteries to temperatures higher than ~45 C (113 F) is *VERY*
bad for them, still another reason to *not* leave your laptop
in the car on a hot day!!]
-Rob
p.s. Even non-"electronic" electrical components tend to fail
upon power-cycling, especially power-on. Consider incandescent
lightbulbs, for which almost *all* failures occur during the
power-on surge.
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
Rob Warnock wrote:
>Since almost *all* electronic device failures occur during power
>transitions
Evidence, please.
It's fact - current and sometimes voltages surges on power up stress the
components to a greater extent than during normal operation. This
especially applies to electrolytics, which must withstand a high surge
charge current on power up. Under these conditions, the current is
limited by the rise time of the power rail and circuit + capacitor
internal resistance + whatever other limiting device the manufacturer
has included in the design. This may be a moot point for small value
caps, where the current tends to be self limiting, but not the case in a
typical cheap 'n cheerfull pc switchmode psu, where the parts are much
more highly stressed. The military and high end computer vendors have
always been aware of these factors, which is why (for example) quality
kit always has what appears to be oversized power supplies and lots of
fans, compared to a cheap pc.
As for evidence, it's basic electronics theory - but feel free to check
with google. Check out "thermal cycling", "power surge" and associated
topics on the mtbf of components...
Chris
--
----------------------
Greenfield Designs Ltd
Electronic and Embedded System Design
Oxford, England
(44) 1865 750 681
I just checked Google, and found nothing there (or in your post)
that is evidence of the "almost *all* electronic device failures"
claim. You mention the specific cases of electrolytics and pc
switchmode psus, (you could have added light bulbs) which implies
that there are other cases don't "almost always" fail only during
power transitions.
While thermal cycling is an effect of power cycling, it is only
so in components that use significant power. An all-CMOS
design with a very slow clock doesn't heat up much compared
to variations in room temperature. Also, why would you think
that the thermal-cycling-caused failures only occur during power
transitions?
Do always-on LEDs last almost forever while the same LEDs in
flashing mode fail quickly?
Please see the papers listed on Google or other competant
web search engine under power cycling failure.
Fast cycling is an industry standard method of accelerated
aging/failure/lifetime estimation.
-- Robert
Another source of failure is aging because of moisture creep along
plastic/metal interface of the chip packages, which happens when the device
cools down after power off. Such moisture creep is reason why the parts
should be kept is a dessicated container before soldering. Fast heating
would cause great stress to the said interface because of moisture
evaporation.
>
> I just checked Google, and found nothing there (or in your post)
> that is evidence of the "almost *all* electronic device failures"
> claim. You mention the specific cases of electrolytics and pc
> switchmode psus, (you could have added light bulbs) which implies
> that there are other cases don't "almost always" fail only during
> power transitions.
>
> While thermal cycling is an effect of power cycling, it is only
> so in components that use significant power. An all-CMOS
> design with a very slow clock doesn't heat up much compared
> to variations in room temperature. Also, why would you think
> that the thermal-cycling-caused failures only occur during power
> transitions?
>
> Do always-on LEDs last almost forever while the same LEDs in
> flashing mode fail quickly?
>
I'm sorry, but don't have time to do the research for you for stuff that
I know to be true. Have already tried to explain the power on surge
failure mechanism.. While some pc components don't draw much power,
major components like cpus, chip sets etc do, cmos or not. Devices are
often not only plastic encapsulated, but are mounted on low cost glass
fibre or similar ball grid array substrates and they get quite hot. The
various parts can have widely different coefficients of expansion, which
mechanically stresses soldering and chip bonds every time the machine is
power / thermally cycled. The thermal cycling also creates voids between
plastic and metal, which allows contaminants to enter the device. The
plastics are much better than they used to be, but still don't provide a
good hermetic seal long term under adverse conditions. This is why mil
and aerospace qualified devices are often expensive solder brazed
ceramic or glass metal seal encapsulated. They really don't spend this
money for no reason.
For another example, domestic crt type tv sets, despite frequent power
on/off, are actually quite reliable. However, the most common failures
are soldered joints / dry joints / solder embrittlement due to the
mechanical stress caused by thermal cycling, or electrolytics drying out
and failing due to proximity to some hot device. Such problems are rare
on cooler parts of the board.
Modern electronics run well within electrical ratings has an essentially
indefinate life if the temperature is kept low and constant. Anything
else increases the failure rate...
Just to be a bit contrarian...
The recent Google paper on disk drive reliability ("Failure Trends in
a Large Disk Drive Population") has a few interesting tidbits. Mind
you I have a few problems with the research as presented, but I think
some of the gross trends are likely to be in line with what they
present. They have a few specific topical observations:
First, within the first couple of years of life of a disk drive they
found no correlation between power cycle counts and failures. And
while they did see a correlation for older drives, they had some
caveats. Second, they did *not* find a correlation between operating
temperature and failures (obviously they weren't running any drives at
200C, but still most of us would have expected some correlation, even
within the "normal" operating temperature range). Third, they also
did *not* find a correlation between usage (IOW, net IOPS) and
failures.
All three are unexpected results, especially the latter two, which
also specifically contradict manufacturer published data. OTOH, that
the published disk drive reliability data is at significant variance
with real world performance has been suspected/understood for a long
time.
The other recent paper on the subject (USENIX, "Disk Failures in the
real world: What does an MTTF of 1,000,000 house mean to you?"), is
IMO a better bit of research, and matches the overall conclusions of
the Google paper, but doesn't break down the failures in the same way.
...
> First, within the first couple of years of life of a disk drive they
> found no correlation between power cycle counts and failures. And
> while they did see a correlation for older drives, they had some
> caveats.
Well, one of the caveats was that the server disk population they
studied just didn't do much power cycling at all: they were powered
almost all the time. Since IIRC modern disks are designed to survive
50K power cycles (which works out to more than once every hour during a
standard 5-year service life), one would not expect to see any
noticeable failure rate (let alone one sufficient to form any real
correlation) when studying drives that only underwent a few tens or at
most hundreds of power cycles during their lifetime.
To evaluate the rule of thumb that electronic devices tend to fail
during power cycles, one really would need to study a population where
power cycles were far more frequent (both in an absolute sense and with
respect to design values).
Second, they did *not* find a correlation between operating
> temperature and failures (obviously they weren't running any drives at
> 200C, but still most of us would have expected some correlation, even
> within the "normal" operating temperature range).
That certainly used to be the conventional wisdom (the doubling of
failure rate for a 15C temperature rise came from vendor studies IIRC).
However, to state that they found no correlation is incorrect: they
did find a correlation with drives toward the end of their service life
operating in the higher portion of the permitted temperature range (IIRC
the normal design limit is 55C or 60C).
This again is not all that surprising: if temperature-related failures
are the result of accumulated deterioration (as seems intuitively
reasonable), then the older the drive and the higher the temperature,
the greater the accumulation. So given that the drives are designed to
tolerate temperatures up to 55C or 60C, the fact that little difference
is seen between operation, say, at 25C vs. 40C is somewhat less than
astonishing (if one assumes that things like temperature-related
lubrication breakdown may rise steeply as one nears the design limit).
The high early death rate in drives operating at unusually *low*
temperatures could be due to lubrication operating below its design
point (and thus placing unanticipated stress on components, which they
either failed early or not at all). At least (as a non-specialist in
this area, but with a lifetime's experience with automotive and similar
lubricants) I don't find it unlikely that the band of optimal operation
could be fairly narrow here for such a sensitive, high-performance
component.
Third, they also
> did *not* find a correlation between usage (IOW, net IOPS) and
> failures.
Again, that statement is incorrect: they found at least moderate
correlation in both very young disks and in disks nearly the end of
their service life. And, again, there is a potentially easy explanation
for this: the parts which are marginal in tolerating the stress get
weeded out early, and the parts which are marginal in tolerating wear
fail near the end.
And as in the case of power cycles, Google's access patterns tend not to
stress drives the way manufacturers stress them: IIRC Google accesses
are typically fairly large, rather than small and seek-intensive, so
place not that much more stress on a drive than just sitting there
spinning does. With something more like an OLTP workload (which
high-end disks are supposedly optimized for) the results might be quite
different.
...
> The other recent paper on the subject (USENIX, "Disk Failures in the
> real world: What does an MTTF of 1,000,000 house mean to you?"), is
> IMO a better bit of research, and matches the overall conclusions of
> the Google paper, but doesn't break down the failures in the same way.
In particular, it says nothing regarding the three suggestions that you
made above - though it does lend support to my suggestions above with
respect to both temperature and activity correlations that wear is a
factor even within a disk's nominal lifetime (i.e., that the floor of
the traditional 'bathtub' failure graph has a significant upward slope
to it).
The most glaring deficiency of the CMU paper IMO is its failure to
characterize the workloads studied: if they're not
small-access-seek-intensive, then the fact that SATA drives fared just
about as well as high-end drives does not seem surprising at all.
The most important single observation to take away from both papers is
that manufacturer AFR specs understate real-world failure rates by
factors of 2x - 10x (possibly even more for high-end disks if none of
the workloads were truly seek-intensive, though one would hope that this
would not affect failure rates of high-end disks all that much since
that's the environment they're supposedly designed to serve; conversely,
it would not be fair to criticize any additional affect upon SATA
failure rates - save for Raptors - because they're *not* designed to
survive in that environment, though it would be very useful to know with
respect to whether substituting them for high-end disks is suitable in
such environments).
- bill
They found one: There was a temperature range (IIRC around 40 degrees
C) with the lowest failure probability; much cooler drives had a
larger failure probability; they did not have much warmer warmer
drives in their population, but there the failure probability seemed
to rise, too.
You seem to have misunderstood my sentence. What I meant is that they
seem to be more likely to fail on power cycling than machines that are
power cycled regularly; like, a failure for every few hundred power
cycles for these nearly-always-on machines, whereas maybe several
thousand power cycles for my home machines.
Also, I have seen a number of machine failures during power-on
conditions, from stuck fans to dead RAM and dead disks.
>A power
>cycle is stressful in two ways: voltage/current surges, and
>temperature cycling. The latter is the more serious in the
>long term but sets up the conditions for failure due to the
>former.
OTOH, power-on hours also take their toll on the electronics. For
semiconductors, electromigration will eventually make the device fail;
for electrolytic capacitors, the liquid tends to evaporate faster with
higher temperature, and they are hotter when the machine is powered
on.
Over 40 years of experience in the industry, designing and
manufacturing computers & related devices, mainly, but also
using them.
Note, however, that I was *not* referring to devices with a
high component of mechanical motion such as fans or disk drives
[as another branch of this thread seemed to get fixated on].
I was referring primarily to printed circuit boards and the
non-moving parts normally mounted on them, such as chips,
capacitors, resistors, etc.
-Rob
Indeed so. In the worst case, this can cause such failures as
the top of a chip literally blowing off during wave-soldering.
-Rob
No, I understood your sentence completely, but disgreed with it.
You, in turn, seem to have missed the meaning of my reply, which
is that I considered your sentence/observation to be a false value
judgement (due to the trigger induction effect). You see, it really
doesn't *matter* if a nearly-always-on machine fails after "a few
hundred power cycles" *if* [as is likely] the mean *TIME* before
that failure is much, much *larger* that the mean time before failure
of the machine [such as your home machine] that is being power-cycled
"thousands" of times.
To say it yet another way, the "fraction of power cycles which result
in failures" is *NOT* (usually) an important value metric!! What's
(usually) important is the mean *TIME* to failure. And for a large
number/style of systems in a wide range of conditions, frequent power
cycling generally tends to result in less total time before failure.
There are exceptions, of course, at the extremes of operating regimes.
If you only turn your system on once a year for one hour per session,
it might very well last longer than my machines which run continuously.
But for myself, personally, I consider that operating regime to be
nearly useless. YMMV.
+---------------
| Also, I have seen a number of machine failures during power-on
| conditions, from stuck fans to dead RAM and dead disks.
+---------------
Uh... I didn't disagree with that. Whether frequently or infrequently
power-cycled, the most likely moment for a failure is during a
power-on event [and the second-most-likely is a power-off event].
But also see my other reply, where I mentioned that I wasn't
especially addressing devices with a high component of mechanical
motion or stress. Those have much more complex tradeoffs between
frequency of power-cycling, on-time per session, total on-time,
and MTBF. [Bearings dry out while running; seals dry out while
*not* running; "stiction" is aggravated by long "off" times; heads
can crash from overly-*short* "off" times (power flicks); etc.]
+---------------
| >A power cycle is stressful in two ways: voltage/current surges,
| >and temperature cycling. The latter is the more serious in the
| >long term but sets up the conditions for failure due to the
| >former.
|
| OTOH, power-on hours also take their toll on the electronics.
| For semiconductors, electromigration will eventually make the
| device fail;
+---------------
True, but for devices manufactured since the phemomenon of
electromigration was understood and (somewhat) mitigated against,
thermal cycling is probably a more serious stressor. The worst
case -- when both are active -- is a 95% "on" time with frequent
"off" cycles just long enough for the system to cool off.
+---------------
| for electrolytic capacitors, the liquid tends to evaporate faster
| with higher temperature, and they are hotter when the machine is
| powered on.
+---------------
Also somewhat true, though any machine which is operated at such
a high temperature that this is a dominant effect is in serious
danger from failing for many *other* reasons!! ;-} Under normal
"room-temperature" conditions with circuits designed not to overly-
stress the capacitors' A.C. current-handling rating [which is IME
a more-serious source of high temps in electrolytics than ambient
temp per se], thermal cycling is more likely to be the cause of
failure, e.g., by the cracking of seals or solder joints.
Though I will agree with you completely about those old *huge*
electrolytics that used to be used in large-computer linear
power supplies [back before switchers!], e.g., the 100,000 uF
16 V caps that were used in the DEC PDP-10 supplies. I used to
help run a PDP-10 that was in a small closed room with barely
adequate air conditioning. If the air conditioning failed --
which it sometimes did in the hot Atlanta summers -- and the
PDP-10 got hot enough for the internal thermal sensors to shut
the machine down, then, just like clockwork, about two weeks
(yes, weeks!) after the air conditioning outage one of those
monster caps would suddenly explode, spraying electrolyte all
over the place [and resulting in hours of DEC Field Circus time
to get it fixed]. We were convinced the failures were due to
the caps having dried out during the HVAC outage, causing the
A.C. impedance to go up, which caused the caps to continually
overheat later, which accelerated the aging in a runaway cycle
until... *BOOM*!
But modern desktop system mainboards don't contain those sorts
of "wet cell" electrolytics any more.
-Rob
Not always. In some circumstances, the most likely moment is a
power-yer-whaa? event. Spikes and phase jumps are not good news,
and nor are even brown-outs and very short-term power drops.
But that is just nitpicking :-)
|> But modern desktop system mainboards don't contain those sorts
|> of "wet cell" electrolytics any more.
Nor do most other devices. Thank heavens.
Regards,
Nick Maclaren.
ChrisQuayle wrote:
>
>m...@privacy.net wrote:
>> Do always-on LEDs last almost forever while the same LEDs in
>> flashing mode fail quickly?
>
>I'm sorry, but don't have time to do the research for you for stuff that
>I know to be true.
In other words, you know that always-on LEDs and flashing LEDS have
about the same lifetime, and yet you won't admit that the "almost
all electronic device failures occur during power transitions."
claim is false in the case of LEDs. And diodes. And resistors.
And inductors. And electrolytic capacitors that aren't subjected
to currents that exceeed their ratings.
>While some pc components don't draw much power, major components
>like cpus, chip sets etc do, cmos or not.
Rob Warnock's claim was (exact qoute) "almost *all* electronic
device failures occur during power transitions." You are defending
another claim that he didn't make -- that *some* electronic device
failures occur during power transitions. Nobody is disputing that.
>Devices are often not only plastic encapsulated, but are mounted on
>low cost glass fibre or similar ball grid array substrates
Again you are talking about some, not all. There are many electronic
devices that are neither plastic encapsulated nor mounted on glass
fibre ball grid array substrates.
Please address the actual claim that Rob Warnock made, not some other,
easier to prove claim.
Rob Warnock wrote:
>
><m...@privacy.net> wrote:
>+---------------
>| Rob Warnock wrote:
>| >Since almost *all* electronic device failures occur during power
>| >transitions
>|
>| Evidence, please.
>+---------------
>
>Over 40 years of experience in the industry,
That's not evidence.
Has it been your experience that flashing LEDs fail more
quickly than alway-on LEDs? That resistors, diodes, etc.
that aren't at the edges of the rating evvelope fail much
more quickly when subjected to power on/off cycles?
>designing and manufacturing computers & related devices,
>mainly, but also using them.
That's not evidence. And "computers" are a small subset of
"all electronic devices."
>Note, however, that I was *not* referring to devices with a
>high component of mechanical motion such as fans or disk drives
>[as another branch of this thread seemed to get fixated on].
One might argue that those are also part of "all electronic
devices", but I am inclined to assume that you meant pure
electronic rather than electromechanical. Clearly they have
differing failure mechanisms.
>I was referring primarily to printed circuit boards and the
>non-moving parts normally mounted on them, such as chips,
>capacitors, resistors, etc.
A 1/4 watt resistor that disssipates a constant 1/16 watt
(many resistors in opamp circuits dissipate far less than
that) will last about as long as one that cycles from 1/16
watt to 0 watts and back again. And when the latter does
fail it will be at a random time, not at the exact moment
the power transition happens.
Rob Warnock wrote:
>Rob Warnock writes:
>| >Since almost *all* electronic device failures occur during power
>| >transitions.
>for a large number/style of systems in a wide range of conditions,
>frequent power cycling generally tends to result in less total
>time before failure.
The latter claim is true. The former is false.
Evidence, please. You cannot hold others to standards
without expecting to first comply with them yourself.
Furthermore, what evidence do you have that resistor
failures cause most device failures? They are easy to
find, but finding a failed resistor doesn't mean it caused
the failure. Resistor burnout is just as likely to have
resulted from some other initiating cause.
-- Robert
This is borne out by data from network operations of a large
organisation I cannot name without checking with them.
Even having room temperatures of 28 degrees C for years on end
does not seem to have affected equipment in statistically significant
ways, as long as the equipment was properly serviced and did not
age too badly.
There was great correlations between model/batch numbers and failures
Bad production series is a very real problem, and was the only significant
non-abuse hardware problem for young equipment. (< 5 yr old).
This especially applies to disks and power supplies. Fans is also
an issue.
At the 5 year mark power and restart events started to show up
with hardware failures. Power+disk failures dominated to a point
of masking out other events.
But some batches just kept going, 10-12 years of continous service
was not uncommon.
>All three are unexpected results, especially the latter two, which
>also specifically contradict manufacturer published data. OTOH, that
>the published disk drive reliability data is at significant variance
>with real world performance has been suspected/understood for a long
>time.
Amen!
>The other recent paper on the subject (USENIX, "Disk Failures in the
>real world: What does an MTTF of 1,000,000 house mean to you?"), is
>IMO a better bit of research, and matches the overall conclusions of
>the Google paper, but doesn't break down the failures in the same way.
I have seen my share of cascading failures in Raid systems. If more
than one disk in a batch fails over a period of a month, change them all.
-- mrr
>
> In other words, you know that always-on LEDs and flashing LEDS have
> about the same lifetime, and yet you won't admit that the "almost
> all electronic device failures occur during power transitions."
> claim is false in the case of LEDs. And diodes. And resistors.
> And inductors. And electrolytic capacitors that aren't subjected
> to currents that exceeed their ratings.
>
I don't know where leds came into this discussion, perhaps a point that
you were trying to make, but I don't remember mentioning them at all.
However, if you accept that chip bonds are a significant failure
mechanism, then leds would be expected to be more reliable as a device
than a device with 144 pins, because you have only 2 instead of 144 chip
bonds. Leds typically don't dissapate much power either, so there is
little thermal cycling anyway and may be a poor example to illustrate
the point you were trying to make.
>
> Rob Warnock's claim was (exact qoute) "almost *all* electronic
> device failures occur during power transitions." You are defending
> another claim that he didn't make -- that *some* electronic device
> failures occur during power transitions. Nobody is disputing that.
No, not defending, just explaining how such failure machanisms work. If
you don't understand that or disagree, then do the research to put
numbers on it. I'm quite happy to accept that power on surge and thermal
cycling are significant failure mechanisms, but I really don't have time
for semantic nit-picking over nearly all, all, half, whatever. If you
think you are right - ie: Not nearly all, then contribute to the sum of
kwowledge by getting some numbers to support your theory and post back
here. To save confusion, perhaps we should start by defining what
"nearly all" means :-).
>
> Please address the actual claim that Rob Warnock made, not some other,
> easier to prove claim.
>
What was this "easier to prove claim" ?. In fact, i'm not arguing with
his conclusions at all and though I don't have any direct evidence re,
"nearly all failures", my own experience of several decades in
electronic design would suggest that power on surge failure comes well
up the list of possible failure modes. Different semantics perhaps, but
the same conclusion to a designer.
So, how do you go about doing the research ?. You need a large sample
size to get meaningfull results. Do you have a large number of different
types of kit that you can afford to lose or the time to test to failure
?. Sometimes, reports from a large number of individuals with extensive
experience in the field is about as close as you can get, unless you are
suggesting that there is a hidden agenda here or conspiracy to distort
the truth. If you think this, what do you think the agenda is ?...
Robert Redelmeier wrote:
>Furthermore, what evidence do you have that resistor
>failures cause most device failures?
A resistor is an electronic device.
If you want to make an argument about the failure of
electronic systems instead of "all electronic devices"
please call them electronic systems, not electronic
devices. Note thet the original "all electronic devices"
claim was in the context of a discussion of capacitors,
which are devices, not systems.
Many electronic devices comprise an electronic system.
If the subset of devices that tend to fail at the moment
of power cycling fail far more often than the subset
that does not, then the system itself will reflect the
failure characteristics of the fail-on-power-cycle subset.
That seems like a very real possibility and a reasonable
argument. Rob Warnock's claim was (exact qoute) "almost
*all* electronic device failures occur during power
transitions" is quite simply not true.
I see, you wish to argue semantics and terminology.
I'll indulge you, briefly:
First, a resistor is not an "electronic device" it is an electrical
part. electronics is a term reserved for semiconductors (perhaps
including tubes) where the unique behaviour of electron [holes]
(as opposed to current) is controlled
Second, a device is not a single simple part. Devices are
more complicated than parts but less complex than systems.
> Rob Warnock's claim was (exact qoute) "almost
> *all* electronic device failures occur during power
> transitions" is quite simply not true.
Oh, it what way? Please note, the clock may be considered
a power transition as well, especially in the predominant
CMOS technology.
-- Robert
>
>
Not at the company where I work. Nor at the school from which I
graduated.
>
> Second, a device is not a single simple part. Devices are
> more complicated than parts but less complex than systems.
Again, not in my neck of the woods. Ever hear of the "one device cell" ?
I know you have.
Where is it you are located?
>
>> Rob Warnock's claim was (exact qoute) "almost
>> *all* electronic device failures occur during power
>> transitions" is quite simply not true.
It may or may not be true depending on the device and its application.
>
> Oh, it what way? Please note, the clock may be considered
> a power transition as well, especially in the predominant
> CMOS technology.
Oh please.
>
> -- Robert
>
>
>
>>
>>
A place that is only now, slowly and grudgingly, switching
from relay logic to semiconductors for critical safety systems.
Welcome to my world.
-- Robert
That has been my experience also. Nothing was more dreaded by our sysadmin at
the university than the biannual tests of emergency power supply to the
building: because this meant turning off all power, not only did he have to
shutdown and powerdown the whole network, he could count on average about
three PSUs (from several hundrer) dying in the process.
Jan