Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Do SSD drives really fail a lot ?

44 views
Skip to first unread message

Lynn McGuire

unread,
May 3, 2011, 11:30:46 AM5/3/11
to
Do SSD drives really fail a lot ?
http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html

"… I feel ethically and morally obligated to let you in on a
dirty little secret I've discovered in the last two years of
full time SSD ownership. Solid state hard drives fail. A lot.
And not just any fail. I'm talking about catastrophic,
oh-my-God-what-just-happened-to-all-my-data instant gigafail.
It's not pretty. "

Lynn

Don Phillipson

unread,
May 3, 2011, 6:11:16 PM5/3/11
to
"Lynn McGuire" <l...@winsim.com> wrote in message
news:ipp73a$oo4$1...@dont-email.me...

LM omitted from the next page:
"Solid state hard drives are so freaking amazing performance wise, and the
experience you will have with them is so transformative, that I don't even
care if they fail every 12 months on average! I can't imagine using a
computer without a SSD any more; it'd be like going back to dial-up internet
. . . "


--
Don Phillipson
Carlsbad Springs
(Ottawa, Canada)


Arno

unread,
May 3, 2011, 10:14:56 PM5/3/11
to

> "? I feel ethically and morally obligated to let you in on a


> dirty little secret I've discovered in the last two years of
> full time SSD ownership. Solid state hard drives fail. A lot.
> And not just any fail. I'm talking about catastrophic,
> oh-my-God-what-just-happened-to-all-my-data instant gigafail.
> It's not pretty. "
> Lynn

It depends on your usage pattern and the SSD. Failure rate is
a designed feature with SSDs, i.e. the manufacturers know pretty
well how much writing an SSD can take. By designing wear-leveling
and spare capacity, they can design a specific write load
that kills a drive. In the beginning, this process is shaky
though and whole drive series can have worse reliability.

The typical reliability design goal is a 5% failure rate
per year for an average usage pattern. Consumers are willing
to tolerate that. That is a real failure rate, but it is
not "all the time". There are people that think because SSDs
are not suceptible to mechanical damage, they could do without
backup. Thise people will lose their data, no matter what
storage medium it is on, untill some day no money can be saved
by aiming for that 5% and reliability slowly goes up.

That said, I think the coding horror person (which has some
prrry nice things about coding in his blog) has a census of
mostly early models. These, like any new technology, have
increased failure rates, as the manufacturers try to aim
for that 5%/year but make mistakes in the process. It could
also just be a statistical annomaly.

There is one additional thing: SSDs are susceptible to
heat, just like any other electronics and to bad power.
It is possible that the guy with the 8 of 8 dead deives
just killed them by overheating or by voltage-spikes
from a cheap/bad PSU. For heat, rule of thumb is half
the lifetime every 10C for semiconductors and this works
pretty well. I have seen it several times now, one a 22
unit network card sample. As SSDs contain power circutry,
some parts of them run much hotter (step-up regulators for
converting 5V to the write-voltage needed), and lifetime
of 5 years is typically calculated at 40C environmental
temperature. Run them at 60C and you get 1.25 years average
lifetime. Other example: Memory and logic chips have something
like 30 years at 25C (figure from a very old Intel databook).
Run them at 65C and you get around 2 years lifetime.
That means you get the first failured (depending on
sample size) after 1-1.5 years and after 3 years most are
dead. This incidentally was my intital measurement and
prediction for the 22 network cards and what happened
then. Note that high-performance CPUs are different, as
they are more designed as power semiconductors. But chipsets
are not. I have seen several fail from inadequate cooling
in 1-3 years.

There is one other effect at work here: A lot of people
expected SSDs to be much more reliable than HDDs.
They are not in general, see above. This can lead
to disappointments causing overstatement of the problem.

Altogether, I don't believe we are seeing more than
early-adopter problems, and they are always the same.
Also, there are certainly cheap SSDs and better
SSDs, just like allways and it is possible to treat SSDs
well or badly.

Arno
--
Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: ar...@wagner.name
GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
----
Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans

Franc Zabkar

unread,
May 16, 2011, 8:43:49 PM5/16/11
to
On Tue, 03 May 2011 10:30:46 -0500, Lynn McGuire <l...@winsim.com> put
finger to keyboard and composed:

The most common reason for failure (90%) in flash drives appears to be
translator corruption (damaged lookup tables), especially if the power
fails while the translator is being updated. Afterwards the drive
powers up in safe mode with a very small capacity.

What are the Flash drives' typical failures [Public Forum]:
http://www.salvationdata.com/forum/topic1873.html

I suspect that SSDs may be similarly affected. Perhaps that's why some
newer models have large super capacitors for power backup.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.

Arno

unread,
May 17, 2011, 12:06:43 AM5/17/11
to
Franc Zabkar <fza...@iinternode.on.net> wrote:
> On Tue, 03 May 2011 10:30:46 -0500, Lynn McGuire <l...@winsim.com> put
> finger to keyboard and composed:

> The most common reason for failure (90%) in flash drives appears to be
> translator corruption (damaged lookup tables), especially if the power
> fails while the translator is being updated. Afterwards the drive
> powers up in safe mode with a very small capacity.

That should not happen if the firmware designers know how
to do this. The trick is to have a log-structure. In addition
enough stored power to complete one write is also a good idea
but not strictly needed.

I did have USB flash drives lose all data and return different data
on each read. That would be an explanation. The problem went away
after a full overwrite. I guess the developers of these devices
are still learning how to do this right. Not that the relevant
algorithms have been around for several decades. This possibly
is an education problem.

> What are the Flash drives' typical failures [Public Forum]:
> http://www.salvationdata.com/forum/topic1873.html

> I suspect that SSDs may be similarly affected. Perhaps that's why some
> newer models have large super capacitors for power backup.

With a supercap you can always complete the write.

It is possible to deal with this issue in the filesystem case
by accepting that writes some time before the power failure
(seconds) may get lost. The filesystem needs to be aware of the
SSD blocksize though. Otherwise you can get corruption in data
that was not actually requested to be written, which is really
bad. I guess how to do this in practice is still being hashed
out at this time.

Personally, I do not trust SSDs at the moment, because of
this error amplification property and for other reasons.
The one SSD I have with critical data is in a RAID1 with
normal disks. Reads are done from the SSD, unless there
is an error, which gives me SSD speeds for my apllication.

JW

unread,
May 17, 2011, 6:32:45 AM5/17/11
to
On Tue, 17 May 2011 10:43:49 +1000 Franc Zabkar
<fza...@iinternode.on.net> wrote in Message id:
<scg3t69r7ftgnd0u7...@4ax.com>:

Be wary of the new Intel SSD 320 series. Currently, there's a bug in the
controller that can cause the device to revert to 8MB during a power
failure. AFAIK they have not yet publicly announced it, and won't have a
firmware fix ready for release until the end of July.

We had an SSD 320 600GB 2.5" SATA drive in for evaluation from our Intel
rep. I was able to kill it in two or three hours by power cycling it.
Apparently (according to the Intel rep) when the power failure is
happening, the SSD device tries to reconnect with the SATA port instead of
initiating a proper shutdown. Something to do with interrupt priority
being higher for reconnection rather than a proper shutdown.

I was able to kill their 80GB device as well. We've sent both drives back
to Intel and they're going to give us their pre-release firmware for
testing.

Arno

unread,
May 17, 2011, 2:32:41 PM5/17/11
to

Interesting. Goes to show that firmware development is apparently
not done any better than other software development. I am tempted
to run my next SSD through similar tests before using it.

JW

unread,
Aug 16, 2011, 8:21:41 AM8/16/11
to
On Tue, 17 May 2011 06:32:45 -0400 JW <no...@dev.null> wrote in Message id:
<l3j4t65ofinhm36kp...@4ax.com>:

The Pre-release firmware also had the problem. I ended up supplying Intel
SSD engineering with my test platform and they reproduced the problem and
have a fix pending. See:
http://communities.intel.com/thread/24121?tstart=0

The firmware is not yet released however.

Looks like this Usenet thread caused quite a bit of commotion on their
forum:
http://communities.intel.com/thread/22227?tstart=0

:)

Arno

unread,
Aug 16, 2011, 12:22:23 PM8/16/11
to

This is rather patheric on their side (not so at all on your side,
obviously).

> The firmware is not yet released however.

> Looks like this Usenet thread caused quite a bit of commotion on their
> forum:
> http://communities.intel.com/thread/22227?tstart=0

> :)

Understandable. The conclusion can only be to stay away from
Intel SSDs for the next few years, until they have
demonstrated they their Q/A under control and have started to take
the date safety of their customers seriously.

It also underlines somethign I have been saying for a while,
namely that SSDs should be regarded as less reliable than HDDs at
this time, because of engineering screw-ups like this one.

My SSDs are either in a RAID with non-SSDs (with "write mostly"
that gives SSD read-speeds under Linux software RAID) or do
not have critical data on them.

0 new messages