Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Quarterly RAID5 Rant - enhanced

0 views
Skip to first unread message

Art S. Kagel

unread,
Nov 26, 2003, 1:26:17 PM11/26/03
to
Well, it's that time of year again. Here 'tis:

RAID5 versus RAID10 (or even RAID3 or RAID4)

What is RAID5?

OK here is the deal, RAID5 uses ONLY ONE parity drive per stripe and many
RAID5 arrays are 5 (if your counts are different adjust the calculations
appropriately) drives (4 data and 1 parity though it is not a single drive
that is holding all of the parity as in RAID 3 & 4 but read on). If you
have 10 drives or say 20GB each for 200GB RAID5 will use 20% for parity so
you will have 160GB of storage. Now since RAID10, like mirroring (RAID1),
uses 1 (or more) mirror drive for each primary drive you areusing 50% for
redundancy so to get the same 160GB of storage you will need 8 pairs or 16
- 20GB drives, which is why RAID5 is so popular. This intro is just to
put things into perspective.

RAID5 is physically a stripe set like RAID0 but with data recovery
included. RAID5 reserves one disk block out of each stripe block for
parity data. The parity block contains an error correction code which can
correct any error in the RAID5 block, in effect it is used in combination
with the remaining data blocks to recreate any single missing block, gone
missing because a drive has failed. The innovation of RAID5 over RAID3 &
RAID4 is that the parity is distributed on a round robin basis so that
there can be independent reading of different blocks from the several
drives. This is why RAID5 became more popular than RAID3 & RAID4 which
must sychronously read the same block from all drives together. So, if
Drive2 fails blocks 1,2,4,5,6 &7 are data blocks on this drive and blocks
3 and 8 are parity blocks on this drive. So that means that the parity on
Drive5 will be used to recreate the data block from Disk2 if block 1 is
requested before a new drive replaces Drive2 or during the rebuilding of
the new Drive2 replacement. Likewise the parity on Drive1 will be used to
repair block 2 and the parity on Drive3 will repair block4, etc. For
block 2 all the data is safely on the remaining drives but during the
rebuilding of Drive2's replacement a new parity block will be calculated
from the block 2 data and will be written to Drive 2.

Now when a disk block is read from the array the RAID software/firmware
calculates which RAID block contains the disk block, which drive the disk
block is on and which drive contains the parity block for that RAID block
and reads ONLY the data drive. It returns the data block . If you later
modify the data block it recalculates the parity by subtracting the old
block and adding in the new version then in two separate operations it
writes the data block followed by the new parity block. To do this it
must first read the parity block from whichever drive contains the parity
for that stripe block and reread the unmodified data for the updated block
from the original drive. This read-read-write-write is known as the RAID5
write penalty since these two writes are sequential and synchronous the
write system call cannot return until the reread and both writes complete,
for safety, so writing to RAID5 is up to 50% slower than RAID0 for an
array of the same capacity.

Now what is RAID10:

RAID10 is one of the combinations of RAID1 (mirroring) and RAID0
(striping) which are possible. There used to be confusion about what
RAID01 or RAID01 meant and different RAID vendors defined them
differently. About five years or so ago I proposed the following standard
language which seems to have taken hold. When N mirrored pairs are
striped together this is called RAID10 because the mirroring (RAID1) is
applied before striping (RAID0). The other option is to create two stripe
sets and mirror them one to the other, this is known as RAID01 (because
the RAID0 is applied first). In either a RAID01 or RAID10 system each and
every disk block is completely duplicated on its drive's mirror.
Performance-wise both RAID01 and RAID10 are functionally equivalent. The
difference comes in during recovery where RAID01 suffers from some of the
same problems I will describe affecting RAID5 while RAID10 does not.

Now if a drive in the RAID5 array dies, is removed, or is shut off data is
returned by reading the blocks from the remaining drives and calculating
the missing data using the parity, assuming the defunct drive is not the
parity block drive for that RAID block. Note that it takes 4 physical
reads to replace the missing disk block (for a 5 drive array) for four out
of every five disk blocks leading to a 64% performance degradation until
the problem is discovered and a new drive can be mapped in to begin
recovery.

If a drive in the RAID10 array dies data is returned from its mirror drive
in a single read with only minor (6.25% on average) performance reduction
when two non-contiguous blocks are needed from the damaged pair and none
otherwise.

One begins to get an inkling of what is going on and why I dislike RAID5,
but there's more.

What's wrong besides a bit of performance I don't know I'm missing?

OK, so that brings us to the final question of the day which is: What is
the problem with RAID5? It does recover a failed drive right? So writes
are slower, I don't do enough writing to worry about it and the cache
helps a lot also, I've got LOTS of cache! The problem is that despite the
improved reliability of modern drives and the improved error correction
codes on most drives, and even despite the additional 8 bytes of error
correction that EMC puts on every Clariion drive disk block (if you are
lucky enough to use EMC systems), it is more than a little possible that a
drive will become flaky and begin to return garbage. This is known as
partial media failure. Now SCSI controllers reserve several hundred disk
blocks to be remapped to replace fading sectors with unused ones, but if
the drive is going these will not last very long and will run out and SCSI
does NOT report correctable errors back to the OS! Therefore you will not
know the drive is becoming unstable until it is too late and there are no
more replacement sectors, and the errors begin to exceed the drive's ECC
coding's ability to correct single bit per byte errors, and the drive
begins to return garbage. [Note that the recently popular ATA drives do
not (to my knowledge) include bad sector remapping in their firmware so
garbage is returned that much sooner.] When a drive returns garbage,
since RAID5 does not EVER check parity on read (RAID3 & RAID4 do BTW and
both perform better for databases than RAID5 to boot) when you write the
garbage sector back garbage parity will be calculated and your RAID5
integrity is lost! Similarly if a drive fails and one of the remaining
drives is flaky the replacement will be rebuilt with garbage also.

Need more? During recovery, read performance for a RAID5 array is
degraded by as much as 80%. Some advanced arrays let you configure the
preference more toward recovery or toward performance. However, doing so
will increase recovery time and increase the likelihood of losing a second
drive in the array before recovery completes resulting in catastrophic
data loss. RAID10 on the other hand will only be recovering one drive out
of 4 or more pairs with performance ONLY of reads from the recovering pair
degraded making the performance hit to the array overall only about 20%!
Plus there is no parity calculation time used during recovery - it's a
straight data copy.

What about that thing about losing a second drive? Well with RAID10 there
is no danger unless the one mirror that is recovering also fails and
that's 80% or more less likely than that any other drive in a RAID5 array
will fail! And since most multiple drive failures are caused by
undetected manufacturing defects you can make even this possibility
vanishingly small by making sure to mirror every drive with one from a
different manufacturer's lot number. ("Oh", you say, "this schenario does
not seem likely!" Pooh, we lost 50 drives over two weeks when a batch of
200 IBM drives began to fail. IBM discovered that the single lot of
drives would have their spindle bearings freeze after so many hours of
operation. Fortunately due in part to RAID10 and in part to a herculean
effort by DG techs and our own people over 2 weeks no data was lost.
HOWEVER, one RAID5 filesystem was a total loss after a second drive failed
during recover. Fortunately everything was on tape.

Conclusion? For safety and performance favor RAID10 first, RAID3 second,
RAID4 third, and RAID5 last! The original reason for the RAID2-5 specs
was that the high cost of disks was making RAID1, mirroring, impractical.
That is no longer the case! Drives are commodity priced, even the biggest
fastest drives are cheaper in absolute dollars than drives were then and
cost per MB is a tiny fraction of what it was. Does RAID5 make ANY sense
anymore? Obviously I think not.

To put things into perspective: If a drive costs $1000US (and most are far
less expensive than that) then switching from a 4 pair RAID10 array to a 5
drive RAID5 array will save 3 drives or $3000US. What is the cost of
overtime, wear and tear on the technicians, DBAs, managers, and customers
of even a recovery scare? What is the cost of reduced performance and
possibly reduced customer satisfaction? Finally what is the cost of lost
business if data is unrecoverable? I maintain that the drives are FAR
cheaper! Hence my mantra:

NO RAID5! NO RAID5! NO RAID5! NO RAID5! NO RAID5! NO RAID5! NO RAID5!

Art S. Kagel

Andy Kent

unread,
Nov 27, 2003, 4:00:15 AM11/27/03
to
Interesting and comprehensive discussion.

But I couldn't help noticing that in the Microsoft Guide To Get SQL
Server Certification" (and let's resist any parochial pettiness in
this debate) there were some sample questions on "Which RAID level
would you choose in the following scenarios" and one of the answers
was RAID 5.

Although I shuddered at the answer myself, (and would probably have
put something else and therefore lost a mark), and whatever we might
say about Microsoft or SQL Server, it's interesting to note that an
organisation with the wellie of Microsoft is still recommending RAID 5
to its up-and-coming DBAs.

Also I was in the fortunate position to be able to try configuring IDS
9.3 on a big Sun with twin T3 arrays last year, tried RAID5 on one and
RAID10 on the other and was surprised to find that both read and write
performance were almost identical.

Now you know I'm with you on your disdain for RAID5 Art but I'd be
interested to hear your (or anyone else's) feedback on all of this. As
I said it would be nice if we could keep the parochial stuff out of
the debate.

Andy

malcolm.iiug

unread,
Nov 27, 2003, 7:32:15 AM11/27/03
to

Basically, what Art says is that using raid5 rather than raid10 is a means
of saving money on protection of your data. Can you afford to do this?

But if you are using Microsoft products are can you really be concerned
about protecting your data? I have never known the idea of protected data
and Microsoft to be happy bedfellows.

But on a more serious note:-
How many customers are really aware of this problem. How many are using
Raid and what types? Perhaps if we got those answers we could establish
whether the industry in general is still prepared to take risks with their
data to save a few pennies.

regards

Malcolm

sending to informix-list

Toni Arte

unread,
Nov 27, 2003, 8:38:07 AM11/27/03
to
Andy Kent wrote:
> Also I was in the fortunate position to be able to try configuring IDS
> 9.3 on a big Sun with twin T3 arrays last year, tried RAID5 on one and
> RAID10 on the other and was surprised to find that both read and write
> performance were almost identical.

We used to run a twin T3 array for our chunkfiles for Informix 7.31 in
RAID5 mode. Actually that was raid 51, i.e. two RAID5 boxes which were
mirrored by the volume manager. Never had any issues and the performance
was generally quite OK.

Now we have new system with a newer version of the T3 array pair. This
time we configured it without using RAID5, but the issue with T3 is that
you cannot mirror individual disks across the boxes. So we ended up
creating stripes within the box, mirroring the stripes within the box
and the mirroring the mirrored stripes accross the boxes. So each file
system has four stripes underneath :-) Sounds like a hack, but the T3 is
not really flexible in that sense.
--
Toni

Richard Kofler

unread,
Nov 27, 2003, 6:10:39 PM11/27/03
to
"malcolm.iiug" wrote:
>
> Basically, what Art says is that using raid5 rather than raid10 is a means
> of saving money on protection of your data. Can you afford to do this?
>
> But if you are using Microsoft products are can you really be concerned
> about protecting your data? I have never known the idea of protected data
> and Microsoft to be happy bedfellows.
>
> But on a more serious note:-
> How many customers are really aware of this problem. How many are using
> Raid and what types? Perhaps if we got those answers we could establish
> whether the industry in general is still prepared to take risks with their
> data to save a few pennies.

Here, the answer is definitely yes.

And SAN (re-)sellers do play their part in that game:

[ all of the following disks are 10k rpm, 8MB ondisk cache]
1 GB is 1.00 Euro on Serial ATA 1 or par. ATA133
1 GB is 3.20 Euro on a stand alone SCSI disk (with SCSI320)
interface
Inside a SAN the very same SCSI disk is priced at
15 Euro / 1 GB.

No wonder, why one has to look after the cheapest way to go
on the winding road of storage consolidation.

Industry seems to be able to cope with the loss of performance
easier than with the 15 fold increase of $$ (compared to IDE)

What I cannot understand is, that noone seems to see the fact
that you get the same number of GB for the same price, but
almost 2 times the sustained access rates (no of seeks/sec that is)
if you go with IDE 250 GB and use only 25 GB of each disk.
The 'outer', faster tracks .....

In terms of 2KB pages per I/O:
(both solutions via 2 FC 2Gbit/sec attachments)
32 IDE serial ATA 1 disks with 2 IDE RAID controllers,
8+8 disks on eachcontroller, all disks 250 GB, but
only 146 GB used per disk:
Performance: approx 210000 pages dskrd / sec

8 disks in a SAN (2 times 3+1 in 2 RAID5 cages)
inkluding 4GB SAN cache
all disks 146 GB
Performance: approx 121000 pages dskrd/ sec

(Performance as seen on onstat -p, so on the SAN it
includes all in-SAN caching, which doesn't help too much)

(the SAN is still 10% more $$, but there is no
IDE solution with higher price on the market)

dic_k
--
Richard Kofler
SOLID STATE EDV
Dienstleistungen GmbH
Vienna/Austria/Europe

0 new messages