Written by Jeff Layton
Tuesday, 13 June 2006
Some aid for those that use RAID
The Beowulf mailing list provides detailed discussions about issues
concerning Linux HPC clusters. In this article we turn our attention to
other mailinsg lists that also can provide useful information. In this
article I review some postings in the Rocks-Discuss and LVM mailing
lists where we report on RAID and file system preferences.
ROCKS: Raid
Most of the time the mailing lists for specific cluster applications or
cluster distributions are devoted to specific questions about the
application or distribution. However, some times you will see general
questions and very good responses from knowledgeable people on these
lists. Rocks is a popular cluster distribution. On January 6, 2004, a
simple question to the Rocks mailing list gave rise to some good
recommendations. Purushotham Komaravolu asked for recommendations for a
RAID configuration for about 200 GB of data (recall that RAID stands
for Redundant Array of Inexpensive Disks).
Greg Bruno provided the first answer. He said that for pure capacity
(not necessarily throughput) you should use a 3ware 8006-2LP serial ATA
controller with two 200 GB (Gigabyte) serial ATA drives that are
configured for mirroring (RAID-1). He said that this should give about
80 Megabytes/sec (MB/s) in read performance and about 40 MB/s in write
performance. For more performance, Greg recommended using a 3ware
8506-4LP serial ATA controller and four 100 GB ATA drives configured as
RAID-10 (two sets of mirrored drives which are then striped over the
two sets). Greg was estimating performance as 160 MB/s for read IO and
80 MB/s for write IO, if you use decent disks.
Jon Forrest joined in the discussion saying that he had a difficult
time getting the Promise and Iwill RAID cards (RAID-0 or RAID-1)
working with Linux. Greg Bruno responded that they had good luck with
3ware controllers and bad luck with the controllers that Jon mentioned.
However, Tim Carlson joined in that he was not impressed with the
RAID-5 performance of the 3ware controllers even using serial ATA
(SATA) drives. Tim said that he had never gotten more than 50 MB/s
using RAID-5 and SATA. He recommended going with SCSI drives and a SCSI
RAID controller along with software RAID. Tim finally suggested using a
box of IDE (ATA) disks with a back end controller that converts things
over to SCSI or FC (Fibre Channel). He said that in his experience this
solution scales nicely to tens of TB (terabytes).
Joe Landman jumped in to say that using RAID-5 for high performance is
not a good idea. Rather one should be using something like RAID-0
(striping) for increased performance. Joe also took issue with the idea
of using SCSI disks. Joe said that in his experience ATA drives were
very good but suffered from an interrupt problem that leads to
increased CPU load to the point that you could swamp a CPU by writing
many, many small blocks at the same time (think of a cluster head node
or NFS file server). SCSI controllers hide this behind a controller
interface. Joe went on to discuss that current CPUs have much more
power than the controller in a RAID card. However, combining software
RAID over a cheap hardware controller is asking for trouble,
particularly for large loads. Joe ended that he agreed with Tim's
recommendation of using IDE disks with a back end controller that
converts to SCSI or FC.
A little later Joe said that the important question was what file
system people were running on their RAID disks. Joe said that XFS was
the best and should be incorporated into ROCKS (note that XFS is now
part of the standard 2.4 and 2.6 kernels from kernel.org). Joe Kaiser
chimed in that he thought XFS was great and that they have had very
good luck with it. Tim Carlson jumped back in to say that he has good
luck with ext3. Joe Kaiser responded that they had some data corruption
with ext3 for large arrays when the disk has been filled all of the
way. Joe and Tim then discussed several aspects of design including the
importance of understanding your data needs and your data layout.
This discussion points out that there are several important
considerations when designing a file server for a cluster.
Considerations such as your data layout, the host machine (CPU power),
disk types, RAID controllers, monitoring capabilities, and file system
choice, can all have a great effect on the resultant IO performance.
ROCKS: Using Other File Systems
A couple of months after the previous discussion about RAID, a
discussion about alternative file systems was begun on the
ROCKS-Discuss mailing list. On 16 April, 2004, Yaron Minsky asked about
using something other than ext2 on the master node of his ROCKS
cluster, particularly ReiserFS or XFS. Phillip Papadopoulos replied
that this was a bug in ROCKS 3.1.0 forcing you to use ext2 and would be
fixed in the next release. However he did say that you could convert
the ext2 filesystem to ext3 using C to add a journal.
Laurence Liew responded that he thought ext2, ext3, ReiserFS, and XFS
all had their strengths and weaknesses. He suggested using ext2 for a
while to understand the application usage pattern. He also said that in
some cases, modifying the layout of the cluster would have a bigger
impact than changing file systems. Yaron replied back that he thought
ext3 faired worse than XFS or JFS in benchmarks. Laurence replied that
he remembered some SNAP benchmark results that showed ext3 winning in
certain cases.
There was some discussion about whether Red Hat included ReiserFS
and/or XFS in the version of RHEL (Red Hat Enterprise Linux) that ROCKS
uses. It was finally determined that XFS was not included but ReiserFS
was included but as an unsupported RPM. Later on, Josh Brandt mentioned
that he thought ReiserFS would do better on lots of small files
compared to other file systems. However, for large files ReiserFS
performed worse than other file systems. Yaron, the original poster,
posted his basic usage pattern (size of files, number of files, number
of directories, etc.). Josh thought he should give ReiserFS a try.
While this discussion is brief it does show that there is a difference
in file system performance among various people and groups.
The Beowulf mailing list provides detailed discussions about issues
concerning Linux HPC clusters. In this column I report on using
semi-public PC's for grid type applications and how we can handle large
numbers of files. We also turn to the ganglia-developers mailings list
to report on how one can add a "disk alive" metric to ganglia. You can
consult the Beowulf archives, the bioclusters archives, and the ganglia
archives for the actual conversations.
Using Semi-Public PCs
There was an interesting discussion a few months back on the
bioclusters mailing list about using semi-public PC's for heavy
computational jobs. On Feb. 15, 2004, Arnon Klein asked about running
his jobs on semi-public machines that are running various flavors of
Windows. Arnon is asking this question because he is doing his graduate
research and needs computational power. He's already exhausted the
machines easily available to him, so he was looking for suggestions
about what to do next.
The first response came from Chris Dwan. Chris responded that he's in a
similar boat but has managed to put together some systems from various
campuses into something like a grid. He also provided a very useful
ranking of systems in terms of access difficulty. For example, systems
that he maintains were easiest to get into followed by systems running
Linux or OS X (which Chris also runs). The lowest two ranked systems
were Windows machines that either could be rebooted at night or could
not be rebooted at all. Chris went on to talk about some schedulers
that can steal cycles from idle workstations (e. g. SGE, torque, LSF).
Although he said that integrating disparate schedulers can be very
difficult. He did mention Condor from the University of Wisconsin as a
possible solution. He also mentioned the grid software from United
Devices, which runs on Windows machines but will use compute cycles
from other machines.
Farud Ghazali also mentioned that's he's also looking for a solution to
this type of problem. He pointed that there were many practical
difficulties including authentication across disparate resources. Chris
Dwan jumped in to explain how he has hacked up something to do
authentication for him.
Ron Chen joined the conversation to mention that SGE (Sun Grid Engine)
version 6.0 will integrate with JXTA which then offers Jgrid that
offers P2P (Peer-to-Peer) workload management in a fashion similar to
SETI@home. However he did say that SGE 6.0 won't be out until May of
2004 (and it may slip slightly from then). Until then, Ron recommended
using boinc. This package starts jobs and transmits data using port 80,
which makes it easier to get in and out of a firewall than other
approaches. It also has versions for Windows, Linux, Solaris and OS X.
John van Workum also mentioned GreenTea that offers a Java P2P client
that gives grid capabilities for running jobs. Bruce Moxon also
mentioned that the Cornell Theory Center, has some tools that might
help with Windows machines.
While this is discussion was short it did offer some ideas that could
help people in similar situations. There are many people and groups
thinking about the same things that Arnon mentioned in his first
posting.
Disk Alive Metric
I'm sure many readers are aware of ganglia. It is a scalable
distributed monitoring system for high performance computing systems
such as clusters and grids. It is open source and in use on over 500
clusters throughout the world. On December 22, 2003, on the Ganglia
Developers mailing list Federico Sacerdoti asked about a metric that
ganglia could watch that would report if a disk was alive or not. It
seems that Federico was talking to a Purdue (my alma mater) system
administrator about a cluster that is put together from old PCs. The
disks in the machines keep failing but ganglia fails to report the
disks as down since the ganglia daemon will still report a heartbeat
even the node is basically down. Federico posted a possible solution
that he worked out with the administrator but had not tried it.
Brooks Davis replied that he didn't think it would work, at least in
FreeBSD, because of the way Unix and Unix-like systems work. He did
offer another solution that read random blocks from a file system to
make sure the drive was still functioning.
Robert Walsh responded that he has been trying to get information from
the SMART (Self-Monitoring Analysis and Reporting Technology System)
data in most hard drives into ganglia. Brooks Davis mentioned that he
thought integrating smartmontools with ganglia might offer a solution.
smartmontools is a package that allows you to control and monitor the
SMART data contained in virtually all modern hard drives.
The discussion spilled over into January of 2004, where Sander van
Vliet announced that he had a preliminary working version of a gmetric
code that would test if the drives were alive. The code walks the
/proc/mounts file looking for drives that are mounted and then attempts
to write 4 bytes to the end of the current used file system to
determine if the disk is alive. If there were no errors along the way,
then the disk is alive. Sander then posted that he had a version of his
code working that used the SMART data but the job as to be run as root.
This problem was sorted out fairly quickly though. During all of the
conversation, there was an effort to make the code work under Linux and
the various BSD flavors, especially FreeBSD. At this point the thread
died out, but it appears as though the code was working correctly for
Linux and FreeBSD.