Overcoming the I/O Bottleneck with General Parallel File System

13 views
Skip to first unread message

milan

unread,
Sep 11, 2006, 11:40:11 PM9/11/06
to 中国高性能计算论坛
Overcoming the I/O Bottleneck with General Parallel File System
By Andrew Naiberg
----------------------------------------------------------------------------------------------
It used to be that I/O was faster than computation. In fact, not too
long ago a supercomputer could be loosely defined as any machine that
turned a compute-bound problem into an I/O-bound problem. However,
dramatic increases in CPU, memory and bus speeds have turned this
relationship on its head—now disk I/O is usually the critical factor
limiting application performance and the ability to share data across a
computing cluster. First seen in scientific supercomputers, this I/O
bottleneck is now common in many data-intensive business applications
such as digital media, financial analysis, business intelligence,
engineering design, medical imaging, geographic/geological analysis and
so forth. And with data volumes, CPU and interconnect speeds still
increasing, the I/O bottleneck problem will likely only get worse.

Parallel file systems (also known as cluster file systems) have emerged
as a powerful solution to the I/O bottleneck and IBM’s General
Parallel File System (GPFS) is among the best. Originally developed for
digital-media applications, GPFS now powers many of the world’s most
powerful scientific supercomputers and holds the world’s terabyte
sort record and several other performance awards. Parallel file systems
offer three primary advantages over traditional distributed and SAN
file systems:

· High bandwidth—Parallel file systems are very effective
when distributed or SAN file systems can’t deliver the aggregate
bandwidth required for the environment. Where network file systems
typically deliver less than 125 MB/second and SAN file systems top out
around 500 MB/second, GPFS has delivered 15 GB/second on a single node.
Moreover, GPFS can scale this performance as more nodes are attached,
delivering enormous aggregate bandwidth. In addition to its world
terabyte sort record, GPFS won awards for both the highest bandwidth
and the most I/Os per second (running a real application) at the 2004
Supercomputing Conference.

· Data sharing—Another key advantage of parallel file systems
is that all of the attached nodes have equal access to all of the data
on the underlying disks, making parallel file systems ideal for cluster
environments where many users or applications work with the same data
(e.g., many engineers can share a single set of design files). And GPFS
recently added unique “multi-cluster” support, enabling data
sharing and collaboration across interconnected clusters; this
capability is currently being used to share data and results across a
consortium of European research centers.

· High reliability without bottlenecks—Unlike distributed
file systems, which transfer all of the data through a single server
and path, parallel file systems aren’t client/server designs and
employ redundant paths, allowing configurations that eliminate all
single points of failure; if one path fails, data can flow via another
one. Even SAN file systems, which don’t transfer all of the data
through a single data server, typically must access a single metadata
server to initiate a transfer, again impacting performance. And despite
efforts to alleviate these bottlenecks with simple mechanisms to split
the load among multiple data or metadata servers, there’s simply no
good way to prevent overload or failure with these designs. Conversely,
GPFS stores both data and metadata across any or all of the disks in
the cluster so there’s no single data server, metadata server or data
path to act as a bottleneck or single point of failure.

With these capabilities come several others, including the ability for
multiple users or applications to access different parts of a single
file simultaneously and high scalability. GPFS currently supports
production clusters of more than 2,200 nodes and file systems
comprising more than 1,000 disks and hundreds of terabytes of storage.

How Does GPFS Do It?
The key to GPFS’s bandwidth is that GPFS divides individual files
into multiple blocks and “stripes” (stores) these blocks across all
of the disks in a file system (see Figure 1, below). To read or write a
file, GPFS initiates multiple I/Os in parallel, thereby performing the
transfer quickly. In addition, each block is much larger than in a
traditional file system—typically 256 KB and up to 1024 KB. This
enables GPFS to transfer large amounts of data with each operation and
reduce the effect of seek latency.

To keep track of all of these blocks and ensure data integrity, GPFS
implements a distributed byte-range locking mechanism. The
“distributed” part synchronizes file system operations across
compute nodes so that, although file system management is distributed
across many machines for optimal performance, the entire file system
looks like a single file server to every node in the cluster. The
“byte-range” part means that rather than locking an entire file
when it’s accessed, thereby preventing all other access like
traditional file systems, GPFS locks individual parts of a file
separately. This enables multiple users, applications or parallel jobs
to work on different parts of a single file simultaneously, offering
many benefits. For example, multiple engineers or applications can
access and update a single design file simultaneously, eliminating the
additional storage and overhead associated with storing multiple
copies, not to mention the effort required to merge all of the copies
into a finished product. In a broadcast news environment, video editors
can work on a live video feed as it streams in from the field,
accelerating the time to air.

Designed for High Reliability
In addition to eliminating the single points of failure associated with
individual servers or paths (as previously discussed), GPFS is also
designed to accommodate hardware failures. In fact, several commercial
customers use GPFS primarily for its reliability rather than its
performance. To protect against failure of a compute node, each node
logs all of its updates and stores them on shared disks just like all
of the other data and metadata. If a node fails, another node can
access the failed node’s log to determine what updates were in
progress and restore the affected file(s). These files can be accessed
normally once they’re consistent again. To protect against disk
failures, GPFS can stripe its data across RAID disks and be configured
to store copies of data and metadata on different disks. Finally, GPFS
doesn’t require the file system to be taken down to make
configuration changes such as adding, removing or replacing disks in an
existing file system or adding nodes to an existing cluster.

Highly Available Access
GPFS supports a variety of disk hardware and offers three configuration
options ranging from a full-access SAN implementation for ultimate
performance to a shared-disk server model that’s less expensive as
the cluster gets very large. GPFS is a proven solution for virtually
any environment requiring extremely reliable, high-bandwidth shared
data access.
------------------------------------------------------------------------------------------------------------------

Andrew Naiberg is the product-marketing manager for pSeries software.
He’s also been a software engineer and service delivery specialist
since joining IBM in 1997. Andrew can be reached at anai...@us.ibm.com.

Reply all
Reply to author
Forward
0 new messages