hi All,
I need to prepare small report on “NFS vs. Lustre” ?
I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) …
Can you guys please provide few tips … URLs … etc.
cheers,
__
tharindu
******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." ******************************************************************************************************************************************************************* |
The advantage of NFS is its native for many Unix systems and is widely
available. The advantage of Lustre is its performance.
GPFS is a parallel fileysystem very similar to Lustre but its backed
by IBM. It runs on AIX and Linux. Its good but costly.
CXFS and GFS work similar. You need shared blockdev device such as a
SAN or NetAPP (iscsi). Not really for performance. This is mostly for
high availability.
What are you trying to solve? We maybe able to help .
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
It's Saturday, the family is out running around. I have time to think
about this question. Unfortunately, for you, I do this more for myself.
Which means this is going to be a stream-of-consciousness thing far more
than a well organized discussion. Sorry.
I'd begin by motivating both NFS and Lustre. Why do they exist? What
problems do they solve.
NFS first.
Way back in the day, ethernet and the concept of a workstation got
popular. There were many tools to copy files between machines but few
ways to share a name space; Have the directory hierarchy and it's
content directly accessible to an application on a foreign machine. This
made file sharing awkward. The model was to copy the file or files to
the workstation where the work was going to be done, do the work, and
copy the results back to some, hopefully, well maintained central
machine.
There *were* solutions to this at the time. I recall an attractive
alternative called RFS (I believe) from the Bell Labs folks, via some
place in England if I'm remembering right, it's been a looong time after
all. It had issues though. The nastiest issue for me was that if a
client went down the service side would freeze, at least partially.
Since this could happen willy-nilly, depending on the users wishes and
how well the power button on his workstation was protected, together
with the power cord and ethernet connection, this freezing of service
for any amount of time was difficult to accept. This was so even in a
rather small collection of machines.
The problem with RFS (?) and it's cousins were that they were all
stateful. The service side depended on state that was held at the
client. If the client went down, the service side couldn't continue
without a whole lot of recovery, timeouts, etc. It was a very *annoying*
problem.
In the latter half of the 1980's (am I remembering right?) SUN proposed
an open protocol called NFS. An implementation using this protocol could
do most everything RFS(?) could but it didn't suffer the service-side
hangs. It couldn't. It was stateless. If the client went down, the
server just didn't care. If the server went down, the client had the
opportunity to either give up on the local operation, usually with an
error returned, or wait. It was always up to the user and for client
failures the annoyance was limited to the user(s) on that client.
SUN, also, wisely desired the protocol to be ubiquitous. They published
it. They wanted *everyone* to adopt it. More, they would help
competitors. SUN held interoperability bake-a-thons to help with this.
It looks like they succeeded, all around :)
Let's sum up, then. The goals for NFS were:
1) Share a local file system name space across the network.
2) Do it in a robust, resilient way. Pesky FS issues because some user
kicked the cord out of his workstation was unacceptable.
3) Make it ubiquitous. SUN was a workstation vendor. They sold servers
but almost everyone had a VAX in their back pocket where they made the
infrastructure investment. SUN needed the high-value machines to support
this protocol.
Now Lustre.
Lustre has a weird story and I'm not going to go into all of it. The
shortest, relevant, part is that while there was at least one solution
that DOE/NNSA felt acceptable, GPFS, it was not available on anything
other than an IBM platform and because DOE/NNSA had a semi-formal policy
of buying from different vendors at each of the three labs we were kind
of stuck. Other file systems, existing and imminent, at the time were
examined but they were all distributed file systems and we needed IO
*bandwidth*. We needed lots, and lots of bandwidth.
We also needed that ubiquitous thing that SUN had as one of their goals.
We didn't want to pay millions of dollars for another GPFS. We felt that
would only be painting ourselves into a corner. Whatever we did, the
result *had* to be open. It also had to be attractive to smaller sites
as we wanted to turn loose of the ting at some point. If it was
attractive for smaller machines we felt we would win in the long term
as, eventually, the cost to further and maintain this thing was spread
across the community.
As far as technical goals, I guess we just wanted GPFS, but open. More
though, we wanted it to survive in our platform roadmaps for at least a
decade. The actual technical requirements for the contract that DOE/NNSA
executed with HP, CFS was the sub-contractor responsible for
development, can be found here:
<http://www-cs-students.stanford.edu/~trj/SGS_PathForward_SOW.pdf>
LLNL used to host this but it's no longer there? Oh well, hopefully this
link will be good for a while, at least.
I'm just going to jump to the end and sum the goals up:
1) It must do *everything* NFS can. We relaxed the stateless thing
though, see the next item for why.
2) It must support full POSIX semantics; Last writer wins, POSIX locks,
etc.
3) It must support all of the transports we are interested in.
4) It must be scalable, in that we can cheaply attach storage and both
performance (reading *and* writing) and capacity within a single mounted
file system increase in direct proportion.
6) We wanted it to be easy, administratively. Our goal was that it be no
harder than NFS to set up and maintain. We were involving too many folks
with PhDs in the operation of our machines at the time. Before you yell
FAIL, I'll say we did try. I'll also say we didn't make CFS responsible
for this part of the task. Don't blame them overly much, OK?
7) We recognized we were asking for a stateful system, we wanted to
mitigate that by having some focus on resiliency. These were big
machines and clients died all the time.
8) While not in the SOW, we structured the contract to accomplish some
future form of wide acceptance. We wanted it to be ubiquitous.
That's a lot of goals! For the technical ones, the main ones are all
pretty much structured to ask two things of what became Lustre. First,
give us everything NFS functionally does but go far beyond it in
performance. Second, give us everything NFS functionally does but make
it completely equivalent to a local file system, semantically.
There's a little more we have to consider. NFS4 is a different beast
than NFS2 or NFS3. NFS{2,3} had some serious issues that becaome more
prominent as time went by. First, security; It had none. Folks had
bandaged on some different things to try to cure this but they weren't
standard across platforms. Second, it couldn't do the full POSIX
required semantics. That was attacked with the NFS lock protocols but it
was such an after-thought it will always remain problematic. Third, new
authorization possibilities introduced by Microsoft and then POSIX,
called ACLs, had no way of being accomplished.
NFS4 addresses those by:
1) Introducing state. Can do full POSIX now without the lock servers.
Lots of resiliency mechanisms introduced to offset the downside of this,
too.
2) Formalizing and offerring standardized authentication headers.
3) Introducing ACLs that map to equivalents in POSIX and Microsoft.
Strengths and Weaknesses of the Two
-----------------------------------
NFS4 does most everything Lustre can with one very important exception,
IO bandwidth.
Both seem able to deliver metadata performance at roughly the same
speeds. File create, delete, and stat rates are about the same. NetApp
seems to have a partial enhancement. They bought the Spinnaker goodies
some time back and have deployed that technology, and redirection
too(?), within their servers. The good about that is two users in
different directories *could* leverage two servers, independently, and,
so, scale metadata performance. It's not guaranteed but at least there
is the possibility. If the two users are in the same directory, it's not
much different, though, I'm thinking. Someone correct me if I'm wrong?
Both can offer full POSIX now. It's nasty in both cases but, yes, in
theory you can export mail directory hierarchies with locking.
The NFS client and server are far easier to set up and maintain. The
tools to debug issues are advanced. While the Lustre folks have done
much to improve this area, NFS is just leaps and bounds ahead. It's
easier to deal with NFS than Lustre. Just far, far easier, still.
NFS is just built in to everything. My TV has it, for hecks sake. Lustre
is, seemingly, always an add-on. It's also a moving target. We're
constantly futzing with it, upgrading, and patching. Lustre might be
compilable most everywhere we care about but building it isn't trivial.
The supplied modules are great but, still, moving targets in that we
wait for SUN to catch up to the vendor supplied changes that affect
Lustre. Given Lustre's size and interaction with other components in the
OS, that happens far more frequently than desired. NFS just plain wins
the ubiquity argument at present.
NFS IO performance does *not* scale. It's still an in-band protocol. The
data is carried in the same message as the request and is, practically,
limited in size. Reads are more scalable in writes, a popular
file-segment can be satisfied from the cache on reads but develops
issues at some point. For writes, NFS3 and NFS4 help in that they
directly support write-behind so that a client doesn't have to wait for
data to go to disk, but it's just not enough. If one streams data
to/from the store, it can be larger than the cache. A client that might
read a file already made "hot" but at a very different rate just loses.
A client, writing, is always looking for free memory to buffer content.
Again, too many of these, simultaneously, and performance descends to
the native speed of the attached back-end store and that store can only
get so big.
Lustre IO performance *does* scale. It uses a 3rd-party transfer.
Requests are made to the metadata server and IO moves directly between
the affected storage component(s) and the client. The more storage
components, the less possibility of contention between clients and the
more data can be accepted/supplied per unit time.
NFS4 has a proposed extension, called pNFS, to address this problem. It
just introduces the 3rd-party data transfers that Lustre enjoys. If and
when that is a standard, and is well supported by clients and vendors,
the really big technical difference will virtually disappear. It's been
a long time coming, though. It's still not there. Will it ever be,
really?
The answer to the NFS vs. Lustre question comes down to the workload for
a given application then, since they do have overlap in their solution
space. If I were asked to look at a platform and recommend a solution I
would worry about IO bandwidth requirements. If the platform in question
were either read-mostly and, practically, never needed sustained read or
write bandwidth, NFS would be an easy choice. I'd even think hard about
NFS if the platform created many files but all were very small; Today's
filers have very respectable IOPS rates. If it came down to IO
bandwidth, I'm still on the parallel file system bandwagon. NFS just
can't deal with that at present and I do still have the folks, in house,
to manage the administrative burden.
Done. That was useful for me. I think five years ago I might have opted
for Lustre in the "create many small files" case, where I would consider
NFS today, so re-examining the motivations, relative strengths, and
weaknesses of both was useful. As I said, I did this more as a
self-exercise than anything else but I hope you can find something
useful here, too. The family is back from their errands, too :) Best
wishes and good luck.
--Lee
On Wed, 2009-08-26 at 04:11 -0600, Tharindu Rukshan Bamunuarachchi
wrote:
Thanks for posting this. I found the background and perspective very
interesting.
John
John K. Dawson
jkda...@gmail.com
612-860-2388
This should be on the Wiki :-)
On Sat, Aug 29, 2009 at 11:56:40AM -0600, Lee Ward wrote:
> NFS4 addresses those by:
>
> 1) Introducing state. Can do full POSIX now without the lock servers.
> Lots of resiliency mechanisms introduced to offset the downside of this,
> too.
NFS4 implementations are able to handle Posix advisory locks, but unlike
Lustre, they don't support full Posix filesystem semantics. For example, NFS4
still follows the traditional NFS close-to-open cache consistency model whereas
with Lustre, individual write()s are atomic and become immediately visible to
all clients.
Regards,
Daniel.
NFSv4 can't handle O_APPEND, and has those close-to-open semantics.
Those are the two large departures from POSIX in NFSv4.
NFSv4.1 also adds metadata/data separation and data distribution, much
like Lustre, but with the same POSIX semantics departures mentioned
above. Also, NFSv4.1's "pNFS" concept doesn't have room for
"capabilities" (in the distributed filesystem sense, not in the Linux
capabilities sense), which means that OSSes and MDSes have to
communicate to get permissions to be enforced. There are also
differences with respect to recovery, etcetera.
One thing about NFS is that it's meant to be neutral w.r.t. the type of
filesystem it shares. So NFSv4, for example, has features for dealing
with filesystems that don't have a notion of persistent inode number.
Whereas Lustre has its own on-disk format and therefore can't be used to
share just any type of filesystem.
Nico
--
You have "stumbled on to" an interesting, significant difference between
NFS and Lustre. NFS is a protocol for sharing an existing filesystem.
Lustre is a filesystem -- so much so in fact, that NFS can even share it
out.
b.
Indeed. pNFS is not really a protocol for sharing generic, pre-existing
filesystems anymore either. The moment you want to distribute the
filesystem itself you can no longer just substitute any filesystem into
an implementation of the protocol.
(Yes, I understand that when Lustre was layered above the VFS one could
conceivably have changed the underlying fs, though that didn't work out,
if for practical reasons. But even then, one couldn't have used the
underlying fs directly, not in a meaningful way.)
On Sun, Aug 30, 2009 at 04:12:11PM -0500, Nicolas Williams wrote:
> NFSv4 can't handle O_APPEND, and has those close-to-open semantics.
> Those are the two large departures from POSIX in NFSv4.
Along these lines, it's probably worth mentioning commit-on-close as well, an
area where NFS (v3 and v4, optionally relaxed when using write delegations) is
more strict than Posix. This is to make sure that NFS still has the possibility
to notify the user about errors when trying to save their data. Lustre's
standard config follows Posix and allows dirty client-side caches after
close(). Performance improves as a result, of course, but in case something
goes wrong on the net or the server, users potentially lose data just like on
any local Posix filesystem. The difference being that users tend to notice when
their local machine crashes. It's much easier to miss a remote server or a
switch going down, and hence suffer from silent data loss. (Admins will
typically notice, eg. via eviction messages in the logs, but have a hard time
telling whicht files had been affected.) The solution is to fsync() all
valuable data on a Posix filesystem, but that's not necessarily within reach
for an average end user.
Regards,
Daniel.
Hi,
> Lustre's
> standard config follows Posix and allows dirty client-side caches after
> close(). Performance improves as a result, of course, but in case something
> goes wrong on the net or the server, users potentially lose data just like on
> any local Posix filesystem.
I don't think this is true. This is something that I am only
peripherally knowledgeable about and I am sure somebody like Andreas or
Johann can correct me if/where I go wrong...
You are right that there is an opportunity for a client to write to an
OST and get it's write(2) call returned before data goes to physical
disk. But Lustre clients know that, and therefore they keep the state
needed to replay that write(2) to the server until the server sends back
a commit callback. The commit callback is what tells the client that
the data actually went to physical media and that it can now purge any
state required to replay that transaction.
Until that write callback is received, the client holds on to whatever
state it would need to do that write(2) all over again, for exactly the
case you cite which is the server goes down before the data goes to
physical media.
It is this data that the client is caching until the commit callback is
received that is used by the recovery mechanisms that start when a
target comes back on-line.
Hope that clarifies things, and further, I hope my understanding is
correct as is my explanation.
b.
1) the client crashes before the server has written the data to disk
(data that made it to the server should be written, but that is asynch),
2) the server returns an error to the client (EIO, ie due to errors on
the OST),
3) the client is evicted by the server (ie, due to communication issues)
before writing data to disk, or
4) server reboots and recovery fails (ie, in 1.6.x a _different_ client
does not reconnect to replay transactions). With version-based recovery
in 1.8, clients might be able to still replay some transactions even if
another client crashed/rebooted while the server was down.
fsync() is the best way to ensure data is on disk, for both Lustre and a
local filesystem.
Kevin
> ------------------------------------------------------------------------
The client will have DLM locks outstanding if it has dirty data, so that
the client's death can be used to detect that its open, dirty files are
now potentially corrupted.
Client death with dirty data is not all that different from process
death with dirty data in user-land. Think of an application that does
write(2), write(2), close(2), _exit(2), but dies between writes.
Compare that to a client that dies after flushing the first of those
writes but before flushing the second, though after the application
calls close(2). Nothing special is usually done in the first case, even
though if the process did have byte range locks outstanding, then the OS
could flag the affected file as potentially corrupted.
I don't think Lustre does actually do anything to mark files as
corrupted that Lustre could detect as potentially corrupted. Some
applications can recover automatically -- think of databases, such as
SQLite3, or think of plain log files. Other applications might well be
affected. Since corruption detection in this case is heuristic, and
since the impact will vary by application, I don't think there's an easy
answer as to what Lustre ought to do about it. Ideally we could track
the "potentially corrupt" status as an advisory meta-data item that
could be fetched with a stat(2)-like system call, and have applications
reset it when they recover.
Nico
--
It is very different; with a user application crash, all the writes to
that point will be completed by the kernel.
With the node crash, there are no guarantees about what made it to disk
since the last fsync() -- a later write may be partly flushed before an
earlier write, so the node could crash after the second write made it to
disk, but before the first one does. But this is also true for a local
filesystem.
Kevin
But not writes that the application hasn't done yet but which it was
working on putting together at the time that it died. If those never-
done writes are related to writes that did get made, then you may have a
problem.
For example, consider an RDBMS. Say you begin a transaction, do some
INSERTs/UPDATEs/DELETEs, then COMMIT. This will almost certainly
require multiple write(2)s (even for a DB that uses COW principles).
And suppose that somewhere in the middle the process doing the writes
dies. There should be some undo/redo log somewhere, and on restart the
RDBMS must recover from a partially unfinished transaction.
[ ... ]
lee> 3) It must support all of the transports we are interested in.
Except for some corner cases (that an HEP site might well have)
that today tends to reduce to the classic Ethernet/IP pair...
lee> 4) It must be scalable, in that we can cheaply attach
lee> storage and both performance (reading *and* writing) and
lee> capacity within a single mounted file system increase in
lee> direct proportion.
I suspect that scalability is more of a dream, as to me it
involves more requirements including scalable backup (not so
easy) and scalable 'fsck' (not so easy).
These are easier with Lustre because it does not provide "a
single mounted file system" but a single mounted *namespace*
which is a very different thing, even if largely equivalent for
most users.
[ ... ]
lee> NFS4 does most everything Lustre can with one very
lee> important exception, IO bandwidth. [ ... ] Lustre IO
lee> performance *does* scale. It uses a 3rd-party transfer.
That can summarized by saying that Lustre is a parallel
distributed metafilesystem, while NFS is a protocol used to
access what usually is something not distributed and an actual
filesystem. The limitations of the NFS protocol can be overcome,
and as you say, pNFS turns it into a parallel distributed
metafilesystem too:
lee> NFS4 has a proposed extension, called pNFS, to address this
lee> problem. It just introduces the 3rd-party data transfers
lee> that Lustre enjoys. If and when that is a standard, and is
lee> well supported by clients and vendors, the really big
lee> technical difference will virtually disappear. It's been a
lee> long time coming, though. It's still not there. Will it
lee> ever be, really?
My impression is that it is a lot more real than it was only a
couple years ago, and here is an amusing mashup:
http://FT.ORNL.gov/pubs-archive/ipdps2009-wyu-final.pdf
«Parallel NFS (pNFS) is an emergent open standard for
parallelizing data transfer over a variety of I/O
protocols. Prototypes of pNFS are actively being developed
by industry and academia to examine its viability and
possible enhancements. In this paper, we present the design,
implementation, and evaluation of lpNFS, a Lustre-based
parallel NFS. [ ... ] Our initial performance evaluation
shows that the performance of lpNFS is comparable to that of
original Lustre.»
lee> Done. That was useful for me. I think five years ago I
lee> might have opted for Lustre in the "create many small
lee> files" case, where I would consider NFS today,
Looks optimistic to me -- I don't see any good solution to the
"create many small files case, at least as to shared storage.
For smaller situations I am looking out of interest to some
other distributed filesystems, which are a bit more researchy,
but seem fairly reliable already.
On Mon, Aug 31, 2009 at 04:34:58PM -0400, Brian J. Murrell wrote:
> On Mon, 2009-08-31 at 21:56 +0200, Daniel Kobras wrote:
> > Lustre's
> > standard config follows Posix and allows dirty client-side caches after
> > close(). Performance improves as a result, of course, but in case something
> > goes wrong on the net or the server, users potentially lose data just like on
> > any local Posix filesystem.
>
> I don't think this is true. This is something that I am only
> peripherally knowledgeable about and I am sure somebody like Andreas or
> Johann can correct me if/where I go wrong...
>
> You are right that there is an opportunity for a client to write to an
> OST and get it's write(2) call returned before data goes to physical
> disk. But Lustre clients know that, and therefore they keep the state
> needed to replay that write(2) to the server until the server sends back
> a commit callback. The commit callback is what tells the client that
> the data actually went to physical media and that it can now purge any
> state required to replay that transaction.
Lustre can recover from certain error conditions just fine, of course, but
still it cannot recover gracefully from others. Think double failures or, more
likely, connectivity problems to a subset of hosts. For instance, if, say, an
Ethernet switch goes down for a few minutes with IB still available, all
Ethernet-connected clients will get evicted. Users won't necessarily notice
that there was a problem, but they've just potentially lost data. VBR makes the
data loss less likely in this case, but the possibility is still there. I'd
suspect you'll always be able to construct similar corner cases as long as the
networked filesystem allows dirty caches after close().
I'd like to throw in my 2c as well. I'm not a Lustre dev, just a
sysadmin who manages a small (<10TB) Lustre data store.
For some background, we're using it in a web hosting environment for a
particularly large set of websites. We're also considering it for use
as a storage backend to a cluster of VPS servers. Our "default" choice
for new clusters is usually NFS, for many of the reasons mentioned
already- pretty good read performance, makes good use of client- and
server-side caching with no extra work, and above all it's *extremely*
simple to maintain. You can install 2 machines with completely stock
Linux distros, and the odds are both of them will support being an NFS
server *and* client, and will talk to each other with only minimal
effort.
Our problems with NFS: Occasionally we need better locking support
than NFS delivers. Often capacity scalability is a concern (if you
planned for it, you can grow the NFS-exported volume to some extent).
Scaling out to many clients (frontend web servers in our case,
usually) is sometimes a problem, although realistically we just don't
need that many frontends very often.
The downside to Lustre is the complexity. Initial setup is much
simpler than, say, Red Hat GFS or OCFS2, but still *vastly* more
complicated than NFS, due in large part to the ubiquity of NFS. If NFS
breaks (and it rarely does for us), the fix is usually pretty simple.
If Lustre breaks... well, let's just say I don't like being the guy
on-call. It could be worse, but it's no picnic. We've had a lot more
downtime with our *redundant* Lustre cluster than we ever did with the
standalone NFS servers it replaced.
Documentation-wise, a lot of NFS documentation is extremely dated, and
what used to be good advice often isn't anymore. My personal opinion
is that the Red Hat GFS documentation is an utter disaster. It looks
great from 50,000 feet but is nigh-impossible to implement without
much head-bashing. You may have found that really nice article in Red
Hat Magazine about NFS vs GFS scalability. Looks cool, doesn't it? We
tried that and gave up a week later when we just couldn't make it
stable- yeah we could make it work, but it'd be a *constant* headache.
Lustre, on the other hand, has pretty good documentation. The admin
guide is beefy and detailed, and has a lot of good info. Some of it
feels dated (1.6 vs 1.8), but all in all I'm happy with it.
Redundancy is a problem- you can sorta do HA-NFS, but it's not
particularly pretty, and it's not conveniently active-active. Lustre
has some redundancy abilities, although none of them are what I'd call
"native". To me, native failover redundancy would mean Lustre handles
the data migration/synchronization and the actual failover. Lustre
supports multiple targets for the same data, and will try them both if
it's not working... but it's up to *you* to make sure the data is
actually *available* in both places. We use DRBD for this, and
heartbeat to handle it. It mostly works, but I'm not really happy with
it. It's no worse than what NFS offers, and sometimes better.
You can easily do a LOT of disk space on one server if needed. I've
seen a 25TB array on one server (Dell MD1000's + Windows!), and
*heard* of as much as 67TB on one server (not NFS though). I really
don't know how well NFS handles arrays that size, but it should at
least function. Of course, with Lustre, you can still do that much on
one server, *plus* more servers with that much too.
There's also staffing to consider. Being so much simpler, NFS wins
because you don't need as highly-trained staff to deal with it. NFS
probably costs less from a personnel standpoint- Lustre admins are
rarer, and therefore probably command higher salaries, it's not
obvious that you would need fewer of them. At some point a manager
will have to decide if the technological benefits of Lustre outweigh
the extra staffing costs to maintain it (if there actually are any
such costs).
All in all, neither is really ideal, and they have different
strengths. If you need to be 24/7 and not a lot of your staff is going
to have time to become proficient with a complicated storage subsystem
like Lustre, you're probably better off with NFS. If you really need
better scalability or POSIX-ness, and can stand the administrative
overhead, Lustre works.
I guess the proof is in the pudding- we're not planning on migrating
en-masse from NFS to Lustre. We're sticking with NFS as our default
choice, at least for the time being.
Happy sysadmin-ing,
Jake