Hi guys,
I finally got a chance to catch up and bring up Rick’s pNFS setup on a couple of test machines. He’s right, obviously - The “plan A” approach is a bit convoluted and not at all surprisingly slow. With all of those transits twixt kernel and userland, not to mention glusterfs itself which has not really been tuned for our platform (there are a number of papers on this we probably haven’t even all read yet), we’re obviously still in the “first make it work” stage.
That said, I think there are probably more possible plans than just A and B here, and we should give the broader topic of “what does FreeBSD want to do in the Enterprise / Cloud computing space?" at least some consideration at the same time, since there are more than a few goals running in parallel here.
First, let’s talk about our story around clustered filesystems + associated command-and-control APIs in FreeBSD. There is something of an embarrassment of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS, RiakCS, moose, etc. All or most of them offer different pros and cons, and all offer more than just the ability to store files and scale “elastically”. They also have ReST APIs for configuring and monitoring the health of the cluster, some offer object as well as file storage, and Riak offers a distributed KVS for storing information *about* file objects in addition to the object themselves (and when your application involves storing and managing several million photos, for example, the idea of distributing the index as well as the files in a fault-tolerant fashion is also compelling). Some, if not most, of them are also far better supported under Linux than FreeBSD (I don’t think we even have a working ceph port yet). I’m not saying we need to blindly follow the herds and do all the same things others are doing here, either, I’m just saying that it’s a much bigger problem space than simply “parallelizing NFS” and if we can kill multiple birds with one stone on the way to doing that, we should certainly consider doing so.
Why? Because pNFS was first introduced as a draft RFC (RFC5661 <https://datatracker.ietf.org/doc/rfc5661/>) in 2005. The linux folks have been working on it <http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> since 2006. Ten years is a long time in this business, and when I raised the topic of pNFS at the recent SNIA DSI conference (where storage developers gather to talk about trends and things), the most prevalent reaction I got was “people are still using pNFS?!” This is clearly one of those technologies that may still have some runway left, but it’s been rapidly overtaken by other approaches to solving more or less the same problems in coherent, distributed filesystem access and if we want to get mindshare for this, we should at least have an answer ready for the “why did you guys do pNFS that way rather than just shimming it on top of ${someNewerHotness}??” argument. I’m not suggesting pNFS is dead - hell, even AFS <https://www.openafs.org/> still appears to be somewhat alive, but there’s a difference between appealing to an increasingly narrow niche and trying to solve the sorts of problems most DevOps folks working At Scale these days are running into.
That is also why I am not sure I would totally embrace the idea of a central MDS being a Real Option. Sure, the risks can be mitigated (as you say, by mirroring it), but even saying the words “central MDS” (or central anything) may be such a turn-off to those very same DevOps folks, folks who have been burned so many times by SPOFs and scaling bottlenecks in large environments, that we'll lose the audience the minute they hear the trigger phrase. Even if it means signing up for Other Problems later, it’s a lot easier to “sell” the concept of completely distributed mechanisms where, if there is any notion of centralization at all, it’s at least the result of a quorum election and the DevOps folks don’t have to do anything manually to cause it to happen - the cluster is “resilient" and "self-healing" and they are happy with being able to say those buzzwords to the CIO, who nods knowingly and tells them they’re doing a fine job!
Let’s get back, however, to the notion of downing multiple avians with the same semi-spherical kinetic projectile: What seems to be The Rage at the moment, and I don’t know how well it actually scales since I’ve yet to be at the pointy end of such a real-world deployment, is the idea of clustering the storage (“somehow”) underneath and then providing NFS and SMB protocol access entirely in userland, usually with both of those services cooperating with the same lock manager and even the same ACL translation layer. Our buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha + Samba on top - I talked to one of the Samba core team guys at SNIA and he indicated that this was increasingly common, with the team having helped here and there when approached by different vendors with the same idea. We (iXsystems) also get a lot of requests to be able to make the same file(s) available via both NFS and SMB at the same time and they don’t much at all like being told “but that’s dangerous - don’t do that! Your file contents and permissions models are not guaranteed to survive such an experience!” They really want to do it, because the rest of the world lives in Heterogenous environments and that’s just the way it is.
Even the object storage folks, like Openstack’s Swift project, are spending significant amounts of mental energy on the topic of how to re-export their object stores as shared filesystems over NFS and SMB, the single consistent and distributed object store being, of course, Their Thing. They wish, of course, that the rest of the world would just fall into line and use their object system for everything, but they also get that the "legacy stuff” just won’t go away and needs some sort of attention if they’re to remain players at the standards table.
So anyway, that’s the view I have from the perspective of someone who actually sells storage solutions for a living, and while I could certainly “sell some pNFS” to various customers who just want to add a dash of steroids to their current NFS infrastructure, or need to use NFS but also need to store far more data into a single namespace than any one box will accommodate, I also know that offering even more elastic solutions will be a necessary part of offering solutions to the growing contingent of folks who are not tied to any existing storage infrastructure and have various non-greybearded folks shouting in their ears about object this and cloud that. Might there not be some compromise solution which allows us to put more of this in userland with less context switches in and out of the kernel, also giving us the option of presenting a more united front to multiple protocols that require more ACL and lock impedance-matching than we’d ever want to put in the kernel anyway?
- Jordan
Put another way, I'd really like to hear a "user" perspective.
Actually, I would have worded this as "will anyone ever use pNFS?".
Although 10 years is a long time in this business, it doesn't seem to be long
at all in the standards world where the NFSv4 protocols are being developed.
- You note that the Linux folk started development in 2006.
I will note that RFC5661 (the RFC that describes pNFS) is dated 2010.
I will also note that I believe the first vendor to ship a server that supported pNFS
happened sometime after the RFC was published.
- I could be wrong, but I'd guess that Netapp's clustered Filers were the
first to ship, about 4 years ago.
To this date, very few vendors have actually shipped working pNFS servers
as far as I am aware. Other than Netapp, the only one I know that has shipped
are the large EMC servers (not Isilon).
I am not sure if Oracle/Solaris has ever shipped a pNFS server to customers yet.
Same goes for Panasas. I am not aware of a Linux based pNFS server usable in
a production environment, although Ganesha-NFS might be shipping with pNFS support now.
- If others are aware of other pNFS servers that are shipping to customers,
please correct me. (I haven't been to a NFSv4.1 testing event for 3 years,
so my info is definitely dated.)
Note that the "Flex Files" layout I used for the Plan A experiment is only an
Internet draft at this time and hasn't even made it to the RFC stage.
--> As such, I think it is very much an open question w.r.t. whether or not
this protocol will become widely used or yet another forgotten standard?
I also suspect that some storage vendors that have invested considerable
resources in NFSv4.1/pNFS development might ask the same question in-house;-)
> This is clearly one of those
> technologies that may still have some runway left, but it’s been rapidly
> overtaken by other approaches to solving more or less the same problems in
> coherent, distributed filesystem access and if we want to get mindshare for
> this, we should at least have an answer ready for the “why did you guys do
> pNFS that way rather than just shimming it on top of ${someNewerHotness}??”
> argument. I’m not suggesting pNFS is dead - hell, even AFS
> <https://www.openafs.org/> still appears to be somewhat alive, but there’s a
> difference between appealing to an increasingly narrow niche and trying to
> solve the sorts of problems most DevOps folks working At Scale these days
> are running into.
>
> That is also why I am not sure I would totally embrace the idea of a central
> MDS being a Real Option. Sure, the risks can be mitigated (as you say, by
> mirroring it), but even saying the words “central MDS” (or central anything)
> may be such a turn-off to those very same DevOps folks, folks who have been
> burned so many times by SPOFs and scaling bottlenecks in large environments,
> that we'll lose the audience the minute they hear the trigger phrase. Even
> if it means signing up for Other Problems later, it’s a lot easier to “sell”
> the concept of completely distributed mechanisms where, if there is any
> notion of centralization at all, it’s at least the result of a quorum
> election and the DevOps folks don’t have to do anything manually to cause it
> to happen - the cluster is “resilient" and "self-healing" and they are happy
> with being able to say those buzzwords to the CIO, who nods knowingly and
> tells them they’re doing a fine job!
>
I'll admit that I'm a bits and bytes guy. I have a hunch how difficult it is
to get "resilient" and "self-healing" to really work. I also know it is way
beyond what I am capable of.
> Let’s get back, however, to the notion of downing multiple avians with the
> same semi-spherical kinetic projectile: What seems to be The Rage at the
> moment, and I don’t know how well it actually scales since I’ve yet to be at
> the pointy end of such a real-world deployment, is the idea of clustering
> the storage (“somehow”) underneath and then providing NFS and SMB protocol
> access entirely in userland, usually with both of those services cooperating
> with the same lock manager and even the same ACL translation layer. Our
> buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha +
> Samba on top - I talked to one of the Samba core team guys at SNIA and he
> indicated that this was increasingly common, with the team having helped
> here and there when approached by different vendors with the same idea. We
> (iXsystems) also get a lot of requests to be able to make the same file(s)
> available via both NFS and SMB at the same time and they don’t much at all
> like being told “but that’s dangerous - don’t do that! Your file contents
> and permissions models are not guaranteed to survive such an experience!”
> They really want to do it, because the rest of the world lives in
> Heterogenous environments and that’s just the way it is.
>
If you want to make SMB and NFS work to-gether on the same uderlying file systems,
I suspect it is doable, although messy. To do this with the current FreeBSD nfsd,
it would require someone with Samba/Windows knowledge pointing out what Samba
needs to interact with NFSv4 and those hooks could probably be implemented.
(I know nothing about Samba/Windows, so I'd need someone else doing that side
of it.)
I actually mentioned Ganesha-NFS at the little talk/discussion I gave.
At this time, they have ripped a FreeBSD port out of their sources and they
use Linux specific thread primitives.
--> It would probably be significant work to get Ganesha-NFS up to speed on
FreeBSD. Maybe a good project, but it needs some person/group dedicating
resources to get it to happen.
> Even the object storage folks, like Openstack’s Swift project, are spending
> significant amounts of mental energy on the topic of how to re-export their
> object stores as shared filesystems over NFS and SMB, the single consistent
> and distributed object store being, of course, Their Thing. They wish, of
> course, that the rest of the world would just fall into line and use their
> object system for everything, but they also get that the "legacy stuff” just
> won’t go away and needs some sort of attention if they’re to remain players
> at the standards table.
>
> So anyway, that’s the view I have from the perspective of someone who
> actually sells storage solutions for a living, and while I could certainly
> “sell some pNFS” to various customers who just want to add a dash of
> steroids to their current NFS infrastructure, or need to use NFS but also
> need to store far more data into a single namespace than any one box will
> accommodate, I also know that offering even more elastic solutions will be a
> necessary part of offering solutions to the growing contingent of folks who
> are not tied to any existing storage infrastructure and have various
> non-greybearded folks shouting in their ears about object this and cloud
> that. Might there not be some compromise solution which allows us to put
> more of this in userland with less context switches in and out of the
> kernel, also giving us the option of presenting a more united front to
> multiple protocols that require more ACL and lock impedance-matching than
> we’d ever want to put in the kernel anyway?
>
For SMB + NFS in userland, the combination of Samba and Ganesha is probably
your main open source choice, from what I am aware of.
I am one guy who does this as a spare time retirement hobby. As such, doing
something like a Ganesha port etc is probably beyond what I am interested in.
When saying this, I don't want to imply that it isn't a good approach.
You sent me the URL for an abstract for a paper discussing how Facebook is
using GlusterFS. It would be nice to get more details w.r.t. how they use it,
such as:
- How do their client servers access it? (NFS, Fuse, or ???)
- Whether or not they've tried the Ganesha-NFS stuff that GlusterFS is
transitioning to?
Put another way, they might have some insight into whether the NFS is userland
via Ganesha works well or not?
Hopefully some "users" for this stuff will respond, rick
ps: Maybe this could be reposted in a place they are likely to read it.
Chris
Sent from my iPhone 5
> On Jun 18, 2016, at 3:50 PM, Jordan Hubbard <j...@ixsystems.com> wrote:
>
>
>> On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:
>>
>> You may have already heard of Plan A, which sort of worked
>> and you could test by following the instructions here:
>>
>> http://people.freebsd.org/~rmacklem/pnfs-setup.txt
>>
>> However, it is very slow for metadata operations (everything other than
>> read/write) and I don't think it is very useful.
>
> Hi guys,
>
> I finally got a chance to catch up and bring up Rick’s pNFS setup on a couple of test machines. He’s right, obviously - The “plan A” approach is a bit convoluted and not at all surprisingly slow. With all of those transits twixt kernel and userland, not to mention glusterfs itself which has not really been tuned for our platform (there are a number of papers on this we probably haven’t even all read yet), we’re obviously still in the “first make it work” stage.
>
> That said, I think there are probably more possible plans than just A and B here, and we should give the broader topic of “what does FreeBSD want to do in the Enterprise / Cloud computing space?" at least some consideration at the same time, since there are more than a few goals running in parallel here.
>
> First, let’s talk about our story around clustered filesystems + associated command-and-control APIs in FreeBSD. There is something of an embarrassment of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS, RiakCS, moose, etc. All or most of them offer different pros and cons, and all offer more than just the ability to store files and scale “elastically”. They also have ReST APIs for configuring and monitoring the health of the cluster, some offer object as well as file storage, and Riak offers a distributed KVS for storing information *about* file objects in addition to the object themselves (and when your application involves storing and managing several million photos, for example, the idea of distributing the index as well as the files in a fault-tolerant fashion is also compelling). Some, if not most, of them are also far better supported under Linux than FreeBSD (I don’t think we even have a working ceph port yet). I’m not saying we need to blindly follow the herds and do all the same things others are doing here, either, I’m just saying that it’s a much bigger problem space than simply “parallelizing NFS” and if we can kill multiple birds with one stone on the way to doing that, we should certainly consider doing so.
>
> Why? Because pNFS was first introduced as a draft RFC (RFC5661 <https://datatracker.ietf.org/doc/rfc5661/>) in 2005. The linux folks have been working on it <http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> since 2006. Ten years is a long time in this business, and when I raised the topic of pNFS at the recent SNIA DSI conference (where storage developers gather to talk about trends and things), the most prevalent reaction I got was “people are still using pNFS?!” This is clearly one of those technologies that may still have some runway left, but it’s been rapidly overtaken by other approaches to solving more or less the same problems in coherent, distributed filesystem access and if we want to get mindshare for this, we should at least have an answer ready for the “why did you guys do pNFS that way rather than just shimming it on top of ${someNewerHotness}??” argument. I’m not suggesting pNFS is dead - hell, even AFS <https://www.openafs.org/> still appears to be somewhat alive, but there’s a difference between appealing to an increasingly narrow niche and trying to solve the sorts of problems most DevOps folks working At Scale these days are running into.
>
> That is also why I am not sure I would totally embrace the idea of a central MDS being a Real Option. Sure, the risks can be mitigated (as you say, by mirroring it), but even saying the words “central MDS” (or central anything) may be such a turn-off to those very same DevOps folks, folks who have been burned so many times by SPOFs and scaling bottlenecks in large environments, that we'll lose the audience the minute they hear the trigger phrase. Even if it means signing up for Other Problems later, it’s a lot easier to “sell” the concept of completely distributed mechanisms where, if there is any notion of centralization at all, it’s at least the result of a quorum election and the DevOps folks don’t have to do anything manually to cause it to happen - the cluster is “resilient" and "self-healing" and they are happy with being able to say those buzzwords to the CIO, who nods knowingly and tells them they’re doing a fine job!
>
> Let’s get back, however, to the notion of downing multiple avians with the same semi-spherical kinetic projectile: What seems to be The Rage at the moment, and I don’t know how well it actually scales since I’ve yet to be at the pointy end of such a real-world deployment, is the idea of clustering the storage (“somehow”) underneath and then providing NFS and SMB protocol access entirely in userland, usually with both of those services cooperating with the same lock manager and even the same ACL translation layer. Our buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha + Samba on top - I talked to one of the Samba core team guys at SNIA and he indicated that this was increasingly common, with the team having helped here and there when approached by different vendors with the same idea. We (iXsystems) also get a lot of requests to be able to make the same file(s) available via both NFS and SMB at the same time and they don’t much at all like being told “but that’s dangerous - don’t do that! Your file contents and permissions models are not guaranteed to survive such an experience!” They really want to do it, because the rest of the world lives in Heterogenous environments and that’s just the way it is.
>
> Even the object storage folks, like Openstack’s Swift project, are spending significant amounts of mental energy on the topic of how to re-export their object stores as shared filesystems over NFS and SMB, the single consistent and distributed object store being, of course, Their Thing. They wish, of course, that the rest of the world would just fall into line and use their object system for everything, but they also get that the "legacy stuff” just won’t go away and needs some sort of attention if they’re to remain players at the standards table.
>
> So anyway, that’s the view I have from the perspective of someone who actually sells storage solutions for a living, and while I could certainly “sell some pNFS” to various customers who just want to add a dash of steroids to their current NFS infrastructure, or need to use NFS but also need to store far more data into a single namespace than any one box will accommodate, I also know that offering even more elastic solutions will be a necessary part of offering solutions to the growing contingent of folks who are not tied to any existing storage infrastructure and have various non-greybearded folks shouting in their ears about object this and cloud that. Might there not be some compromise solution which allows us to put more of this in userland with less context switches in and out of the kernel, also giving us the option of presenting a more united front to multiple protocols that require more ACL and lock impedance-matching than we’d ever want to put in the kernel anyway?
>
I think we should also be careful to define our terms in such a discussion. Specifically:
1. Are we talking about block-level clustering underneath ZFS (e.g. HAST or ${somethingElse}) or otherwise incorporated into ZFS itself at some low level? If you Google for “High-availability ZFS” you will encounter things like RSF-1 or the somewhat more mysterious Zetavault (http://www.zeta.systems/zetavault/high-availability/) but it’s not entirely clear how these technologies work, they simply claim to “scale-out ZFS” or “cluster ZFS” (which can be done within ZFS or one level above and still probably pass the Marketing Test for what people are willing to put on a web page).
2. Are we talking about clustering at a slightly higher level, in a filesystem-agnostic fashion which still preserves filesystem semantics?
3. Are we talking about clustering for data objects, in a fashion which does not necessarily provide filesystem semantics (a sharding database which can store arbitrary BLOBs would qualify)?
For all of the above: Are we seeking to be compatible with any other mechanisms, or are we talking about a FreeBSD-only solution?
This is why I brought up glusterfs / ceph / RiakCS in my previous comments - when talking to the $users that Rick wants to involve in the discussion, they rarely come to the table asking for “some or any sort of clustering, don’t care which or how it works” - they ask if I can offer an S3 compatible object store with horizontal scaling, or if they can use NFS in some clustered fashion where there’s a single namespace offering petabytes of storage with configurable redundancy such that no portion of that namespace is ever unavailable.
I’d be interested in what Justin had in mind when he asked Matt about this. Being able to “attach ZFS pools to one another” in such a fashion that all clients just see One Big Pool and ZFS’s own redundancy / snapshotting characteristics magically apply to the überpool would be Pretty Cool, obviously, and would allow one to do round-robin DNS for NFS such that any node could serve the same contents, but that also sounds pretty ambitious, depending on how it’s implemented.
umm look at Panzura who have been selling this on FreeBSD for 4 years
<plug>and need FreeBSD devs in the bay area (or closer than me))</plug>
Well, unlike Panzura, I think we’re also looking for an open source solution that can be upstreamed to FreeBSD and/or (probably better) the OpenZFS project. Any takers on that? My hand is up. :)
- Jordan
I suspect #1 sits at a low enough level that redirecting I/O via the pNFS layouts
isn't practical, since ZFS is taking care of block allocations, etc.
I see #3 as a separate problem space, since NFS deals with files and not objects.
However, GlusterFS maps file objects on top of the POSIX-like FS, so I suppose that
could be done at the client end. (What glusterfs.org calls SwiftonFile, I think?)
It is also possible to map POSIX files onto file objects, but that sounds like more
work, which would need to be done under the NFS service.
> For all of the above: Are we seeking to be compatible with any other
> mechanisms, or are we talking about a FreeBSD-only solution?
>
> This is why I brought up glusterfs / ceph / RiakCS in my previous comments -
> when talking to the $users that Rick wants to involve in the discussion,
> they rarely come to the table asking for “some or any sort of clustering,
> don’t care which or how it works” - they ask if I can offer an S3 compatible
> object store with horizontal scaling, or
> if they can use NFS in some
> clustered fashion where there’s a single namespace offering petabytes of
> storage with configurable redundancy such that no portion of that namespace
> is ever unavailable.
>
I tend to think of this last case as the target for any pNFS server. The basic
idea is to redirect the I/O operations to wherever the data is actually stored,
so that I/O performance doesn't degrade with scale.
If redundancy is a necessary feature, then maybe Plan A is preferable to Plan B,
since GlusterFS does provide for redundancy and resilvering of lost copies, at
least from my understanding of the docs on gluster.org.
I'd also like to see how GlusterFS performs on a typical Linux setup.
Even without having the nfsd use FUSE, access of GlusterFS via FUSE results in crossing
user (syscall on mount) --> kernel --> user (glusterfs daemon) within the client machine,
if I understand how GlusterFS works. Then the gluster brick server glusterfsd daemon does
file system syscall(s) to get at the actual file on the underlying FS (xfs or ZFS or ...).
As such, there is already a lot of user<->kernel boundary crossings.
I wonder how much delay is added by the extra nfsd step for metadata?
- I can't say much about performance of Plan A yet, but metadata operations are slow
and latency seems to be the issue. (I actually seem to get better performance by
disabling SMP, for example.)
> I’d be interested in what Justin had in mind when he asked Matt about this.
> Being able to “attach ZFS pools to one another” in such a fashion that all
> clients just see One Big Pool and ZFS’s own redundancy / snapshotting
> characteristics magically apply to the überpool would be Pretty Cool,
> obviously, and would allow one to do round-robin DNS for NFS such that any
> node could serve the same contents, but that also sounds pretty ambitious,
> depending on how it’s implemented.
>
This would probably work with the extant nfsd and wouldn't have a use for pNFS.
I also agree that this sounds pretty ambitious.
rick
One of these notes that the first Linux distribution that shipped with pNFS
support was RHEL6.4 in 2013.
So, I have no idea if it will catch on, but I don't think it can be considered
end of life. (Many use NFSv3 and its RFC is dated June 1995.)
rick
> That is also why I am not sure I would totally embrace the idea of a
> central MDS being a Real Option. Sure, the risks can be mitigated (as you
> say, by mirroring it), but even saying the words “central MDS” (or central
> anything) may be such a turn-off to those very same DevOps folks, folks who
> have been burned so many times by SPOFs and scaling bottlenecks in large
> environments, that we'll lose the audience the minute they hear the trigger
> phrase. Even if it means signing up for Other Problems later, it’s a lot
> easier to “sell” the concept of completely distributed mechanisms where, if
> there is any notion of centralization at all, it’s at least the result of a
> quorum election and the DevOps folks don’t have to do anything manually to
> cause it to happen - the cluster is “resilient" and "self-healing" and they
> are happy with being able to say those buzzwords to the CIO, who nods
> knowingly and tells them they’re doing a fine job!
>
My main reason for liking NFS is that it has decent client support in
upstream Linux. One reason I started working on pNFS was that at $work our
existing cluster filesystem product which uses a proprietary client
protocol caused us to delay OS upgrades for months while we waited for
$vendor to port their client code to RHEL7. The NFS protocol is well
documented with several accessible reference implementations and pNFS gives
enough flexibility to support a distributed filesystem at an interesting
scale.
You mention a 'central MDS' as being an issue. I'm not going to go through
your list but at least HDFS also has this 'issue' and it doesn't seem to be
a problem for many users storing >100 Pb across >10^5 servers. In practice,
the MDS would be replicated for redundancy - there are lots of approaches
for this, my preference being Paxos but Raft would work just as well.
Google's GFS also followed this model and was an extremely reliable large
scale filesystem.
I am building an MDS as a layer on top of a key/value database which offers
the possibility of moving the backing store to some kind of distributed
key/value store in future which would remove the scaling and reliability
concerns.
I can agree with this - everything I'm working on is in userland. Given
that I'm not trying to export a local filesystem most of the reasons for
wanting a kernel implementation disappear. Adding support for NFS over RDMA
removes all the network context switching and for frequently accessed data
would typically be served out of a userland cache which removes the rest of
the context switches.
So, let me just set the record straight by saying that I’m all in favor of pNFS. It addresses a very definite need in the Enterprise marketplace and gives FreeBSD yet another arrow in its quiver when it comes to being “a player” in that (ever-growing) arena. The only point I was trying to make before was that if we could ALSO address clustering in a more general way as part of providing a pNFS solution, that would be great. I am not, however, the one writing the code and if my comments were in any way discouraging to the folks that are, I apologize and want to express my enthusiasm for it. If iXsystems engineering resources can contribute in any way to moving this ball forward, let me know and we’ll start doing so.
On the more general point of “NFS is hard, let’s go shopping” let me also say that it’s kind of important not to conflate end-user targeted solutions with enterprise solutions. Setting up a Kerberized NFSv4, for example, is not really designed to be trivial to set up and if anyone is waiting for that to happen, they may be waiting a very long time (like, forever). NFS and SMB are both fairly simple technologies to use if you restrict yourself to using, say, just 20% of their overall feature-sets. Once you add ACLs, Directory Services, user/group and permissions mappings, and any of the other more enterprise-centric features of these filesharing technologies, however, things rapidly get more complicated and the DevOps people who routinely play in these kinds of environments are quite happy to have all those options available because they’re not consumers operating in consumer environments.
Sun didn’t design NFS to be particularly consumer-centric, for that matter, and if you think SMB is “simple” because you clicked Network on Windows Explorer one day and stuff just automagically appeared, you should try operating it in a serious Windows Enterprise environment (just flip through some of the SMB bugs in the FreeNAS bug tracker - https://bugs.freenas.org/projects/freenas/issues?utf8=✓&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=*&f%5B%5D=category_id&op%5Bcategory_id%5D=%3D&v%5Bcategory_id%5D%5B%5D=57&f%5B%5D=&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=fixed_version&group_by= - if you want to see the kinds of problems users wrestle with all the time).
Anyway, I’ll get off the soapbox now, I just wanted to dispute the premise that “simple file sharing” that is also “secure file sharing” and “flexible file sharing” doesn’t really exist. The simplest end-user oriented file sharing system I’ve used to date is probably AFP, and Apple has been trying to kill it for years, probably because it doesn’t have all those extra knobs and Kerberos / Directory Services integration business users have been asking for (it’s also not particularly industry standard).
- Jordan
Vmware now has interest in pnfs.
Technology gets driven by business/enterprise. I talked to a CA at a
large electronics chain and asked why they are using ceph and he said
about 100 words, then said because red hat recommends it with openstack.
Intel is driving lustre. RHEL driving ceph. Vmware driving pnfs. I don't
see anyone driving gluster.
Every once in awhile you see products grow on their merit(watching
proxmox and zerto right now) but those usually get swooped up by a
bigger one.
To the point of setting up kerberized nfs, AD has made kerberos easy, it
could be just as easy with nfs. Everything is easy once you know it.
lk
https://www.socallinuxexpo.org/scale/14x/presentations/scaling-glusterfs-facebook
Facebook is a user, but a large one.
Although GlusterFS seems to supports OpenStack stuff, it seems to be layered on top of the
POSIX file system using something they call SwiftOnFile.
Thanks for the comments, rick
As for defending pNFS, all I was trying to say was that "although it is hard
to believe, it has taken 10years for pNFS to hit the streets". As such, it
is anyone's guess w.r.t. whether or not it will become widely adopted?
If it came across as more than that, I am the one that should be apologizing
and am in no way discouraged by any of the comments.
> So, let me just set the record straight by saying that I’m all in favor of
> pNFS. It addresses a very definite need in the Enterprise marketplace and
> gives FreeBSD yet another arrow in its quiver when it comes to being “a
> player” in that (ever-growing) arena. The only point I was trying to make
> before was that if we could ALSO address clustering in a more general way as
> part of providing a pNFS solution, that would be great.
When I did a fairly superficial evaluation of the open source clustering systems
out there (looking at online doc and not actually their code), it seemed that
GlusterFS was the best bet for "one size fits all".
It had:
- a distributed file system (replication, etc) with a POSIX/FUSE interface.
- SwiftOnFile that put the Swift/Openstack on top of this.
- It had decentralized metadata handling.
For pNFS:
- It had a NFSv3 server built into it.
- Was ported to FreeBSD.
The others were:
- Object store only with no POSIX file system support
or
- Single centralized metadata store (MooseFS, for example)
- No FreeBSD port and rumoured to be hard to port (Ceph, Lustre are two examples).
Now that I've worked with GlusterFS a little bit, I am skeptical that it can
deliver adequate performance for pNFS using the nfsd. I am still hoping I will
be proven wrong on this, but???
A GlusterFS/Ganesha-NFS user space solution may be feasible. This is what the
GlusterFS folk are planning. However, for FreeBSD...
- Ganesha-NFS apparently was ported to FreeBSD, but the port was removed from
their source tree and it is said it now uses Linux-specific thread primitives.
--> As such, I have no idea what effort is involved in getting this ported and
working well on FreeBSD is.
- I would also wait until this is working in Linux and would want to do an
evaluation of that, to make sure it actually works/performs well, before
considering this.
*** For me personally, I am probably not interested in working on this. I
know the FreeBSD nfsd kernel code well and can easily work with that,
but Ganesha-NFS would be an entirely different beast.
Bottom line, at this point I am skeptical that a generic clustering system
will work for pNFS.
> I am not, however,
> the one writing the code and if my comments were in any way discouraging to
> the folks that are, I apologize and want to express my enthusiasm for it.
> If iXsystems engineering resources can contribute in any way to moving this
> ball forward, let me know and we’ll start doing so.
>
Well, although they may not be useful for building a pNFS server, but some sort
of evaluation of the open source clustering systems might be useful.
Sooner or later, the Enterprise marketplace may want one or more of these and
it seems to me that having one of these layered on top of ZFS may be an attractive
solution.
- Some will never be ported to FreeBSD, but the ones that are could probably be
evaluated fairly easily, if you have the resources.
Since almost all the code I've written gets reused if I do a PlanB, I will
probably pursue that, leaving the GlusterFS interface bits in place in case
they are useful.
Thanks for all the interesting comments, rick
To rip just a bit of your text out of context:
On 18-6-2016 22:50, Jordan Hubbard wrote:
> Some, if not most, of them are also far
> better supported under Linux than FreeBSD (I don’t think we even have
> a working ceph port yet).
In the spare time I have left, I'm trying to get a lot of small fixes
into the ceph tree to get it actually compiling, testing, and running on
FreeBSD. But Ceph is a lot of code, and since a lot of people are
working on it, the number of code changes are big. And just keeping up
with that is sometimes hard. More and more Linux-isms are dropped into
the the code. So progress is slow. I only because it is hard to get
people to look at the commits and get them.
Current state is that I have it compile everything, and I can run 120 of
129 test with success. I once had them complete all, but then a busload
of changes were dropped in the tree. And so I needed to "start
"repairing" again.
I gave a small presentation of my work thus far at Ceph Day Cern in
Geneva. https://indico.cern.ch/event/542464/contributions/2202309/
Differences in code are not really that big in the CC-code, most of the
things to fix are additional tools that have to deal with the
infrastructure that fully assumes it is running a Linux-distro.
Next to that is Ceph going to its own diskstore system: BlueStore, where
as I hope(d) to base it on a ZFS underlying layer...
To run BlueStore AIO is needed for diskdevices, but the current AIO is
not call for call compatible, and requires a glue layer. I have not
looked into the size of the semantic problems between Linux and FreeBSD
here.
On the other hand they just declared CephFS (a posix filesystem running
on Ceph) stable and to be used.
--WjW
Hi Willem,
Yes, I read your paper on the porting effort!
I also took a look at porting ceph myself, about 2 years ago, and rapidly concluded that it wasn’t a small / trivial effort by any means and would require a strong justification in terms of ceph’s feature set over glusterfs / moose / OpenAFS / RiakCS / etc. Since that time, there’s been customer interest but nothing truly “strong” per-se. My attraction to ceph remains centered around at least these 4 things:
1. Distributed Object store with S3-compatible ReST API
2. Interoperates with Openstack via Swift compatibility
3. Block storage (RADOS) - possibly useful for iSCSI and other block storage requirements
4. Filesystem interface
Is there anything we can do to help? Do the CEPH folks seem receptive to actually having a “Tier 1” FreeBSD port? I know that stas@ did an early almost-port awhile back, but it never reached fruition and my feeling was that they (ceph) might be a little gun-shy about seeing another port that might wind up in the same place, crufting up their code base to no purpose. Do you have any initial impressions about that? I’ve never talked to any of the 3 principle guys working on the project and this is pure guesswork on my part.
- Jordan
I've been going at it since last November... And all I go in are about 3
batches of FreeBSD specific commits. Lots has to do with release windows
and code slush, like we know on FreeBSD. But then still reviews tend to
slow and I need people to push to look at them. Whilst in the mean time
all kinds of thing get pulled and inserted in the tree, that seriously
are not FreeBSD. Sometimes I see them during commit, and "negotiate"
better comparability with the author. At other times I missed the whole
thing, and I need to rebase to get ride of merge conflicts. To find out
the hard way that somebody has made the whole
peer communication async. And has thrown kqueue for the BSDs at it. But
they don't work (yet). So to get my other patches in, if First need to
fix this. Takes a lot of time .....
That all said I was in Geneva and a lot of the Ceph people were there
including Sage Weil. And I go the feeling they appreciated a larger
community. I think they see what ZFS has done with OpenZFS and see that
communities get somewhere.
Now on of the things to do to continue, now that I sort of can compile
and run the first testset, is set up sort of my own Jenkins stuff. So
that I can at least test drive some of the tree automagically to get
some testcoverage of the code on FreeBSD. In my mind (and Sage warned me
that that will be more or less required) it is the only way to actually
get a serious foot in the door with the Ceph guys.
> My attraction to ceph remains centered around at least these
> 4 things:
>
> 1. Distributed Object store with S3-compatible ReST API
> 2. Interoperates with Openstack via Swift compatibility
> 3. Block storage > (RADOS) - possibly useful for iSCSI and other block storage
> requirements
> 4. Filesystem interface
>
> Is there anything we can do to help?
I'll get back on that in a separate Email.
> Do the CEPH folks seem
> receptive to actually having a “Tier 1” FreeBSD port? I know that
> stas@ did an early almost-port awhile back, but it never reached
> fruition and my feeling was that they (ceph) might be a little
> gun-shy about seeing another port that might wind up in the same
> place, crufting up their code base to no purpose.
Well, as you know, I for the era before there was automake....
So then porting was still very much an art. So I've been balancing
between crufting up the code, and hiding things nice and cleanly in C++
classes and place. And as an go inbetween stuff get stuck in compat.h.
One of my slides was actually about the impact of foreign code in the
tree. And uptill now that is relatively minimal. Which seemed to please
a lot of the folks. But they also like the idee that getting FreeBSD
stuff in actually showed code weakness (and fixes) in the odd corners.
> Do you have any
> initial impressions about that? I’ve never talked to any of the 3
> principle guys working on the project and this is pure guesswork on
> my part.
I think they are going their own path, like writting their own datastore
so they can do things they require that posix can't deliver.
And as such are also diverging from what is default on Linux.
The systemarchitect in me also sees things in pain happen, because of
the "reinvention" of things. But then again, that happens with projects
this big. Things like checksums, compression, encryption, ....
Lots of stuff I've seen happen to ZFS over its time.
But so be it, everybody gets to chose their own axes to grind.
The community person to talk to is perhaps Patrick McGarry, but even
Sage would be good to talk to.
--WjW