Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

pNFS server Plan B

180 views

Skip to first unread message

Rick Macklem

unread,

Jun 13, 2016, 7:44:09 PM6/13/16

You may have already heard of Plan A, which sort of worked
and you could test by following the instructions here:

http://people.freebsd.org/~rmacklem/pnfs-setup.txt

However, it is very slow for metadata operations (everything other than
read/write) and I don't think it is very useful.

After my informal talk at BSDCan, here are some thoughts I have:
- I think the slowness is related to latency w.r.t. all the messages
being passed between the nfsd, GlusterFS via Fuse and between the
GlusterFS daemons. As such, I don't think faster hardware is likely
to help a lot w.r.t. performance.
- I have considered switching to MooseFS, but I would still be using Fuse.
*** MooseFS uses a centralized metadata store, which would imply only
a single Metadata Server (MDS) could be supported, I think?
(More on this later...)
- dfr@ suggested that avoiding Fuse and doing everything in userspace
might help.
- I thought of porting the nfsd to userland, but that would be quite a
bit of work, since it uses the kernel VFS/VOP interface, etc.

All of the above has led me to Plan B.
It would be limited to a single MDS, but as you'll see
I'm not sure that is as large a limitation as I thought it would be.
(If you aren't interested in details of this Plan B design, please
skip to "Single Metadata server..." for the issues.)

Plan B:
- Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would
be used for both the MDS and Data Server (DS).)
- One FreeBSD server running nfsd would be the MDS. It would
build a file system tree that looks exactly like it would without pNFS,
except that the files would be empty. (size == 0)
--> As such, all the current nfsd code would do metadata operations on
this file system exactly like the nfsd does now.
- When a new file is created (an Open operation on NFSv4.1), the file would
be created exactly like it is now for the MDS.
- Then DS(s) would be selected and the MDS would do
a Create of a data storage file on these DS(s).
(This algorithm could become interesting later, but initially it would
probably just pick one DS at random or similar.)
- These file(s) would be in a single directory on the DS(s) and would have
a file name which is simply the File Handle for this file on the
MDS (an FH is 28bytes->48bytes of Hex in ASCII).
- Extended attributes would be added to the Metadata file for:
- The data file's actual size.
- The DS(s) the data file in on.
- The File Handle for these data files on the DS(s).
This would add some overhead to the Open/create, which would be one
Create RPC for each DS the data file is on.
*** Initially there would only be one file on one DS. Mirroring for
redundancy can be added later.

Now, the layout would be generated from these extended attributes for any
NFSv4.1 client that asks for it.

If I/O operations (read/write/setattr_of_size) are performed on the Metadata
server, it would act as a proxy and do them on the DS using the extended
attribute information (doing an RPC on the DS for the client).

When the file is removed on the Metadata server (link cnt --> 0), the
Metadata server would do Remove RPC(s) on the DS(s) for the data file(s).
(This requires the file name, which is just the Metadata FH in ASCII.)

The only addition that the nfsd for the DS(s) would need would be a callback
to the MDS done whenever a client (not the MDS) does
a write to the file, notifying the Metadata server the file has been
modified and is now Size=K, so the Metadata server can keep the attributes
up to date for the file. (It can identify the file by the MDS FH.)

All of this is a relatively small amount of change to the FreeBSD nfsd,
so it shouldn't be that much work (I'm a lazy guy looking for a minimal
solution;-).

Single Metadata server...
The big limitation to all of the above is the "single MDS" limitation.
I had thought this would be a serious limitation to the design scaling
up to large stores.
However, I'm not so sure it is a big limitation??
1 - Since the files on the MDS are all empty, the file system is only
i-nodes, directories and extended attribute blocks.
As such, I hope it can be put on fast storage.
*** I don't know anything about current and near term future SSD technologies.
Hopefully others can suggest how large/fast a store for the MDS could
be built easily?
--> I am hoping that it will be possible to build an MDS that can handle
a lot of DS/storage this way?
(If anyone has access to hardware and something like SpecNFS, they could
test an RPC load with almost no Read/Write RPCs and this would probably
show about what the metadata RPC limits are for one of these.)

2 - Although it isn't quite having multiple MDSs, the directory tree could
be split up with an MDS for each subtree. This would allow some scaling
beyond one MDS.
(Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are basically
an NFS server driven "automount" that redirects the NFSv4.1 client to
a different server for a subtree. This might be a useful tool for
splitting off subtrees to different MDSs?)

If you actually read this far, any comments on this would be welcome.
In particular, if you have an opinion w.r.t. this single MDS limitation
and/or how big an MDS could be built, that would be appreciated.

Thanks for any comments, rick
_______________________________________________
freeb...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Doug Rabson

unread,

Jun 14, 2016, 4:47:45 AM6/14/16

As I mentioned to Rick, I have been working on similar lines to put
together a pNFS implementation. Comments embedded below.

On 13 June 2016 at 23:28, Rick Macklem <rmac...@uoguelph.ca> wrote:

> You may have already heard of Plan A, which sort of worked
> and you could test by following the instructions here:
>
> http://people.freebsd.org/~rmacklem/pnfs-setup.txt
>
> However, it is very slow for metadata operations (everything other than
> read/write) and I don't think it is very useful.
>
> After my informal talk at BSDCan, here are some thoughts I have:
> - I think the slowness is related to latency w.r.t. all the messages
> being passed between the nfsd, GlusterFS via Fuse and between the
> GlusterFS daemons. As such, I don't think faster hardware is likely
> to help a lot w.r.t. performance.
> - I have considered switching to MooseFS, but I would still be using Fuse.
> *** MooseFS uses a centralized metadata store, which would imply only
> a single Metadata Server (MDS) could be supported, I think?
> (More on this later...)
> - dfr@ suggested that avoiding Fuse and doing everything in userspace
> might help.
> - I thought of porting the nfsd to userland, but that would be quite a
> bit of work, since it uses the kernel VFS/VOP interface, etc.
>

I ended up writing everything from scratch as userland code rather than
consider porting the kernel code. It was quite a bit of work :)

>
> All of the above has led me to Plan B.
> It would be limited to a single MDS, but as you'll see
> I'm not sure that is as large a limitation as I thought it would be.
> (If you aren't interested in details of this Plan B design, please
> skip to "Single Metadata server..." for the issues.)
>
> Plan B:
> - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would
> be used for both the MDS and Data Server (DS).)
> - One FreeBSD server running nfsd would be the MDS. It would
> build a file system tree that looks exactly like it would without pNFS,
> except that the files would be empty. (size == 0)
> --> As such, all the current nfsd code would do metadata operations on
> this file system exactly like the nfsd does now.
> - When a new file is created (an Open operation on NFSv4.1), the file would
> be created exactly like it is now for the MDS.
> - Then DS(s) would be selected and the MDS would do
> a Create of a data storage file on these DS(s).
> (This algorithm could become interesting later, but initially it would
> probably just pick one DS at random or similar.)
> - These file(s) would be in a single directory on the DS(s) and would
> have
> a file name which is simply the File Handle for this file on the
> MDS (an FH is 28bytes->48bytes of Hex in ASCII).
>

I have something similar but using a directory hierarchy to try to avoid
any one directory being excessively large.

> - Extended attributes would be added to the Metadata file for:
> - The data file's actual size.
> - The DS(s) the data file in on.
> - The File Handle for these data files on the DS(s).
> This would add some overhead to the Open/create, which would be one
> Create RPC for each DS the data file is on.
>

An alternative here would be to store the extra metadata in the file itself
rather than use extended attributes.

> *** Initially there would only be one file on one DS. Mirroring for
> redundancy can be added later.
>

The scale of filesystem I want to build more or less requires the extra
redundancy of mirroring so I added this at the start. It does add quite a
bit of complexity to the MDS to keep track of which DS should have which
piece of data and to handle DS failures properly, re-silvering data etc.

>
> Now, the layout would be generated from these extended attributes for any
> NFSv4.1 client that asks for it.
>
> If I/O operations (read/write/setattr_of_size) are performed on the
> Metadata
> server, it would act as a proxy and do them on the DS using the extended
> attribute information (doing an RPC on the DS for the client).
>
> When the file is removed on the Metadata server (link cnt --> 0), the
> Metadata server would do Remove RPC(s) on the DS(s) for the data file(s).
> (This requires the file name, which is just the Metadata FH in ASCII.)
>

Currently I have a non-nfs control protocol for this but strictly speaking
it isn't necessary as you note.

>
> The only addition that the nfsd for the DS(s) would need would be a
> callback
> to the MDS done whenever a client (not the MDS) does
> a write to the file, notifying the Metadata server the file has been
> modified and is now Size=K, so the Metadata server can keep the attributes
> up to date for the file. (It can identify the file by the MDS FH.)
>

I don't think you need this - the client should perform LAYOUTCOMMIT rpcs
which will inform the MDS of the last write position and last modify time.
This can be used to update the file metadata. The Linux client does this
before the CLOSE rpc on the client as far as I can tell.

>
> All of this is a relatively small amount of change to the FreeBSD nfsd,
> so it shouldn't be that much work (I'm a lazy guy looking for a minimal
> solution;-).
>
> Single Metadata server...
> The big limitation to all of the above is the "single MDS" limitation.
> I had thought this would be a serious limitation to the design scaling
> up to large stores.
> However, I'm not so sure it is a big limitation??
> 1 - Since the files on the MDS are all empty, the file system is only
> i-nodes, directories and extended attribute blocks.
> As such, I hope it can be put on fast storage.
> *** I don't know anything about current and near term future SSD
> technologies.
> Hopefully others can suggest how large/fast a store for the MDS could
> be built easily?
> --> I am hoping that it will be possible to build an MDS that can
> handle
> a lot of DS/storage this way?
> (If anyone has access to hardware and something like SpecNFS, they
> could
> test an RPC load with almost no Read/Write RPCs and this would
> probably
> show about what the metadata RPC limits are for one of these.)
>

I think a single MDS can scale up to petabytes of storage easily. It
remains to be seen how far it can scale for TPS. I will note that Google's
GFS filesystem (you can find a paper describing it at
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)
uses effectively a single MDS, replicated for redundancy but still serving
just from one master MDS at a time. That filesystem scaled pretty well for
both data size and transactions so I think the approach is viable.

>
> 2 - Although it isn't quite having multiple MDSs, the directory tree could
> be split up with an MDS for each subtree. This would allow some scaling
> beyond one MDS.
> (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are
> basically
> an NFS server driven "automount" that redirects the NFSv4.1 client to
> a different server for a subtree. This might be a useful tool for
> splitting off subtrees to different MDSs?)
>
> If you actually read this far, any comments on this would be welcome.
> In particular, if you have an opinion w.r.t. this single MDS limitation
> and/or how big an MDS could be built, that would be appreciated.
>
> Thanks for any comments, rick
>

My back-of-envelope calculation assumed a 10 Pb filesystem containing
mostly large files which would be striped in 10 Mb pieces. Guessing that we
need 200 bytes of metadata per piece, that gives around 200 Gb of metadata
which is very reasonable. Even for file sets containing much smaller files,
a single server should have no trouble storing the metadata.

Rick Macklem

unread,

Jun 14, 2016, 6:36:05 PM6/14/16

I thought of that, but since no one will be doing an "ls" of it, I wasn't going to
bother doing multiple dirs initially. However, now that I think of it, the Create
and Remove RPCs will end up doing VOP_LOOKUP()s, so breaking these up into multiple
directories sounds like a good idea. (I may just hash the FH and let the hash choose
a directory.)

Good suggestion, thanks.

>
> > - Extended attributes would be added to the Metadata file for:
> > - The data file's actual size.
> > - The DS(s) the data file in on.
> > - The File Handle for these data files on the DS(s).
> > This would add some overhead to the Open/create, which would be one
> > Create RPC for each DS the data file is on.
> >
>
> An alternative here would be to store the extra metadata in the file itself
> rather than use extended attributes.
>

Yep. I'm not sure if there is any performance advantage of doing data vs. extended attributes?

When I developed the NFSv4.1_Files layout client, I had three servers to test
against.
- The Netapp filer just returned EOPNOTSUPP for LayoutCommit.
- The Linux test server (had MDS and DS on the same Linux system) accepted the
LayoutCommit, but didn't do anything for it, so doing it had no effect.
- The only pNFS server I've ever tested against that needed LayoutCommit was
Oracle/Solaris and the Oracle folk never explained why their server required
it or what would break if you didn't do it. (I don't recall attributes being
messed up when I didn't do it correctly.)
As such, I've never been sure what it is used for.

I need to read the LayoutCommit stuff in the RFC and Flex Files draft again.
It would be nice if the DS->MDS calls could be avoided for every write.
Doing one when the DS receives a Commit RPC wouldn't be too bad.

Thanks for all the good comments, rick
ps: Good luck with your pNFS server. Maybe someday it will be available for FreeBSD?

Jordan Hubbard

unread,

Jun 18, 2016, 4:50:53 PM6/18/16

> On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:
>
> You may have already heard of Plan A, which sort of worked
> and you could test by following the instructions here:
>
> http://people.freebsd.org/~rmacklem/pnfs-setup.txt
>
> However, it is very slow for metadata operations (everything other than
> read/write) and I don't think it is very useful.

Hi guys,

I finally got a chance to catch up and bring up Rick’s pNFS setup on a couple of test machines. He’s right, obviously - The “plan A” approach is a bit convoluted and not at all surprisingly slow. With all of those transits twixt kernel and userland, not to mention glusterfs itself which has not really been tuned for our platform (there are a number of papers on this we probably haven’t even all read yet), we’re obviously still in the “first make it work” stage.

That said, I think there are probably more possible plans than just A and B here, and we should give the broader topic of “what does FreeBSD want to do in the Enterprise / Cloud computing space?" at least some consideration at the same time, since there are more than a few goals running in parallel here.

First, let’s talk about our story around clustered filesystems + associated command-and-control APIs in FreeBSD. There is something of an embarrassment of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS, RiakCS, moose, etc. All or most of them offer different pros and cons, and all offer more than just the ability to store files and scale “elastically”. They also have ReST APIs for configuring and monitoring the health of the cluster, some offer object as well as file storage, and Riak offers a distributed KVS for storing information *about* file objects in addition to the object themselves (and when your application involves storing and managing several million photos, for example, the idea of distributing the index as well as the files in a fault-tolerant fashion is also compelling). Some, if not most, of them are also far better supported under Linux than FreeBSD (I don’t think we even have a working ceph port yet). I’m not saying we need to blindly follow the herds and do all the same things others are doing here, either, I’m just saying that it’s a much bigger problem space than simply “parallelizing NFS” and if we can kill multiple birds with one stone on the way to doing that, we should certainly consider doing so.

Why? Because pNFS was first introduced as a draft RFC (RFC5661 <https://datatracker.ietf.org/doc/rfc5661/>) in 2005. The linux folks have been working on it <http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> since 2006. Ten years is a long time in this business, and when I raised the topic of pNFS at the recent SNIA DSI conference (where storage developers gather to talk about trends and things), the most prevalent reaction I got was “people are still using pNFS?!” This is clearly one of those technologies that may still have some runway left, but it’s been rapidly overtaken by other approaches to solving more or less the same problems in coherent, distributed filesystem access and if we want to get mindshare for this, we should at least have an answer ready for the “why did you guys do pNFS that way rather than just shimming it on top of ${someNewerHotness}??” argument. I’m not suggesting pNFS is dead - hell, even AFS <https://www.openafs.org/> still appears to be somewhat alive, but there’s a difference between appealing to an increasingly narrow niche and trying to solve the sorts of problems most DevOps folks working At Scale these days are running into.

That is also why I am not sure I would totally embrace the idea of a central MDS being a Real Option. Sure, the risks can be mitigated (as you say, by mirroring it), but even saying the words “central MDS” (or central anything) may be such a turn-off to those very same DevOps folks, folks who have been burned so many times by SPOFs and scaling bottlenecks in large environments, that we'll lose the audience the minute they hear the trigger phrase. Even if it means signing up for Other Problems later, it’s a lot easier to “sell” the concept of completely distributed mechanisms where, if there is any notion of centralization at all, it’s at least the result of a quorum election and the DevOps folks don’t have to do anything manually to cause it to happen - the cluster is “resilient" and "self-healing" and they are happy with being able to say those buzzwords to the CIO, who nods knowingly and tells them they’re doing a fine job!

Let’s get back, however, to the notion of downing multiple avians with the same semi-spherical kinetic projectile: What seems to be The Rage at the moment, and I don’t know how well it actually scales since I’ve yet to be at the pointy end of such a real-world deployment, is the idea of clustering the storage (“somehow”) underneath and then providing NFS and SMB protocol access entirely in userland, usually with both of those services cooperating with the same lock manager and even the same ACL translation layer. Our buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha + Samba on top - I talked to one of the Samba core team guys at SNIA and he indicated that this was increasingly common, with the team having helped here and there when approached by different vendors with the same idea. We (iXsystems) also get a lot of requests to be able to make the same file(s) available via both NFS and SMB at the same time and they don’t much at all like being told “but that’s dangerous - don’t do that! Your file contents and permissions models are not guaranteed to survive such an experience!” They really want to do it, because the rest of the world lives in Heterogenous environments and that’s just the way it is.

Even the object storage folks, like Openstack’s Swift project, are spending significant amounts of mental energy on the topic of how to re-export their object stores as shared filesystems over NFS and SMB, the single consistent and distributed object store being, of course, Their Thing. They wish, of course, that the rest of the world would just fall into line and use their object system for everything, but they also get that the "legacy stuff” just won’t go away and needs some sort of attention if they’re to remain players at the standards table.

So anyway, that’s the view I have from the perspective of someone who actually sells storage solutions for a living, and while I could certainly “sell some pNFS” to various customers who just want to add a dash of steroids to their current NFS infrastructure, or need to use NFS but also need to store far more data into a single namespace than any one box will accommodate, I also know that offering even more elastic solutions will be a necessary part of offering solutions to the growing contingent of folks who are not tied to any existing storage infrastructure and have various non-greybearded folks shouting in their ears about object this and cloud that. Might there not be some compromise solution which allows us to put more of this in userland with less context switches in and out of the kernel, also giving us the option of presenting a more united front to multiple protocols that require more ACL and lock impedance-matching than we’d ever want to put in the kernel anyway?

- Jordan

Rick Macklem

unread,

Jun 18, 2016, 7:06:10 PM6/18/16

Jordan Hubbard wrote:
>
> > On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:
> >
> > You may have already heard of Plan A, which sort of worked
> > and you could test by following the instructions here:
> >
> > http://people.freebsd.org/~rmacklem/pnfs-setup.txt
> >
> > However, it is very slow for metadata operations (everything other than
> > read/write) and I don't think it is very useful.
>

I am going to respond to a few of the comments, but I hope that people who
actually run server farms and might be a user of a fairly large/inexpensive
storage cluster will comment.

Put another way, I'd really like to hear a "user" perspective.

Actually, I would have worded this as "will anyone ever use pNFS?".

Although 10 years is a long time in this business, it doesn't seem to be long
at all in the standards world where the NFSv4 protocols are being developed.
- You note that the Linux folk started development in 2006.
I will note that RFC5661 (the RFC that describes pNFS) is dated 2010.
I will also note that I believe the first vendor to ship a server that supported pNFS
happened sometime after the RFC was published.
- I could be wrong, but I'd guess that Netapp's clustered Filers were the
first to ship, about 4 years ago.

To this date, very few vendors have actually shipped working pNFS servers
as far as I am aware. Other than Netapp, the only one I know that has shipped
are the large EMC servers (not Isilon).
I am not sure if Oracle/Solaris has ever shipped a pNFS server to customers yet.
Same goes for Panasas. I am not aware of a Linux based pNFS server usable in
a production environment, although Ganesha-NFS might be shipping with pNFS support now.
- If others are aware of other pNFS servers that are shipping to customers,
please correct me. (I haven't been to a NFSv4.1 testing event for 3 years,
so my info is definitely dated.)

Note that the "Flex Files" layout I used for the Plan A experiment is only an
Internet draft at this time and hasn't even made it to the RFC stage.

--> As such, I think it is very much an open question w.r.t. whether or not
this protocol will become widely used or yet another forgotten standard?
I also suspect that some storage vendors that have invested considerable
resources in NFSv4.1/pNFS development might ask the same question in-house;-)

> This is clearly one of those
> technologies that may still have some runway left, but it’s been rapidly
> overtaken by other approaches to solving more or less the same problems in
> coherent, distributed filesystem access and if we want to get mindshare for
> this, we should at least have an answer ready for the “why did you guys do
> pNFS that way rather than just shimming it on top of ${someNewerHotness}??”
> argument. I’m not suggesting pNFS is dead - hell, even AFS
> <https://www.openafs.org/> still appears to be somewhat alive, but there’s a
> difference between appealing to an increasingly narrow niche and trying to
> solve the sorts of problems most DevOps folks working At Scale these days
> are running into.
>
> That is also why I am not sure I would totally embrace the idea of a central
> MDS being a Real Option. Sure, the risks can be mitigated (as you say, by
> mirroring it), but even saying the words “central MDS” (or central anything)
> may be such a turn-off to those very same DevOps folks, folks who have been
> burned so many times by SPOFs and scaling bottlenecks in large environments,
> that we'll lose the audience the minute they hear the trigger phrase. Even
> if it means signing up for Other Problems later, it’s a lot easier to “sell”
> the concept of completely distributed mechanisms where, if there is any
> notion of centralization at all, it’s at least the result of a quorum
> election and the DevOps folks don’t have to do anything manually to cause it
> to happen - the cluster is “resilient" and "self-healing" and they are happy
> with being able to say those buzzwords to the CIO, who nods knowingly and
> tells them they’re doing a fine job!
>

I'll admit that I'm a bits and bytes guy. I have a hunch how difficult it is
to get "resilient" and "self-healing" to really work. I also know it is way
beyond what I am capable of.

> Let’s get back, however, to the notion of downing multiple avians with the
> same semi-spherical kinetic projectile: What seems to be The Rage at the
> moment, and I don’t know how well it actually scales since I’ve yet to be at
> the pointy end of such a real-world deployment, is the idea of clustering
> the storage (“somehow”) underneath and then providing NFS and SMB protocol
> access entirely in userland, usually with both of those services cooperating
> with the same lock manager and even the same ACL translation layer. Our
> buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha +
> Samba on top - I talked to one of the Samba core team guys at SNIA and he
> indicated that this was increasingly common, with the team having helped
> here and there when approached by different vendors with the same idea. We
> (iXsystems) also get a lot of requests to be able to make the same file(s)
> available via both NFS and SMB at the same time and they don’t much at all
> like being told “but that’s dangerous - don’t do that! Your file contents
> and permissions models are not guaranteed to survive such an experience!”
> They really want to do it, because the rest of the world lives in
> Heterogenous environments and that’s just the way it is.
>

If you want to make SMB and NFS work to-gether on the same uderlying file systems,
I suspect it is doable, although messy. To do this with the current FreeBSD nfsd,
it would require someone with Samba/Windows knowledge pointing out what Samba
needs to interact with NFSv4 and those hooks could probably be implemented.
(I know nothing about Samba/Windows, so I'd need someone else doing that side
of it.)

I actually mentioned Ganesha-NFS at the little talk/discussion I gave.
At this time, they have ripped a FreeBSD port out of their sources and they
use Linux specific thread primitives.
--> It would probably be significant work to get Ganesha-NFS up to speed on
FreeBSD. Maybe a good project, but it needs some person/group dedicating
resources to get it to happen.

> Even the object storage folks, like Openstack’s Swift project, are spending
> significant amounts of mental energy on the topic of how to re-export their
> object stores as shared filesystems over NFS and SMB, the single consistent
> and distributed object store being, of course, Their Thing. They wish, of
> course, that the rest of the world would just fall into line and use their
> object system for everything, but they also get that the "legacy stuff” just
> won’t go away and needs some sort of attention if they’re to remain players
> at the standards table.
>
> So anyway, that’s the view I have from the perspective of someone who
> actually sells storage solutions for a living, and while I could certainly
> “sell some pNFS” to various customers who just want to add a dash of
> steroids to their current NFS infrastructure, or need to use NFS but also
> need to store far more data into a single namespace than any one box will
> accommodate, I also know that offering even more elastic solutions will be a
> necessary part of offering solutions to the growing contingent of folks who
> are not tied to any existing storage infrastructure and have various
> non-greybearded folks shouting in their ears about object this and cloud
> that. Might there not be some compromise solution which allows us to put
> more of this in userland with less context switches in and out of the
> kernel, also giving us the option of presenting a more united front to
> multiple protocols that require more ACL and lock impedance-matching than
> we’d ever want to put in the kernel anyway?
>

For SMB + NFS in userland, the combination of Samba and Ganesha is probably
your main open source choice, from what I am aware of.

I am one guy who does this as a spare time retirement hobby. As such, doing
something like a Ganesha port etc is probably beyond what I am interested in.
When saying this, I don't want to imply that it isn't a good approach.

You sent me the URL for an abstract for a paper discussing how Facebook is
using GlusterFS. It would be nice to get more details w.r.t. how they use it,
such as:
- How do their client servers access it? (NFS, Fuse, or ???)
- Whether or not they've tried the Ganesha-NFS stuff that GlusterFS is
transitioning to?
Put another way, they might have some insight into whether the NFS is userland
via Ganesha works well or not?

Hopefully some "users" for this stuff will respond, rick
ps: Maybe this could be reposted in a place they are likely to read it.

Chris Watson

unread,

Jun 18, 2016, 9:14:51 PM6/18/16

Since Jordan brought up clustering, I would be interested to hear Justin Gibbs thoughts here. I know about a year ago he was asked on an "after hours" video chat hosted by Matt Aherns about a feature he would really like to see and he mentioned he would really like, in a universe filled with time and money I'm sure, to work on a native clustering solution for FreeBSD. I don't know if he is subscribed to the list, and I'm certainly not throwing him under the bus by bringing his name up, but I know he has at least been thinking about this for some time and probably has some value to add here.

Chris

Sent from my iPhone 5

> On Jun 18, 2016, at 3:50 PM, Jordan Hubbard <j...@ixsystems.com> wrote:
>
>
>> On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:
>>
>> You may have already heard of Plan A, which sort of worked
>> and you could test by following the instructions here:
>>
>> http://people.freebsd.org/~rmacklem/pnfs-setup.txt
>>
>> However, it is very slow for metadata operations (everything other than
>> read/write) and I don't think it is very useful.
>

> Hi guys,
>
> I finally got a chance to catch up and bring up Rick’s pNFS setup on a couple of test machines. He’s right, obviously - The “plan A” approach is a bit convoluted and not at all surprisingly slow. With all of those transits twixt kernel and userland, not to mention glusterfs itself which has not really been tuned for our platform (there are a number of papers on this we probably haven’t even all read yet), we’re obviously still in the “first make it work” stage.
>
> That said, I think there are probably more possible plans than just A and B here, and we should give the broader topic of “what does FreeBSD want to do in the Enterprise / Cloud computing space?" at least some consideration at the same time, since there are more than a few goals running in parallel here.
>
> First, let’s talk about our story around clustered filesystems + associated command-and-control APIs in FreeBSD. There is something of an embarrassment of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS, RiakCS, moose, etc. All or most of them offer different pros and cons, and all offer more than just the ability to store files and scale “elastically”. They also have ReST APIs for configuring and monitoring the health of the cluster, some offer object as well as file storage, and Riak offers a distributed KVS for storing information *about* file objects in addition to the object themselves (and when your application involves storing and managing several million photos, for example, the idea of distributing the index as well as the files in a fault-tolerant fashion is also compelling). Some, if not most, of them are also far better supported under Linux than FreeBSD (I don’t think we even have a working ceph port yet). I’m not saying we need to blindly follow the herds and do all the same things others are doing here, either, I’m just saying that it’s a much bigger problem space than simply “parallelizing NFS” and if we can kill multiple birds with one stone on the way to doing that, we should certainly consider doing so.
>

> Why? Because pNFS was first introduced as a draft RFC (RFC5661 <https://datatracker.ietf.org/doc/rfc5661/>) in 2005. The linux folks have been working on it <http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> since 2006. Ten years is a long time in this business, and when I raised the topic of pNFS at the recent SNIA DSI conference (where storage developers gather to talk about trends and things), the most prevalent reaction I got was “people are still using pNFS?!” This is clearly one of those technologies that may still have some runway left, but it’s been rapidly overtaken by other approaches to solving more or less the same problems in coherent, distributed filesystem access and if we want to get mindshare for this, we should at least have an answer ready for the “why did you guys do pNFS that way rather than just shimming it on top of ${someNewerHotness}??” argument. I’m not suggesting pNFS is dead - hell, even AFS <https://www.openafs.org/> still appears to be somewhat alive, but there’s a difference between appealing to an increasingly narrow niche and trying to solve the sorts of problems most DevOps folks working At Scale these days are running into.

>
> That is also why I am not sure I would totally embrace the idea of a central MDS being a Real Option. Sure, the risks can be mitigated (as you say, by mirroring it), but even saying the words “central MDS” (or central anything) may be such a turn-off to those very same DevOps folks, folks who have been burned so many times by SPOFs and scaling bottlenecks in large environments, that we'll lose the audience the minute they hear the trigger phrase. Even if it means signing up for Other Problems later, it’s a lot easier to “sell” the concept of completely distributed mechanisms where, if there is any notion of centralization at all, it’s at least the result of a quorum election and the DevOps folks don’t have to do anything manually to cause it to happen - the cluster is “resilient" and "self-healing" and they are happy with being able to say those buzzwords to the CIO, who nods knowingly and tells them they’re doing a fine job!
>

> Let’s get back, however, to the notion of downing multiple avians with the same semi-spherical kinetic projectile: What seems to be The Rage at the moment, and I don’t know how well it actually scales since I’ve yet to be at the pointy end of such a real-world deployment, is the idea of clustering the storage (“somehow”) underneath and then providing NFS and SMB protocol access entirely in userland, usually with both of those services cooperating with the same lock manager and even the same ACL translation layer. Our buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha + Samba on top - I talked to one of the Samba core team guys at SNIA and he indicated that this was increasingly common, with the team having helped here and there when approached by different vendors with the same idea. We (iXsystems) also get a lot of requests to be able to make the same file(s) available via both NFS and SMB at the same time and they don’t much at all like being told “but that’s dangerous - don’t do that! Your file contents and permissions models are not guaranteed to survive such an experience!” They really want to do it, because the rest of the world lives in Heterogenous environments and that’s just the way it is.
>

> Even the object storage folks, like Openstack’s Swift project, are spending significant amounts of mental energy on the topic of how to re-export their object stores as shared filesystems over NFS and SMB, the single consistent and distributed object store being, of course, Their Thing. They wish, of course, that the rest of the world would just fall into line and use their object system for everything, but they also get that the "legacy stuff” just won’t go away and needs some sort of attention if they’re to remain players at the standards table.
>
> So anyway, that’s the view I have from the perspective of someone who actually sells storage solutions for a living, and while I could certainly “sell some pNFS” to various customers who just want to add a dash of steroids to their current NFS infrastructure, or need to use NFS but also need to store far more data into a single namespace than any one box will accommodate, I also know that offering even more elastic solutions will be a necessary part of offering solutions to the growing contingent of folks who are not tied to any existing storage infrastructure and have various non-greybearded folks shouting in their ears about object this and cloud that. Might there not be some compromise solution which allows us to put more of this in userland with less context switches in and out of the kernel, also giving us the option of presenting a more united front to multiple protocols that require more ACL and lock impedance-matching than we’d ever want to put in the kernel anyway?
>

Jordan Hubbard

unread,

Jun 18, 2016, 9:51:14 PM6/18/16

> On Jun 18, 2016, at 6:14 PM, Chris Watson <bsdu...@gmail.com> wrote:
>
> Since Jordan brought up clustering, I would be interested to hear Justin Gibbs thoughts here. I know about a year ago he was asked on an "after hours" video chat hosted by Matt Aherns about a feature he would really like to see and he mentioned he would really like, in a universe filled with time and money I'm sure, to work on a native clustering solution for FreeBSD. I don't know if he is subscribed to the list, and I'm certainly not throwing him under the bus by bringing his name up, but I know he has at least been thinking about this for some time and probably has some value to add here.

I think we should also be careful to define our terms in such a discussion. Specifically:

1. Are we talking about block-level clustering underneath ZFS (e.g. HAST or ${somethingElse}) or otherwise incorporated into ZFS itself at some low level? If you Google for “High-availability ZFS” you will encounter things like RSF-1 or the somewhat more mysterious Zetavault (http://www.zeta.systems/zetavault/high-availability/) but it’s not entirely clear how these technologies work, they simply claim to “scale-out ZFS” or “cluster ZFS” (which can be done within ZFS or one level above and still probably pass the Marketing Test for what people are willing to put on a web page).

2. Are we talking about clustering at a slightly higher level, in a filesystem-agnostic fashion which still preserves filesystem semantics?

3. Are we talking about clustering for data objects, in a fashion which does not necessarily provide filesystem semantics (a sharding database which can store arbitrary BLOBs would qualify)?

For all of the above: Are we seeking to be compatible with any other mechanisms, or are we talking about a FreeBSD-only solution?

This is why I brought up glusterfs / ceph / RiakCS in my previous comments - when talking to the $users that Rick wants to involve in the discussion, they rarely come to the table asking for “some or any sort of clustering, don’t care which or how it works” - they ask if I can offer an S3 compatible object store with horizontal scaling, or if they can use NFS in some clustered fashion where there’s a single namespace offering petabytes of storage with configurable redundancy such that no portion of that namespace is ever unavailable.

I’d be interested in what Justin had in mind when he asked Matt about this. Being able to “attach ZFS pools to one another” in such a fashion that all clients just see One Big Pool and ZFS’s own redundancy / snapshotting characteristics magically apply to the überpool would be Pretty Cool, obviously, and would allow one to do round-robin DNS for NFS such that any node could serve the same contents, but that also sounds pretty ambitious, depending on how it’s implemented.

Julian Elischer

unread,

Jun 19, 2016, 12:31:44 PM6/19/16

On 19/06/2016 9:50 AM, Jordan Hubbard wrote:
> 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST or ${somethingElse}) or otherwise incorporated into ZFS itself at some low level? If you Google for “High-availability ZFS” you will encounter things like RSF-1 or the somewhat more mysterious Zetavault (http://www.zeta.systems/zetavault/high-availability/) but it’s not entirely clear how these technologies work, they simply claim to “scale-out ZFS” or “cluster ZFS” (which can be done within ZFS or one level above and still probably pass the Marketing Test for what people are willing to put on a web page).

umm look at Panzura who have been selling this on FreeBSD for 4 years
<plug>and need FreeBSD devs in the bay area (or closer than me))</plug>

Jordan Hubbard

unread,

Jun 19, 2016, 1:54:29 PM6/19/16

> On Jun 19, 2016, at 9:31 AM, Julian Elischer <jul...@freebsd.org> wrote:
>
> umm look at Panzura who have been selling this on FreeBSD for 4 years <plug>and need FreeBSD devs in the bay area (or closer than me))</plug>

Well, unlike Panzura, I think we’re also looking for an open source solution that can be upstreamed to FreeBSD and/or (probably better) the OpenZFS project. Any takers on that? My hand is up. :)

- Jordan

Rick Macklem

unread,

Jun 19, 2016, 7:29:46 PM6/19/16

Jordan Hubbard wrote:
>
> > On Jun 18, 2016, at 6:14 PM, Chris Watson <bsdu...@gmail.com> wrote:
> >
> > Since Jordan brought up clustering, I would be interested to hear Justin
> > Gibbs thoughts here. I know about a year ago he was asked on an "after
> > hours" video chat hosted by Matt Aherns about a feature he would really
> > like to see and he mentioned he would really like, in a universe filled
> > with time and money I'm sure, to work on a native clustering solution for
> > FreeBSD. I don't know if he is subscribed to the list, and I'm certainly
> > not throwing him under the bus by bringing his name up, but I know he has
> > at least been thinking about this for some time and probably has some
> > value to add here.
>
> I think we should also be careful to define our terms in such a discussion.
> Specifically:
>

> 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST or
> ${somethingElse}) or otherwise incorporated into ZFS itself at some low
> level? If you Google for “High-availability ZFS” you will encounter things
> like RSF-1 or the somewhat more mysterious Zetavault
> (http://www.zeta.systems/zetavault/high-availability/) but it’s not entirely
> clear how these technologies work, they simply claim to “scale-out ZFS” or
> “cluster ZFS” (which can be done within ZFS or one level above and still
> probably pass the Marketing Test for what people are willing to put on a web
> page).
>

> 2. Are we talking about clustering at a slightly higher level, in a
> filesystem-agnostic fashion which still preserves filesystem semantics?
>
> 3. Are we talking about clustering for data objects, in a fashion which does
> not necessarily provide filesystem semantics (a sharding database which can
> store arbitrary BLOBs would qualify)?
>

For the pNFS use case I am looking at, I would say #2.

I suspect #1 sits at a low enough level that redirecting I/O via the pNFS layouts
isn't practical, since ZFS is taking care of block allocations, etc.

I see #3 as a separate problem space, since NFS deals with files and not objects.
However, GlusterFS maps file objects on top of the POSIX-like FS, so I suppose that
could be done at the client end. (What glusterfs.org calls SwiftonFile, I think?)
It is also possible to map POSIX files onto file objects, but that sounds like more
work, which would need to be done under the NFS service.

> For all of the above: Are we seeking to be compatible with any other
> mechanisms, or are we talking about a FreeBSD-only solution?
>
> This is why I brought up glusterfs / ceph / RiakCS in my previous comments -
> when talking to the $users that Rick wants to involve in the discussion,
> they rarely come to the table asking for “some or any sort of clustering,
> don’t care which or how it works” - they ask if I can offer an S3 compatible
> object store with horizontal scaling, or

> if they can use NFS in some
> clustered fashion where there’s a single namespace offering petabytes of
> storage with configurable redundancy such that no portion of that namespace
> is ever unavailable.
>

I tend to think of this last case as the target for any pNFS server. The basic
idea is to redirect the I/O operations to wherever the data is actually stored,
so that I/O performance doesn't degrade with scale.

If redundancy is a necessary feature, then maybe Plan A is preferable to Plan B,
since GlusterFS does provide for redundancy and resilvering of lost copies, at
least from my understanding of the docs on gluster.org.

I'd also like to see how GlusterFS performs on a typical Linux setup.
Even without having the nfsd use FUSE, access of GlusterFS via FUSE results in crossing
user (syscall on mount) --> kernel --> user (glusterfs daemon) within the client machine,
if I understand how GlusterFS works. Then the gluster brick server glusterfsd daemon does
file system syscall(s) to get at the actual file on the underlying FS (xfs or ZFS or ...).
As such, there is already a lot of user<->kernel boundary crossings.
I wonder how much delay is added by the extra nfsd step for metadata?
- I can't say much about performance of Plan A yet, but metadata operations are slow
and latency seems to be the issue. (I actually seem to get better performance by
disabling SMP, for example.)

> I’d be interested in what Justin had in mind when he asked Matt about this.
> Being able to “attach ZFS pools to one another” in such a fashion that all
> clients just see One Big Pool and ZFS’s own redundancy / snapshotting
> characteristics magically apply to the überpool would be Pretty Cool,
> obviously, and would allow one to do round-robin DNS for NFS such that any
> node could serve the same contents, but that also sounds pretty ambitious,
> depending on how it’s implemented.
>

This would probably work with the extant nfsd and wouldn't have a use for pNFS.
I also agree that this sounds pretty ambitious.

rick

Rick Macklem

unread,

Jun 19, 2016, 9:55:06 PM6/19/16

Here are a few pNFS papers from the Netapp and Panansas sites. They are
dated 2012->2015: (these papers give a nice overview of what pNFS is)
http://www.netapp.com/us/media/tr-4063.pdf
http://www.netapp.com/us/media/tr-4239.pdf
http://www.netapp.com/us/media/wp-7153.pdf
http://www.panasas.com/products/pnfs-overview

One of these notes that the first Linux distribution that shipped with pNFS
support was RHEL6.4 in 2013.

So, I have no idea if it will catch on, but I don't think it can be considered
end of life. (Many use NFSv3 and its RFC is dated June 1995.)

rick

Doug Rabson

unread,

Jun 20, 2016, 6:02:01 AM6/20/16

> That is also why I am not sure I would totally embrace the idea of a
> central MDS being a Real Option. Sure, the risks can be mitigated (as you
> say, by mirroring it), but even saying the words “central MDS” (or central
> anything) may be such a turn-off to those very same DevOps folks, folks who
> have been burned so many times by SPOFs and scaling bottlenecks in large
> environments, that we'll lose the audience the minute they hear the trigger
> phrase. Even if it means signing up for Other Problems later, it’s a lot
> easier to “sell” the concept of completely distributed mechanisms where, if
> there is any notion of centralization at all, it’s at least the result of a
> quorum election and the DevOps folks don’t have to do anything manually to
> cause it to happen - the cluster is “resilient" and "self-healing" and they
> are happy with being able to say those buzzwords to the CIO, who nods
> knowingly and tells them they’re doing a fine job!
>

My main reason for liking NFS is that it has decent client support in
upstream Linux. One reason I started working on pNFS was that at $work our
existing cluster filesystem product which uses a proprietary client
protocol caused us to delay OS upgrades for months while we waited for
$vendor to port their client code to RHEL7. The NFS protocol is well
documented with several accessible reference implementations and pNFS gives
enough flexibility to support a distributed filesystem at an interesting
scale.

You mention a 'central MDS' as being an issue. I'm not going to go through
your list but at least HDFS also has this 'issue' and it doesn't seem to be
a problem for many users storing >100 Pb across >10^5 servers. In practice,
the MDS would be replicated for redundancy - there are lots of approaches
for this, my preference being Paxos but Raft would work just as well.
Google's GFS also followed this model and was an extremely reliable large
scale filesystem.

I am building an MDS as a layer on top of a key/value database which offers
the possibility of moving the backing store to some kind of distributed
key/value store in future which would remove the scaling and reliability
concerns.

I can agree with this - everything I'm working on is in userland. Given
that I'm not trying to export a local filesystem most of the reasons for
wanting a kernel implementation disappear. Adding support for NFS over RDMA
removes all the network context switching and for frequently accessed data
would typically be served out of a userland cache which removes the rest of
the context switches.

Jordan Hubbard

unread,

Jun 20, 2016, 10:54:36 PM6/20/16

OK, wow. This appears to have turned into something of a referendum on NFS and, just based on Rick and Doug’s defense of pNFS, I also think my commentary on that may have been misconstrued somewhat.

So, let me just set the record straight by saying that I’m all in favor of pNFS. It addresses a very definite need in the Enterprise marketplace and gives FreeBSD yet another arrow in its quiver when it comes to being “a player” in that (ever-growing) arena. The only point I was trying to make before was that if we could ALSO address clustering in a more general way as part of providing a pNFS solution, that would be great. I am not, however, the one writing the code and if my comments were in any way discouraging to the folks that are, I apologize and want to express my enthusiasm for it. If iXsystems engineering resources can contribute in any way to moving this ball forward, let me know and we’ll start doing so.

On the more general point of “NFS is hard, let’s go shopping” let me also say that it’s kind of important not to conflate end-user targeted solutions with enterprise solutions. Setting up a Kerberized NFSv4, for example, is not really designed to be trivial to set up and if anyone is waiting for that to happen, they may be waiting a very long time (like, forever). NFS and SMB are both fairly simple technologies to use if you restrict yourself to using, say, just 20% of their overall feature-sets. Once you add ACLs, Directory Services, user/group and permissions mappings, and any of the other more enterprise-centric features of these filesharing technologies, however, things rapidly get more complicated and the DevOps people who routinely play in these kinds of environments are quite happy to have all those options available because they’re not consumers operating in consumer environments.

Sun didn’t design NFS to be particularly consumer-centric, for that matter, and if you think SMB is “simple” because you clicked Network on Windows Explorer one day and stuff just automagically appeared, you should try operating it in a serious Windows Enterprise environment (just flip through some of the SMB bugs in the FreeNAS bug tracker - https://bugs.freenas.org/projects/freenas/issues?utf8=✓&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=*&f%5B%5D=category_id&op%5Bcategory_id%5D=%3D&v%5Bcategory_id%5D%5B%5D=57&f%5B%5D=&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=fixed_version&group_by= - if you want to see the kinds of problems users wrestle with all the time).

Anyway, I’ll get off the soapbox now, I just wanted to dispute the premise that “simple file sharing” that is also “secure file sharing” and “flexible file sharing” doesn’t really exist. The simplest end-user oriented file sharing system I’ve used to date is probably AFP, and Apple has been trying to kill it for years, probably because it doesn’t have all those extra knobs and Kerberos / Directory Services integration business users have been asking for (it’s also not particularly industry standard).

- Jordan

Linda Kateley

unread,

Jun 21, 2016, 12:20:52 PM6/21/16

I have really enjoyed this discussion. Just to echo this point further.
I have spent most of my career with 1 foot in opensource and the other 3
feet in the enterprise(And yes I have 4 feet.). Enterprise always makes
decisions based on reliability or someone telling them something is
reliable. If you ask 100 vmware admins why they use nfs probably 100
will say because vmware recommends it. If you ask a CT at vmware why
they recommend it, the couple I have asked have said because it is a
reliable transport.

Vmware now has interest in pnfs.

Technology gets driven by business/enterprise. I talked to a CA at a
large electronics chain and asked why they are using ceph and he said
about 100 words, then said because red hat recommends it with openstack.

Intel is driving lustre. RHEL driving ceph. Vmware driving pnfs. I don't
see anyone driving gluster.

Every once in awhile you see products grow on their merit(watching
proxmox and zerto right now) but those usually get swooped up by a
bigger one.

To the point of setting up kerberized nfs, AD has made kerberos easy, it
could be just as easy with nfs. Everything is easy once you know it.

Rick Macklem

unread,

Jun 21, 2016, 4:43:00 PM6/21/16

Linda Kateley wrote:
> I have really enjoyed this discussion. Just to echo this point further.
> I have spent most of my career with 1 foot in opensource and the other 3
> feet in the enterprise(And yes I have 4 feet.). Enterprise always makes
> decisions based on reliability or someone telling them something is
> reliable. If you ask 100 vmware admins why they use nfs probably 100
> will say because vmware recommends it. If you ask a CT at vmware why
> they recommend it, the couple I have asked have said because it is a
> reliable transport.
>
> Vmware now has interest in pnfs.
>
> Technology gets driven by business/enterprise. I talked to a CA at a
> large electronics chain and asked why they are using ceph and he said
> about 100 words, then said because red hat recommends it with openstack.
>
> Intel is driving lustre. RHEL driving ceph. Vmware driving pnfs. I don't
> see anyone driving gluster.
>

I don't know of any vendors (Redhat people basically maintain it, afaik), but Jordan
sent me this a little while back:

https://www.socallinuxexpo.org/scale/14x/presentations/scaling-glusterfs-facebook

Facebook is a user, but a large one.

Although GlusterFS seems to supports OpenStack stuff, it seems to be layered on top of the
POSIX file system using something they call SwiftOnFile.

Thanks for the comments, rick

Rick Macklem

unread,

Jun 21, 2016, 5:55:17 PM6/21/16

Jordan Hubbard wrote:
> OK, wow. This appears to have turned into something of a referendum on NFS
> and, just based on Rick and Doug’s defense of pNFS, I also think my
> commentary on that may have been misconstrued somewhat.
>

Actually, I thought it had become a referendum on LDAP;-)

As for defending pNFS, all I was trying to say was that "although it is hard
to believe, it has taken 10years for pNFS to hit the streets". As such, it
is anyone's guess w.r.t. whether or not it will become widely adopted?
If it came across as more than that, I am the one that should be apologizing
and am in no way discouraged by any of the comments.

> So, let me just set the record straight by saying that I’m all in favor of
> pNFS. It addresses a very definite need in the Enterprise marketplace and
> gives FreeBSD yet another arrow in its quiver when it comes to being “a
> player” in that (ever-growing) arena. The only point I was trying to make
> before was that if we could ALSO address clustering in a more general way as
> part of providing a pNFS solution, that would be great.

When I did a fairly superficial evaluation of the open source clustering systems
out there (looking at online doc and not actually their code), it seemed that
GlusterFS was the best bet for "one size fits all".
It had:
- a distributed file system (replication, etc) with a POSIX/FUSE interface.
- SwiftOnFile that put the Swift/Openstack on top of this.
- It had decentralized metadata handling.
For pNFS:
- It had a NFSv3 server built into it.
- Was ported to FreeBSD.

The others were:
- Object store only with no POSIX file system support
or
- Single centralized metadata store (MooseFS, for example)
- No FreeBSD port and rumoured to be hard to port (Ceph, Lustre are two examples).

Now that I've worked with GlusterFS a little bit, I am skeptical that it can
deliver adequate performance for pNFS using the nfsd. I am still hoping I will
be proven wrong on this, but???

A GlusterFS/Ganesha-NFS user space solution may be feasible. This is what the
GlusterFS folk are planning. However, for FreeBSD...
- Ganesha-NFS apparently was ported to FreeBSD, but the port was removed from
their source tree and it is said it now uses Linux-specific thread primitives.
--> As such, I have no idea what effort is involved in getting this ported and
working well on FreeBSD is.
- I would also wait until this is working in Linux and would want to do an
evaluation of that, to make sure it actually works/performs well, before
considering this.
*** For me personally, I am probably not interested in working on this. I
know the FreeBSD nfsd kernel code well and can easily work with that,
but Ganesha-NFS would be an entirely different beast.

Bottom line, at this point I am skeptical that a generic clustering system
will work for pNFS.

> I am not, however,
> the one writing the code and if my comments were in any way discouraging to
> the folks that are, I apologize and want to express my enthusiasm for it.
> If iXsystems engineering resources can contribute in any way to moving this
> ball forward, let me know and we’ll start doing so.
>

Well, although they may not be useful for building a pNFS server, but some sort
of evaluation of the open source clustering systems might be useful.
Sooner or later, the Enterprise marketplace may want one or more of these and
it seems to me that having one of these layered on top of ZFS may be an attractive
solution.
- Some will never be ported to FreeBSD, but the ones that are could probably be
evaluated fairly easily, if you have the resources.

Since almost all the code I've written gets reused if I do a PlanB, I will
probably pursue that, leaving the GlusterFS interface bits in place in case
they are useful.

Thanks for all the interesting comments, rick

Willem Jan Withagen

unread,

Jun 22, 2016, 7:26:41 AM6/22/16

Hi Jordan,

To rip just a bit of your text out of context:

On 18-6-2016 22:50, Jordan Hubbard wrote:
> Some, if not most, of them are also far
> better supported under Linux than FreeBSD (I don’t think we even have
> a working ceph port yet).

In the spare time I have left, I'm trying to get a lot of small fixes
into the ceph tree to get it actually compiling, testing, and running on
FreeBSD. But Ceph is a lot of code, and since a lot of people are
working on it, the number of code changes are big. And just keeping up
with that is sometimes hard. More and more Linux-isms are dropped into
the the code. So progress is slow. I only because it is hard to get
people to look at the commits and get them.
Current state is that I have it compile everything, and I can run 120 of
129 test with success. I once had them complete all, but then a busload
of changes were dropped in the tree. And so I needed to "start
"repairing" again.

I gave a small presentation of my work thus far at Ceph Day Cern in
Geneva. https://indico.cern.ch/event/542464/contributions/2202309/
Differences in code are not really that big in the CC-code, most of the
things to fix are additional tools that have to deal with the
infrastructure that fully assumes it is running a Linux-distro.

Next to that is Ceph going to its own diskstore system: BlueStore, where
as I hope(d) to base it on a ZFS underlying layer...
To run BlueStore AIO is needed for diskdevices, but the current AIO is
not call for call compatible, and requires a glue layer. I have not
looked into the size of the semantic problems between Linux and FreeBSD
here.

On the other hand they just declared CephFS (a posix filesystem running
on Ceph) stable and to be used.

--WjW

Jordan Hubbard

unread,

Jun 24, 2016, 3:35:40 AM6/24/16

> On Jun 22, 2016, at 1:56 AM, Willem Jan Withagen <w...@digiware.nl> wrote:
>
> In the spare time I have left, I'm trying to get a lot of small fixes
> into the ceph tree to get it actually compiling, testing, and running on
> FreeBSD. But Ceph is a lot of code, and since a lot of people are
> working on it, the number of code changes are big.

Hi Willem,

Yes, I read your paper on the porting effort!

I also took a look at porting ceph myself, about 2 years ago, and rapidly concluded that it wasn’t a small / trivial effort by any means and would require a strong justification in terms of ceph’s feature set over glusterfs / moose / OpenAFS / RiakCS / etc. Since that time, there’s been customer interest but nothing truly “strong” per-se. My attraction to ceph remains centered around at least these 4 things:

1. Distributed Object store with S3-compatible ReST API
2. Interoperates with Openstack via Swift compatibility
3. Block storage (RADOS) - possibly useful for iSCSI and other block storage requirements
4. Filesystem interface

Is there anything we can do to help? Do the CEPH folks seem receptive to actually having a “Tier 1” FreeBSD port? I know that stas@ did an early almost-port awhile back, but it never reached fruition and my feeling was that they (ceph) might be a little gun-shy about seeing another port that might wind up in the same place, crufting up their code base to no purpose. Do you have any initial impressions about that? I’ve never talked to any of the 3 principle guys working on the project and this is pure guesswork on my part.

- Jordan

Willem Jan Withagen

unread,

Jun 24, 2016, 4:32:08 AM6/24/16

On 24-6-2016 09:35, Jordan Hubbard wrote:
>
>> On Jun 22, 2016, at 1:56 AM, Willem Jan Withagen <w...@digiware.nl>
>> wrote:
>>
>> In the spare time I have left, I'm trying to get a lot of small
>> fixes into the ceph tree to get it actually compiling, testing, and
>> running on FreeBSD. But Ceph is a lot of code, and since a lot of
>> people are working on it, the number of code changes are big.
>
> Hi Willem,
>
> Yes, I read your paper on the porting effort!
>
> I also took a look at porting ceph myself, about 2 years ago, and
> rapidly concluded that it wasn’t a small / trivial effort by any
> means and would require a strong justification in terms of ceph’s
> feature set over glusterfs / moose / OpenAFS / RiakCS / etc. Since
> that time, there’s been customer interest but nothing truly “strong”
> per-se.

I've been going at it since last November... And all I go in are about 3
batches of FreeBSD specific commits. Lots has to do with release windows
and code slush, like we know on FreeBSD. But then still reviews tend to
slow and I need people to push to look at them. Whilst in the mean time
all kinds of thing get pulled and inserted in the tree, that seriously
are not FreeBSD. Sometimes I see them during commit, and "negotiate"
better comparability with the author. At other times I missed the whole
thing, and I need to rebase to get ride of merge conflicts. To find out
the hard way that somebody has made the whole
peer communication async. And has thrown kqueue for the BSDs at it. But
they don't work (yet). So to get my other patches in, if First need to
fix this. Takes a lot of time .....

That all said I was in Geneva and a lot of the Ceph people were there
including Sage Weil. And I go the feeling they appreciated a larger
community. I think they see what ZFS has done with OpenZFS and see that
communities get somewhere.

Now on of the things to do to continue, now that I sort of can compile
and run the first testset, is set up sort of my own Jenkins stuff. So
that I can at least test drive some of the tree automagically to get
some testcoverage of the code on FreeBSD. In my mind (and Sage warned me
that that will be more or less required) it is the only way to actually
get a serious foot in the door with the Ceph guys.

> My attraction to ceph remains centered around at least these
> 4 things:
>
> 1. Distributed Object store with S3-compatible ReST API
> 2. Interoperates with Openstack via Swift compatibility
> 3. Block storage > (RADOS) - possibly useful for iSCSI and other block storage
> requirements
> 4. Filesystem interface
>
> Is there anything we can do to help?

I'll get back on that in a separate Email.

> Do the CEPH folks seem
> receptive to actually having a “Tier 1” FreeBSD port? I know that
> stas@ did an early almost-port awhile back, but it never reached
> fruition and my feeling was that they (ceph) might be a little
> gun-shy about seeing another port that might wind up in the same
> place, crufting up their code base to no purpose.

Well, as you know, I for the era before there was automake....
So then porting was still very much an art. So I've been balancing
between crufting up the code, and hiding things nice and cleanly in C++
classes and place. And as an go inbetween stuff get stuck in compat.h.

One of my slides was actually about the impact of foreign code in the
tree. And uptill now that is relatively minimal. Which seemed to please
a lot of the folks. But they also like the idee that getting FreeBSD
stuff in actually showed code weakness (and fixes) in the odd corners.

> Do you have any
> initial impressions about that? I’ve never talked to any of the 3
> principle guys working on the project and this is pure guesswork on
> my part.

I think they are going their own path, like writting their own datastore
so they can do things they require that posix can't deliver.
And as such are also diverging from what is default on Linux.

The systemarchitect in me also sees things in pain happen, because of
the "reinvention" of things. But then again, that happens with projects
this big. Things like checksums, compression, encryption, ....
Lots of stuff I've seen happen to ZFS over its time.
But so be it, everybody gets to chose their own axes to grind.

The community person to talk to is perhaps Patrick McGarry, but even
Sage would be good to talk to.

--WjW

0 new messages