[GIT PULL] Ceph distributed file system client for 2.6.33

Sage Weil

unread,

Dec 7, 2009, 6:30:02 PM12/7/09

to

Hi Linus,

Please pull from 'master' branch of

git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git master

to receive the Ceph distributed file system client. The fs has made a
half dozen rounds on linux-fsdevel, and has been in linux-next for the
last month or so. Although review has been sparse, Andrew said the code
looks reasonable for 2.6.33.

The git tree includes the full patchset posted in October and incremental
changes since then. I've tried to cram in all the anticipated protocol
changes, but the file system is still strictly EXPERIMENTAL and is marked
as such. Merging now will attract new eyes and make it easier to test and
evaluate the system (both the client and server side).

Basic features include:

* High availability and reliability. No single points of failure.
* Strong data and metadata consistency between clients
* N-way replication of all data across storage nodes
* Seamless scaling from 1 to potentially many thousands of nodes
* Fast recovery from node failures
* Automatic rebalancing of data on node addition/removal
* Easy deployment: most FS components are userspace daemons

More info on Ceph at

http://ceph.newdream.net/

Thanks-
sage

Julia Lawall (2):
fs/ceph: introduce missing kfree
fs/ceph: Move a dereference below a NULL test

Noah Watkins (3):
ceph: replace list_entry with container_of
ceph: remove redundant use of le32_to_cpu
ceph: fix intra strip unit length calculation

Sage Weil (93):
ceph: documentation
ceph: on-wire types
ceph: client types
ceph: ref counted buffer
ceph: super.c
ceph: inode operations
ceph: directory operations
ceph: file operations
ceph: address space operations
ceph: MDS client
ceph: OSD client
ceph: CRUSH mapping algorithm
ceph: monitor client
ceph: capability management
ceph: snapshot management
ceph: messenger library
ceph: message pools
ceph: nfs re-export support
ceph: ioctls
ceph: debugfs
ceph: Kconfig, Makefile
ceph: document shared files in README
ceph: show meaningful version on module load
ceph: include preferred_osd in file layout virtual xattr
ceph: gracefully avoid empty crush buckets
ceph: fix mdsmap decoding when multiple mds's are present
ceph: renew mon subscription before it expires
ceph: fix osd request submission race
ceph: revoke osd request message on request completion
ceph: fail gracefully on corrupt osdmap (bad pg_temp mapping)
ceph: reset osd session on fault, not peer_reset
ceph: cancel osd requests before resending them
ceph: update to mon client protocol v15
ceph: add file layout validation
ceph: ignore trailing data in monamp
ceph: remove unused CEPH_MSG_{OSD,MDS}_GETMAP
ceph: add version field to message header
ceph: convert encode/decode macros to inlines
ceph: initialize sb->s_bdi, bdi_unregister after kill_anon_super
ceph: move generic flushing code into helper
ceph: flush dirty caps via the cap_dirty list
ceph: correct subscribe_ack msgpool payload size
ceph: warn on allocation from msgpool with larger front_len
ceph: move dirty caps code around
ceph: enable readahead
ceph: include preferred osd in placement seed
ceph: v0.17 of client
ceph: move directory size logic to ceph_getattr
ceph: remove small mon addr limit; use CEPH_MAX_MON where appropriate
ceph: reduce parse_mount_args stack usage
ceph: silence uninitialized variable warning
ceph: fix, clean up string mount arg parsing
ceph: allocate and parse mount args before client instance
ceph: correct comment to match striping calculation
ceph: fix object striping calculation for non-default striping schemes
ceph: fix uninitialized err variable
crush: always return a value from crush_bucket_choose
ceph: init/destroy bdi in client create/destroy helpers
ceph: use fixed endian encoding for ceph_entity_addr
ceph: fix endian conversions for ceph_pg
ceph: fix sparse endian warning
ceph: convert port endianness
ceph: clean up 'osd%d down' console msg
ceph: make CRUSH hash functions non-inline
ceph: use strong hash function for mapping objects to pgs
ceph: make object hash a pg_pool property
ceph: make CRUSH hash function a bucket property
ceph: do not confuse stale and dead (unreconnected) caps
ceph: separate banner and connect during handshake into distinct stages
ceph: remove recon_gen logic
ceph: exclude snapdir from readdir results
ceph: initialize i_size/i_rbytes on snapdir
ceph: pr_info when mds reconnect completes
ceph: build cleanly without CONFIG_DEBUG_FS
ceph: fix page invalidation deadlock
ceph: remove bad calls to ceph_con_shutdown
ceph: remove unnecessary ceph_con_shutdown
ceph: handle errors during osd client init
ceph: negotiate authentication protocol; implement AUTH_NONE protocol
ceph: move mempool creation to ceph_create_client
ceph: small cleanup in hash function
ceph: fix debugfs entry, simplify fsid checks
ceph: decode updated mdsmap format
ceph: reset requested max_size after mds reconnect
ceph: reset msgr backoff during open, not after successful handshake
ceph: remove dead code
ceph: remove useless IS_ERR checks
ceph: plug leak of request_mutex
ceph: whitespace cleanup
ceph: hide /.ceph from readdir results
ceph: allow preferred osd to be get/set via layout ioctl
ceph: update MAINTAINERS entry with correct git URL
ceph: mark v0.18 release

Yehuda Sadeh (1):
ceph: mount fails immediately on error

Sage Weil

unread,

Dec 18, 2009, 4:00:03 PM12/18/09

to

Hi Linus,

I would still like to see ceph merged for 2.6.33. It's certainly not
production ready, but it would be greatly beneficial to be in mainline for
the same reasons other file systems like btrfs and exofs were merged
early.

Is there more information you'd like to see from me before pulling? If
there was a reason you decided not to pull, please let me know.

Thanks-
sage

> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

Linus Torvalds

unread,

Dec 18, 2009, 4:40:02 PM12/18/09

to

On Fri, 18 Dec 2009, Sage Weil wrote:
>
> I would still like to see ceph merged for 2.6.33. It's certainly not
> production ready, but it would be greatly beneficial to be in mainline for
> the same reasons other file systems like btrfs and exofs were merged
> early.

So what happened to ceph is the same thing that happened to the alacrityvm
pull request (Greg Haskins added to cc): I pretty much continually had a
_lot_ of pull requests, and all the time the priority for the ceph and
alactrityvm pull requests were just low enough on my priority list that I
never felt I had the reason to look into the background enough to make an
even half-assed decision of whether to pull or not.

And no, "just pull" is not my default answer - if I don't have a reason,
the default action is "don't pull".

I used to say that "my job is to say 'no'", although I've been so good at
farming out submaintainers that most of the time my real job is to pull
from submaintainers who hopefully know how to say 'no'. But when it comes
to whole new driver features, I'm still "no by default - tell me _why_ I
should pull".

So what is a new subsystem person to do?

The best thing to do is to try to have users that are vocal about the
feature, and talk about how great it is. Some advocates for it, in other
words. Just a few other people saying "hey, I use this, it's great", is
actually a big deal to me. For alacrityvm and cephfs, I didn't have that,
or they just weren't loud enough for me to hear.

So since you mentioned btrfs as an "early merge", I'll mention it too, as
a great example of how something got merged early because it had easily
gotten past my "people are asking for it" filter, to the point where _I_
was interested in trying it out personally, and asking Chris&co to tell me
when it was ready.

Ok, so that was somewhat unusual - I'm not suggesting you'd need to try to
drum up quite _that_ much hype - but it kind of illustrates the opposite
extreme of your issue. Get some PR going, get people talking about it, get
people testing it out. Get people outside of your area saying "hey, I use
it, and I hate having to merge it every release".

Then, when I see a pull request during the merge window, the pull suddenly
has a much higher priority, and I go "Ok, I know people are using this".

So no astro-turfing, but real grass-roots support really does help (or
top-down feedback for that matter - if a _distribution_ says "we're going
to merge this in our distro regardless", that also counts as a big hint
for me that people actually expect to use it and would like to not go
through the pain of merging).

Linus

Jim Garlick

unread,

Dec 18, 2009, 6:30:02 PM12/18/09

to

On Fri, Dec 18, 2009 at 01:38:00PM -0800, Linus Torvalds wrote:
> On Fri, 18 Dec 2009, Sage Weil wrote:
> >
> > I would still like to see ceph merged for 2.6.33. It's certainly not
> > production ready, but it would be greatly beneficial to be in mainline for
> > the same reasons other file systems like btrfs and exofs were merged
> > early.
>

> The best thing to do is to try to have users that are vocal about the
> feature, and talk about how great it is. Some advocates for it, in other
> words. Just a few other people saying "hey, I use this, it's great", is
> actually a big deal to me. For alacrityvm and cephfs, I didn't have that,
> or they just weren't loud enough for me to hear.

FWIW: I'd like to see it go in.

Ceph is new and experimental so you're not going to see production shops
like ours jumping up and down saying we use it and are tired of merging it,
like we would say if if Lustre were (again) on the table.

However I will say Ceph looks good and in the interest of nuturing future
options, I'm for merging it!

Jim Garlick
Lawrence Livermore National Laboratory

Valdis.K...@vt.edu

unread,

Dec 19, 2009, 12:40:02 AM12/19/09

to

On Fri, 18 Dec 2009 12:54:02 PST, Sage Weil said:
> I would still like to see ceph merged for 2.6.33. It's certainly not
> production ready, but it would be greatly beneficial to be in mainline for
> the same reasons other file systems like btrfs and exofs were merged
> early.

Is the on-the-wire protocol believed to be correct, complete, and stable? How
about any userspace APIs and on-disk formats? In other words..

> > The git tree includes the full patchset posted in October and incremental
> > changes since then. I've tried to cram in all the anticipated protocol
> > changes, but the file system is still strictly EXPERIMENTAL and is marked

Anything left dangling on the changes?

Andi Kleen

unread,

Dec 19, 2009, 6:10:01 AM12/19/09

to

Jim Garlick <gar...@llnl.gov> writes:
>
> Ceph is new and experimental so you're not going to see production shops

One issue with ceph is that I'm not sure it has any users at all.
The mailing list seems to be pretty much dead?
On a philosophical area I agree that network file systems are
definitely an area that could need some more improvements.

> like ours jumping up and down saying we use it and are tired of merging it,
> like we would say if if Lustre were (again) on the table.

OT, but I took a look at some Lustre srpm a few months ago and it
didn't seem to still require all the horrible VFS patches that the
older versions were plagued with (or perhaps I missed them). Because
it definitely seems to have a large real world user base perhaps it
would be something for staging at least these days?

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Sage Weil

unread,

Dec 21, 2009, 11:50:02 AM12/21/09

to

On Sat, 19 Dec 2009, Andi Kleen wrote:

> Jim Garlick <gar...@llnl.gov> writes:
> >
> > Ceph is new and experimental so you're not going to see production shops
>
> One issue with ceph is that I'm not sure it has any users at all.
> The mailing list seems to be pretty much dead?
> On a philosophical area I agree that network file systems are
> definitely an area that could need some more improvements.

The list is slow. The developers all work in the same office, so most of
the technical discussion ends up face to face (we're working on moving
more of it to the list). I also tend to send users actively testing it to
the irc channel.

That said, there aren't many active users. I see lots of interested
people lurking on the list and 'waiting for stability,' but I think the
prospect of testing an unstable cluster fs is much more daunting than a
local one.

If you want stability, then it's probably too early to merge. If you want
active users, that essentially hinges on stability too. But if it's
interest in/demand for an alternative distributed fs, then the sooner it's
merged the better.

From my point of view merging now will be a bit rockier with coordinating
releases, bug fixes, and dealing with any unforseen client side changes,
but I think it'll be worth it. OTOH, another release cycle will bring
greater stability and better first impressions.

sage

Sage Weil

unread,

Dec 21, 2009, 11:50:02 AM12/21/09

to

The wire protocol is close. There is a corner cases with MDS failure
recovery that need attention, but it can be resolved in a backward
compatible way. I think a compat/incompat flags mechanism during the
initial handshake might be appropriate to make changes easier going
forward. I don't anticipate any other changes there.

There are some as-yet unresolved interface and performance issues with the
way the storage nodes interact with btrfs that have on disk format
implications. I hope to resolve those shortly. Those of course do not
impact the client code.

sage

Andreas Dilger

unread,

Dec 21, 2009, 1:10:02 PM12/21/09

to

On 2009-12-21, at 09:42, Sage Weil wrote:
> I think a compat/incompat flags mechanism during the
> initial handshake might be appropriate to make changes easier going
> forward.

Having compat/incompat flags for the network protocol, implemented
correctly, is really critical for long term maintenance. For Lustre,
we ended up using a single set of compatibility flags:
- client sends full set of features that it understands
- server replies with strict subset of flags that it also understands
(i.e. client_features & server_supported_features)
- if client doesn't have required support for a feature needed by the
server, server refuses to allow client to mount
- if server doesn't have feature required by client (e.g. understands
only
some older implementation no longer supported by client), client
refuses
to mount filesystem

We've been able to use this mechanism for the past 5 years to maintain
protocol interoperability for Lustre, though we don't promise
perpetual interoperability, only for about 3 years or so before users
have to upgrade to a newer release. That allows us to drop support
for ancient code instead of having to carry around baggage for every
possible combination of old features.

Using simple version numbers for the protocol means you have to carry
the baggage of every single previous version, and it isn't possible to
have "experimental" features that are out in the wild, but eventually
don't make sense to keep around forever.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Josef Bacik

unread,

Feb 9, 2010, 3:50:02 PM2/9/10

to

On Fri, Dec 18, 2009 at 01:38:00PM -0800, Linus Torvalds wrote:
>
>

We have had bugzilla's opened with us (Red Hat) requesting that CEPH be included
in Fedora/RHEL, so I'm here to yell loudly that somebody wants it :).

The problem for these particular users is that sucking down a git tree and
applying patches and building a kernel is a very high entry cost to test
something they are very excited about, so they depend on distributions to ship
the new fun stuff for them to start testing. The problem is the distributions
do not want to ship new fun stuff thats not upstream if at all possible
(especially when it comes to filesystems). I personally have no issues with
just sucking a bunch of patches into the Fedora kernel so people can start
testing it, but I think that sends the wrong message, since we're supposed to be
following upstream and encouraging people to push their code upstream first.
Not to mention that it makes the actual Fedora kernel team antsy, and I already
bug them enough with what I pull in for btrfs :).

So for the time being I'm just going to pull the userspace stuff into Fedora.
If you still feel that there is not enough users to justify pulling CEPH in I
will probably pull the patches into the rawhide Fedora kernel when F13 branches
off and hopefully that will pull even more users in. Thanks,

Josef