Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[PATCH 00/26] ceph distributed file system client

2 views

Skip to first unread message

Sage Weil

unread,

Mar 2, 2010, 8:00:04 PM3/2/10

Hi everyone,

This is the latest patchset for the Ceph distributed file system client.
I've broken it down into individual files for ease of review.
Alternatively, the ceph-client.git tree includes a full history (from
mid-October). A shortlog for changes since the last time I sent this
out (during the 2.6.33 merge window) at the end of this email.

We've been hammering on this steadily for the past few weeks and are
pretty happy with the stability at this point. We would definitely
benefit from the broader testing that would come from inclusion in
2.6.34. If there are no major complaints, I will send this to Linus
again shortly.

Questions, comments, review, etc. are all welcome. :)

Thanks-
sage

Git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

Server-side git:
git://ceph.newdream.net/ceph.git

Below is a shortlog of changes since v0.18, the last patchset sent out
to LKML/fsdevel for review. The main items are support for client and
server authentication (auth_x), some support for easing future protocol
changes, and a variety of bug fixes. The full history is available at

http://ceph.newdream.net/git/?p=ceph-client.git;a=shortlog;h=unstable

Alexander Beregalov (1):
ceph: move dereference after NULL test

Julia Lawall (1):
ceph: remove duplicate variable initialization

Sage Weil (80):
ceph: use kref for ceph_buffer
ceph: simplify ceph_buffer interface
ceph: use kref for struct ceph_mds_request
ceph: use kref for ceph_osd_request
ceph: use kref for ceph_msg
ceph: do not feed bad device ids to crush
ceph: fix leak of monc mutex
ceph: carry explicit msg reference for currently sending message
ceph: plug msg leak in con_fault
ceph: detect lossy state of connection
ceph: don't save sent messages on lossy connections
ceph: hex dump corrupt server data to KERN_DEBUG
ceph: plug leak of incoming message during connection fault/close
ceph: make mds ops interruptible
ceph: include link to bdi in debugfs
ceph: ensure rename target dentry fails revalidation
ceph: do not drop lease during revalidate
ceph: fix error paths for corrupt osdmap messages
ceph: fix incremental osdmap pg_temp decoding bug
ceph: do not touch_caps while iterating over caps list
ceph: only unregister registered bdi
ceph: unregister canceled/timed out osd requests
ceph: use connection mutex to protect read and write stages
ceph: control access to page vector for incoming data
ceph: more informative msgpool errors
ceph: include transaction id in ceph_msg_header (protocol change)
ceph: add feature bits to connection handshake (protocol change)
ceph: support ceph_pagelist for message payload
ceph: use ceph_pagelist for mds reconnect message; change encoding (protocol change)
ceph: remove unused erank field
ceph: display pgid in debugfs osd request dump
ceph: mark MDS CREATE as a write op
ceph: properly handle aborted mds requests
ceph: precede encoded ceph_pg_pool struct with version
ceph: include type in ceph_entity_addr, filepath
ceph: release all pages after successful osd write response
ceph: buffer decoding helpers
ceph: aes crypto and base64 encode/decode helpers
ceph: allow renewal of auth credentials
ceph: add struct version to auth encoding
ceph: add support for auth_x authentication protocol
ceph: add uid field to ceph_pg_pool
ceph: cap revocation fixes
ceph: do not retain caps that are being revoked
ceph: fix sync read eof check deadlock
ceph: cleanup async writeback, truncation, invalidate helpers
ceph: invalidate pages even if truncate is pending
ceph: remove bogus invalidate_mapping_pages
ceph: fix msgr to keep sent messages until acked
ceph: reset osd connections after fault
ceph: allow connection to be reopened by fault callback
ceph: cancel delayed work when closing connection
ceph: use rbtree for mds requests
ceph: use rbtree for snap_realms
ceph: use rbtree for mon statfs requests
ceph: fix authentication races, auth_none oops
ceph: clean up readdir caps reservation
ceph: fix iterate_caps removal race
ceph: fix memory leak when destroying osdmap with pg_temp mappings
ceph: use rbtree for pg pools; decode new osdmap format
ceph: v0.19 release
ceph: fix typo in ceph_queue_writeback debug output
ceph: fix check for invalidate_mapping_pages success
ceph: fix up unexpected message handling
ceph: fix comments, locking in destroy_inode
ceph: drop messages on unregistered mds sessions; cleanup
ceph: fix client_request_forward decoding
ceph: invalidate_authorizer without con->mutex held
ceph: fix connection fault STANDBY check
ceph: remove fragile __map_osds optimization
ceph: remove bogus mds forward warning
ceph: reset bits on connection close
ceph: use single osd op reply msg
ceph: fix snaptrace decoding on cap migration between mds
ceph: reset front len on return to msgpool; BUG on mismatched front iov
ceph: set osd request message front length correctly
ceph: return EBADF if waiting for caps on closed file
ceph: fix osdmap decoding when pools include (removed) snaps
ceph: include migrating caps in issued set
ceph: fix flush_dirty_caps race with caps migration

Yehuda Sadeh (23):
ceph: fix msgpool reservation leak
ceph: remove unaccessible code
ceph: writepage grabs and releases inode
ceph: writeback congestion control
ceph: fix copy_user_to_page_vector()
ceph: change dentry offset and position after splice_dentry
ceph: allocate middle of message before stating to read
ceph: refactor messages data section allocation
ceph: alloc message data pages and check if tid exists
ceph: keep reserved replies on the request structure
ceph: remove unreachable code
ceph: always send truncation info with read and write osd ops
ceph: remove unused variable
ceph: put unused osd connections on lru
ceph: fix short synchronous reads
ceph: refactor ceph_write_begin, fix ceph_page_mkwrite
ceph: fix truncation when not holding caps
ceph: sync read/write considers page cache
ceph: remove page upon writeback completion if lost cache cap
ceph: don't truncate dirty pages in invalidate work thread
ceph: cleanup redundant code in handle_cap_grant
ceph: don't clobber write return value when using O_SYNC
ceph: reset osd after relevant messages timed out

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Sage Weil

unread,

Mar 2, 2010, 8:00:03 PM3/2/10

File open and close operations, and read and write methods that ensure
we have obtained the proper capabilities from the MDS cluster before
performing IO on a file. We take references on held capabilities for
the duration of the read/write to avoid prematurely releasing them
back to the MDS.

We implement two main paths for read and write: one that is buffered
(and uses generic_aio_{read,write}), and one that is fully synchronous
and blocking (operating either on a __user pointer or, if O_DIRECT,
directly on user pages).

Signed-off-by: Sage Weil <sa...@newdream.net>
---
fs/ceph/file.c | 937 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 937 insertions(+), 0 deletions(-)
create mode 100644 fs/ceph/file.c

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
new file mode 100644
index 0000000..5d2af84
--- /dev/null
+++ b/fs/ceph/file.c
@@ -0,0 +1,937 @@
+#include "ceph_debug.h"
+
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/writeback.h>
+
+#include "super.h"
+#include "mds_client.h"
+
+/*
+ * Ceph file operations
+ *
+ * Implement basic open/close functionality, and implement
+ * read/write.
+ *
+ * We implement three modes of file I/O:
+ * - buffered uses the generic_file_aio_{read,write} helpers
+ *
+ * - synchronous is used when there is multi-client read/write
+ * sharing, avoids the page cache, and synchronously waits for an
+ * ack from the OSD.
+ *
+ * - direct io takes the variant of the sync path that references
+ * user pages directly.
+ *
+ * fsync() flushes and waits on dirty pages, but just queues metadata
+ * for writeback: since the MDS can recover size and mtime there is no
+ * need to wait for MDS acknowledgement.
+ */
+
+
+/*
+ * Prepare an open request. Preallocate ceph_cap to avoid an
+ * inopportune ENOMEM later.
+ */
+static struct ceph_mds_request *
+prepare_open_request(struct super_block *sb, int flags, int create_mode)
+{
+ struct ceph_client *client = ceph_sb_to_client(sb);
+ struct ceph_mds_client *mdsc = &client->mdsc;
+ struct ceph_mds_request *req;
+ int want_auth = USE_ANY_MDS;
+ int op = (flags & O_CREAT) ? CEPH_MDS_OP_CREATE : CEPH_MDS_OP_OPEN;
+
+ if (flags & (O_WRONLY|O_RDWR|O_CREAT|O_TRUNC))
+ want_auth = USE_AUTH_MDS;
+
+ req = ceph_mdsc_create_request(mdsc, op, want_auth);
+ if (IS_ERR(req))
+ goto out;
+ req->r_fmode = ceph_flags_to_mode(flags);
+ req->r_args.open.flags = cpu_to_le32(flags);
+ req->r_args.open.mode = cpu_to_le32(create_mode);
+ req->r_args.open.preferred = cpu_to_le32(-1);
+out:
+ return req;
+}
+
+/*
+ * initialize private struct file data.
+ * if we fail, clean up by dropping fmode reference on the ceph_inode
+ */
+static int ceph_init_file(struct inode *inode, struct file *file, int fmode)
+{
+ struct ceph_file_info *cf;
+ int ret = 0;
+
+ switch (inode->i_mode & S_IFMT) {
+ case S_IFREG:
+ case S_IFDIR:
+ dout("init_file %p %p 0%o (regular)\n", inode, file,
+ inode->i_mode);
+ cf = kmem_cache_alloc(ceph_file_cachep, GFP_NOFS | __GFP_ZERO);
+ if (cf == NULL) {
+ ceph_put_fmode(ceph_inode(inode), fmode); /* clean up */
+ return -ENOMEM;
+ }
+ cf->fmode = fmode;
+ cf->next_offset = 2;
+ file->private_data = cf;
+ BUG_ON(inode->i_fop->release != ceph_release);
+ break;
+
+ case S_IFLNK:
+ dout("init_file %p %p 0%o (symlink)\n", inode, file,
+ inode->i_mode);
+ ceph_put_fmode(ceph_inode(inode), fmode); /* clean up */
+ break;
+
+ default:
+ dout("init_file %p %p 0%o (special)\n", inode, file,
+ inode->i_mode);
+ /*
+ * we need to drop the open ref now, since we don't
+ * have .release set to ceph_release.
+ */
+ ceph_put_fmode(ceph_inode(inode), fmode); /* clean up */
+ BUG_ON(inode->i_fop->release == ceph_release);
+
+ /* call the proper open fop */
+ ret = inode->i_fop->open(inode, file);
+ }
+ return ret;
+}
+
+/*
+ * If the filp already has private_data, that means the file was
+ * already opened by intent during lookup, and we do nothing.
+ *
+ * If we already have the requisite capabilities, we can satisfy
+ * the open request locally (no need to request new caps from the
+ * MDS). We do, however, need to inform the MDS (asynchronously)
+ * if our wanted caps set expands.
+ */
+int ceph_open(struct inode *inode, struct file *file)
+{
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_client *client = ceph_sb_to_client(inode->i_sb);
+ struct ceph_mds_client *mdsc = &client->mdsc;
+ struct ceph_mds_request *req;
+ struct ceph_file_info *cf = file->private_data;
+ struct inode *parent_inode = file->f_dentry->d_parent->d_inode;
+ int err;
+ int flags, fmode, wanted;
+
+ if (cf) {
+ dout("open file %p is already opened\n", file);
+ return 0;
+ }
+
+ /* filter out O_CREAT|O_EXCL; vfs did that already. yuck. */
+ flags = file->f_flags & ~(O_CREAT|O_EXCL);
+ if (S_ISDIR(inode->i_mode))
+ flags = O_DIRECTORY; /* mds likes to know */
+
+ dout("open inode %p ino %llx.%llx file %p flags %d (%d)\n", inode,
+ ceph_vinop(inode), file, flags, file->f_flags);
+ fmode = ceph_flags_to_mode(flags);
+ wanted = ceph_caps_for_mode(fmode);
+
+ /* snapped files are read-only */
+ if (ceph_snap(inode) != CEPH_NOSNAP && (file->f_mode & FMODE_WRITE))
+ return -EROFS;
+
+ /* trivially open snapdir */
+ if (ceph_snap(inode) == CEPH_SNAPDIR) {
+ spin_lock(&inode->i_lock);
+ __ceph_get_fmode(ci, fmode);
+ spin_unlock(&inode->i_lock);
+ return ceph_init_file(inode, file, fmode);
+ }
+
+ /*
+ * No need to block if we have any caps. Update wanted set
+ * asynchronously.
+ */
+ spin_lock(&inode->i_lock);
+ if (__ceph_is_any_real_caps(ci)) {
+ int mds_wanted = __ceph_caps_mds_wanted(ci);
+ int issued = __ceph_caps_issued(ci, NULL);
+
+ dout("open %p fmode %d want %s issued %s using existing\n",
+ inode, fmode, ceph_cap_string(wanted),
+ ceph_cap_string(issued));
+ __ceph_get_fmode(ci, fmode);
+ spin_unlock(&inode->i_lock);
+
+ /* adjust wanted? */
+ if ((issued & wanted) != wanted &&
+ (mds_wanted & wanted) != wanted &&
+ ceph_snap(inode) != CEPH_SNAPDIR)
+ ceph_check_caps(ci, 0, NULL);
+
+ return ceph_init_file(inode, file, fmode);
+ } else if (ceph_snap(inode) != CEPH_NOSNAP &&
+ (ci->i_snap_caps & wanted) == wanted) {
+ __ceph_get_fmode(ci, fmode);
+ spin_unlock(&inode->i_lock);
+ return ceph_init_file(inode, file, fmode);
+ }
+ spin_unlock(&inode->i_lock);
+
+ dout("open fmode %d wants %s\n", fmode, ceph_cap_string(wanted));
+ req = prepare_open_request(inode->i_sb, flags, 0);
+ if (IS_ERR(req)) {
+ err = PTR_ERR(req);
+ goto out;
+ }
+ req->r_inode = igrab(inode);
+ req->r_num_caps = 1;
+ err = ceph_mdsc_do_request(mdsc, parent_inode, req);
+ if (!err)
+ err = ceph_init_file(inode, file, req->r_fmode);
+ ceph_mdsc_put_request(req);
+ dout("open result=%d on %llx.%llx\n", err, ceph_vinop(inode));
+out:
+ return err;
+}
+
+
+/*
+ * Do a lookup + open with a single request.
+ *
+ * If this succeeds, but some subsequent check in the vfs
+ * may_open() fails, the struct *file gets cleaned up (i.e.
+ * ceph_release gets called). So fear not!
+ */
+/*
+ * flags
+ * path_lookup_open -> LOOKUP_OPEN
+ * path_lookup_create -> LOOKUP_OPEN|LOOKUP_CREATE
+ */
+struct dentry *ceph_lookup_open(struct inode *dir, struct dentry *dentry,
+ struct nameidata *nd, int mode,
+ int locked_dir)
+{
+ struct ceph_client *client = ceph_sb_to_client(dir->i_sb);
+ struct ceph_mds_client *mdsc = &client->mdsc;
+ struct file *file = nd->intent.open.file;
+ struct inode *parent_inode = get_dentry_parent_inode(file->f_dentry);
+ struct ceph_mds_request *req;
+ int err;
+ int flags = nd->intent.open.flags - 1; /* silly vfs! */
+
+ dout("ceph_lookup_open dentry %p '%.*s' flags %d mode 0%o\n",
+ dentry, dentry->d_name.len, dentry->d_name.name, flags, mode);
+
+ /* do the open */
+ req = prepare_open_request(dir->i_sb, flags, mode);
+ if (IS_ERR(req))
+ return ERR_PTR(PTR_ERR(req));
+ req->r_dentry = dget(dentry);
+ req->r_num_caps = 2;
+ if (flags & O_CREAT) {
+ req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
+ req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
+ }
+ req->r_locked_dir = dir; /* caller holds dir->i_mutex */
+ err = ceph_mdsc_do_request(mdsc, parent_inode, req);
+ dentry = ceph_finish_lookup(req, dentry, err);
+ if (!err && (flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
+ err = ceph_handle_notrace_create(dir, dentry);
+ if (!err)
+ err = ceph_init_file(req->r_dentry->d_inode, file,
+ req->r_fmode);
+ ceph_mdsc_put_request(req);
+ dout("ceph_lookup_open result=%p\n", dentry);
+ return dentry;
+}
+
+int ceph_release(struct inode *inode, struct file *file)
+{
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_file_info *cf = file->private_data;
+
+ dout("release inode %p file %p\n", inode, file);
+ ceph_put_fmode(ci, cf->fmode);
+ if (cf->last_readdir)
+ ceph_mdsc_put_request(cf->last_readdir);
+ kfree(cf->last_name);
+ kfree(cf->dir_info);
+ dput(cf->dentry);
+ kmem_cache_free(ceph_file_cachep, cf);
+
+ /* wake up anyone waiting for caps on this inode */
+ wake_up(&ci->i_cap_wq);
+ return 0;
+}
+
+/*
+ * build a vector of user pages
+ */
+static struct page **get_direct_page_vector(const char __user *data,
+ int num_pages,
+ loff_t off, size_t len)
+{
+ struct page **pages;
+ int rc;
+
+ pages = kmalloc(sizeof(*pages) * num_pages, GFP_NOFS);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+
+ down_read(&current->mm->mmap_sem);
+ rc = get_user_pages(current, current->mm, (unsigned long)data,
+ num_pages, 0, 0, pages, NULL);
+ up_read(&current->mm->mmap_sem);
+ if (rc < 0)
+ goto fail;
+ return pages;
+
+fail:
+ kfree(pages);
+ return ERR_PTR(rc);
+}
+
+static void put_page_vector(struct page **pages, int num_pages)
+{
+ int i;
+
+ for (i = 0; i < num_pages; i++)
+ put_page(pages[i]);
+ kfree(pages);
+}
+
+void ceph_release_page_vector(struct page **pages, int num_pages)
+{
+ int i;
+
+ for (i = 0; i < num_pages; i++)
+ __free_pages(pages[i], 0);
+ kfree(pages);
+}
+
+/*
+ * allocate a vector new pages
+ */
+static struct page **alloc_page_vector(int num_pages)
+{
+ struct page **pages;
+ int i;
+
+ pages = kmalloc(sizeof(*pages) * num_pages, GFP_NOFS);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+ for (i = 0; i < num_pages; i++) {
+ pages[i] = alloc_page(GFP_NOFS);
+ if (pages[i] == NULL) {
+ ceph_release_page_vector(pages, i);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+ return pages;
+}
+
+/*
+ * copy user data into a page vector
+ */
+static int copy_user_to_page_vector(struct page **pages,
+ const char __user *data,
+ loff_t off, size_t len)
+{
+ int i = 0;
+ int po = off & ~PAGE_CACHE_MASK;
+ int left = len;
+ int l, bad;
+
+ while (left > 0) {
+ l = min_t(int, PAGE_CACHE_SIZE-po, left);
+ bad = copy_from_user(page_address(pages[i]) + po, data, l);
+ if (bad == l)
+ return -EFAULT;
+ data += l - bad;
+ left -= l - bad;
+ po += l - bad;
+ if (po == PAGE_CACHE_SIZE) {
+ po = 0;
+ i++;
+ }
+ }
+ return len;
+}
+
+/*
+ * copy user data from a page vector into a user pointer
+ */
+static int copy_page_vector_to_user(struct page **pages, char __user *data,
+ loff_t off, size_t len)
+{
+ int i = 0;
+ int po = off & ~PAGE_CACHE_MASK;
+ int left = len;
+ int l, bad;
+
+ while (left > 0) {
+ l = min_t(int, left, PAGE_CACHE_SIZE-po);
+ bad = copy_to_user(data, page_address(pages[i]) + po, l);
+ if (bad == l)
+ return -EFAULT;
+ data += l - bad;
+ left -= l - bad;
+ if (po) {
+ po += l - bad;
+ if (po == PAGE_CACHE_SIZE)
+ po = 0;
+ }
+ i++;
+ }
+ return len;
+}
+
+/*
+ * Zero an extent within a page vector. Offset is relative to the
+ * start of the first page.
+ */
+static void zero_page_vector_range(int off, int len, struct page **pages)
+{
+ int i = off >> PAGE_CACHE_SHIFT;
+
+ off &= ~PAGE_CACHE_MASK;
+
+ dout("zero_page_vector_page %u~%u\n", off, len);
+
+ /* leading partial page? */
+ if (off) {
+ int end = min((int)PAGE_CACHE_SIZE, off + len);
+ dout("zeroing %d %p head from %d\n", i, pages[i],
+ (int)off);
+ zero_user_segment(pages[i], off, end);
+ len -= (end - off);
+ i++;
+ }
+ while (len >= PAGE_CACHE_SIZE) {
+ dout("zeroing %d %p len=%d\n", i, pages[i], len);
+ zero_user_segment(pages[i], 0, PAGE_CACHE_SIZE);
+ len -= PAGE_CACHE_SIZE;
+ i++;
+ }
+ /* trailing partial page? */
+ if (len) {
+ dout("zeroing %d %p tail to %d\n", i, pages[i], (int)len);
+ zero_user_segment(pages[i], 0, len);
+ }
+}
+
+
+/*
+ * Read a range of bytes striped over one or more objects. Iterate over
+ * objects we stripe over. (That's not atomic, but good enough for now.)
+ *
+ * If we get a short result from the OSD, check against i_size; we need to
+ * only return a short read to the caller if we hit EOF.
+ */
+static int striped_read(struct inode *inode,
+ u64 off, u64 len,
+ struct page **pages, int num_pages,
+ int *checkeof)
+{
+ struct ceph_client *client = ceph_inode_to_client(inode);
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ u64 pos, this_len;
+ int page_off = off & ~PAGE_CACHE_MASK; /* first byte's offset in page */
+ int left, pages_left;
+ int read;
+ struct page **page_pos;
+ int ret;
+ bool hit_stripe, was_short;
+
+ /*
+ * we may need to do multiple reads. not atomic, unfortunately.
+ */
+ pos = off;
+ left = len;
+ page_pos = pages;
+ pages_left = num_pages;
+ read = 0;
+
+more:
+ this_len = left;
+ ret = ceph_osdc_readpages(&client->osdc, ceph_vino(inode),
+ &ci->i_layout, pos, &this_len,
+ ci->i_truncate_seq,
+ ci->i_truncate_size,
+ page_pos, pages_left);
+ hit_stripe = this_len < left;
+ was_short = ret >= 0 && ret < this_len;
+ if (ret == -ENOENT)
+ ret = 0;
+ dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
+ ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
+
+ if (ret > 0) {
+ int didpages =
+ ((pos & ~PAGE_CACHE_MASK) + ret) >> PAGE_CACHE_SHIFT;
+
+ if (read < pos - off) {
+ dout(" zero gap %llu to %llu\n", off + read, pos);
+ zero_page_vector_range(page_off + read,
+ pos - off - read, pages);
+ }
+ pos += ret;
+ read = pos - off;
+ left -= ret;
+ page_pos += didpages;
+ pages_left -= didpages;
+
+ /* hit stripe? */
+ if (left && hit_stripe)
+ goto more;
+ }
+
+ if (was_short) {
+ /* was original extent fully inside i_size? */
+ if (pos + left <= inode->i_size) {
+ dout("zero tail\n");
+ zero_page_vector_range(page_off + read, len - read,
+ pages);
+ read = len;
+ goto out;
+ }
+
+ /* check i_size */
+ *checkeof = 1;
+ }
+
+out:
+ if (ret >= 0)
+ ret = read;
+ dout("striped_read returns %d\n", ret);
+ return ret;
+}
+
+/*
+ * Completely synchronous read and write methods. Direct from __user
+ * buffer to osd, or directly to user pages (if O_DIRECT).
+ *
+ * If the read spans object boundary, just do multiple reads.
+ */
+static ssize_t ceph_sync_read(struct file *file, char __user *data,
+ unsigned len, loff_t *poff, int *checkeof)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct page **pages;
+ u64 off = *poff;
+ int num_pages = calc_pages_for(off, len);
+ int ret;
+
+ dout("sync_read on file %p %llu~%u %s\n", file, off, len,
+ (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
+
+ if (file->f_flags & O_DIRECT) {
+ pages = get_direct_page_vector(data, num_pages, off, len);
+
+ /*
+ * flush any page cache pages in this range. this
+ * will make concurrent normal and O_DIRECT io slow,
+ * but it will at least behave sensibly when they are
+ * in sequence.
+ */
+ } else {
+ pages = alloc_page_vector(num_pages);
+ }
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
+
+ ret = filemap_write_and_wait(inode->i_mapping);
+ if (ret < 0)
+ goto done;
+
+ ret = striped_read(inode, off, len, pages, num_pages, checkeof);
+
+ if (ret >= 0 && (file->f_flags & O_DIRECT) == 0)
+ ret = copy_page_vector_to_user(pages, data, off, ret);
+ if (ret >= 0)
+ *poff = off + ret;
+
+done:
+ if (file->f_flags & O_DIRECT)
+ put_page_vector(pages, num_pages);
+ else
+ ceph_release_page_vector(pages, num_pages);
+ dout("sync_read result %d\n", ret);
+ return ret;
+}
+
+/*
+ * Write commit callback, called if we requested both an ACK and
+ * ONDISK commit reply from the OSD.
+ */
+static void sync_write_commit(struct ceph_osd_request *req,
+ struct ceph_msg *msg)
+{
+ struct ceph_inode_info *ci = ceph_inode(req->r_inode);
+
+ dout("sync_write_commit %p tid %llu\n", req, req->r_tid);
+ spin_lock(&ci->i_unsafe_lock);
+ list_del_init(&req->r_unsafe_item);
+ spin_unlock(&ci->i_unsafe_lock);
+ ceph_put_cap_refs(ci, CEPH_CAP_FILE_WR);
+}
+
+/*
+ * Synchronous write, straight from __user pointer or user pages (if
+ * O_DIRECT).
+ *
+ * If write spans object boundary, just do multiple writes. (For a
+ * correct atomic write, we should e.g. take write locks on all
+ * objects, rollback on failure, etc.)
+ */
+static ssize_t ceph_sync_write(struct file *file, const char __user *data,
+ size_t left, loff_t *offset)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_client *client = ceph_inode_to_client(inode);
+ struct ceph_osd_request *req;
+ struct page **pages;
+ int num_pages;
+ long long unsigned pos;
+ u64 len;
+ int written = 0;
+ int flags;
+ int do_sync = 0;
+ int check_caps = 0;
+ int ret;
+ struct timespec mtime = CURRENT_TIME;
+
+ if (ceph_snap(file->f_dentry->d_inode) != CEPH_NOSNAP)
+ return -EROFS;
+
+ dout("sync_write on file %p %lld~%u %s\n", file, *offset,
+ (unsigned)left, (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
+
+ if (file->f_flags & O_APPEND)
+ pos = i_size_read(inode);
+ else
+ pos = *offset;
+
+ ret = filemap_write_and_wait_range(inode->i_mapping, pos, pos + left);
+ if (ret < 0)
+ return ret;
+
+ ret = invalidate_inode_pages2_range(inode->i_mapping,
+ pos >> PAGE_CACHE_SHIFT,
+ (pos + left) >> PAGE_CACHE_SHIFT);
+ if (ret < 0)
+ dout("invalidate_inode_pages2_range returned %d\n", ret);
+
+ flags = CEPH_OSD_FLAG_ORDERSNAP |
+ CEPH_OSD_FLAG_ONDISK |
+ CEPH_OSD_FLAG_WRITE;
+ if ((file->f_flags & (O_SYNC|O_DIRECT)) == 0)
+ flags |= CEPH_OSD_FLAG_ACK;
+ else
+ do_sync = 1;
+
+ /*
+ * we may need to do multiple writes here if we span an object
+ * boundary. this isn't atomic, unfortunately. :(
+ */
+more:
+ len = left;
+ req = ceph_osdc_new_request(&client->osdc, &ci->i_layout,
+ ceph_vino(inode), pos, &len,
+ CEPH_OSD_OP_WRITE, flags,
+ ci->i_snap_realm->cached_context,
+ do_sync,
+ ci->i_truncate_seq, ci->i_truncate_size,
+ &mtime, false, 2);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ num_pages = calc_pages_for(pos, len);
+
+ if (file->f_flags & O_DIRECT) {
+ pages = get_direct_page_vector(data, num_pages, pos, len);
+ if (IS_ERR(pages)) {
+ ret = PTR_ERR(pages);
+ goto out;
+ }
+
+ /*
+ * throw out any page cache pages in this range. this
+ * may block.
+ */
+ truncate_inode_pages_range(inode->i_mapping, pos, pos+len);
+ } else {
+ pages = alloc_page_vector(num_pages);
+ if (IS_ERR(pages)) {
+ ret = PTR_ERR(pages);
+ goto out;
+ }
+ ret = copy_user_to_page_vector(pages, data, pos, len);
+ if (ret < 0) {
+ ceph_release_page_vector(pages, num_pages);
+ goto out;
+ }
+
+ if ((file->f_flags & O_SYNC) == 0) {
+ /* get a second commit callback */
+ req->r_safe_callback = sync_write_commit;
+ req->r_own_pages = 1;
+ }
+ }
+ req->r_pages = pages;
+ req->r_num_pages = num_pages;
+ req->r_inode = inode;
+
+ ret = ceph_osdc_start_request(&client->osdc, req, false);
+ if (!ret) {
+ if (req->r_safe_callback) {
+ /*
+ * Add to inode unsafe list only after we
+ * start_request so that a tid has been assigned.
+ */
+ spin_lock(&ci->i_unsafe_lock);
+ list_add(&ci->i_unsafe_writes, &req->r_unsafe_item);
+ spin_unlock(&ci->i_unsafe_lock);
+ ceph_get_cap_refs(ci, CEPH_CAP_FILE_WR);
+ }
+ ret = ceph_osdc_wait_request(&client->osdc, req);
+ }
+
+ if (file->f_flags & O_DIRECT)
+ put_page_vector(pages, num_pages);
+ else if (file->f_flags & O_SYNC)
+ ceph_release_page_vector(pages, num_pages);
+
+out:
+ ceph_osdc_put_request(req);
+ if (ret == 0) {
+ pos += len;
+ written += len;
+ left -= len;
+ if (left)
+ goto more;
+
+ ret = written;
+ *offset = pos;
+ if (pos > i_size_read(inode))
+ check_caps = ceph_inode_set_size(inode, pos);
+ if (check_caps)
+ ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY,
+ NULL);
+ }
+ return ret;
+}
+
+/*
+ * Wrap generic_file_aio_read with checks for cap bits on the inode.
+ * Atomically grab references, so that those bits are not released
+ * back to the MDS mid-read.
+ *
+ * Hmm, the sync read case isn't actually async... should it be?
+ */
+static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct file *filp = iocb->ki_filp;
+ loff_t *ppos = &iocb->ki_pos;
+ size_t len = iov->iov_len;
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ void *base = iov->iov_base;
+ ssize_t ret;
+ int got = 0;
+ int checkeof = 0, read = 0;
+
+ dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
+ inode, ceph_vinop(inode), pos, (unsigned)len, inode);
+again:
+ __ceph_do_pending_vmtruncate(inode);
+ ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, CEPH_CAP_FILE_CACHE,
+ &got, -1);
+ if (ret < 0)
+ goto out;
+ dout("aio_read %p %llx.%llx %llu~%u got cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, (unsigned)len,
+ ceph_cap_string(got));
+
+ if ((got & CEPH_CAP_FILE_CACHE) == 0 ||
+ (iocb->ki_filp->f_flags & O_DIRECT) ||
+ (inode->i_sb->s_flags & MS_SYNCHRONOUS))
+ /* hmm, this isn't really async... */
+ ret = ceph_sync_read(filp, base, len, ppos, &checkeof);
+ else
+ ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+out:
+ dout("aio_read %p %llx.%llx dropping cap refs on %s = %d\n",
+ inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
+ ceph_put_cap_refs(ci, got);
+
+ if (checkeof && ret >= 0) {
+ int statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
+
+ /* hit EOF or hole? */
+ if (statret == 0 && *ppos < inode->i_size) {
+ dout("aio_read sync_read hit hole, reading more\n");
+ read += ret;
+ base += ret;
+ len -= ret;
+ checkeof = 0;
+ goto again;
+ }
+ }
+ if (ret >= 0)
+ ret += read;
+
+ return ret;
+}
+
+/*
+ * Take cap references to avoid releasing caps to MDS mid-write.
+ *
+ * If we are synchronous, and write with an old snap context, the OSD
+ * may return EOLDSNAPC. In that case, retry the write.. _after_
+ * dropping our cap refs and allowing the pending snap to logically
+ * complete _before_ this write occurs.
+ *
+ * If we are near ENOSPC, write synchronously.
+ */
+static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_dentry->d_inode;
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_osd_client *osdc = &ceph_client(inode->i_sb)->osdc;
+ loff_t endoff = pos + iov->iov_len;
+ int got = 0;
+ int ret, err;
+
+ if (ceph_snap(inode) != CEPH_NOSNAP)
+ return -EROFS;
+
+retry_snap:
+ if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
+ return -ENOSPC;
+ __ceph_do_pending_vmtruncate(inode);
+ dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+ inode->i_size);
+ ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, CEPH_CAP_FILE_BUFFER,
+ &got, endoff);
+ if (ret < 0)
+ goto out;
+
+ dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+ ceph_cap_string(got));
+
+ if ((got & CEPH_CAP_FILE_BUFFER) == 0 ||
+ (iocb->ki_filp->f_flags & O_DIRECT) ||
+ (inode->i_sb->s_flags & MS_SYNCHRONOUS)) {
+ ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
+ &iocb->ki_pos);
+ } else {
+ ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+
+ if ((ret >= 0 || ret == -EIOCBQUEUED) &&
+ ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
+ || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
+ err = vfs_fsync_range(file, file->f_path.dentry,
+ pos, pos + ret - 1, 1);
+ if (err < 0)
+ ret = err;
+ }
+ }
+ if (ret >= 0) {
+ spin_lock(&inode->i_lock);
+ __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
+ spin_unlock(&inode->i_lock);
+ }
+
+out:
+ dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+ ceph_cap_string(got));
+ ceph_put_cap_refs(ci, got);
+
+ if (ret == -EOLDSNAPC) {
+ dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
+ goto retry_snap;
+ }
+
+ return ret;
+}
+
+/*
+ * llseek. be sure to verify file size on SEEK_END.
+ */
+static loff_t ceph_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ int ret;
+
+ mutex_lock(&inode->i_mutex);
+ __ceph_do_pending_vmtruncate(inode);
+ switch (origin) {
+ case SEEK_END:
+ ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
+ if (ret < 0) {
+ offset = ret;
+ goto out;
+ }
+ offset += inode->i_size;
+ break;
+ case SEEK_CUR:
+ /*
+ * Here we special-case the lseek(fd, 0, SEEK_CUR)
+ * position-querying operation. Avoid rewriting the "same"
+ * f_pos value back to the file because a concurrent read(),
+ * write() or lseek() might have altered it
+ */
+ if (offset == 0) {
+ offset = file->f_pos;
+ goto out;
+ }
+ offset += file->f_pos;
+ break;
+ }
+
+ if (offset < 0 || offset > inode->i_sb->s_maxbytes) {
+ offset = -EINVAL;
+ goto out;
+ }
+
+ /* Special lock needed here? */
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+
+out:
+ mutex_unlock(&inode->i_mutex);
+ return offset;
+}
+
+const struct file_operations ceph_file_fops = {
+ .open = ceph_open,
+ .release = ceph_release,
+ .llseek = ceph_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = ceph_aio_read,
+ .aio_write = ceph_aio_write,
+ .mmap = ceph_mmap,
+ .fsync = ceph_fsync,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
+ .unlocked_ioctl = ceph_ioctl,
+ .compat_ioctl = ceph_ioctl,
+};
+
--
1.7.0

Sage Weil

unread,

Mar 2, 2010, 8:00:03 PM3/2/10

Define the authentication interface and abstract interface for
authentication protocols. This allows us to implement different
authentication schemes, including a trivial scheme that performs
no actual authentication. A proper implementation allows us to
authenticate ourselves once with the monitor, and use a ticket to
access services (MDS, OSD) with mutual authentication.

When the client initially connects to the monitor, it lists
supported authentication protocols and the monitor indicates
which will be used for the session.

Signed-off-by: Sage Weil <sa...@newdream.net>
---

fs/ceph/auth.c | 257 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ceph/auth.h | 84 ++++++++++++++++++
2 files changed, 341 insertions(+), 0 deletions(-)
create mode 100644 fs/ceph/auth.c
create mode 100644 fs/ceph/auth.h

diff --git a/fs/ceph/auth.c b/fs/ceph/auth.c
new file mode 100644
index 0000000..abb204f
--- /dev/null
+++ b/fs/ceph/auth.c
@@ -0,0 +1,257 @@
+#include "ceph_debug.h"
+
+#include <linux/module.h>
+#include <linux/err.h>
+
+#include "types.h"
+#include "auth_none.h"
+#include "auth_x.h"
+#include "decode.h"
+#include "super.h"
+
+#include "messenger.h"
+
+/*
+ * get protocol handler
+ */
+static u32 supported_protocols[] = {
+ CEPH_AUTH_NONE,
+ CEPH_AUTH_CEPHX
+};
+
+int ceph_auth_init_protocol(struct ceph_auth_client *ac, int protocol)
+{
+ switch (protocol) {
+ case CEPH_AUTH_NONE:
+ return ceph_auth_none_init(ac);
+ case CEPH_AUTH_CEPHX:
+ return ceph_x_init(ac);
+ default:
+ return -ENOENT;
+ }
+}
+
+/*
+ * setup, teardown.
+ */
+struct ceph_auth_client *ceph_auth_init(const char *name, const char *secret)
+{
+ struct ceph_auth_client *ac;
+ int ret;
+
+ dout("auth_init name '%s' secret '%s'\n", name, secret);
+
+ ret = -ENOMEM;
+ ac = kzalloc(sizeof(*ac), GFP_NOFS);
+ if (!ac)
+ goto out;
+
+ ac->negotiating = true;
+ if (name)
+ ac->name = name;
+ else
+ ac->name = CEPH_AUTH_NAME_DEFAULT;
+ dout("auth_init name %s secret %s\n", ac->name, secret);
+ ac->secret = secret;
+ return ac;
+
+out:
+ return ERR_PTR(ret);
+}
+
+void ceph_auth_destroy(struct ceph_auth_client *ac)
+{
+ dout("auth_destroy %p\n", ac);
+ if (ac->ops)
+ ac->ops->destroy(ac);
+ kfree(ac);
+}
+
+/*
+ * Reset occurs when reconnecting to the monitor.
+ */
+void ceph_auth_reset(struct ceph_auth_client *ac)
+{
+ dout("auth_reset %p\n", ac);
+ if (ac->ops && !ac->negotiating)
+ ac->ops->reset(ac);
+ ac->negotiating = true;
+}
+
+int ceph_entity_name_encode(const char *name, void **p, void *end)
+{
+ int len = strlen(name);
+
+ if (*p + 2*sizeof(u32) + len > end)
+ return -ERANGE;
+ ceph_encode_32(p, CEPH_ENTITY_TYPE_CLIENT);
+ ceph_encode_32(p, len);
+ ceph_encode_copy(p, name, len);

+ return 0;
+}
+
+/*

+ * Initiate protocol negotiation with monitor. Include entity name
+ * and list supported protocols.
+ */
+int ceph_auth_build_hello(struct ceph_auth_client *ac, void *buf, size_t len)
+{
+ struct ceph_mon_request_header *monhdr = buf;
+ void *p = monhdr + 1, *end = buf + len, *lenp;
+ int i, num;
+ int ret;
+
+ dout("auth_build_hello\n");
+ monhdr->have_version = 0;
+ monhdr->session_mon = cpu_to_le16(-1);
+ monhdr->session_mon_tid = 0;
+
+ ceph_encode_32(&p, 0); /* no protocol, yet */
+
+ lenp = p;
+ p += sizeof(u32);
+
+ ceph_decode_need(&p, end, 1 + sizeof(u32), bad);
+ ceph_encode_8(&p, 1);
+ num = ARRAY_SIZE(supported_protocols);
+ ceph_encode_32(&p, num);
+ ceph_decode_need(&p, end, num * sizeof(u32), bad);
+ for (i = 0; i < num; i++)
+ ceph_encode_32(&p, supported_protocols[i]);
+
+ ret = ceph_entity_name_encode(ac->name, &p, end);

+ if (ret < 0)
+ return ret;

+ ceph_decode_need(&p, end, sizeof(u64), bad);
+ ceph_encode_64(&p, ac->global_id);
+
+ ceph_encode_32(&lenp, p - lenp - sizeof(u32));
+ return p - buf;
+
+bad:
+ return -ERANGE;
+}
+
+int ceph_build_auth_request(struct ceph_auth_client *ac,
+ void *msg_buf, size_t msg_len)
+{
+ struct ceph_mon_request_header *monhdr = msg_buf;
+ void *p = monhdr + 1;
+ void *end = msg_buf + msg_len;
+ int ret;
+
+ monhdr->have_version = 0;
+ monhdr->session_mon = cpu_to_le16(-1);
+ monhdr->session_mon_tid = 0;
+
+ ceph_encode_32(&p, ac->protocol);
+
+ ret = ac->ops->build_request(ac, p + sizeof(u32), end);
+ if (ret < 0) {
+ pr_err("error %d building request\n", ret);
+ return ret;
+ }
+ dout(" built request %d bytes\n", ret);
+ ceph_encode_32(&p, ret);
+ return p + ret - msg_buf;
+}
+
+/*
+ * Handle auth message from monitor.
+ */
+int ceph_handle_auth_reply(struct ceph_auth_client *ac,
+ void *buf, size_t len,
+ void *reply_buf, size_t reply_len)
+{
+ void *p = buf;
+ void *end = buf + len;
+ int protocol;
+ s32 result;
+ u64 global_id;
+ void *payload, *payload_end;
+ int payload_len;
+ char *result_msg;
+ int result_msg_len;
+ int ret = -EINVAL;
+
+ dout("handle_auth_reply %p %p\n", p, end);
+ ceph_decode_need(&p, end, sizeof(u32) * 3 + sizeof(u64), bad);
+ protocol = ceph_decode_32(&p);
+ result = ceph_decode_32(&p);
+ global_id = ceph_decode_64(&p);
+ payload_len = ceph_decode_32(&p);
+ payload = p;
+ p += payload_len;
+ ceph_decode_need(&p, end, sizeof(u32), bad);
+ result_msg_len = ceph_decode_32(&p);
+ result_msg = p;
+ p += result_msg_len;
+ if (p != end)
+ goto bad;
+
+ dout(" result %d '%.*s' gid %llu len %d\n", result, result_msg_len,
+ result_msg, global_id, payload_len);
+
+ payload_end = payload + payload_len;
+
+ if (global_id && ac->global_id != global_id) {
+ dout(" set global_id %lld -> %lld\n", ac->global_id, global_id);
+ ac->global_id = global_id;
+ }
+
+ if (ac->negotiating) {
+ /* server does not support our protocols? */
+ if (!protocol && result < 0) {
+ ret = result;
+ goto out;
+ }
+ /* set up (new) protocol handler? */
+ if (ac->protocol && ac->protocol != protocol) {
+ ac->ops->destroy(ac);
+ ac->protocol = 0;
+ ac->ops = NULL;
+ }
+ if (ac->protocol != protocol) {
+ ret = ceph_auth_init_protocol(ac, protocol);
+ if (ret) {
+ pr_err("error %d on auth protocol %d init\n",
+ ret, protocol);

+ goto out;
+ }
+ }
+

+ ac->negotiating = false;
+ }
+
+ ret = ac->ops->handle_reply(ac, result, payload, payload_end);
+ if (ret == -EAGAIN) {
+ return ceph_build_auth_request(ac, reply_buf, reply_len);
+ } else if (ret) {
+ pr_err("authentication error %d\n", ret);
+ return ret;
+ }
+ return 0;
+
+bad:
+ pr_err("failed to decode auth msg\n");
+out:

+ return ret;
+}
+

+int ceph_build_auth(struct ceph_auth_client *ac,
+ void *msg_buf, size_t msg_len)
+{
+ if (!ac->protocol)
+ return ceph_auth_build_hello(ac, msg_buf, msg_len);
+ BUG_ON(!ac->ops);
+ if (!ac->ops->is_authenticated(ac))
+ return ceph_build_auth_request(ac, msg_buf, msg_len);

+ return 0;
+}
+

+int ceph_auth_is_authenticated(struct ceph_auth_client *ac)
+{
+ if (!ac->ops)
+ return 0;
+ return ac->ops->is_authenticated(ac);
+}
diff --git a/fs/ceph/auth.h b/fs/ceph/auth.h
new file mode 100644
index 0000000..ca4f57c
--- /dev/null
+++ b/fs/ceph/auth.h
@@ -0,0 +1,84 @@
+#ifndef _FS_CEPH_AUTH_H
+#define _FS_CEPH_AUTH_H
+
+#include "types.h"
+#include "buffer.h"
+
+/*
+ * Abstract interface for communicating with the authenticate module.
+ * There is some handshake that takes place between us and the monitor
+ * to acquire the necessary keys. These are used to generate an
+ * 'authorizer' that we use when connecting to a service (mds, osd).
+ */
+
+struct ceph_auth_client;
+struct ceph_authorizer;
+
+struct ceph_auth_client_ops {
+ /*
+ * true if we are authenticated and can connect to
+ * services.
+ */
+ int (*is_authenticated)(struct ceph_auth_client *ac);
+
+ /*
+ * build requests and process replies during monitor
+ * handshake. if handle_reply returns -EAGAIN, we build
+ * another request.
+ */
+ int (*build_request)(struct ceph_auth_client *ac, void *buf, void *end);
+ int (*handle_reply)(struct ceph_auth_client *ac, int result,
+ void *buf, void *end);
+
+ /*
+ * Create authorizer for connecting to a service, and verify
+ * the response to authenticate the service.
+ */
+ int (*create_authorizer)(struct ceph_auth_client *ac, int peer_type,
+ struct ceph_authorizer **a,
+ void **buf, size_t *len,
+ void **reply_buf, size_t *reply_len);
+ int (*verify_authorizer_reply)(struct ceph_auth_client *ac,
+ struct ceph_authorizer *a, size_t len);
+ void (*destroy_authorizer)(struct ceph_auth_client *ac,
+ struct ceph_authorizer *a);
+ void (*invalidate_authorizer)(struct ceph_auth_client *ac,
+ int peer_type);
+
+ /* reset when we (re)connect to a monitor */
+ void (*reset)(struct ceph_auth_client *ac);
+
+ void (*destroy)(struct ceph_auth_client *ac);
+};
+
+struct ceph_auth_client {
+ u32 protocol; /* CEPH_AUTH_* */
+ void *private; /* for use by protocol implementation */
+ const struct ceph_auth_client_ops *ops; /* null iff protocol==0 */
+
+ bool negotiating; /* true if negotiating protocol */
+ const char *name; /* entity name */
+ u64 global_id; /* our unique id in system */
+ const char *secret; /* our secret key */
+ unsigned want_keys; /* which services we want */
+};
+
+extern struct ceph_auth_client *ceph_auth_init(const char *name,
+ const char *secret);
+extern void ceph_auth_destroy(struct ceph_auth_client *ac);
+
+extern void ceph_auth_reset(struct ceph_auth_client *ac);
+
+extern int ceph_auth_build_hello(struct ceph_auth_client *ac,
+ void *buf, size_t len);
+extern int ceph_handle_auth_reply(struct ceph_auth_client *ac,
+ void *buf, size_t len,
+ void *reply_buf, size_t reply_len);
+extern int ceph_entity_name_encode(const char *name, void **p, void *end);
+
+extern int ceph_build_auth(struct ceph_auth_client *ac,
+ void *msg_buf, size_t msg_len);
+
+extern int ceph_auth_is_authenticated(struct ceph_auth_client *ac);
+
+#endif
--
1.7.0

Sage Weil

unread,

Mar 2, 2010, 8:00:04 PM3/2/10

Basic NFS re-export support is included. This mostly works. However,
Ceph's MDS design precludes the ability to generate a (small)
filehandle that will be valid forever, so this is of limited utility.

Signed-off-by: Sage Weil <sa...@newdream.net>
---

fs/ceph/export.c | 223 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 223 insertions(+), 0 deletions(-)
create mode 100644 fs/ceph/export.c

diff --git a/fs/ceph/export.c b/fs/ceph/export.c
new file mode 100644
index 0000000..fc68e39
--- /dev/null
+++ b/fs/ceph/export.c
@@ -0,0 +1,223 @@
+#include "ceph_debug.h"
+
+#include <linux/exportfs.h>
+#include <asm/unaligned.h>
+
+#include "super.h"
+
+/*
+ * NFS export support
+ *
+ * NFS re-export of a ceph mount is, at present, only semireliable.
+ * The basic issue is that the Ceph architectures doesn't lend itself
+ * well to generating filehandles that will remain valid forever.
+ *
+ * So, we do our best. If you're lucky, your inode will be in the
+ * client's cache. If it's not, and you have a connectable fh, then
+ * the MDS server may be able to find it for you. Otherwise, you get
+ * ESTALE.
+ *
+ * There are ways to this more reliable, but in the non-connectable fh
+ * case, we won't every work perfectly, and in the connectable case,
+ * some changes are needed on the MDS side to work better.
+ */
+
+/*
+ * Basic fh
+ */
+struct ceph_nfs_fh {
+ u64 ino;
+} __attribute__ ((packed));
+
+/*
+ * Larger 'connectable' fh that includes parent ino and name hash.
+ * Use this whenever possible, as it works more reliably.
+ */
+struct ceph_nfs_confh {
+ u64 ino, parent_ino;
+ u32 parent_name_hash;
+} __attribute__ ((packed));
+
+static int ceph_encode_fh(struct dentry *dentry, u32 *rawfh, int *max_len,
+ int connectable)
+{
+ struct ceph_nfs_fh *fh = (void *)rawfh;
+ struct ceph_nfs_confh *cfh = (void *)rawfh;
+ struct dentry *parent = dentry->d_parent;
+ struct inode *inode = dentry->d_inode;
+ int type;
+
+ /* don't re-export snaps */
+ if (ceph_snap(inode) != CEPH_NOSNAP)
+ return -EINVAL;
+
+ if (*max_len >= sizeof(*cfh)) {
+ dout("encode_fh %p connectable\n", dentry);
+ cfh->ino = ceph_ino(dentry->d_inode);
+ cfh->parent_ino = ceph_ino(parent->d_inode);
+ cfh->parent_name_hash = parent->d_name.hash;
+ *max_len = sizeof(*cfh);
+ type = 2;
+ } else if (*max_len > sizeof(*fh)) {
+ if (connectable)
+ return -ENOSPC;
+ dout("encode_fh %p\n", dentry);
+ fh->ino = ceph_ino(dentry->d_inode);
+ *max_len = sizeof(*fh);
+ type = 1;
+ } else {
+ return -ENOSPC;
+ }
+ return type;
+}
+
+/*
+ * convert regular fh to dentry
+ *
+ * FIXME: we should try harder by querying the mds for the ino.
+ */
+static struct dentry *__fh_to_dentry(struct super_block *sb,
+ struct ceph_nfs_fh *fh)
+{
+ struct inode *inode;
+ struct dentry *dentry;
+ struct ceph_vino vino;
+ int err;
+
+ dout("__fh_to_dentry %llx\n", fh->ino);
+ vino.ino = fh->ino;
+ vino.snap = CEPH_NOSNAP;
+ inode = ceph_find_inode(sb, vino);
+ if (!inode)
+ return ERR_PTR(-ESTALE);
+
+ dentry = d_obtain_alias(inode);
+ if (!dentry) {
+ pr_err("fh_to_dentry %llx -- inode %p but ENOMEM\n",
+ fh->ino, inode);
+ iput(inode);
+ return ERR_PTR(-ENOMEM);
+ }
+ err = ceph_init_dentry(dentry);
+
+ if (err < 0) {
+ iput(inode);
+ return ERR_PTR(err);
+ }
+ dout("__fh_to_dentry %llx %p dentry %p\n", fh->ino, inode, dentry);
+ return dentry;
+}
+
+/*
+ * convert connectable fh to dentry
+ */
+static struct dentry *__cfh_to_dentry(struct super_block *sb,
+ struct ceph_nfs_confh *cfh)
+{
+ struct ceph_mds_client *mdsc = &ceph_client(sb)->mdsc;
+ struct inode *inode;
+ struct dentry *dentry;
+ struct ceph_vino vino;
+ int err;
+
+ dout("__cfh_to_dentry %llx (%llx/%x)\n",
+ cfh->ino, cfh->parent_ino, cfh->parent_name_hash);
+
+ vino.ino = cfh->ino;
+ vino.snap = CEPH_NOSNAP;
+ inode = ceph_find_inode(sb, vino);
+ if (!inode) {

+ struct ceph_mds_request *req;
+

+ req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_LOOKUPHASH,
+ USE_ANY_MDS);

+ if (IS_ERR(req))
+ return ERR_PTR(PTR_ERR(req));
+

+ req->r_ino1 = vino;
+ req->r_ino2.ino = cfh->parent_ino;
+ req->r_ino2.snap = CEPH_NOSNAP;
+ req->r_path2 = kmalloc(16, GFP_NOFS);
+ snprintf(req->r_path2, 16, "%d", cfh->parent_name_hash);

+ req->r_num_caps = 1;

+ err = ceph_mdsc_do_request(mdsc, NULL, req);
+ ceph_mdsc_put_request(req);
+ inode = ceph_find_inode(sb, vino);
+ if (!inode)
+ return ERR_PTR(err ? err : -ESTALE);
+ }
+
+ dentry = d_obtain_alias(inode);
+ if (!dentry) {
+ pr_err("cfh_to_dentry %llx -- inode %p but ENOMEM\n",
+ cfh->ino, inode);
+ iput(inode);
+ return ERR_PTR(-ENOMEM);
+ }
+ err = ceph_init_dentry(dentry);
+ if (err < 0) {
+ iput(inode);
+ return ERR_PTR(err);
+ }
+ dout("__cfh_to_dentry %llx %p dentry %p\n", cfh->ino, inode, dentry);

+ return dentry;
+}
+

+static struct dentry *ceph_fh_to_dentry(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ if (fh_type == 1)
+ return __fh_to_dentry(sb, (struct ceph_nfs_fh *)fid->raw);
+ else
+ return __cfh_to_dentry(sb, (struct ceph_nfs_confh *)fid->raw);
+}
+
+/*
+ * get parent, if possible.
+ *
+ * FIXME: we could do better by querying the mds to discover the
+ * parent.
+ */
+static struct dentry *ceph_fh_to_parent(struct super_block *sb,
+ struct fid *fid,
+ int fh_len, int fh_type)
+{
+ struct ceph_nfs_confh *cfh = (void *)fid->raw;
+ struct ceph_vino vino;
+ struct inode *inode;
+ struct dentry *dentry;
+ int err;
+
+ if (fh_type == 1)
+ return ERR_PTR(-ESTALE);
+
+ pr_debug("fh_to_parent %llx/%d\n", cfh->parent_ino,
+ cfh->parent_name_hash);
+
+ vino.ino = cfh->ino;
+ vino.snap = CEPH_NOSNAP;
+ inode = ceph_find_inode(sb, vino);
+ if (!inode)
+ return ERR_PTR(-ESTALE);
+
+ dentry = d_obtain_alias(inode);
+ if (!dentry) {
+ pr_err("fh_to_parent %llx -- inode %p but ENOMEM\n",
+ cfh->ino, inode);
+ iput(inode);
+ return ERR_PTR(-ENOMEM);
+ }
+ err = ceph_init_dentry(dentry);
+ if (err < 0) {
+ iput(inode);
+ return ERR_PTR(err);
+ }
+ dout("fh_to_parent %llx %p dentry %p\n", cfh->ino, inode, dentry);

+ return dentry;
+}
+

+const struct export_operations ceph_export_ops = {
+ .encode_fh = ceph_encode_fh,
+ .fh_to_dentry = ceph_fh_to_dentry,
+ .fh_to_parent = ceph_fh_to_parent,
+};
--
1.7.0

Christoph Hellwig

unread,

Mar 3, 2010, 2:50:02 AM3/3/10

On Tue, Mar 02, 2010 at 04:46:54PM -0800, Sage Weil wrote:
> Basic NFS re-export support is included. This mostly works. However,
> Ceph's MDS design precludes the ability to generate a (small)
> filehandle that will be valid forever, so this is of limited utility.

So don't include it if it's broken by design.

Sage Weil

unread,

Mar 3, 2010, 4:40:02 PM3/3/10

On Wed, 3 Mar 2010, Christoph Hellwig wrote:
> On Tue, Mar 02, 2010 at 04:46:54PM -0800, Sage Weil wrote:
> > Basic NFS re-export support is included. This mostly works. However,
> > Ceph's MDS design precludes the ability to generate a (small)
> > filehandle that will be valid forever, so this is of limited utility.
>
> So don't include it if it's broken by design.

Well, there are changes we can make to the MDS to make it work that are on
the roadmap (they'll also help with disaster recovery). They will require
some change on the client side, though, so it's not like the current
implementation will suddenly work perfectly with a server-side upgrade.

That said, I don't think Ceph is the only filesystem that has this problem
(fat? fuse? cifs?), and it is still useful in its current state. But I
can leave it off by default (via Kconfig option), or include a kern.log
warning, or leave it out entirely, if that's what people prefer.

sage

Miklos Szeredi

unread,

Mar 4, 2010, 7:00:02 AM3/4/10

On Wed, 3 Mar 2010, Sage Weil wrote:
> On Wed, 3 Mar 2010, Christoph Hellwig wrote:
> > On Tue, Mar 02, 2010 at 04:46:54PM -0800, Sage Weil wrote:
> > > Basic NFS re-export support is included. This mostly works. However,
> > > Ceph's MDS design precludes the ability to generate a (small)
> > > filehandle that will be valid forever, so this is of limited utility.
> >
> > So don't include it if it's broken by design.
>
> Well, there are changes we can make to the MDS to make it work that are on
> the roadmap (they'll also help with disaster recovery). They will require
> some change on the client side, though, so it's not like the current
> implementation will suddenly work perfectly with a server-side upgrade.
>
> That said, I don't think Ceph is the only filesystem that has this problem
> (fat? fuse? cifs?), and it is still useful in its current state. But I
> can leave it off by default (via Kconfig option), or include a kern.log
> warning, or leave it out entirely, if that's what people prefer.

Fuse uses a fixed 64bit handle on the userspace/kernel API and on the
"low level" userspace API, so that's NFS export compatible.

On the "high level" API fuse uses path strings, which are not so
suitable for NFS exporting. Despite that, people have been exporting
these fuse filesystems over NFS and dealing with the occasional ESTALE
if the inode has been evicted from the cache on the server side but
the client still holds a file handle.

CIFS doesn't seem to have any export capability. FAT tries, but is
also not guaranteed to work if the inodes are no longer in the cache.

Thanks,
Miklos

0 new messages