Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] dcache: Translating dentry into pathname without taking rename_lock

82 views
Skip to first unread message

Waiman Long

unread,
Sep 4, 2013, 3:06:21 PM9/4/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
When running the AIM7's short workload, Linus' lockref patch eliminated
most of the spinlock contention. However, there were still some left:

8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--42.21%-- d_path
| proc_pid_readlink
| SyS_readlinkat
| SyS_readlink
| system_call
| __GI___readlink
|
|--40.97%-- sys_getcwd
| system_call
| __getcwd

The big one here is the rename_lock (seqlock) contention in d_path()
and the getcwd system call. This patch will eliminate the need to take
the rename_lock while translating dentries into the full pathnames.

The need to take the rename_lock is to make sure that no rename
operation can be ongoing while the translation is in progress. However,
only one thread can take the rename_lock thus blocking all the other
threads that need it even though the translation process won't make
any change to the dentries.

This patch will replace the writer's write_seqlock/write_sequnlock
sequence of the rename_lock of the callers of the prepend_path() and
__dentry_path() functions with the reader's read_seqbegin/read_seqretry
sequence within these 2 functions. As a result, the code will have to
retry if one or more rename operations had been performed. In addition,
RCU read lock will be taken during the translation process to make
sure that no dentries will go away.

In addition, the dentry's d_lock is also not taken to further reduce
spinlock contention. However, this means that the name pointer and
other dentry fields may not be valid. As a result, the code was
enhanced to handle the case that the parent pointer or the name
pointer can be NULL.

With this patch, the _raw_spin_lock will now account for only 1.2%
of the total CPU cycles for the short workload. This patch also has
the effect of reducing the effect of running perf on its profile
since the perf command itself can be a heavy user of the d_path()
function depending on the complexity of the workload.

When taking the perf profile of the high-systime workload, the amount
of spinlock contention contributed by running perf without this patch
was about 16%. With this patch, the spinlock contention caused by
the running of perf will go away and we will have a more accurate
perf profile.

Signed-off-by: Waiman Long <Waima...@hp.com>
---
fs/dcache.c | 118 +++++++++++++++++++++++++++++++++++++++++------------------
1 files changed, 82 insertions(+), 36 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 96655f4..12d07d7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2512,7 +2512,19 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)

static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
- return prepend(buffer, buflen, name->name, name->len);
+ /*
+ * With RCU path tracing, it may race with rename. Use
+ * ACCESS_ONCE() to make sure that it is either the old or
+ * the new name pointer. The length does not really matter as
+ * the sequence number check will eventually catch any ongoing
+ * rename operation.
+ */
+ const char *dname = ACCESS_ONCE(name->name);
+ int dlen = name->len;
+
+ if (unlikely(!dname || !dlen))
+ return -EINVAL;
+ return prepend(buffer, buflen, dname, dlen);
}

/**
@@ -2522,7 +2534,12 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2533,7 +2550,15 @@ static int prepend_path(const struct path *path,
struct mount *mnt = real_mount(vfsmnt);
bool slash = false;
int error = 0;
+ unsigned seq;
+ char *bptr;
+ int blen;

+restart:
+ bptr = *buffer;
+ blen = *buflen;
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
while (dentry != root->dentry || vfsmnt != root->mnt) {
struct dentry * parent;

@@ -2547,22 +2572,38 @@ static int prepend_path(const struct path *path,
continue;
}
parent = dentry->d_parent;
+ if (unlikely(parent == NULL)) {
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
+ goto garbage;
+ }
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(buffer, buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
+ error = prepend_name(&bptr, &blen, &dentry->d_name);
if (!error)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend(&bptr, &blen, "/", 1);
if (error)
break;

slash = true;
dentry = parent;
}
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;

if (!error && !slash)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend(&bptr, &blen, "/", 1);
+
+ret:
+ if (error == -EINVAL)
+ goto garbage;
+ *buffer = bptr;
+ *buflen = blen;
+ return error;

+garbage:
+ error = prepend(buffer, buflen, "(garbage)", 10);
return error;

global_root:
@@ -2575,11 +2616,14 @@ global_root:
WARN(1, "Root dentry has weird name <%.*s>\n",
(int) dentry->d_name.len, dentry->d_name.name);
}
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
if (!slash)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend(&bptr, &blen, "/", 1);
if (!error)
error = is_mounted(vfsmnt) ? 1 : 2;
- return error;
+ goto ret;
}

/**
@@ -2607,9 +2651,7 @@ char *__d_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2628,9 +2670,7 @@ char *d_absolute_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error > 1)
@@ -2696,9 +2736,7 @@ char *d_path(const struct path *path, char *buf, int buflen)

get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2744,44 +2782,57 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq, error = 0;

- prepend(&end, &buflen, "\0", 1);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;

+ if (unlikely(parent == NULL))
+ goto chk_seq;
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, &dentry->d_name);
+ if (!error)
+ error = prepend(&end, &len, "/", 1);
+ if (error)
+ goto chk_seq;

retval = end;
dentry = parent;
}
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
+chk_seq:
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
+ if (error == -ENAMETOOLONG)
+ goto Elong;
+ error = prepend(&end, &len, "//garbage", 10);
+ if (error)
+ goto Elong;
+ return end;
}

char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);

@@ -2790,7 +2841,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;

- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2798,7 +2848,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2837,7 +2886,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2845,7 +2893,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2866,7 +2913,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}

--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Al Viro

unread,
Sep 4, 2013, 3:11:28 PM9/4/13
to Waiman Long, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
On Wed, Sep 04, 2013 at 03:05:23PM -0400, Waiman Long wrote:
>
> static int prepend_name(char **buffer, int *buflen, struct qstr *name)
> {
> - return prepend(buffer, buflen, name->name, name->len);
> + /*
> + * With RCU path tracing, it may race with rename. Use
> + * ACCESS_ONCE() to make sure that it is either the old or
> + * the new name pointer. The length does not really matter as
> + * the sequence number check will eventually catch any ongoing
> + * rename operation.
> + */
> + const char *dname = ACCESS_ONCE(name->name);
> + int dlen = name->len;
> +
> + if (unlikely(!dname || !dlen))
> + return -EINVAL;
> + return prepend(buffer, buflen, dname, dlen);

NAK. A race with d_move() can very well leave you with dname pointing into
an object of length smaller than dlen. You *can* copy it byte-by-byte
and rely on NUL-termination, but you can't rely on length being accurate -
not without having excluded d_move().

Waiman Long

unread,
Sep 4, 2013, 3:26:58 PM9/4/13
to Waiman Long, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
In term of AIM7 performance, this patch has a performance boost of
about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.

User Range | 10-100 | 200-10000 | 1100-2000 |
Mean JPM w/o patch | 4,365,114 | 7,211,115 | 6,964,439 |
Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
% Change | -11.3% | +6.2% | +6.6% |

I am not sure if it is too aggresive for not taking the d_lock before
prepend_name(). I can add back those locking instructions if necessary.

Regards,
Longman

Waiman Long

unread,
Sep 4, 2013, 3:33:19 PM9/4/13
to Al Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
On 09/04/2013 03:11 PM, Al Viro wrote:
> On Wed, Sep 04, 2013 at 03:05:23PM -0400, Waiman Long wrote:
>>
>> static int prepend_name(char **buffer, int *buflen, struct qstr *name)
>> {
>> - return prepend(buffer, buflen, name->name, name->len);
>> + /*
>> + * With RCU path tracing, it may race with rename. Use
>> + * ACCESS_ONCE() to make sure that it is either the old or
>> + * the new name pointer. The length does not really matter as
>> + * the sequence number check will eventually catch any ongoing
>> + * rename operation.
>> + */
>> + const char *dname = ACCESS_ONCE(name->name);
>> + int dlen = name->len;
>> +
>> + if (unlikely(!dname || !dlen))
>> + return -EINVAL;
>> + return prepend(buffer, buflen, dname, dlen);
> NAK. A race with d_move() can very well leave you with dname pointing into
> an object of length smaller than dlen. You *can* copy it byte-by-byte
> and rely on NUL-termination, but you can't rely on length being accurate -
> not without having excluded d_move().

I have thought about that. But if a d_move() is going on, the string in
the buffer will be discarded as the sequence number will change. So
whether or not it have embedded null byte shouldn't matter. That is why
I didn't add code to do byte-by-byte copy at this first patch. I can add
code to do that if you think it is safer to do so.

Regards,
Longman

Al Viro

unread,
Sep 4, 2013, 3:43:50 PM9/4/13
to Waiman Long, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
On Wed, Sep 04, 2013 at 03:33:00PM -0400, Waiman Long wrote:

> I have thought about that. But if a d_move() is going on, the string
> in the buffer will be discarded as the sequence number will change.
> So whether or not it have embedded null byte shouldn't matter. That
> is why I didn't add code to do byte-by-byte copy at this first
> patch. I can add code to do that if you think it is safer to do so.

Sigh... Junk in the output is not an issue; reading from invalid address
is, since you might not survive to the sequence number check. Again,
if p is an address returned by kmalloc(size, ...), dereferencing p + offset
is not safe unless offset is less than size.

John Stoffel

unread,
Sep 4, 2013, 5:04:14 PM9/4/13
to Waiman Long, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
>>>>> "Waiman" == Waiman Long <waima...@hp.com> writes:

Waiman> In term of AIM7 performance, this patch has a performance boost of
Waiman> about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.

Waiman> User Range | 10-100 | 200-10000 | 1100-2000 |
Waiman> Mean JPM w/o patch | 4,365,114 | 7,211,115 | 6,964,439 |
Waiman> Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
Waiman> % Change | -11.3% | +6.2% | +6.6% |

This -11% impact is worisome to me, because at smaller numbers of
users, I would still expect the performance to go up. So why the big
drop?

Also, how is the impact of these changes on smaller 1 socket, 4 core
systems? Just because it helps a couple of big boxes, doesn't mean it
won't hurt the more common small case.

John

Linus Torvalds

unread,
Sep 4, 2013, 5:31:59 PM9/4/13
to Waiman Long, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J
On Wed, Sep 4, 2013 at 12:05 PM, Waiman Long <Waima...@hp.com> wrote:
> + rcu_read_unlock();
> + if (read_seqretry(&rename_lock, seq))
> + goto restart;

Btw, you have this pattern twice, and while it's not necessarily
incorrect, it's a bit worrisome for performance.

The rcu_read_unlock() is very possibly going to trigger an immediate
scheduling event, so checking the sequence lock after dropping the
read-lock sounds like it would make it much easier to hit the race
with some rename.

I'm also a tiny bit worried about livelocking on the sequence lock in
the presence of lots of renames, so I'm wondering if the locking
should try to approximate what we do for the RCU lookup path: start
off optimistically using just the RCU lock and a sequence point, but
if that fails, fall back to taking the real lock. Maybe using a
counter ("try twice, then get the rename lock for writing")

Hmm?

Linus

Waiman Long

unread,
Sep 4, 2013, 9:56:30 PM9/4/13
to Al Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
On 09/04/2013 03:43 PM, Al Viro wrote:
> On Wed, Sep 04, 2013 at 03:33:00PM -0400, Waiman Long wrote:
>
>> I have thought about that. But if a d_move() is going on, the string
>> in the buffer will be discarded as the sequence number will change.
>> So whether or not it have embedded null byte shouldn't matter. That
>> is why I didn't add code to do byte-by-byte copy at this first
>> patch. I can add code to do that if you think it is safer to do so.
> Sigh... Junk in the output is not an issue; reading from invalid address
> is, since you might not survive to the sequence number check. Again,
> if p is an address returned by kmalloc(size, ...), dereferencing p + offset
> is not safe unless offset is less than size.

Yeah, I understand that. As said in my reply to Linus, I will use
memchr() to see if there is null byte within the specified length. If
one is found, I will assume the string is not valid and return error to
the caller.

Longman

Waiman Long

unread,
Sep 4, 2013, 10:04:50 PM9/4/13
to John Stoffel, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
On 09/04/2013 04:40 PM, John Stoffel wrote:
>>>>>> "Waiman" == Waiman Long<waima...@hp.com> writes:
> Waiman> In term of AIM7 performance, this patch has a performance boost of
> Waiman> about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.
>
> Waiman> User Range | 10-100 | 200-10000 | 1100-2000 |
> Waiman> Mean JPM w/o patch | 4,365,114 | 7,211,115 | 6,964,439 |
> Waiman> Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
> Waiman> % Change | -11.3% | +6.2% | +6.6% |
>
> This -11% impact is worisome to me, because at smaller numbers of
> users, I would still expect the performance to go up. So why the big
> drop?
>
> Also, how is the impact of these changes on smaller 1 socket, 4 core
> systems? Just because it helps a couple of big boxes, doesn't mean it
> won't hurt the more common small case.
>
> John

I don't believe the patch will make it slower with less user. It is more
a result of run-to-run variation. The short workload typically completed
in a very short time. In the 10-100 user range, the completion times
range from 0.02-0.11s. With a higher user count, it needs several
seconds to run and hence the results are more reliable.

Longman

Waiman Long

unread,
Sep 4, 2013, 10:17:24 PM9/4/13
to Linus Torvalds, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J
On 09/04/2013 05:31 PM, Linus Torvalds wrote:
> On Wed, Sep 4, 2013 at 12:05 PM, Waiman Long<Waima...@hp.com> wrote:
>> + rcu_read_unlock();
>> + if (read_seqretry(&rename_lock, seq))
>> + goto restart;
> Btw, you have this pattern twice, and while it's not necessarily
> incorrect, it's a bit worrisome for performance.

The rcu_read_unlock sequence in the middle of prepend_path() is not
likely to executed. So it shouldn't affect performance exception for the
conditional check.

> The rcu_read_unlock() is very possibly going to trigger an immediate
> scheduling event, so checking the sequence lock after dropping the
> read-lock sounds like it would make it much easier to hit the race
> with some rename.

I can put read_seqbegin/read_seqretry within the
rcu_read_lock/rcu_read_unlock block. However, read_seqbegin() can spin
for a while if a rename is in progress. So it depends on which is more
important, a shorter RCU critical section at the expense of more retries
or vice versa.

> I'm also a tiny bit worried about livelocking on the sequence lock in
> the presence of lots of renames, so I'm wondering if the locking
> should try to approximate what we do for the RCU lookup path: start
> off optimistically using just the RCU lock and a sequence point, but
> if that fails, fall back to taking the real lock. Maybe using a
> counter ("try twice, then get the rename lock for writing")
>
> Hmm?

Yes, I can implement a counter that switch to taking the rename_lock if
the count reaches 0. It shouldn't be too hard. That should avoid the
possibility of a livelock.

Longman

Al Viro

unread,
Sep 4, 2013, 10:42:32 PM9/4/13
to Waiman Long, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
On Wed, Sep 04, 2013 at 09:55:43PM -0400, Waiman Long wrote:
> On 09/04/2013 03:43 PM, Al Viro wrote:
> >On Wed, Sep 04, 2013 at 03:33:00PM -0400, Waiman Long wrote:
> >
> >>I have thought about that. But if a d_move() is going on, the string
> >>in the buffer will be discarded as the sequence number will change.
> >>So whether or not it have embedded null byte shouldn't matter. That
> >>is why I didn't add code to do byte-by-byte copy at this first
> >>patch. I can add code to do that if you think it is safer to do so.
> >Sigh... Junk in the output is not an issue; reading from invalid address
> >is, since you might not survive to the sequence number check. Again,
> >if p is an address returned by kmalloc(size, ...), dereferencing p + offset
> >is not safe unless offset is less than size.
>
> Yeah, I understand that. As said in my reply to Linus, I will use
> memchr() to see if there is null byte within the specified length.
> If one is found, I will assume the string is not valid and return
> error to the caller.

Umm... Strictly speaking, memchr() behaviour is undefined if the third
argument exceeds the size of object pointed to by the first one. IOW,
it has every right to assume that all characters in the range to be
searched in are safely readable. You can't assume that it will read
them one by one until it hits the one you are searching for. In practice
it's probably almost[1] true for all our implementations of memchr(), but...

[1] reads past the character being searched for are very likely, but they'll
be within the same page, which is safe.

Linus Torvalds

unread,
Sep 4, 2013, 10:48:22 PM9/4/13
to Waiman Long, Al Viro, Scott J Norton, linux-fsdevel, Chandramouleeswaran, Aswin, Linux Kernel Mailing List
On Wed, Sep 4, 2013 at 6:49 PM, Waiman Long <waima...@hp.com> wrote:
>
> So what I am going to do is to use memchr() to locate a null
> byte within the given length. If one is found, the string must be invalid
> and there is no point in doing the copying. Instead, EINVAL will be returned
> and the code will check the sequence number then. The memchr() function can
> be fast if an architecture-specific version exists.

We don't have an architecture-specific fast version of memchr, because
nobody sane has ever cared about that abomination of a function. Also,
if rename() overwrites the pathname (switching inline names), I guess
a zero could disappear between your memchr and the copy. Although I
think we always have an ending NUL byte for the inline case, so that
should make sure that memchr would find it *eventually*.

But regardless, that's really not what you want to do.

You should do:

- use the name length as a maximum

- do a byte-at-a-time copy, stopping at a zero (it's going to be
faster than memchr anyway)

Then, later on, we can do one that does a word-at-a-time using the
CONFIG_DCACHE_WORD_ACCESS magic: we know dentry names are always
word-aligned, and we have an efficient "has_zero()" function for
finding zero bytes in a word.

Really.

Linus

Al Viro

unread,
Sep 5, 2013, 12:20:15 AM9/5/13
to Linus Torvalds, Waiman Long, Scott J Norton, linux-fsdevel, Chandramouleeswaran, Aswin, Linux Kernel Mailing List
On Wed, Sep 04, 2013 at 07:48:14PM -0700, Linus Torvalds wrote:
> - use the name length as a maximum
>
> - do a byte-at-a-time copy, stopping at a zero (it's going to be
> faster than memchr anyway)
>
> Then, later on, we can do one that does a word-at-a-time using the
> CONFIG_DCACHE_WORD_ACCESS magic: we know dentry names are always
> word-aligned, and we have an efficient "has_zero()" function for
> finding zero bytes in a word.

Umm... Dentry names are word-aligned, but the place where we copy them
doesn't have to be. OTOH, DCACHE_WORD_ACCESS is for architectures where
unaligned stores are fine, so...

George Spelvin

unread,
Sep 5, 2013, 12:30:42 AM9/5/13
to waima...@hp.com, li...@horizon.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, torv...@linux-foundation.org, vi...@zeniv.linux.org.uk
As long as you're removing locks from prepend_name and complicating its
innards, I notice that each and every call site follows it by prepending
"/". How about moving that into prepend_name as well?

Also, if you happen to feel like it, you can delete the slash flag
and replace it with "bptr != *buffer".

Another small tweak would be to the global_root part of the code.
You could move the is_mounted(vfsmnt) test up, and combine the tail of
that code path with the regular exit. All you have to do is change
the !slash test to:

if (error >= 0 && bptr == *buffer) { /* Root directory */
if (--blen < 0)
error = -ENAMETOOLONG;
else
*--bptr = '/';
}

This modified form is no more code than an inlined copy of prepend(),
so we haven't actually slowed the fast path, but it avoids corrupting
the return value of 0/1/2 if possible.

John Stoffel

unread,
Sep 5, 2013, 9:29:50 AM9/5/13
to Waiman Long, John Stoffel, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J
>>>>> "Waiman" == Waiman Long <waima...@hp.com> writes:

Waiman> On 09/04/2013 04:40 PM, John Stoffel wrote:
>>>>>>> "Waiman" == Waiman Long<waima...@hp.com> writes:
Waiman> In term of AIM7 performance, this patch has a performance boost of
Waiman> about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.
>>
Waiman> User Range | 10-100 | 200-10000 | 1100-2000 |
Waiman> Mean JPM w/o patch | 4,365,114 | 7,211,115 | 6,964,439 |
Waiman> Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
Waiman> % Change | -11.3% | +6.2% | +6.6% |
>>
>> This -11% impact is worisome to me, because at smaller numbers of
>> users, I would still expect the performance to go up. So why the big
>> drop?
>>
>> Also, how is the impact of these changes on smaller 1 socket, 4 core
>> systems? Just because it helps a couple of big boxes, doesn't mean it
>> won't hurt the more common small case.
>>
>> John

Waiman> I don't believe the patch will make it slower with less
Waiman> user. It is more a result of run-to-run variation. The short
Waiman> workload typically completed in a very short time. In the
Waiman> 10-100 user range, the completion times range from
Waiman> 0.02-0.11s. With a higher user count, it needs several seconds
Waiman> to run and hence the results are more reliable.

Can you then show the variation over multiple runs? I think you have
a good justification for larger boxes to make this change, I just
worry about smaller systems getting hit and losing performance.

John

Waiman Long

unread,
Sep 5, 2013, 1:06:55 PM9/5/13
to George Spelvin, linux-...@vger.kernel.org, linux-...@vger.kernel.org, torv...@linux-foundation.org, vi...@zeniv.linux.org.uk
On 09/05/2013 12:30 AM, George Spelvin wrote:
> As long as you're removing locks from prepend_name and complicating its
> innards, I notice that each and every call site follows it by prepending
> "/". How about moving that into prepend_name as well?
>
> Also, if you happen to feel like it, you can delete the slash flag
> and replace it with "bptr != *buffer".
>
> Another small tweak would be to the global_root part of the code.
> You could move the is_mounted(vfsmnt) test up, and combine the tail of
> that code path with the regular exit. All you have to do is change
> the !slash test to:
>
> if (error>= 0&& bptr == *buffer) { /* Root directory */
> if (--blen< 0)
> error = -ENAMETOOLONG;
> else
> *--bptr = '/';
> }
>
> This modified form is no more code than an inlined copy of prepend(),
> so we haven't actually slowed the fast path, but it avoids corrupting
> the return value of 0/1/2 if possible.

Thank for the suggestions. I will implement them in my v2 patch.

-Longman

Waiman Long

unread,
Sep 5, 2013, 2:56:16 PM9/5/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
that no dentries will go away. To prevent live-lock from happening,
the code will switch back to take the rename_lock if read_seqretry()
fails for three times.

To further reduce spinlock contention, this patch also tries not
to take the dentry's d_lock if the dentry name is internal. For
external dentry name, it is safer to take the d_lock as racing with
d_move() may cause it to access invalid memory location leading to
segmentation fault. For safety, there is also additional check to
see if the name string is valid (no embedded null).

The following code re-factoring are also made:
1. Move prepend('/') into prepend_name().
2. Move the global root check in prepend_path() back to the top of
the while loop.

With this patch, the _raw_spin_lock will now account for only 1.2%
of the total CPU cycles for the short workload. This patch also has
the effect of reducing the effect of running perf on its profile
since the perf command itself can be a heavy user of the d_path()
function depending on the complexity of the workload.

When taking the perf profile of the high-systime workload, the amount
of spinlock contention contributed by running perf without this patch
was about 16%. With this patch, the spinlock contention caused by
the running of perf will go away and we will have a more accurate
perf profile.

Signed-off-by: Waiman Long <Waima...@hp.com>
---
fs/dcache.c | 213 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 151 insertions(+), 62 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 96655f4..9703aa6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2510,9 +2510,57 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
return 0;
}

-static int prepend_name(char **buffer, int *buflen, struct qstr *name)
+static int prepend_name(char **buffer, int *buflen, struct dentry *dentry)
{
- return prepend(buffer, buflen, name->name, name->len);
+ /*
+ * With RCU path tracing, it may race with rename. Use
+ * ACCESS_ONCE() to make sure that it is either the old or
+ * the new name pointer. The length does not really matter as
+ * the sequence number check will eventually catch any ongoing
+ * rename operation.
+ */
+ const char *dname = ACCESS_ONCE(dentry->d_name.name);
+ u32 dlen = dentry->d_name.len;
+ int error;
+
+ if (likely(dname == (const char *)dentry->d_iname)) {
+ /*
+ * Internal dname, the string memory is valid as long
+ * as its length is not over the limit.
+ */
+ if (unlikely(dlen > sizeof(dentry->d_iname)))
+ return -EINVAL;
+ } else if (!dname)
+ return -EINVAL;
+ else {
+ const char *ptr;
+ u32 len;
+
+ /*
+ * External dname, need to fetch name pointer and length
+ * again under d_lock to get a consistent set and avoid
+ * racing with d_move() which will take d_lock before
+ * acting on the dentries.
+ */
+ spin_lock(&dentry->d_lock);
+ dname = dentry->d_name.name;
+ dlen = dentry->d_name.len;
+ spin_unlock(&dentry->d_lock);
+
+ if (unlikely(!dname || !dlen))
+ return -EINVAL;
+ /*
+ * As the length and the content of the string may not be
+ * valid, need to scan the string and return EINVAL if there
+ * is embedded null byte within the length of the string.
+ */
+ for (ptr = dname, len = dlen; len; len--, ptr++) {
+ if (*ptr == '\0')
+ return -EINVAL;
+ }
+ }
+ error = prepend(buffer, buflen, dname, dlen);
+ return error ? error : prepend(buffer, buflen, "/", 1);
}

/**
@@ -2522,7 +2570,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2531,54 +2586,80 @@ static int prepend_path(const struct path *path,
struct dentry *dentry = path->dentry;
struct vfsmount *vfsmnt = path->mnt;
struct mount *mnt = real_mount(vfsmnt);
- bool slash = false;
int error = 0;
+ unsigned seq = 0;
+ char *bptr;
+ int blen;
+ int retry_cnt = 3; /* Try lockless path conversion 3 times */

+restart:
+ bptr = *buffer;
+ blen = *buflen;
+ if (retry_cnt) {
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
+ } else
+ write_seqlock(&rename_lock);
while (dentry != root->dentry || vfsmnt != root->mnt) {
struct dentry * parent;

if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
/* Global root? */
- if (!mnt_has_parent(mnt))
- goto global_root;
- dentry = mnt->mnt_mountpoint;
- mnt = mnt->mnt_parent;
- vfsmnt = &mnt->mnt;
- continue;
+ if (mnt_has_parent(mnt)) {
+ dentry = mnt->mnt_mountpoint;
+ mnt = mnt->mnt_parent;
+ vfsmnt = &mnt->mnt;
+ continue;
+ }
+ /*
+ * Filesystems needing to implement special "root names"
+ * should do so with ->d_dname()
+ */
+ if (IS_ROOT(dentry) &&
+ (dentry->d_name.len != 1 ||
+ dentry->d_name.name[0] != '/')) {
+ WARN(1, "Root dentry has weird name <%.*s>\n",
+ (int) dentry->d_name.len,
+ dentry->d_name.name);
+ }
+ if (!error)
+ error = is_mounted(vfsmnt) ? 1 : 2;
+ break;
}
parent = dentry->d_parent;
+ if (unlikely(parent == NULL)) {
+ error = -EINVAL;
+ break; /* Check sequence number */
+ }
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(buffer, buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (!error)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend_name(&bptr, &blen, dentry);
if (error)
break;

- slash = true;
dentry = parent;
}
+ if (retry_cnt) {
+ retry_cnt--;
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
+ } else
+ write_sequnlock(&rename_lock);

- if (!error && !slash)
- error = prepend(buffer, buflen, "/", 1);
-
+ if (error >= 0 && bptr == *buffer) {
+ if (--blen < 0)
+ error = -ENAMETOOLONG;
+ else
+ *--bptr = '/';
+ }
+ if (error == -EINVAL)
+ goto garbage;
+ *buffer = bptr;
+ *buflen = blen;
return error;

-global_root:
- /*
- * Filesystems needing to implement special "root names"
- * should do so with ->d_dname()
- */
- if (IS_ROOT(dentry) &&
- (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
- WARN(1, "Root dentry has weird name <%.*s>\n",
- (int) dentry->d_name.len, dentry->d_name.name);
- }
- if (!slash)
- error = prepend(buffer, buflen, "/", 1);
- if (!error)
- error = is_mounted(vfsmnt) ? 1 : 2;
+garbage:
+ error = prepend(buffer, buflen, "(garbage)", 10);
return error;
}

@@ -2607,9 +2688,7 @@ char *__d_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2628,9 +2707,7 @@ char *d_absolute_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error > 1)
@@ -2696,9 +2773,7 @@ char *d_path(const struct path *path, char *buf, int buflen)

get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2733,10 +2808,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
char *end = buffer + buflen;
/* these dentries are never renamed, so d_lock is not needed */
if (prepend(&end, &buflen, " (deleted)", 11) ||
- prepend_name(&end, &buflen, &dentry->d_name) ||
+ prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
prepend(&end, &buflen, "/", 1))
end = ERR_PTR(-ENAMETOOLONG);
- return end;
+ return end;
}

/*
@@ -2744,30 +2819,55 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq = 0;
+ int error = 0;
+ int retry_cnt = 3; /* Try lockless 3 times before taking lock */

- prepend(&end, &buflen, "\0", 1);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
+ if (retry_cnt) {
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
+ } else
+ write_seqlock(&rename_lock);
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;

+ if (unlikely(parent == NULL)) {
+ error = -EINVAL;
+ break;
+ }
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, dentry);
+ if (error)
+ break;

retval = end;
dentry = parent;
}
+ if (retry_cnt) {
+ retry_cnt--;
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
+ } else
+ write_sequnlock(&rename_lock);
+ if (error == -EINVAL) {
+ error = prepend(&end, &len, "//garbage", 10);
+ if (!error)
+ return end;
+ }
+ if (error)
+ goto Elong;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -2775,13 +2875,7 @@ Elong:

char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);

@@ -2790,7 +2884,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;

- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2798,7 +2891,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2837,7 +2929,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2845,7 +2936,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2866,7 +2956,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}

--
1.7.1

Waiman Long

unread,
Sep 5, 2013, 2:56:32 PM9/5/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
Change History

v1->v2:
- Check for internal vs external dname, taking d_lock only for
external dname for safety.
- Replace memchr() by a byte-by-byte checking for loop.
- Try lockless dentry to pathname conversion 3 times before falling
back to taking the rename_lock to prevent live-lock.
- Make code re-factoring suggested by George Spelvin.

Waiman Long (1):
dcache: Translating dentry into pathname without taking rename_lock

fs/dcache.c | 213 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 151 insertions(+), 62 deletions(-)

Linus Torvalds

unread,
Sep 5, 2013, 3:35:15 PM9/5/13
to Waiman Long, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
No. Stop all these stupid games.

No d_lock, no trying to make d_name/d_len consistent.

No "compare d_name against d_iname".

No EINVAL.

No idiotic racy "let's fetch each byte one-by one and test them
against NUL", which is just racy and stupid.

Just do what I told you to do: *copy* the name one byte at a time, and
stop copying if you hit NUL. IOW, make prepend() just use "strncpy()"
instead of "memcpy()". You don't even need to check the end result -
if it's bogus, the sequence count will fix it.

No special cases. No games. No crap. Just make "prepend()" able to
handle the special case of "oops, the name length didn't match, but we
must not ever go past the end of the buffer".

We can optimize strncpy() to use word accesses if it ends up ever
being a performance issue. I suspect it never even shows up, but it's
not hard to do if if does.

Linus

Al Viro

unread,
Sep 5, 2013, 4:04:18 PM9/5/13
to Waiman Long, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Thu, Sep 05, 2013 at 02:55:16PM -0400, Waiman Long wrote:
> + const char *dname = ACCESS_ONCE(dentry->d_name.name);
> + u32 dlen = dentry->d_name.len;
> + int error;
> +
> + if (likely(dname == (const char *)dentry->d_iname)) {
> + /*
> + * Internal dname, the string memory is valid as long
> + * as its length is not over the limit.
> + */
> + if (unlikely(dlen > sizeof(dentry->d_iname)))
> + return -EINVAL;
> + } else if (!dname)
> + return -EINVAL;
Can't happen.
> + else {
> + const char *ptr;
> + u32 len;
> +
> + /*
> + * External dname, need to fetch name pointer and length
> + * again under d_lock to get a consistent set and avoid
> + * racing with d_move() which will take d_lock before
> + * acting on the dentries.
> + */
> + spin_lock(&dentry->d_lock);
> + dname = dentry->d_name.name;
> + dlen = dentry->d_name.len;
> + spin_unlock(&dentry->d_lock);
> +
> + if (unlikely(!dname || !dlen))
> + return -EINVAL;
Can't happen.

> + /*
> + * As the length and the content of the string may not be
> + * valid, need to scan the string and return EINVAL if there
> + * is embedded null byte within the length of the string.
> + */
> + for (ptr = dname, len = dlen; len; len--, ptr++) {
> + if (*ptr == '\0')
> + return -EINVAL;

Egads... First of all, this is completely pointless - if you've grabbed
->d_name.name and ->d_name.len under ->d_lock, you don't *need* that crap.
At all. The whole point of that exercise is to avoid taking ->d_lock;
_that_ is where the "read byte by byte until you hit NUL" comes from.
And if you do that, you can bloody well just go ahead and store them in
the target array *as* *you* *go*. No reason to bother with memcpy()
afterwards.

Damnit, just grab len and name (no ->d_lock, etc.). Check if you've got
enough space in the buffer, treat "not enough" as an overflow. Then
proceed to copy the damn thing over there (starting at *buffer -= len)
byte by byte, stopping when you've copied len bytes *or* when the byte you've
got happens to be NUL. Don't bother with EINVAL, etc. - just return to
caller and let rename_lock logics take care of the races. That's it - nothing
more is needed.

Waiman Long

unread,
Sep 5, 2013, 4:29:29 PM9/5/13
to Linus Torvalds, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On 09/05/2013 03:35 PM, Linus Torvalds wrote:
> No. Stop all these stupid games.
>
> No d_lock, no trying to make d_name/d_len consistent.
>
> No "compare d_name against d_iname".
>
> No EINVAL.
>
> No idiotic racy "let's fetch each byte one-by one and test them
> against NUL", which is just racy and stupid.
>
> Just do what I told you to do: *copy* the name one byte at a time, and
> stop copying if you hit NUL. IOW, make prepend() just use "strncpy()"
> instead of "memcpy()". You don't even need to check the end result -
> if it's bogus, the sequence count will fix it.
>
> No special cases. No games. No crap. Just make "prepend()" able to
> handle the special case of "oops, the name length didn't match, but we
> must not ever go past the end of the buffer".
>
> We can optimize strncpy() to use word accesses if it ends up ever
> being a performance issue. I suspect it never even shows up, but it's
> not hard to do if if does.
>
> Linus

It is not as simple as doing a strncpy(). The pathname was built from
the leaf up to the root, and from the end of buffer toward the
beginning. As it goes through the while loop, the buffer will look like:

" /c"
" /b/c"
"/a/b/c"

If the content of the string is unreliable, I have to do at least 2 passes:
1) Locate the end of the string and determine the actual length
2) Copy the whole string or byte-by-byte backward

That is why I am looking for the null byte. Using strncpy() alone won't
work. However, if the actual length doesn't match that of the dlen, the
name string will be invalid and there is no point in proceeding any further.

I also consider the possible, but unlikely, scenario that during a
rename operation, a short old name could be freed and replaced by a long
new name. The old name could then be allocated to another user filling
it up with non-NULL byte. If the buffer happen to be the end of
valid-to-invalid memory page boundary, the code may step over to an
invalid address by looking for the null byte. The current code probably
won't free the buffer while the RCU lock is being hold, but future code
change may make this assumption not valid. Blindly taking the d_lock may
be too expensive as the original code does. That is why I come up with
the internal vs. external dname check and take the lock only for
external dname for safety.

I can change the code to do just locating the end of string and copying
it backward, but not using strncpy().

-Longman

Linus Torvalds

unread,
Sep 5, 2013, 4:42:23 PM9/5/13
to Waiman Long, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Thu, Sep 5, 2013 at 1:29 PM, Waiman Long <waima...@hp.com> wrote:
>
> It is not as simple as doing a strncpy().

Yes it damn well is.

Stop the f*cking stupid arguments, and instead listen to what I say.

Here. Let me bold-face the most important part for you, so that you
don't miss it in all the other crap:

MAKE prepend() JUST USE "strncpy()" INSTEAD OF "memcpy()".

Nothing else. Seriously. Your "you can't do it because we copy
backwards" arguments are pure and utter garbage, exactly BECAUSE YOU
DON'T CHANGE ANY OF THAT. You can actually use the unreliable length
variable BUT YOU MUST STILL STOP AT A ZERO.

Get it?

You're complicating the whole thing for no good reason. I'm telling
you (and HAVE BEEN telling you multiple times) that you cannot use
"memcpy()" because the length may not be reliable, so you need to
check for zero in the middle and stop early. All your arguments have
been totally pointless, because you don't seem to see that simple and
fundamental issue. You don't change ANYTHING else. But you damn well
not do a "memcpy", you do something that stops when it hits a NUL
character.

We call that function "strncpy()". I'd actually prefer to write it out
by hand (because somebody could implement "strncpy()" as a
questionable function that accesses past the NUL as long as it's
within the 'n'), and because I think we might want to do that
word-at-a-time version of it, but for a first approximation, just do
that one-liner version.

Don't do anything else. Don't do locking. Don't do memchr. Just make
sure that you stop at a NUL character, and don't trust the length,
because the length may not match the pointer. That's was always ALL
you needed to do.

Linus

Waiman Long

unread,
Sep 5, 2013, 4:43:38 PM9/5/13
to Al Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
That is what I thought too. I am just not totally sure about it. So yes,
I can scrap all these additional check.

As the internal dname buffer is at least 32 bytes, most dentries will
use the internal buffer instead of allocating from kmem. IOW, the d_lock
taking code path is unlikely to be used.

> Damnit, just grab len and name (no ->d_lock, etc.). Check if you've got
> enough space in the buffer, treat "not enough" as an overflow. Then
> proceed to copy the damn thing over there (starting at *buffer -= len)
> byte by byte, stopping when you've copied len bytes *or* when the byte you've
> got happens to be NUL. Don't bother with EINVAL, etc. - just return to
> caller and let rename_lock logics take care of the races. That's it - nothing
> more is needed.

OK, I will do that.

-Longman

Al Viro

unread,
Sep 5, 2013, 4:46:22 PM9/5/13
to Waiman Long, Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Thu, Sep 05, 2013 at 04:29:06PM -0400, Waiman Long wrote:

> It is not as simple as doing a strncpy(). The pathname was built
> from the leaf up to the root, and from the end of buffer toward the
> beginning. As it goes through the while loop, the buffer will look
> like:
>
> " /c"
> " /b/c"
> "/a/b/c"
>
> If the content of the string is unreliable, I have to do at least 2 passes:
> 1) Locate the end of the string and determine the actual length
> 2) Copy the whole string or byte-by-byte backward

No, you do not need anything of that kind. All you need is
a) don't step out of the array (which will contain NUL at the end
at all times, no matter what) and
b) generate correct output *IF* no d_move() happens while you
do that.

Nothing else matters at all. You trust the length to be correct in absense
of d_move(). You can not trust it to match the size of ->d_name.name when
d_move() is happening, but you can trust everything up to ->d_name.len *or*
the first NUL, whichever happens first, to be safe to access.

Again, the contents copied into the buffer needs to be valid only if d_move()
hasn't happened; if it has, we don't give a fuck - read_seqretry() will take
care of that. All you need to care about in that case is not oopsing the
damn thing.

static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
const char *s = ACCESS_ONCE(name->name);
unsigned len = ACCESS_ONCE(name->len);
char *p;

*buflen -= len;
if (*buflen < 0)
return -ENAMETOOLONG;
p = *buffer -= len;
while (len--) {
c = *s++;
if (!c)
break;
*p++ = c;
}
return 0;
}

And that's *all* - just call that under rcu_read_lock() and within
seq = read_seqbegin(&rename_lock)/read_seqretry(&rename_lock, seq)
loop over the whole prepend_path/path_with_deleted/__dentry_path
thing.

Linus Torvalds

unread,
Sep 5, 2013, 5:27:34 PM9/5/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Thu, Sep 5, 2013 at 1:46 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> static int prepend_name(char **buffer, int *buflen, struct qstr *name)
> {
> const char *s = ACCESS_ONCE(name->name);
> unsigned len = ACCESS_ONCE(name->len);
> char *p;
>
> *buflen -= len;
> if (*buflen < 0)
> return -ENAMETOOLONG;
> p = *buffer -= len;
> while (len--) {
> c = *s++;
> if (!c)
> break;
> *p++ = c;
> }
> return 0;
> }

. and appended is a *totally* untested "dentry_string_copy()" that
does things a word at a time (it doesn't have your "buffer/buflen -=
len" parts, this is purely the copying code).

Note that the return value is only optimistic - it may be copying NUL
bytes *without* returning an error. We don't care. We really only care
that it not access past the end of the string using the same rules as
the RCU lookup does: it *can* access aligned words. It's just meant as
a "hey, if we had to go to the bother of checking for a NUL byte and
we found it, we might as well let the user know so that he could
choose to stop the optimistic RCU thing early".

Also note that this thing depends on CONFIG_DCACHE_WORD_ACCESS (which
in turn depends on little-endian), _and_ only works if the
architecture supports fast unaligned stores (which is true on x86, but
may not be true on ARM). And it does require that it can overshoot the
destination string - it won't *change* the destination string past the
end, but it will do an unaligned store there needs padding. And that
might be the biggest problem with this routine and the thing that
scuttles it. We could do that final store as a series of byte writes
instead, I guess. So maybe it should do

do {
*(char *)ct++ = a;
a >>= 8;
} while (--tcount);

at the end instead of doing the final word store.

I haven't actually tested it, but it looks right (within the above
constraints), and it generates asm that looks good enough. It is not
entirely obvious that it's really worth it over the "stupid"
character-at-a-time model, though: the loop counts are likely not
really all that high, and the unaligned store together with loading
the big constants to do the "test word for zero" might just add enough
overhead that it's not a big win. But the word-at-a-time approach was
a noticeable win for the rcu dentry lookup, so it's possible that it's
noticeable here too.

Linus

----
/*
* The source comes from a dentry, which means that it is guaranteed to
* be aligned. However, the destination is not likely to be aligned.
*
* The only rule is: we must *not* overstep the source by more than that
* aligned word access if it has a NUL character in it.
*
* NOTE NOTE NOTE! This also requires that we can overshoot the destination
* string by up to one unaligned word access!
*/
int dentry_string_copy(const unsigned char *_cs, unsigned char *_ct,
unsigned tcount)
{
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
const unsigned long *cs = (const void *)_cs;
unsigned long *ct = (void *)_ct;
unsigned long a,mask;

for (;;) {
unsigned long adata;
a = *cs;
if (tcount < sizeof(unsigned long))
break;
*ct = a;
if (unlikely(has_zero(a, &adata, &constants)))
return -EINVAL;
cs++; ct++;
tcount -= sizeof(unsigned long);
if (!tcount)
return 0;
}
/* We don't care if there's a NUL in the last word */
mask = ~(~0ul << tcount*8);
*ct = (a & mask) | (~mask & *ct);
return 0;

Waiman Long

unread,
Sep 5, 2013, 10:01:56 PM9/5/13
to Linus Torvalds, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
I am sorry that I misunderstand what you said. I will do what you and Al
advise me to do.

-Longman

Linus Torvalds

unread,
Sep 6, 2013, 12:54:24 AM9/6/13
to Waiman Long, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Thu, Sep 5, 2013 at 7:01 PM, Waiman Long <waima...@hp.com> wrote:
>
> I am sorry that I misunderstand what you said. I will do what you and Al
> advise me to do.

I'm sorry I shouted at you. I was getting a bit frustrated there..

Linus

Waiman Long

unread,
Sep 6, 2013, 12:09:38 PM9/6/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
Change History

v2->v3:
- Simplify prepend_name() to blindly copy the dname over until it
reaches a null byte or the specified length leaving the sequence
check to handle error case.

v1->v2:
- Check for internal vs external dname, taking d_lock only for
external dname for safety.
- Replace memchr() by a byte-by-byte checking for loop.
- Try lockless dentry to pathname conversion 3 times before falling
back to taking the rename_lock to prevent live-lock.
- Make code re-factoring suggested by George Spelvin.

Waiman Long (1):
dcache: Translating dentry into pathname without taking rename_lock

fs/dcache.c | 178 ++++++++++++++++++++++++++++++++++++++--------------------
1 files changed, 116 insertions(+), 62 deletions(-)

Waiman Long

unread,
Sep 6, 2013, 12:10:02 PM9/6/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
To further reduce spinlock contention, this patch does not take the
dentry's d_lock when copying the filename from the dentries. Instead,
it treats the name pointer and length as unreliable and just copy
the string byte-by-byte over until it hits a null byte or the end of
string as specified by the length. This should avoid stepping into
invalid memory address. The error cases are left to be handled by
the sequence number check.

The following code re-factoring are also made:
1. Move prepend('/') into prepend_name() to remove one conditional
check.
2. Move the global root check in prepend_path() back to the top of
the while loop.

With this patch, the _raw_spin_lock will now account for only 1.2%
of the total CPU cycles for the short workload. This patch also has
the effect of reducing the effect of running perf on its profile
since the perf command itself can be a heavy user of the d_path()
function depending on the complexity of the workload.

When taking the perf profile of the high-systime workload, the amount
of spinlock contention contributed by running perf without this patch
was about 16%. With this patch, the spinlock contention caused by
the running of perf will go away and we will have a more accurate
perf profile.

Signed-off-by: Waiman Long <Waima...@hp.com>
---
fs/dcache.c | 178 ++++++++++++++++++++++++++++++++++++++--------------------
1 files changed, 116 insertions(+), 62 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 96655f4..da095ce 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -89,6 +89,12 @@ EXPORT_SYMBOL(rename_lock);
static struct kmem_cache *dentry_cache __read_mostly;

/*
+ * Try 3 times of lockless dentry to pathname conversion before falling
+ * back to take the rename_lock.
+ */
+#define D_LOCKLESS_PREPEND_RETRY 3
+
+/*
* This is the single most critical data structure when it comes
* to the dcache: the hashtable for lookups. Somebody should try
* to make this good - I've just made it work.
@@ -2512,7 +2518,33 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)

static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
- return prepend(buffer, buflen, name->name, name->len);
+ /*
+ * With RCU path tracing, it may race with d_move(). Use
+ * ACCESS_ONCE() to make sure that either the old or the new name
+ * pointer and length are fetched. However, there may be mismatch
+ * between length and pointer. The length cannot be trusted, we need
+ * to copy it byte-by-byte until the length is reached or a null
+ * byte is found. It also prepends "/" at the beginning of the name.
+ * The sequence number check at the caller will retry it again when
+ * a d_move() does happen. So any garbage in the buffer due to
+ * mismatched pointer and length will be discarded.
+ */
+ const char *dname = ACCESS_ONCE(name->name);
+ u32 dlen = ACCESS_ONCE(name->len);
+ char *p;
+
+ if (*buflen < dlen + 1)
+ return -ENAMETOOLONG;
+ *buflen -= dlen + 1;
+ p = *buffer -= dlen + 1;
+ *p++ = '/';
+ while (dlen--) {
+ char c = *dname++;
+ if (!c)
+ break;
+ *p++ = c;
+ }
+ return 0;
}

/**
@@ -2522,7 +2554,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2531,54 +2570,70 @@ static int prepend_path(const struct path *path,
struct dentry *dentry = path->dentry;
struct vfsmount *vfsmnt = path->mnt;
struct mount *mnt = real_mount(vfsmnt);
- bool slash = false;
int error = 0;
+ unsigned seq = 0;
+ char *bptr;
+ int blen;
+ int retry_cnt = D_LOCKLESS_PREPEND_RETRY;

+restart:
+ bptr = *buffer;
+ blen = *buflen;
+ if (retry_cnt) {
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
+ } else
+ write_seqlock(&rename_lock);
while (dentry != root->dentry || vfsmnt != root->mnt) {
struct dentry * parent;

if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
/* Global root? */
- if (!mnt_has_parent(mnt))
- goto global_root;
- dentry = mnt->mnt_mountpoint;
- mnt = mnt->mnt_parent;
- vfsmnt = &mnt->mnt;
- continue;
+ if (mnt_has_parent(mnt)) {
+ dentry = mnt->mnt_mountpoint;
+ mnt = mnt->mnt_parent;
+ vfsmnt = &mnt->mnt;
+ continue;
+ }
+ /*
+ * Filesystems needing to implement special "root names"
+ * should do so with ->d_dname()
+ */
+ if (IS_ROOT(dentry) &&
+ (dentry->d_name.len != 1 ||
+ dentry->d_name.name[0] != '/')) {
+ WARN(1, "Root dentry has weird name <%.*s>\n",
+ (int) dentry->d_name.len,
+ dentry->d_name.name);
+ }
+ if (!error)
+ error = is_mounted(vfsmnt) ? 1 : 2;
+ break;
}
parent = dentry->d_parent;
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(buffer, buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (!error)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend_name(&bptr, &blen, &dentry->d_name);
if (error)
break;

- slash = true;
dentry = parent;
}
+ if (retry_cnt) {
+ retry_cnt--;
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
+ } else
+ write_sequnlock(&rename_lock);

- if (!error && !slash)
- error = prepend(buffer, buflen, "/", 1);
-
- return error;
-
-global_root:
- /*
- * Filesystems needing to implement special "root names"
- * should do so with ->d_dname()
- */
- if (IS_ROOT(dentry) &&
- (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
- WARN(1, "Root dentry has weird name <%.*s>\n",
- (int) dentry->d_name.len, dentry->d_name.name);
- }
- if (!slash)
- error = prepend(buffer, buflen, "/", 1);
- if (!error)
- error = is_mounted(vfsmnt) ? 1 : 2;
+ if (error >= 0 && bptr == *buffer) {
+ if (--blen < 0)
+ error = -ENAMETOOLONG;
+ else
+ *--bptr = '/';
+ }
+ *buffer = bptr;
+ *buflen = blen;
return error;
}

@@ -2607,9 +2662,7 @@ char *__d_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2628,9 +2681,7 @@ char *d_absolute_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error > 1)
@@ -2696,9 +2747,7 @@ char *d_path(const struct path *path, char *buf, int buflen)

get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2733,10 +2782,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
char *end = buffer + buflen;
/* these dentries are never renamed, so d_lock is not needed */
if (prepend(&end, &buflen, " (deleted)", 11) ||
- prepend_name(&end, &buflen, &dentry->d_name) ||
+ prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
prepend(&end, &buflen, "/", 1))
end = ERR_PTR(-ENAMETOOLONG);
- return end;
+ return end;
}

/*
@@ -2744,30 +2793,46 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq = 0;
+ int error = 0;
+ int retry_cnt = D_LOCKLESS_PREPEND_RETRY;

- prepend(&end, &buflen, "\0", 1);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
+ if (retry_cnt) {
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
+ } else
+ write_seqlock(&rename_lock);
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;

prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, &dentry->d_name);
+ if (error)
+ break;

retval = end;
dentry = parent;
}
+ if (retry_cnt) {
+ retry_cnt--;
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto restart;
+ } else
+ write_sequnlock(&rename_lock);
+ if (error)
+ goto Elong;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -2775,13 +2840,7 @@ Elong:

char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);

@@ -2790,7 +2849,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;

- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2798,7 +2856,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2837,7 +2894,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2845,7 +2901,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2866,7 +2921,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}

--
1.7.1

Linus Torvalds

unread,
Sep 6, 2013, 4:52:56 PM9/6/13
to Waiman Long, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 6, 2013 at 9:08 AM, Waiman Long <Waima...@hp.com> wrote:
>
> This patch will replace the writer's write_seqlock/write_sequnlock
> sequence of the rename_lock of the callers of the prepend_path() and
> __dentry_path() functions with the reader's read_seqbegin/read_seqretry
> sequence within these 2 functions.

Ok, this actually looks really good.

I do have one comment, from just reading the patch:

I would really like the stuff inside the

restart:
bptr = *buffer;
blen = *buflen;
if (retry_cnt) {
seq = read_seqbegin(&rename_lock);
rcu_read_lock();
} else
write_seqlock(&rename_lock);

... guts of path generation ...

if (retry_cnt) {
retry_cnt--;
rcu_read_unlock();
if (read_seqretry(&rename_lock, seq))
goto restart;
} else
write_sequnlock(&rename_lock);

could possible be done as a separate function?

Alternatively (or perhaps additionally), maybe the locking could be
done as an inline function too (taking the retry count as an argument)
to make things a bit more easy to understand.

Right now there is a lot of fairly subtle things going on in that
__dentry_path() function. It's not a huge function, but I think that
"while()" loop inside that locking could be done as its own function
and make it even more readable.

But I could already apply this as-is, so it's not a big deal.

Al - do you have comments? Do you want to take this through your tree,
or are you working on other things? I can take this directly too..

Linus

Al Viro

unread,
Sep 6, 2013, 5:06:13 PM9/6/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 06, 2013 at 01:52:49PM -0700, Linus Torvalds wrote:

> Al - do you have comments? Do you want to take this through your tree,
> or are you working on other things? I can take this directly too..

I can take that, but I'm really not convinced that we need writer lock
there at all. After all, if we really can get livelocks on that one,
we would be getting them on d_lookup()...

Let's not complicate the things without need; if we ever run into loads
where livelock start to happen, we can look into dealing with them.

Linus Torvalds

unread,
Sep 6, 2013, 5:48:40 PM9/6/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 6, 2013 at 2:05 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> I can take that, but I'm really not convinced that we need writer lock
> there at all. After all, if we really can get livelocks on that one,
> we would be getting them on d_lookup()...

d_lookup() does a _single_ path component. That's a *big* difference.

Sure, the hash chain that d_lookup() (well, __d_lookup()) ends up
walking is a bit more complicated than just following the dentry
parent pointer, but that's much harder to trigger than just creating a
really deep directory structure of single-letter nested directories,
and then doing a "getcwd()" that walks 1024+ parents, while another
thread is looping renaming things..

So I personally do feel a lot safer with the fallback to write locking here.

Especially since it's pretty simple, so there isn't really much downside.

Linus

Al Viro

unread,
Sep 6, 2013, 8:00:53 PM9/6/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 06, 2013 at 02:48:32PM -0700, Linus Torvalds wrote:
> On Fri, Sep 6, 2013 at 2:05 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > I can take that, but I'm really not convinced that we need writer lock
> > there at all. After all, if we really can get livelocks on that one,
> > we would be getting them on d_lookup()...
>
> d_lookup() does a _single_ path component. That's a *big* difference.
>
> Sure, the hash chain that d_lookup() (well, __d_lookup()) ends up
> walking is a bit more complicated than just following the dentry
> parent pointer, but that's much harder to trigger than just creating a
> really deep directory structure of single-letter nested directories,
> and then doing a "getcwd()" that walks 1024+ parents, while another
> thread is looping renaming things..
>
> So I personally do feel a lot safer with the fallback to write locking here.
>
> Especially since it's pretty simple, so there isn't really much downside.

Er... what will happen if you have done just what you've described and have
a process call d_lookup()?

Linus Torvalds

unread,
Sep 6, 2013, 8:19:44 PM9/6/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 6, 2013 at 5:00 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> Er... what will happen if you have done just what you've described and have
> a process call d_lookup()?

Umm. Yes?

What part of "one single path component" did you not get?

To repeat: d_lookup() NEVER LOOKS UP A WHOLE PATHNAME. It looks up
just a single path component. It matters not one whit whether you look
up a filename that is 1500 components deep, d_lookup() only ever works
on one single component. It's a single short hash chain lookup.

Sure, it can fail, but people really have to work at it. You're not
going to be able to make it fail by just calling "rename()" in a loop.
You're going to have to do multiple threads at least, and now you need
to do it on multiple different filesystems, since otherwise those
multiple threads are going to be serialized by the (outer)
per-filesystem vfs-layer rename semaphores. In other words, it sounds
impossible to trigger, wouldn't you say? Or if you try, you're going
to stand out for using a *lot* of resources.

In contrast, doing the getcwd() lookup really is following potentially
quite long chains.

So it's quite possible that just a single thread doing rename() in a
loop (on, say, /tmp, so that there isn't any IO) can trigger the
sequence write-lock frequently enough that traversing 1500 d_parent
entries might never complete.

Have I tried it? No. But think about the two different scenarios.
There really is a *big* difference between looking up one single
dentry from its parent using our hash tables, and traversing a
potentially almost unbounded parenthood chain.

(We're bounded in practice by PATH_MAX, so you can't make getcwd()
traverse more than about 2000 parents (single character filename plus
the slash for each level), and for all I know filesystems might cap it
before that, so it's not unbounded, but the difference between "1" and
"2000" is pretty damn big)

Linus

Linus Torvalds

unread,
Sep 6, 2013, 8:59:18 PM9/6/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 6, 2013 at 5:19 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> (We're bounded in practice by PATH_MAX, so you can't make getcwd()
> traverse more than about 2000 parents (single character filename plus
> the slash for each level), and for all I know filesystems might cap it
> before that, so it's not unbounded, but the difference between "1" and
> "2000" is pretty damn big)

. in particular, it's big enough that one is pretty much guaranteed
to fit in any reasonable L1 cache (if we have dentry hash chains so
long that that becomes a problem for traversing a single chain, we're
screwed anyway), while the other can most likely be a case of "not a
single L1 cache hit because by the time you fail and go back to the
start, you've flushed the L1 cache".

Now, whether 2000 L2 cache misses is long enough to give people a
chance to run the whole rename system call path in a loop a few times,
I don't know, but it sure as heck sounds likely.

Of course, you might still ask "why should we even care?" At least
without preemption, you might be able to trigger some really excessive
latencies and possibly a watchdog screaming at you as a result. But
that said, maybe we wouldn't care. I just think that the solution is
so simple (what, five extra lines or so) that it's worth avoiding even
the worry.

Waiman Long

unread,
Sep 6, 2013, 10:24:45 PM9/6/13
to Linus Torvalds, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
I would prefer putting the begin and end blocks into 2 helper inlined
helper functions to make code easier to look at. I will work on this
over the weekend.

-Longman

Al Viro

unread,
Sep 6, 2013, 11:01:46 PM9/6/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 06, 2013 at 05:58:51PM -0700, Linus Torvalds wrote:
> On Fri, Sep 6, 2013 at 5:19 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > (We're bounded in practice by PATH_MAX, so you can't make getcwd()
> > traverse more than about 2000 parents (single character filename plus
> > the slash for each level), and for all I know filesystems might cap it
> > before that, so it's not unbounded, but the difference between "1" and
> > "2000" is pretty damn big)
>
> .. in particular, it's big enough that one is pretty much guaranteed
> to fit in any reasonable L1 cache (if we have dentry hash chains so
> long that that becomes a problem for traversing a single chain, we're
> screwed anyway), while the other can most likely be a case of "not a
> single L1 cache hit because by the time you fail and go back to the
> start, you've flushed the L1 cache".
>
> Now, whether 2000 L2 cache misses is long enough to give people a
> chance to run the whole rename system call path in a loop a few times,
> I don't know, but it sure as heck sounds likely.
>
> Of course, you might still ask "why should we even care?" At least
> without preemption, you might be able to trigger some really excessive
> latencies and possibly a watchdog screaming at you as a result. But
> that said, maybe we wouldn't care. I just think that the solution is
> so simple (what, five extra lines or so) that it's worth avoiding even
> the worry.

We already have that kind of logics - see select_parent() et.al. in
mainline or d_walk() in vfs.git#for-linus (pull request will go in
a few minutes). With this patch we get

* plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(),
ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(),
[audit] handle_path())
* try seqretry once, then switch to write_seqlock() (the things
that got unified into d_walk())
* try seqretry three times, then switch to write_seqlock() (d_path()
and friends)
* several pure write_seqlock() users (d_move(), d_set_mounted(),
d_materialize_unique())

The last class is not a problem - these we want as writers. I really don't
like the way the rest is distributed - if nothing else, nfs_path() and
friends are in exactly the same situation as d_path(). Moreover, why
the distinction between "try once" and "try thrice"?

_If_ we fold the second and the third groups together (and probably have
a bunch from the first one join that), we at least get something
understandable, but the I really wonder if seqlock has the right calling
conventions for that (and at least I'd like to fold the "already got writelock"
flag into seq - we do have a spare bit there).

Comments?

Al Viro

unread,
Sep 7, 2013, 1:32:54 PM9/7/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil, Ian Kent
On Sat, Sep 07, 2013 at 04:01:10AM +0100, Al Viro wrote:
> * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(),
> ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(),
_mds_, actually - sorry.
> [audit] handle_path())
> * try seqretry once, then switch to write_seqlock() (the things
> that got unified into d_walk())
> * try seqretry three times, then switch to write_seqlock() (d_path()
> and friends)
> * several pure write_seqlock() users (d_move(), d_set_mounted(),
> d_materialize_unique())

BTW, autofs4_getpath() looks really odd:
static int autofs4_getpath(struct autofs_sb_info *sbi,
struct dentry *dentry, char **name)
and *name is never modified in it. Why not simply pass it by value?
Moreover, I'm not sure I understand what do we need sbi->fs_lock in
there. Other than that, it's very close to dentry_path() (well, that
and different calling conventions). Ian?

ceph_mds_build_path() is similar, but it does kmalloc to store the result
and grabs ->d_lock for ->d_name protection. This one, BTW, is much more
likely to get stalls - it ends up doing kmalloc on each attempt (after
having calculated the current length). Bugger if I understand what's wrong
with simply grabbing a page and doing that once - before everything else...

build_path_from_dentry() - same story, might very well have been the source
of ceph_mds_build_path().

nfs_path() - not far from open-coded dentry_path() with some postprocessing,
uses ->d_lock for ->d_name protection.

Linus Torvalds

unread,
Sep 7, 2013, 1:52:19 PM9/7/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Fri, Sep 6, 2013 at 8:01 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(),
> ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(),
> [audit] handle_path())
> * try seqretry once, then switch to write_seqlock() (the things
> that got unified into d_walk())
> * try seqretry three times, then switch to write_seqlock() (d_path()
> and friends)
> * several pure write_seqlock() users (d_move(), d_set_mounted(),
> d_materialize_unique())
>
> The last class is not a problem - these we want as writers. I really don't
> like the way the rest is distributed - if nothing else, nfs_path() and
> friends are in exactly the same situation as d_path(). Moreover, why
> the distinction between "try once" and "try thrice"?
>
> _If_ we fold the second and the third groups together (and probably have
> a bunch from the first one join that), we at least get something
> understandable, but the I really wonder if seqlock has the right calling
> conventions for that (and at least I'd like to fold the "already got writelock"
> flag into seq - we do have a spare bit there).
>
> Comments?

So I think we could make a more complicated data structure that looks
something like this:

struct seqlock_retry {
unsigned int seq_no;
int state;
};

and pass that around. Gcc should do pretty well, especially if we
inline things (but even if not, small structures that fit in 64 bytes
generate reasonable code even on 32-bit targets, because gcc knows
about using two registers for passing data around)..

Then you can make "state" have a retry counter in it, and have a
negative value mean "I hold the lock for writing". Add a couple of
helper functions, and you can fairly easily handle the mixed "try for
reading first, then fall back to writing".

That said, __d_lookup() still shows up as very performance-critical on
some loads (symlinks in particular cause us to fall out of the RCU
cases) so I'd like to keep that using the simple pure read case. I
don't believe you can livelock it, as mentioned. But the other ones
might well be worth moving to a "fall back to write-locking after <n>
tries" model. They might all traverse user-specified paths of fairly
arbitrary depth, no?

So this "seqlock_retry" thing wouldn't _replace_ bare seqlocks, it
would just be a helper thing for this kind of behavior where we want
to normally do things with just the read-lock, but want to guarantee
that we don't live-lock.

Sounds reasonable?

Linus

Al Viro

unread,
Sep 7, 2013, 2:07:55 PM9/7/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
More or less; I just wonder if we are overdesigning here - if we don't
do "repeat more than once", we can simply use the lower bit of seq -
read_seqlock() always returns an even value. So we could do something
like seqretry_and_lock(lock, &seq):
if ((*seq & 1) || !read_seqretry(lock, *seq))
return true;
*seq |= 1;
write_seqlock(lock);
return false;
and seqretry_done(lock, seq):
if (seq & 1)
write_sequnlock(lock);
with these loops turning into
seq = read_seqlock(&rename_lock);
...
if (!seqretry_and_lock(&rename_lock, &seq))
goto again;
...
seqretry_done(&rename_lock);

But I'd really like to understand the existing zoo - in particular, ceph and
cifs users can't be converted to anything of that kind (blocking kmalloc()
can't live under write_seqlock()) and they are _easier_ to livelock than
d_path(), due to the same kmalloc() widening the window. Guys, do we really
care about precisely-sized allocations there?

Al Viro

unread,
Sep 7, 2013, 2:54:05 PM9/7/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Sat, Sep 07, 2013 at 07:07:24PM +0100, Al Viro wrote:

> with these loops turning into
> seq = read_seqlock(&rename_lock);
again:
> ...
> if (!seqretry_and_lock(&rename_lock, &seq))
> goto again;
> ...
> seqretry_done(&rename_lock);

Forgot the label, sorry.

And I agree that d_lookup() ought to stay as is - this is just about
the ones that try readlock once and then fall back to writer.

Ian Kent

unread,
Sep 8, 2013, 12:15:58 AM9/8/13
to Al Viro, Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil
On Sat, 2013-09-07 at 18:32 +0100, Al Viro wrote:
> On Sat, Sep 07, 2013 at 04:01:10AM +0100, Al Viro wrote:
> > * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(),
> > ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(),
> _mds_, actually - sorry.
> > [audit] handle_path())
> > * try seqretry once, then switch to write_seqlock() (the things
> > that got unified into d_walk())
> > * try seqretry three times, then switch to write_seqlock() (d_path()
> > and friends)
> > * several pure write_seqlock() users (d_move(), d_set_mounted(),
> > d_materialize_unique())
>
> BTW, autofs4_getpath() looks really odd:
> static int autofs4_getpath(struct autofs_sb_info *sbi,
> struct dentry *dentry, char **name)
> and *name is never modified in it. Why not simply pass it by value?
> Moreover, I'm not sure I understand what do we need sbi->fs_lock in
> there. Other than that, it's very close to dentry_path() (well, that
> and different calling conventions). Ian?

Yes, it is close to dentry_path() but the path functions used to return
the full path to the root, although I don't see where dentry_path() get
the root, mmm ...

autofs4_getpath() is supposed to return a path relative to the current
dentry.

That goes back to the pseudo direct mounts of autofs version 4 where the
keys could be a relative path.

The fs_lock probably isn't needed. This was a tree of directories and I
didn't want mount requests coming in while I was getting the path, and
didn't want an expire to happen either, although there should only be
one expire process anyway.

Ian

Al Viro

unread,
Sep 8, 2013, 12:58:26 AM9/8/13
to Ian Kent, Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil
On Sun, Sep 08, 2013 at 12:15:46PM +0800, Ian Kent wrote:
> > and *name is never modified in it. Why not simply pass it by value?
> > Moreover, I'm not sure I understand what do we need sbi->fs_lock in
> > there. Other than that, it's very close to dentry_path() (well, that
> > and different calling conventions). Ian?
>
> Yes, it is close to dentry_path() but the path functions used to return
> the full path to the root, although I don't see where dentry_path() get
> the root, mmm ...

dentry_path(), unlike d_path(), is relative to the root of filesystem
containing dentry in question. There are 3 of those suckers:
d_path(): vfsmount/dentry => path to current process' root
d_absolute_path(): ditto, but ignores chroot jails (goes to
the absolute root of namespace)
dentry_path(): dentry => path from root of fs containing that
dentry.

IOW, if you have something mounted on /jail/mnt/foo and are chrooted into
/jail, passing vfsmount/dentry of /jail/mnt/foo/bar/baz to d_path() will
yield "/mnt/foo/bar/baz", to d_absolute_path() - "/jail/mnt/foo/bar/baz" and
passing the same dentry to dentry_path() - "/bar/baz".

Ian Kent

unread,
Sep 8, 2013, 4:52:15 AM9/8/13
to Al Viro, Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil
On Sun, 2013-09-08 at 05:58 +0100, Al Viro wrote:
> On Sun, Sep 08, 2013 at 12:15:46PM +0800, Ian Kent wrote:
> > > and *name is never modified in it. Why not simply pass it by value?
> > > Moreover, I'm not sure I understand what do we need sbi->fs_lock in
> > > there. Other than that, it's very close to dentry_path() (well, that
> > > and different calling conventions). Ian?
> >
> > Yes, it is close to dentry_path() but the path functions used to return
> > the full path to the root, although I don't see where dentry_path() get
> > the root, mmm ...
>
> dentry_path(), unlike d_path(), is relative to the root of filesystem
> containing dentry in question. There are 3 of those suckers:
> d_path(): vfsmount/dentry => path to current process' root
> d_absolute_path(): ditto, but ignores chroot jails (goes to
> the absolute root of namespace)
> dentry_path(): dentry => path from root of fs containing that
> dentry.
>
> IOW, if you have something mounted on /jail/mnt/foo and are chrooted into
> /jail, passing vfsmount/dentry of /jail/mnt/foo/bar/baz to d_path() will
> yield "/mnt/foo/bar/baz", to d_absolute_path() - "/jail/mnt/foo/bar/baz" and
> passing the same dentry to dentry_path() - "/bar/baz".

Right, so all I need to do is drop the leading "/" and I'll have what I
need. I'll do that.

Thanks
Ian

Waiman Long

unread,
Sep 9, 2013, 10:31:57 AM9/9/13
to Al Viro, Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
> if ((*seq& 1) || !read_seqretry(lock, *seq))
> return true;
> *seq |= 1;
> write_seqlock(lock);
> return false;
> and seqretry_done(lock, seq):
> if (seq& 1)
> write_sequnlock(lock);
> with these loops turning into
> seq = read_seqlock(&rename_lock);
> ...
> if (!seqretry_and_lock(&rename_lock,&seq))
> goto again;
> ...
> seqretry_done(&rename_lock);

I am fine with try it once and then lock instead of trying it a few
times. Now are you planning to have these helper functions for the
dcache layer only or as part of the seqlock infrastructure? If we are
going to touch the seqlock layer, I would suggest adding a blocking
reader that takes the lock, but won't update the sequence number so that
it won't block other sequence readers as my original seqlock patch was
doing.

-Longman

Waiman Long

unread,
Sep 9, 2013, 12:19:08 PM9/9/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
Change History

v3->v4:
- Extract the begin and end blocks of the rename_lock sequence number
check into helper functions to make the code easier to read.

v2->v3:
- Simplify prepend_name() to blindly copy the dname over until it
reaches a null byte or the specified length leaving the sequence
check to handle error case.

v1->v2:
- Check for internal vs external dname, taking d_lock only for
external dname for safety.
- Replace memchr() by a byte-by-byte checking for loop.
- Try lockless dentry to pathname conversion 3 times before falling
back to taking the rename_lock to prevent live-lock.
- Make code re-factoring suggested by George Spelvin.

Waiman Long (1):
dcache: Translating dentry into pathname without taking rename_lock

fs/dcache.c | 196 ++++++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 133 insertions(+), 63 deletions(-)

Waiman Long

unread,
Sep 9, 2013, 12:19:58 PM9/9/13
to Alexander Viro, Linus Torvalds, Waiman Long, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
When running the AIM7's short workload, Linus' lockref patch eliminated
most of the spinlock contention. However, there were still some left:

8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--42.21%-- d_path
| proc_pid_readlink
| SyS_readlinkat
| SyS_readlink
| system_call
| __GI___readlink
|
|--40.97%-- sys_getcwd
| system_call
| __getcwd

The big one here is the rename_lock (seqlock) contention in d_path()
and the getcwd system call. This patch will eliminate the need to take
the rename_lock while translating dentries into the full pathnames.

The need to take the rename_lock is to make sure that no rename
operation can be ongoing while the translation is in progress. However,
only one thread can take the rename_lock thus blocking all the other
threads that need it even though the translation process won't make
any change to the dentries.

This patch will replace the writer's write_seqlock/write_sequnlock
sequence of the rename_lock of the callers of the prepend_path() and
__dentry_path() functions with the reader's read_seqbegin/read_seqretry
fs/dcache.c | 196 ++++++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 133 insertions(+), 63 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca8e9cd..8186ff9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -88,6 +88,44 @@ EXPORT_SYMBOL(rename_lock);

static struct kmem_cache *dentry_cache __read_mostly;

+/**
+ * read_seqbegin_or_lock - begin a sequence number check or locking block
+ * lock: sequence lock
+ * seq : sequence number to be checked
+ *
+ * First try it once optimistically without taking the lock. If that fails,
+ * take the lock. The sequence number is also used as a marker for deciding
+ * whether to be a reader (even) or writer (odd).
+ * N.B. seq must be initialized to an even number to begin with.
+ */
+static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
+{
+ if (!(*seq & 1)) { /* Even */
+ *seq = read_seqbegin(lock);
+ rcu_read_lock();
+ } else /* Odd */
+ write_seqlock(lock);
+}
+
+/**
+ * read_seqretry_or_unlock - end a seqretry or lock block & return retry status
+ * lock : sequence lock
+ * seq : sequence number
+ * Return: 1 to retry operation again, 0 to continue
+ */
+static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
+{
+ if (!(*seq & 1)) { /* Even */
+ rcu_read_unlock();
+ if (read_seqretry(lock, *seq)) {
+ (*seq)++; /* Take writer lock */
+ return 1;
+ }
+ } else /* Odd */
+ write_sequnlock(lock);
+ return 0;
+}
+
/*
* This is the single most critical data structure when it comes
* to the dcache: the hashtable for lookups. Somebody should try
@@ -2647,9 +2685,39 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
return 0;
}

+/**
+ * prepend_name - prepend a pathname in front of current buffer pointer
+ * buffer: buffer pointer
+ * buflen: allocated length of the buffer
+ * name: name string and length qstr structure
+ *
+ * With RCU path tracing, it may race with d_move(). Use ACCESS_ONCE() to
+ * make sure that either the old or the new name pointer and length are
+ * fetched. However, there may be mismatch between length and pointer.
+ * The length cannot be trusted, we need to copy it byte-by-byte until
+ * the length is reached or a null byte is found. It also prepends "/" at
+ * the beginning of the name. The sequence number check at the caller will
+ * retry it again when a d_move() does happen. So any garbage in the buffer
+ * due to mismatched pointer and length will be discarded.
+ */
static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
- return prepend(buffer, buflen, name->name, name->len);
+ const char *dname = ACCESS_ONCE(name->name);
+ u32 dlen = ACCESS_ONCE(name->len);
+ char *p;
+
+ if (*buflen < dlen + 1)
+ return -ENAMETOOLONG;
+ *buflen -= dlen + 1;
+ p = *buffer -= dlen + 1;
+ *p++ = '/';
+ while (dlen--) {
+ char c = *dname++;
+ if (!c)
+ break;
+ *p++ = c;
+ }
+ return 0;
}

/**
@@ -2659,7 +2727,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2668,54 +2743,60 @@ static int prepend_path(const struct path *path,
struct dentry *dentry = path->dentry;
struct vfsmount *vfsmnt = path->mnt;
struct mount *mnt = real_mount(vfsmnt);
- bool slash = false;
int error = 0;
+ unsigned seq = 0;
+ char *bptr;
+ int blen;

+restart:
+ bptr = *buffer;
+ blen = *buflen;
+ read_seqbegin_or_lock(&rename_lock, &seq);
+ if (read_seqretry_or_unlock(&rename_lock, &seq))
+ goto restart;
@@ -2744,9 +2825,7 @@ char *__d_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2765,9 +2844,7 @@ char *d_absolute_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error > 1)
@@ -2833,9 +2910,7 @@ char *d_path(const struct path *path, char *buf, int buflen)

get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2870,10 +2945,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
char *end = buffer + buflen;
/* these dentries are never renamed, so d_lock is not needed */
if (prepend(&end, &buflen, " (deleted)", 11) ||
- prepend_name(&end, &buflen, &dentry->d_name) ||
+ prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
prepend(&end, &buflen, "/", 1))
end = ERR_PTR(-ENAMETOOLONG);
- return end;
+ return end;
}

/*
@@ -2881,30 +2956,36 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq = 0;
+ int error = 0;

- prepend(&end, &buflen, "\0", 1);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
+ read_seqbegin_or_lock(&rename_lock, &seq);
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;

prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, &dentry->d_name);
+ if (error)
+ break;

retval = end;
dentry = parent;
}
+ if (read_seqretry_or_unlock(&rename_lock, &seq))
+ goto restart;
+ if (error)
+ goto Elong;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -2912,13 +2993,7 @@ Elong:

char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);

@@ -2927,7 +3002,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;

- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2935,7 +3009,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2974,7 +3047,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2982,7 +3054,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -3003,7 +3074,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}

--
1.7.1

Al Viro

unread,
Sep 9, 2013, 1:29:21 PM9/9/13
to Waiman Long, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 12:18:13PM -0400, Waiman Long wrote:
> +/**
> + * read_seqbegin_or_lock - begin a sequence number check or locking block
> + * lock: sequence lock
> + * seq : sequence number to be checked
> + *
> + * First try it once optimistically without taking the lock. If that fails,
> + * take the lock. The sequence number is also used as a marker for deciding
> + * whether to be a reader (even) or writer (odd).
> + * N.B. seq must be initialized to an even number to begin with.
> + */
> +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
> +{
> + if (!(*seq & 1)) { /* Even */
> + *seq = read_seqbegin(lock);
> + rcu_read_lock();
> + } else /* Odd */
> + write_seqlock(lock);
> +}

> +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
> +{
> + if (!(*seq & 1)) { /* Even */
> + rcu_read_unlock();
> + if (read_seqretry(lock, *seq)) {
> + (*seq)++; /* Take writer lock */
> + return 1;
> + }
> + } else /* Odd */
> + write_sequnlock(lock);
> + return 0;
> +}

I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
can do that themselves just fine (it needs to be taken when seq is even),
and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
this variant.

Linus Torvalds

unread,
Sep 9, 2013, 1:45:47 PM9/9/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 9, 2013 at 10:29 AM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> can do that themselves just fine (it needs to be taken when seq is even),
> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> this variant.

Hmm.. I think you need the RCU read lock even when you get the write_seqlock().

Yes, getting the seqlock for write implies that you get a spinlock and
in many normal circumstances that basically is equvalent to being
rcu-locked, but afaik in some configurations that is *not* sufficient
protection against an RCU grace period on another CPU. You need to do
a real rcu_read_lock that increments that whole rcu_read_lock_nesting
level, which a spinlock won't do.

And while the rename sequence lock protects against _renames_, it does
not protect against just plain dentries getting free'd under memory
pressure.

So I think the RCU-readlockness really needs to be independent of the
sequence lock.

Linus

Waiman Long

unread,
Sep 9, 2013, 1:55:34 PM9/9/13
to Al Viro, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On 09/09/2013 01:29 PM, Al Viro wrote:
> On Mon, Sep 09, 2013 at 12:18:13PM -0400, Waiman Long wrote:
>> +/**
>> + * read_seqbegin_or_lock - begin a sequence number check or locking block
>> + * lock: sequence lock
>> + * seq : sequence number to be checked
>> + *
>> + * First try it once optimistically without taking the lock. If that fails,
>> + * take the lock. The sequence number is also used as a marker for deciding
>> + * whether to be a reader (even) or writer (odd).
>> + * N.B. seq must be initialized to an even number to begin with.
>> + */
>> +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
>> +{
>> + if (!(*seq& 1)) { /* Even */
>> + *seq = read_seqbegin(lock);
>> + rcu_read_lock();
>> + } else /* Odd */
>> + write_seqlock(lock);
>> +}
>> +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
>> +{
>> + if (!(*seq& 1)) { /* Even */
>> + rcu_read_unlock();
>> + if (read_seqretry(lock, *seq)) {
>> + (*seq)++; /* Take writer lock */
>> + return 1;
>> + }
>> + } else /* Odd */
>> + write_sequnlock(lock);
>> + return 0;
>> +}
> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> can do that themselves just fine (it needs to be taken when seq is even),
> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> this variant.

I think rcu_read_lock() is needed to make sure that the dentry won't be
freed as we don't take d_lock now.

-Longman

Waiman Long

unread,
Sep 9, 2013, 1:57:10 PM9/9/13
to Linus Torvalds, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On 09/09/2013 01:45 PM, Linus Torvalds wrote:
> On Mon, Sep 9, 2013 at 10:29 AM, Al Viro<vi...@zeniv.linux.org.uk> wrote:
>> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
>> can do that themselves just fine (it needs to be taken when seq is even),
>> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
>> this variant.
> Hmm.. I think you need the RCU read lock even when you get the write_seqlock().
>
> Yes, getting the seqlock for write implies that you get a spinlock and
> in many normal circumstances that basically is equvalent to being
> rcu-locked, but afaik in some configurations that is *not* sufficient
> protection against an RCU grace period on another CPU. You need to do
> a real rcu_read_lock that increments that whole rcu_read_lock_nesting
> level, which a spinlock won't do.
>
> And while the rename sequence lock protects against _renames_, it does
> not protect against just plain dentries getting free'd under memory
> pressure.
>
> So I think the RCU-readlockness really needs to be independent of the
> sequence lock.
>
> Linus

Yes, you are right. It will be safer to take rcu_read_lock() even if we
are taking the rename_lock. It wasn't needed before as d_lock was taken.
Will update the patch to take rcu_read_lock() out to reflect that.

Regards,
Longman

Al Viro

unread,
Sep 9, 2013, 2:06:16 PM9/9/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 10:45:38AM -0700, Linus Torvalds wrote:
> On Mon, Sep 9, 2013 at 10:29 AM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> > can do that themselves just fine (it needs to be taken when seq is even),
> > and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> > this variant.
>
> Hmm.. I think you need the RCU read lock even when you get the write_seqlock().
>
> Yes, getting the seqlock for write implies that you get a spinlock and
> in many normal circumstances that basically is equvalent to being
> rcu-locked, but afaik in some configurations that is *not* sufficient
> protection against an RCU grace period on another CPU. You need to do
> a real rcu_read_lock that increments that whole rcu_read_lock_nesting
> level, which a spinlock won't do.
>
> And while the rename sequence lock protects against _renames_, it does
> not protect against just plain dentries getting free'd under memory
> pressure.

It protects the chain of ->d_parent, so they'd better not get freeds at
all...

> So I think the RCU-readlockness really needs to be independent of the
> sequence lock.

Actually, now that I've tried to convert d_walk() to those guys, I think
I like my proposal for the set of primitives better:

static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq):
{
if ((*seq & 1) || !read_seqretry(lock, *seq))
return true;
*seq |= 1;
write_seqlock(lock);
return false;
}

static inline void seqretry_done(seqlock_t *lock, unsigned seq)
{
if (seq & 1)
write_sequnlock(lock);
}

with the prepend_path() and friends becoming

rcu_read_lock();
seq = read_seqbegin(&rename_lock);
again:
....
if (!seqretry_and_lock(&rename_lock, seq))
goto again; /* now as writer */
seqretry_done(&rename_lock, seq);
rcu_read_unlock();

The thing is, d_walk() does essentially

seq = read_seqbegin(&rename_lock);
again:
....
spin_lock(&d->d_lock);
if (!seqretry_and_lock(&rename_lock, seq)) {
spin_unlock(&d->d_lock);
goto again; /* now as writer */
}
/* now we are holding ->d_lock on it and we know
* that d has not gone stale until that point.
*/
do stuff with d
spin_unlock(&d->d_lock);
seqretry_done(&rename_lock, seq);

OTOH, it's not impossible to handle with Waiman's primitives, just more
massage to do that...

Al Viro

unread,
Sep 9, 2013, 2:07:30 PM9/9/13
to Waiman Long, Linus Torvalds, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 01:55:06PM -0400, Waiman Long wrote:

> >I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> >can do that themselves just fine (it needs to be taken when seq is even),
> >and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> >this variant.
>
> I think rcu_read_lock() is needed to make sure that the dentry won't
> be freed as we don't take d_lock now.

Sure, you do need that; the question is whether you need to take it in
the primitives you are introducing.

Linus Torvalds

unread,
Sep 9, 2013, 2:15:54 PM9/9/13
to Al Viro, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 9, 2013 at 11:06 AM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>>
>> And while the rename sequence lock protects against _renames_, it does
>> not protect against just plain dentries getting free'd under memory
>> pressure.
>
> It protects the chain of ->d_parent, so they'd better not get freeds at
> all...

Yeah, I think you're right because of the other constraints.. Holding
just the rename-lock for writing is actually sufficient, because
(a) you are guaranteed to already hold on to the end-point of the
walk (it's where you begin your path lookup), and so the memory
pressure issue is actually irrelevant.
(b) the only dentry lookup thing that remains is the parenthood chain
which is recursively reference-count-protected from the end, and yes,
in the absense of normal memory pressure, renames are the only thing
that will mess with that.

So even without holding d_lock, I guess we're actually safe even
without a real rcu read lock if we hold the rename lock for writing.

That actually argues for doing the rcu_read_lock() inside the helper
function. HOWEVER, I'd like a comment to that effect, to show why we
can access all those dentries without any other synchronization.

Linus

Al Viro

unread,
Sep 9, 2013, 2:21:21 PM9/9/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 07:06:04PM +0100, Al Viro wrote:
> I like my proposal for the set of primitives better:
>
> static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq):
> {
> if ((*seq & 1) || !read_seqretry(lock, *seq))
> return true;
> *seq |= 1;
> write_seqlock(lock);
> return false;
> }
>
> static inline void seqretry_done(seqlock_t *lock, unsigned seq)
> {
> if (seq & 1)
> write_sequnlock(lock);
> }
>
> with the prepend_path() and friends becoming
>
> rcu_read_lock();
> seq = read_seqbegin(&rename_lock);
> again:
> ....
> if (!seqretry_and_lock(&rename_lock, seq))
> goto again; /* now as writer */
> seqretry_done(&rename_lock, seq);
> rcu_read_unlock();

Actually, it's better for prepend_path() as well, because it's actually

rcu_read_lock();
seq = read_seqbegin(&rename_lock);
again:
....
if (error)
goto done;
....
if (!seqretry_and_lock(&rename_lock, seq))
goto again; /* now as writer */
done:
seqretry_done(&rename_lock, seq);
rcu_read_unlock();

Posted variant will sometimes hit the following path:
* seq_readlock()
* start generating the output
* hit an error
[another process has taken and released rename_lock for some reason]
* hit read_seqretry_and_unlock(), which returns 1.
* retry everything with seq_writelock(), despite the error.

It's not too horrible (we won't be looping indefinitely, ignoring error
all along), but it's certainly subtle enough...

Al Viro

unread,
Sep 9, 2013, 2:36:19 PM9/9/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 07:21:11PM +0100, Al Viro wrote:
> Actually, it's better for prepend_path() as well, because it's actually
>
> rcu_read_lock();
> seq = read_seqbegin(&rename_lock);
> again:
> ....
> if (error)
> goto done;
> ....
> if (!seqretry_and_lock(&rename_lock, seq))
> goto again; /* now as writer */
> done:
> seqretry_done(&rename_lock, seq);
> rcu_read_unlock();
>
> Posted variant will sometimes hit the following path:
> * seq_readlock()
> * start generating the output
> * hit an error
> [another process has taken and released rename_lock for some reason]
> * hit read_seqretry_and_unlock(), which returns 1.
> * retry everything with seq_writelock(), despite the error.
>
> It's not too horrible (we won't be looping indefinitely, ignoring error
> all along), but it's certainly subtle enough...

FWIW, what I propose is this (just the d_path-related parts):

diff --git a/fs/dcache.c b/fs/dcache.c
index 761e31b..b963605 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -88,6 +88,21 @@ EXPORT_SYMBOL(rename_lock);

static struct kmem_cache *dentry_cache __read_mostly;

+static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq)
+{
+ if ((*seq & 1) || !read_seqretry(lock, *seq))
+ return true;
+ *seq |= 1;
+ write_seqlock(lock);
+ return false;
+}
+
+static inline void seqretry_done(seqlock_t *lock, unsigned seq)
+{
+ if (seq & 1)
+ write_sequnlock(lock);
+}
+
/*
* This is the single most critical data structure when it comes
* to the dcache: the hashtable for lookups. Somebody should try
@@ -2644,9 +2659,39 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
@@ -2656,7 +2701,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2665,54 +2717,64 @@ static int prepend_path(const struct path *path,
struct dentry *dentry = path->dentry;
struct vfsmount *vfsmnt = path->mnt;
struct mount *mnt = real_mount(vfsmnt);
- bool slash = false;
int error = 0;
+ unsigned seq;
+ char *bptr;
+ int blen;

+ rcu_read_lock();
+ seq = read_seqbegin(&rename_lock);
+restart:
+ bptr = *buffer;
+ blen = *buflen;
+ if (error >= 0 && !seqretry_and_lock(&rename_lock, &seq))
+ goto restart;

- if (!error && !slash)
- error = prepend(buffer, buflen, "/", 1);
-
- return error;
+ seqretry_done(&rename_lock, seq);
+ rcu_read_unlock();

-global_root:
- /*
- * Filesystems needing to implement special "root names"
- * should do so with ->d_dname()
- */
- if (IS_ROOT(dentry) &&
- (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
- WARN(1, "Root dentry has weird name <%.*s>\n",
- (int) dentry->d_name.len, dentry->d_name.name);
- }
- if (!slash)
- error = prepend(buffer, buflen, "/", 1);
- if (!error)
- error = is_mounted(vfsmnt) ? 1 : 2;
+ if (error >= 0 && bptr == *buffer) {
+ if (--blen < 0)
+ error = -ENAMETOOLONG;
+ else
+ *--bptr = '/';
+ }
+ *buffer = bptr;
+ *buflen = blen;
return error;
}

@@ -2741,9 +2803,7 @@ char *__d_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2762,9 +2822,7 @@ char *d_absolute_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error > 1)
@@ -2830,9 +2888,7 @@ char *d_path(const struct path *path, char *buf, int buflen)

get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2867,10 +2923,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
char *end = buffer + buflen;
/* these dentries are never renamed, so d_lock is not needed */
if (prepend(&end, &buflen, " (deleted)", 11) ||
- prepend_name(&end, &buflen, &dentry->d_name) ||
+ prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
prepend(&end, &buflen, "/", 1))
end = ERR_PTR(-ENAMETOOLONG);
- return end;
+ return end;
}

/*
@@ -2878,30 +2934,40 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq;
+ int error = 0;

- prepend(&end, &buflen, "\0", 1);
+ rcu_read_lock();
+ seq = read_seqbegin(&rename_lock);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;

prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, &dentry->d_name);
+ if (error)
+ goto done;

retval = end;
dentry = parent;
}
+ if (!seqretry_and_lock(&rename_lock, &seq))
+ goto restart;
+done:
+ seqretry_done(&rename_lock, seq);
+ rcu_read_unlock();
+ if (error)
+ goto Elong;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -2909,13 +2975,7 @@ Elong:

char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);

@@ -2924,7 +2984,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;

- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2932,7 +2991,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2971,7 +3029,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2979,7 +3036,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -3000,7 +3056,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}

Al Viro

unread,
Sep 9, 2013, 2:46:45 PM9/9/13
to Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 07:36:08PM +0100, Al Viro wrote:
> FWIW, what I propose is this (just the d_path-related parts):

Grrr... It's still the wrong set of primitives ;-/ Nevermind that one...

Waiman Long

unread,
Sep 9, 2013, 2:47:16 PM9/9/13
to Al Viro, Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
I am fine with your proposed change as long as it gets the job done. It
doesn't really matter if you do it or I do it.

Thank for taking the effort to make it better for us all.

-Longman

Al Viro

unread,
Sep 9, 2013, 3:10:39 PM9/9/13
to Waiman Long, Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote:

> I am fine with your proposed change as long as it gets the job done.

I suspect that the real problem is the unlock part of read_seqretry_or_unlock();
for d_walk() we want to be able to check if we need retry and continue walking
if we do not. Let's do it that way: I've applied your patch as is, with the
next step being
* split read_seqretry_or_unlock():
need_seqretry() (return (!(seq & 1) && read_seqretry(lock, seq))
done_seqretry() (if (seq & 1) write_sequnlock(lock, seq)),
your if (read_seqretry_or_unlock(&rename_lock, &seq))
goto restart;
becoming
if (need_seqretry(&rename_lock, seq)) {
seq = 1;
goto restart;
}
done_seqretry(&rename_lock, seq);

Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(),
need_seqretry() and done_seqretry(). Give me a few, I'll post it...

Al Viro

unread,
Sep 9, 2013, 3:28:46 PM9/9/13
to Waiman Long, Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On Mon, Sep 09, 2013 at 08:10:29PM +0100, Al Viro wrote:
> On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote:
>
> > I am fine with your proposed change as long as it gets the job done.
>
> I suspect that the real problem is the unlock part of read_seqretry_or_unlock();
> for d_walk() we want to be able to check if we need retry and continue walking
> if we do not. Let's do it that way: I've applied your patch as is, with the
> next step being
> * split read_seqretry_or_unlock():
> need_seqretry() (return (!(seq & 1) && read_seqretry(lock, seq))
> done_seqretry() (if (seq & 1) write_sequnlock(lock, seq)),
> your if (read_seqretry_or_unlock(&rename_lock, &seq))
> goto restart;
> becoming
> if (need_seqretry(&rename_lock, seq)) {
> seq = 1;
> goto restart;
> }
> done_seqretry(&rename_lock, seq);
>
> Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(),
> need_seqretry() and done_seqretry(). Give me a few, I'll post it...

OK, how about this? It splits read_seqretry_or_unlock(), takes
rcu_read_{lock,unlock} in the callers and converts d_walk() to those
primitives. I've pushed that and your commit into vfs.git#experimental
(head at 48f5ec2, should propagate in a few); guys, please give it a look
and comment.

diff --git a/fs/dcache.c b/fs/dcache.c
index 38b1b09..b9caf47 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -100,30 +100,21 @@ static struct kmem_cache *dentry_cache __read_mostly;
*/
static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
{
- if (!(*seq & 1)) { /* Even */
+ if (!(*seq & 1)) /* Even */
*seq = read_seqbegin(lock);
- rcu_read_lock();
- } else /* Odd */
+ else /* Odd */
write_seqlock(lock);
}

-/**
- * read_seqretry_or_unlock - end a seqretry or lock block & return retry status
- * lock : sequence lock
- * seq : sequence number
- * Return: 1 to retry operation again, 0 to continue
- */
-static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
+static inline int need_seqretry(seqlock_t *lock, int seq)
{
- if (!(*seq & 1)) { /* Even */
- rcu_read_unlock();
- if (read_seqretry(lock, *seq)) {
- (*seq)++; /* Take writer lock */
- return 1;
- }
- } else /* Odd */
+ return !(seq & 1) && read_seqretry(lock, seq);
+}
+
+static inline void done_seqretry(seqlock_t *lock, int seq)
+{
+ if (seq & 1)
write_sequnlock(lock);
- return 0;
}

/*
@@ -1047,7 +1038,7 @@ void shrink_dcache_for_umount(struct super_block *sb)
* the parenthood after dropping the lock and check
* that the sequence number still matches.
*/
-static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq)
+static struct dentry *try_to_ascend(struct dentry *old, unsigned seq)
{
struct dentry *new = old->d_parent;

@@ -1061,7 +1052,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
*/
if (new != old->d_parent ||
(old->d_flags & DCACHE_DENTRY_KILLED) ||
- (!locked && read_seqretry(&rename_lock, seq))) {
+ need_seqretry(&rename_lock, seq)) {
spin_unlock(&new->d_lock);
new = NULL;
}
@@ -1098,13 +1089,12 @@ static void d_walk(struct dentry *parent, void *data,
{
struct dentry *this_parent;
struct list_head *next;
- unsigned seq;
- int locked = 0;
+ unsigned seq = 0;
enum d_walk_ret ret;
bool retry = true;

- seq = read_seqbegin(&rename_lock);
again:
+ read_seqbegin_or_lock(&rename_lock, &seq);
this_parent = parent;
spin_lock(&this_parent->d_lock);

@@ -1158,13 +1148,13 @@ resume:
*/
if (this_parent != parent) {
struct dentry *child = this_parent;
- this_parent = try_to_ascend(this_parent, locked, seq);
+ this_parent = try_to_ascend(this_parent, seq);
if (!this_parent)
goto rename_retry;
next = child->d_u.d_child.next;
goto resume;
}
- if (!locked && read_seqretry(&rename_lock, seq)) {
+ if (need_seqretry(&rename_lock, seq)) {
spin_unlock(&this_parent->d_lock);
goto rename_retry;
}
@@ -1173,17 +1163,13 @@ resume:

out_unlock:
spin_unlock(&this_parent->d_lock);
- if (locked)
- write_sequnlock(&rename_lock);
+ done_seqretry(&rename_lock, seq);
return;

rename_retry:
if (!retry)
return;
- if (locked)
- goto again;
- locked = 1;
- write_seqlock(&rename_lock);
+ seq = 1;
goto again;
}

@@ -2745,6 +2731,7 @@ static int prepend_path(const struct path *path,
char *bptr;
int blen;

+ rcu_read_lock();
restart:
bptr = *buffer;
blen = *buflen;
@@ -2783,8 +2770,13 @@ restart:

dentry = parent;
}
- if (read_seqretry_or_unlock(&rename_lock, &seq))
+ if (!(seq & 1))
+ rcu_read_unlock();
+ if (need_seqretry(&rename_lock, seq)) {
+ seq = 1;
goto restart;
+ }
+ done_seqretry(&rename_lock, seq);

if (error >= 0 && bptr == *buffer) {
if (--blen < 0)
@@ -2957,6 +2949,7 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
int len, seq = 0;
int error = 0;

+ rcu_read_lock();
restart:
end = buf + buflen;
len = buflen;
@@ -2979,8 +2972,13 @@ restart:
retval = end;
dentry = parent;
}
- if (read_seqretry_or_unlock(&rename_lock, &seq))
+ if (!(seq & 1))
+ rcu_read_unlock();
+ if (need_seqretry(&rename_lock, seq)) {
+ seq = 1;
goto restart;
+ }
+ done_seqretry(&rename_lock, seq);
if (error)
goto Elong;
return retval;

Waiman Long

unread,
Sep 9, 2013, 6:57:51 PM9/9/13
to Al Viro, Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel
On 09/09/2013 03:28 PM, Al Viro wrote:
> On Mon, Sep 09, 2013 at 08:10:29PM +0100, Al Viro wrote:
>> On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote:
>>
>>> I am fine with your proposed change as long as it gets the job done.
>> I suspect that the real problem is the unlock part of read_seqretry_or_unlock();
>> for d_walk() we want to be able to check if we need retry and continue walking
>> if we do not. Let's do it that way: I've applied your patch as is, with the
>> next step being
>> * split read_seqretry_or_unlock():
>> need_seqretry() (return (!(seq& 1)&& read_seqretry(lock, seq))
>> done_seqretry() (if (seq& 1) write_sequnlock(lock, seq)),
>> your if (read_seqretry_or_unlock(&rename_lock,&seq))
>> goto restart;
>> becoming
>> if (need_seqretry(&rename_lock, seq)) {
>> seq = 1;
>> goto restart;
>> }
>> done_seqretry(&rename_lock, seq);
>>
>> Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(),
>> need_seqretry() and done_seqretry(). Give me a few, I'll post it...
> OK, how about this? It splits read_seqretry_or_unlock(), takes
> rcu_read_{lock,unlock} in the callers and converts d_walk() to those
> primitives. I've pushed that and your commit into vfs.git#experimental
> (head at 48f5ec2, should propagate in a few); guys, please give it a look
> and comment.

The changes look good to me. I was planning to take rcu_read_lock() out
and doing something similar, but your change is good. BTW, I think Linus
want to add some comments on why RCU lock is needed without the
rename_lock, but I can put that in with a follow-up patch once the
current change is merged.

Thank for your help and inspiration on this patch.

-Longman

George Spelvin

unread,
Sep 9, 2013, 8:40:38 PM9/9/13
to torv...@linux-foundation.org, vi...@zeniv.linux.org.uk, as...@hp.com, jo...@stoffel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, li...@horizon.com, scott....@hp.com, Waima...@hp.com
I'm really wondering about only trying once before taking the write lock.
Yes, using the lsbit is a cute hack, but are we using it for its cuteness
rather than its effectiveness?

Renames happen occasionally. If that causes all the current pathname
translations to fall back to the write lock, that is fairly heavy.
Worse, all of those translations will (unnecessarily) bump the write
seqcount, triggering *other* translations to fail back to the write-lock
path.

One patch to fix this would be to have the fallback read algorithm take
sl->lock but *not* touch sl->seqcount, so it wouldn't break concurrent
readers.

But another is to simply retry at least once (two attempts) on the
non-exclusive path before falling back to the exclusive one, This means
that the count lsbit is no longer enough space for a retry counter, but
oh, well.

(If you really want to use one word, perhaps a better heuristic
as to how to retry would be to examine the *number* of writes
to the seqlock during the read. If there was only one, there's a fair
chance that another read will succeed. If there was more than one
(i.e. the seqlock has incremented by 3 or more), then forcing the
writers to stop is probably necessary.)

Al Viro

unread,
Sep 9, 2013, 8:58:07 PM9/9/13
to George Spelvin, torv...@linux-foundation.org, as...@hp.com, jo...@stoffel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, scott....@hp.com, Waima...@hp.com
On Mon, Sep 09, 2013 at 08:40:20PM -0400, George Spelvin wrote:
> I'm really wondering about only trying once before taking the write lock.
> Yes, using the lsbit is a cute hack, but are we using it for its cuteness
> rather than its effectiveness?
>
> Renames happen occasionally. If that causes all the current pathname
> translations to fall back to the write lock, that is fairly heavy.
> Worse, all of those translations will (unnecessarily) bump the write
> seqcount, triggering *other* translations to fail back to the write-lock
> path.

_What_ "pathname translations"? Pathname resolution doesn't fall back to
seq_writelock() at all.

Ramkumar Ramachandra

unread,
Sep 9, 2013, 9:16:25 PM9/9/13
to Al Viro, George Spelvin, Linus Torvalds, Chandramouleeswaran, Aswin, jo...@stoffel.org, linux-fsdevel, LKML, Norton, Scott J, Waiman Long
Al Viro wrote:
> _What_ "pathname translations"? Pathname resolution doesn't fall back to
> seq_writelock() at all.

Maybe it should then?

Linus Torvalds

unread,
Sep 9, 2013, 9:34:43 PM9/9/13
to Ramkumar Ramachandra, Al Viro, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long
On Mon, Sep 9, 2013 at 6:15 PM, Ramkumar Ramachandra <arta...@gmail.com> wrote:
>
> Maybe it should then?

It doesn't need to. The RCU lookup looks at individual dentry sequence
numbers and doesn't care about the bigger rename sequence number at
all.

The fallback (if you hit one of the very very rare races, or if you
hit a symlink) ends up doing per-path-component lookups under the
rename sequence lock, but for it, read-locking it until it succeeds is
the right thing to do.

Linus

Al Viro

unread,
Sep 9, 2013, 10:25:24 PM9/9/13
to Linus Torvalds, Ramkumar Ramachandra, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long
On Mon, Sep 09, 2013 at 06:34:16PM -0700, Linus Torvalds wrote:
> On Mon, Sep 9, 2013 at 6:15 PM, Ramkumar Ramachandra <arta...@gmail.com> wrote:
> >
> > Maybe it should then?
>
> It doesn't need to. The RCU lookup looks at individual dentry sequence
> numbers and doesn't care about the bigger rename sequence number at
> all.

One name: Mark V. Shaney...

Linus Torvalds

unread,
Sep 9, 2013, 10:33:17 PM9/9/13
to Al Viro, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long
On Mon, Sep 9, 2013 at 7:25 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> One name: Mark V. Shaney...

Heh, yes. I had ignored the earlier emails, and that last one looked
more reasonable than the earlier ones ;)

Linus

Ramkumar Ramachandra

unread,
Sep 9, 2013, 11:13:44 PM9/9/13
to Linus Torvalds, Al Viro, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long
Linus Torvalds wrote:
> It doesn't need to. The RCU lookup looks at individual dentry sequence
> numbers and doesn't care about the bigger rename sequence number at
> all.

Right; it's sequential.

> The fallback (if you hit one of the very very rare races, or if you
> hit a symlink) ends up doing per-path-component lookups under the
> rename sequence lock, but for it, read-locking it until it succeeds is
> the right thing to do.

No, it's write-locking.

Waiman Long

unread,
Sep 9, 2013, 11:58:05 PM9/9/13
to George Spelvin, torv...@linux-foundation.org, vi...@zeniv.linux.org.uk, as...@hp.com, jo...@stoffel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, scott....@hp.com
On 09/09/2013 08:40 PM, George Spelvin wrote:
> I'm really wondering about only trying once before taking the write lock.
> Yes, using the lsbit is a cute hack, but are we using it for its cuteness
> rather than its effectiveness?
>
> Renames happen occasionally. If that causes all the current pathname
> translations to fall back to the write lock, that is fairly heavy.
> Worse, all of those translations will (unnecessarily) bump the write
> seqcount, triggering *other* translations to fail back to the write-lock
> path.
>
> One patch to fix this would be to have the fallback read algorithm take
> sl->lock but *not* touch sl->seqcount, so it wouldn't break concurrent
> readers.

Actually, a follow-up patch that I am planning to do is to introduce a
read_seqlock() primitive in seqlock.h that does exactly that. Then the
write_seqlock() in this patch will be modified to read_seqlock().

-Longman

George Spelvin

unread,
Sep 10, 2013, 4:24:42 AM9/10/13
to li...@horizon.com, vi...@zeniv.linux.org.uk, as...@hp.com, jo...@stoffel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, scott....@hp.com, torv...@linux-foundation.org, Waima...@hp.com
> _What_ "pathname translations"? Pathname resolution doesn't fall back to
> seq_writelock() at all.

I meant the translation of dentry to pathname that this thread is about.
Apologies for my confusing abbreviation. Since the reverse translation
(pathname to dentry) is done a level at a time and people understand
that it's not atomic, I agree with Linus that a livelock there is too
remote a possibility to worry about.
0 new messages