UTF-8 and case-insensitivity

tri...@samba.org

unread,

Feb 17, 2004, 4:15:33 AM2/17/04

to linux-...@vger.kernel.org

Given how much pain the "kernel is agnostic to charset encoding"
attitude has cost me in terms of programming pain, I thought I should
de-cloak from lurk mode and put my 2c into the UTF-8 issue.

Personally I think that eventually the Linux kernel will have to
embrace the interpretation of the byte streams that applications have
given it, despite the fact that this will be very painful and
potentially quite complex. The reason is that I think that eventually
the Linux kernel will need to efficiently support a userspace policy
of case-insensitivity and the only way to do case-insensitive filename
operations is to interpret those byte streams as a particular
encoding.

Personally I much prefer the systems I use to be case-sensitive, but
there are important applications that require case-insensitivity for
interoperability. Right now it is not possible to write a case
insensitive application on Linux in an efficient manner. With the
current "encoding agnostic" APIs a simple open() or stat() call
becomes a horrendously expensive operation and one that is fraught
with race conditions. Providing the same functionality in the kernel
is dirt cheap by comparison (not cheap in terms of code complexity,
but cheap in terms of runtime efficiency).

Cheers, Tridge
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Feb 17, 2004, 5:15:05 AM2/17/04

to Andrew Tridgell, Kernel Mailing List, Al Viro

[ Al cc'd, because while I'm pretty certain that he agrees with me 100% on
the craziness of case-insensitive name lookups, he may have some input
on the "samba helper" function approach. That input may well boil down
to "Linus is crazy", of course. Wouldn't be the first time ;)

Andrew - you really should assume that case insensitivity is a hell of a
lot more costly than you think it is, and forget that particular idea.
Let's see if there are acceptable half-measures. ]

On Tue, 17 Feb 2004 tri...@samba.org wrote:
>
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
>
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex.

I seriously doubt it. There just isn't any point.

> The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.

The thing is, if you want to do efficient user-space case-insensitive
lookups, that is a _completely_ different matter from having the kernel do
case-insensitivity.

Kernel-level case insensitivity is a total disaster, and your "very
painful and potentially quite complex" assertion is the understatement of
the year. The thing is, you can't sanely do dentry caching, since the case
insensitivity has to be per-open or at least per-process (you MUST NOT be
case-insensitive in a POSIX process).

So the only way to do case-insensitive names is to do all lookups very
slowly. I'm willing to bet that WNT opens files a hell of a lot slower
than Linux does, and one big portion of that is exactly the fact that
Linux can do a really good job with the dentry cache.

And that _depends_ on a well-defined and unique filename setup (by
changing the hashing function and compare function, a filesystem can do a
limited kind of case-insensitivity right now in Linux, but then it will
have to be not only fairly slow, but also case-insensitive for _everybody_
which is unacceptable in a mixed POSIX/samba environment).

In other words, just forget the whole notion. The only set people who have
any reason at _all_ to want it is the samba team, and we can solve the
samba-specific problems other ways.

Just take that as a simple fact - case insensitivity in the kernel is such
a horribly bad idea, that you really shouldn't go there.

With that destructive criticism out of the way, let's look at somewhat
more constructive approaches, ie some way to allow certain processes that
need it better help in their quest for case insensitivity.

Let's start with some assumptions:

- MOST name lookups are likely results of some kind of "readdir()"
lookup, and tend to have the case right in the first place. So that
should go fast. Maybe Tridge has some statistics on this one?

- samba probably has certain pretty well-defined special patterns for
what it wants to do with a filename, do you probably don't need a
generic "everything that takes a filename should be case-insensitive",
and it would be acceptable to have a few _very_ specific system calls.

With those assumptions out of the way, we could think of an interface that
exports some partial functionality of the "lookup_path()" code the kernel
as a special system call. In particular, something that takes an input
pathname, and is able to stop at any point of the name when a lookup
fails.

So some variation of the interface

int magic_open(
/* Input arguments */
const char *pathname,
unsigned long flags,
mode_t mode,

/* output arguments */
int *fd,
struct stat *st,
int *successful_path_length);

ie the system call would:

- look up as far into the pathname (using _exact_ lookup) as possible
- return the error code of the last failure
- the "flags" could be extended so that you can specify that you mustn't
traverse ".." or symlinks (ie those would count as failures)

but also:

- fill in the "struct stat" information for the last _successful_
pathname component.
- fill in the "fd" with a fd of the last _successful_ pathname component.
- tell how much of the pathname it could traverse.

so that the user can do a "readdir" and try to "fix up" the problem
without having to restart the whole thing. For the (hopefully common case)
where the cases match, this would just boil down to an "open with stat
information" thing.

We'd need something more interesting to guarantee unique filename on file
create, possibly even including letting a trusted process maintain some
locks in the VFS layer. The point being that the kernel can _help_ some
specific usage, but making case-insensitive names be part of the VFS layer
proper is not acceptable.

I suspect we can do case-insensitive names faster than WNT even with a
fairly complex user-mode interface. Just because _not_ having them in the
kernel allows us to have much faster default behaviour.

Linus

Tim Connors

unread,

Feb 17, 2004, 5:27:14 AM2/17/04

to linux-...@vger.kernel.org

tri...@samba.org said on Tue, 17 Feb 2004 15:12:06 +1100:

> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
>
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it,

What applications?

> despite the fact that this will be very painful and
> potentially quite complex. The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.
>
> Personally I much prefer the systems I use to be case-sensitive, but
> there are important applications that require case-insensitivity for
> interoperability.

Why? Sounds pretty idiotic to me.

If you don't like it, using some microshit filesystem like vfat. I'll
keep using ext3 etc, thanks.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Conclusion to my thesis -- "It is trivial to show that it is
clearly obvious that this is not woofly."

tri...@samba.org

unread,

Feb 17, 2004, 6:55:14 AM2/17/04

to Linus Torvalds, Kernel Mailing List, Al Viro

Linus,

> Kernel-level case insensitivity is a total disaster, and your "very
> painful and potentially quite complex" assertion is the understatement of
> the year. The thing is, you can't sanely do dentry caching, since the case
> insensitivity has to be per-open or at least per-process (you MUST NOT be
> case-insensitive in a POSIX process).

right, and the patches to add this support to Linux that I have been
involved with in the past have been per-process. You are right that it
is messy, but it is not *horribly* messy. In fact I'd say it is no
worse than many of the other things we already have in the kernel,
although it certainly is much harder than sticking to the "bag of
bytes" interpretation of filenames. I just think that in this case the
simple solution is also wrong.

> So the only way to do case-insensitive names is to do all lookups very
> slowly.

I don't agree with this at all. I agree that the worst-case will get
worse, but I see absolutely no reason why the average case will get
sigificantly worse and I think that the worst case will be rare.

In fact, John Bonesio did a patch to the 2.4 kernel with XFS that
implemened per-process case-insensitivity. It's been a long time since
I played with that patch, but I certainly don't recall any significant
slowdowns. The patch was messy, but it wasn't grossly
inefficient. (that patch was just a proof of concept, and just used
strcasecmp() instead of doing a proper UTF-8 case-insensitive compare,
so there will be some amount of additional cost to adding that).

From memory, the patch added new classes of dentries to the current
"+ve" and "-ve" dentries. It added concepts like a "-ve
case-insensitive" dentry and a "-ve case-sensitive" dentry. It
certainly adds more code in trying to deal with these variants, but I
see no reason why it should be significantly computationally less
efficient.

> I'm willing to bet that WNT opens files a hell of a lot slower
> than Linux does, and one big portion of that is exactly the fact that
> Linux can do a really good job with the dentry cache.

Anyone have any lmbench filesystem numbers for w2k3? The only windows
boxes I use are in vmware sessions, so running performance tests
myself is pretty pointless.

> And that _depends_ on a well-defined and unique filename setup (by
> changing the hashing function and compare function, a filesystem can do a
> limited kind of case-insensitivity right now in Linux, but then it will
> have to be not only fairly slow, but also case-insensitive for _everybody_
> which is unacceptable in a mixed POSIX/samba environment).

right, and thats why bones made it per-process in his patch. It was
set using a process personality bit, which really wasn't ideal (that
was one of my contributions to the patch) but it did work.

> In other words, just forget the whole notion. The only set people who have
> any reason at _all_ to want it is the samba team, and we can solve the
> samba-specific problems other ways.

Nope, its not just Samba, though perhaps Samba is the app that cares
the most about the actual performance. The other obvious people who
care are wine and anyone porting an application from windows. Also,
the problem isn't just one of performance, its also hard to make it
raceless from userspace.

I also think that if the choice were given then some linux distros
(the likes of Lindows comes to mind) would choose to run all processes
case-insensitive. These sorts of distros are aiming at the sorts of
users that would want everything to be case-insensitive.

> Just take that as a simple fact - case insensitivity in the kernel is such
> a horribly bad idea, that you really shouldn't go there.

I'm yet to be convinced :)

> - MOST name lookups are likely results of some kind of "readdir()"
> lookup, and tend to have the case right in the first place. So that
> should go fast. Maybe Tridge has some statistics on this one?

ok, the first thing you need to understand about case-insensitivity on
a case-sensitive system is that the hardest thing to do is prove that
a file doesn't exist. File operations on non-existant files are *very*
common. If you can come up with a solution that allows me to prove
that a file doesn't exist in any case combination then we will be most
of the way there.

That immediately throws out most of the "why don't you just use a
cache" arguments that everyone seems to come up with. We *do* use a
cache that primes the "most likely" filename code, its just that a
cache is almost useless when you are trying to prove that a file
definately doesn't exist.

> - samba probably has certain pretty well-defined special patterns for
> what it wants to do with a filename, do you probably don't need a
> generic "everything that takes a filename should be case-insensitive",
> and it would be acceptable to have a few _very_ specific system calls.

yes, if we had a single function that took a pathname and gave us
either -1/ENOENT or the pathname of a file that matches
case-insensitively then that would be great. Then again, if we had
such a function then it would be really easy to use that function in
the VFS to make the Linux case-insensitive on a per-process basis.

So lets imagine we have such a function like this:

int ci_normalize(char *path);

Lets assume it takes a pathname and returns either -1/ENOENT or
modifies the pathname in place (totally ignoring the fact that the
length of the pathname could change, and that the "char *" is really a
"const char *" - pedants go home).

now lets build a ci_unlink() on top of that:

int ci_unlink(char *path)
{
if (task_is_case_sensitive(current)) {
return unlink(path);
}
if (ci_normalize(path) == -1) {
return -1;
}
return unlink(path);
}

The problem is the negative dentries. If you do the above then
case-sensitive processes will be fast, but case-insensitive processes
will effectively be running without the negative dcache, so unlink()
on paths that don't exist will be slow each and every time. That's why
doing this with any sort of decent efficiency needs dcache changes.

btw, I already know that Al is completely and utterly opposed to
putting any case-insensitivity in the dcache (I think the phrase "over
my dead body" was mentioned), so I know that I'm fighting an uphill
battle here, but I like trying every now and again to see if I can
make any progress.

> With those assumptions out of the way, we could think of an interface that
> exports some partial functionality of the "lookup_path()" code the kernel
> as a special system call. In particular, something that takes an input
> pathname, and is able to stop at any point of the name when a lookup
> fails.
> So some variation of the interface
>
> int magic_open(

....

how would this interact with the negative dcache entries? That is the
key.

> I suspect we can do case-insensitive names faster than WNT even with a
> fairly complex user-mode interface. Just because _not_ having them in the
> kernel allows us to have much faster default behaviour.

on this I completely disagree. Any solution that doesn't cope with
case insensitive properties of negative dentries is just going to
start filling the dcache with lots of useless entries (case
combinations) or effectively not end up using the dcache at
all. Either way its a big loss compared to making the dcache know
about case insensitivity properly.

Cheers, Tridge

PS: ahh, what timing, someone just posted a request to the rsync list
asking for case-insensitivity in rsync.

H. Peter Anvin

unread,

Feb 17, 2004, 7:45:52 AM2/17/04

to linux-...@vger.kernel.org

Followup to: <16433.38038....@samba.org>
By author: tri...@samba.org
In newsgroup: linux.dev.kernel

>
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
>
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex. The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy

> of case-insensitivity and the only way to do case-insensitive filenam=

e
> operations is to interpret those byte streams as a particular
> encoding.
>

Realistically, the only sane way to do this is to set our foot down
and say: UTF-8 is *the* encoding. A good step in that direction would
be to set utf-8 to be the default NLS in the kernel, but as long as
people keep the whole sick idea that we can continue to use
locale-dependent encoding we're in for a world of hurt.

That's really the long and short of it. Until people are willing to
say "we support UTF-8, anything else and it's anyone's guess what
happens" then nothing is going to happen.

-hpa
--
PGP public key available - finger h...@zytor.com
Key fingerprint: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD 1E DF FE 69 EE =
35 BD 74
"The earth is but one country, and mankind its citizens." -- Bahá'u=
'lláh
Just Say No to Morden * The Shadows were defeated -- Babylon 5 is renew=
ed!!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel"=

H. Peter Anvin

unread,

Feb 17, 2004, 8:07:01 AM2/17/04

to linux-...@vger.kernel.org

Followup to: <c0sgnc$ngo$1...@terminus.zytor.com>
By author: h...@zytor.com (H. Peter Anvin)
In newsgroup: linux.dev.kernel

>
> Realistically, the only sane way to do this is to set our foot down

> and say: UTF-8 is *the* encoding. A good step in that direction woul=

d
> be to set utf-8 to be the default NLS in the kernel, but as long as
> people keep the whole sick idea that we can continue to use
> locale-dependent encoding we're in for a world of hurt.
>
> That's really the long and short of it. Until people are willing to
> say "we support UTF-8, anything else and it's anyone's guess what
> happens" then nothing is going to happen.
>

Oh yes, on top of that, if you want case insensitivity, then you also
need to start thinking about a whole lot of other things, including
what normalization form(s) you care about. Keeping normalization (as
well as case-conversion) data for the entire Unicode space in the
kernel is a boatload of memory.

Then, you have to deal with your filesystem going sour on you when two
files suddenly alias, because there is a new revision of the mapping
tables.

Case seemed simple when we were dealing with the "let's teach them all
English" world, but even when you're dealing with languages like
German (ß) or Dutch (Ĳ) things get fuzzy... what's worse, in
Turkish the uppercase equivalent of "i" (U+0069) isn't "I" (U+0049),
it's "İ" (U+0130)! There is no table which can tell you that, sin=
ce
it's context-dependent. Thus, you may now need to consider larger
equivalence classes, but is the other user expecting the same thing?
You can't just use the same base letter being equivalent everywhere,
or a Swedish user would beat the sh*t out of you for confusing the
words "vas" and "väs". On the other hand, the Swedish user would =
be
perfectly happy having "ä" equivalent with "æ" and "ü" e=
quivalent
with "y"!

Therein lies madness.

Neil Brown

unread,

Feb 17, 2004, 8:35:08 AM2/17/04

to tri...@samba.org, Linus Torvalds, Kernel Mailing List, Al Viro

On Tuesday February 17, tri...@samba.org wrote:
>
> I also think that if the choice were given then some linux distros
> (the likes of Lindows comes to mind) would choose to run all processes
> case-insensitive. These sorts of distros are aiming at the sorts of
> users that would want everything to be case-insensitive.

This is the bit I don't understand.

Surely the value of case-insensitivity is that you can type in a
filename from memory and not worry about what case you used when you
created the file.

Yet with Lindows / MS-Windows style interfaces, you virtually never
type the name of a pre-existing file. So case-insensitivity doesn't
seem to be a win to the user.

I thought the value of a case-insensitive filenames was for
legacy applications which have been written to the WIN32 API and took
lots of liberties with "pretty-casing" filenames between readdir and
open.

NeilBrown

Dave Kleikamp

unread,

Feb 17, 2004, 2:26:40 PM2/17/04

to tri...@samba.org, linux-kernel

On Mon, 2004-02-16 at 22:12, tri...@samba.org wrote:
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
>
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex. The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.
>
> Personally I much prefer the systems I use to be case-sensitive, but
> there are important applications that require case-insensitivity for
> interoperability. Right now it is not possible to write a case
> insensitive application on Linux in an efficient manner. With the
> current "encoding agnostic" APIs a simple open() or stat() call
> becomes a horrendously expensive operation and one that is fraught
> with race conditions. Providing the same functionality in the kernel
> is dirt cheap by comparison (not cheap in terms of code complexity,
> but cheap in terms of runtime efficiency).

This would be easy to do in JFS due to the baggage we carried over to be
compatible with OS/2-formatted volumes. In OS/2, the directories were
ordered in a case-insensitive fashion. This would have to be a mkfs
option, and would not be a per-process option. The directories must be
created either case-sensitive or not.

Shaggy
--
David Kleikamp
IBM Linux Technology Center

Linus Torvalds

unread,

Feb 17, 2004, 3:14:50 PM2/17/04

to tri...@samba.org, Kernel Mailing List, Al Viro

On Tue, 17 Feb 2004 tri...@samba.org wrote:
>

> From memory, the patch added new classes of dentries to the current
> "+ve" and "-ve" dentries. It added concepts like a "-ve
> case-insensitive" dentry and a "-ve case-sensitive" dentry. It
> certainly adds more code in trying to deal with these variants, but I
> see no reason why it should be significantly computationally less
> efficient.

Yes, we could add context sensitivity to the dcache with a context
bitmask.

However, it's _not_ correct.

It assumes that there is only one way to do lower/upper case, which just
isn't true. What about different locales that have different case rules?
Your "one bit per dentry" becomes "one bit per locale per dentry". That's
just horribly hard to do.

I don't know how Windows does it, so maybe this thing is hardcoded, and
you don't even want "true" case insensitivity. How "correct" is Windows?

(And don't even bother telling me about the translation table in NTFS
volumes - I'm not interested. This would have to work on a sane filesystem
to be useful, even for samba.)

Linus

Linus Torvalds

unread,

Feb 17, 2004, 5:00:57 PM2/17/04

to tri...@samba.org, Kernel Mailing List, Al Viro

On Tue, 17 Feb 2004, Linus Torvalds wrote:
>
> It assumes that there is only one way to do lower/upper case, which just
> isn't true. What about different locales that have different case rules?
> Your "one bit per dentry" becomes "one bit per locale per dentry". That's
> just horribly hard to do.

It's also hard to know what to do when there are two filenames that
literally _are_ the same when not comparing cases. Which can obviously
happen under Linux - you'd have a case-sensitive app that creates a both
"makefile" and "Makefile", and now you have a case-insensitive app that
looks it up (or worse, removes it), and what the *heck* is the dcache now
supposed to really do?

This is why I'd hate for the generic Linux dcache to know about case
sensitivity, and I'd be a lot happier having a separate path (which isn't
as speed-critical) that can be used to help implement helper functions for
doing case-insensitive things.

That way the bugs and strange behaviour would be all be limited to the
case-insensitive special code, and not pollute the "sane" side.

For example, I fundamentally can't easily do an atomic exclusive
case-insensitive "create" or "rename", but we _could_ expose things like
directory generation counts to the special interfaces, and thus allow at
least "local-atomic" operations (but they would _not_ be atomic over a
network, to give you an idea of the kinds of _fundamental_ limitations
there are here).

That's why I'd advocate having a few very special system calls for doing
the operations that samba (and I'll throw wine into the pot too) wants to
do. So you could literally do an atomic create with something like

- regular atomic create of random case-_sensitive_ name using something
tempnam()-like (use a prefix that is invalid on windows or something:
make the first character be 0xff or whatever).
- "read directory local sequence count"
- readdir to make sure that the new name is still unique even in the
case-insensitive sense
- "atomic move conditionally on the local sequence count still being X"

The thing is, we can do hack like the above, and yes, we could do them all
inside the kernel, and give user space a reasonably nice interface with
"pseudo-atomic" behaviour (ie it will _not_ be atomic if multiple clients
do this over NFS, but I doubt you care).

But it wouldn't be "open()" and "rename()". It would be a totally separate
kernel path. It would be in the "case-insensitivity-module". It would be
_outside_ the regular VFS layer, although it would have some visibility
into it (ie it could follow dentries on its own, and know about the RCU
etc locking rules).

We can even allow that case-insensitive module to set some flags in the
dentries (so that you can create negative dentries that have a flag set
"this is negative for all cases").

Trust me, this is much less intrusive, and a lot easier to debug too. It
won't be as fast as the regular path operations, but depending on what the
common cases are (hopefully "look up name that is exact"), it would likely
not be horrible either. And it could probably be debugged as a real
module, without impacting any existing code, which would make it a lot
easier to create.

See where I'm going? Would this be acceptable to you? Are there any samba
people who are knowledgeable about the VFS-layer and have the time/energy
to try something like this?

Al? What do you think?

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 17, 2004, 7:45:36 PM2/17/04

to Linus Torvalds, tri...@samba.org, Kernel Mailing List

On Tue, Feb 17, 2004 at 08:57:40AM -0800, Linus Torvalds wrote:
> Trust me, this is much less intrusive, and a lot easier to debug too. It
> won't be as fast as the regular path operations, but depending on what the
> common cases are (hopefully "look up name that is exact"), it would likely
> not be horrible either. And it could probably be debugged as a real
> module, without impacting any existing code, which would make it a lot
> easier to create.
>
> See where I'm going? Would this be acceptable to you? Are there any samba
> people who are knowledgeable about the VFS-layer and have the time/energy
> to try something like this?
>
> Al? What do you think?

What will protect your generation counts during the operation itself?
->i_sem?

If anything, I'd suggest doing it as
cretinous_rename(dir_fd, name1, name2)
with the following semantics:

* if directory had been changed since open() that gave us dir_fd -
-EFOAD
* otherwise, rename name1 to name2 (no cross-directory renames here).

No need to expose generation counts to userland - we can just compare the
count at open() time with that at operation time. The rest can be done
in userland (including creation of files).

We _definitely_ don't want to put "UTF-8 case-insensitive comparison" anywhere
near the kernel - it's insane. If samba wants it, they get to pay the price,
both in performance and keeping butt-ugly code (after all, the goal of project
is to imitate butt-ugly system for butt-ugly clients). The same goes for Wine.

And we really don't want to encourage those who port Windows userland in
not fixing the idiotic semantics. As for Lindows... let's just say that
I can't find any way to describe what I really think of those clowns, their
intellect and their morals that wouldn't lead to a lawsuit from them.

Linus Torvalds

unread,

Feb 17, 2004, 8:15:52 PM2/17/04

to vi...@parcelfarce.linux.theplanet.co.uk, tri...@samba.org, Kernel Mailing List

On Tue, 17 Feb 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:
>
> What will protect your generation counts during the operation itself?
> ->i_sem?

Yes. You have to take it anyway, so why not?

> If anything, I'd suggest doing it as
> cretinous_rename(dir_fd, name1, name2)
> with the following semantics:
>
> * if directory had been changed since open() that gave us dir_fd -
> -EFOAD
> * otherwise, rename name1 to name2 (no cross-directory renames here).

Sure, that works.

> No need to expose generation counts to userland - we can just compare the
> count at open() time with that at operation time. The rest can be done
> in userland (including creation of files).

Note that I'm not sure we would expose generation counts at all to user
space: we might keep all of this inside the "crapola windows behaviour"
module, and user space could actually see some easier highlevel interface.
Something like yours, but I suspect we'd want to see what the whole
user-level loop would look like to know what the architecture should be
like.

I do believe we'd need to have some way to "refresh" the fd in your
example, without restarting the whole lookup. So that when the user gets
EFOAD, it can do

refresh(fd);
readdir(fd);
/* Check that nothing clashes */
goto try_again;

or similar. So the generation count _semantics_ would be exposed, even if
the numbers themselves would be hidden inside the kernel.

> We _definitely_ don't want to put "UTF-8 case-insensitive comparison" anywhere
> near the kernel - it's insane. If samba wants it, they get to pay the price,
> both in performance and keeping butt-ugly code (after all, the goal of project
> is to imitate butt-ugly system for butt-ugly clients). The same goes for Wine.

I agree. We'd need to let user space do the equality comparisons, I just
don't see how to sanely do it in kernel land.

> And we really don't want to encourage those who port Windows userland in
> not fixing the idiotic semantics. As for Lindows... let's just say that
> I can't find any way to describe what I really think of those clowns, their
> intellect and their morals that wouldn't lead to a lawsuit from them.

Heh.

I suspect most people don't care that much, but I also suspect that
projects like samba have to have a "anal mode" where they really act like
Windows, even when it's "wrong". People can then choose to say "screw that
idiocy", but by just _having_ a very compatible mode you deflect a lot of
criticism. Regardless of whether people want the anal mode or not in real
life.

Backwards compatibility is King. It's _hugely_ important. It's one of the
most important things to me in the kernel, and by the same logic I do see
that it is important to others as well - even when the backwards
compatibility ends up being inherited from a broken Windows setup. So
while I hate case-insensitive names, I do understand that people want to
have some way to emulate the braindamage for some _really_ "ass-backwards"
compatibility reasons.

So I think it's worth some pain, as long as we keep that compatibility
from starting to encrust the _good_ stuff.

Linus

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 17, 2004, 8:20:10 PM2/17/04

to Linus Torvalds, tri...@samba.org, Kernel Mailing List

On Tue, Feb 17, 2004 at 12:10:23PM -0800, Linus Torvalds wrote:
> I do believe we'd need to have some way to "refresh" the fd in your
> example, without restarting the whole lookup. So that when the user gets
> EFOAD, it can do
>
> refresh(fd);

lseek(fd, 0, 0);

> > And we really don't want to encourage those who port Windows userland in
> > not fixing the idiotic semantics. As for Lindows... let's just say that
> > I can't find any way to describe what I really think of those clowns, their
> > intellect and their morals that wouldn't lead to a lawsuit from them.
>
> Heh.
>
> I suspect most people don't care that much, but I also suspect that
> projects like samba have to have a "anal mode" where they really act like
> Windows, even when it's "wrong". People can then choose to say "screw that
> idiocy", but by just _having_ a very compatible mode you deflect a lot of
> criticism. Regardless of whether people want the anal mode or not in real
> life.

Umm... Samba deals with Windows clients. Windows software allegedly being
ported to Linux is a different story and in that case there's no excuse for
demanding case-insensitive operations.

Linus Torvalds

unread,

Feb 17, 2004, 8:30:23 PM2/17/04

to vi...@parcelfarce.linux.theplanet.co.uk, tri...@samba.org, Kernel Mailing List

On Tue, 17 Feb 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:
>

> > refresh(fd);
>
> lseek(fd, 0, 0);

Yes. We can make that implicitly refresh, I'm certainly ok with that.

> > I suspect most people don't care that much, but I also suspect that
> > projects like samba have to have a "anal mode" where they really act like
> > Windows, even when it's "wrong". People can then choose to say "screw that
> > idiocy", but by just _having_ a very compatible mode you deflect a lot of
> > criticism. Regardless of whether people want the anal mode or not in real
> > life.
>
> Umm... Samba deals with Windows clients. Windows software allegedly being
> ported to Linux is a different story and in that case there's no excuse for
> demanding case-insensitive operations.

"wine". It's not porting, it's emulation.

But yes, I agree, I don't see any other cases where we want it.

We basically want to support broken clients - whether they be on the other
side of the network, or the other side of an emulation interface. That is
the only valid reason to do this crap.

It's a fairly sizeable reason, though. On another front ("World
Domination, Fast!") we'll try to fix the problem another way, but there's
nothing wrong with fighting on multiple fronts if you have the man-power.

Linus

Robin Rosenberg

unread,

Feb 17, 2004, 9:13:20 PM2/17/04

to Linus Torvalds, tri...@samba.org, Kernel Mailing List, Al Viro

On Tuesday 17 February 2004 17.57, Linus Torvalds wrote:
[case-insanesititvity proposal ///]

> See where I'm going? Would this be acceptable to you? Are there any samba
> people who are knowledgeable about the VFS-layer and have the time/energy
> to try something like this?

So the same guy that strongly insist that a file is a string of bytes and nothing else,
now thinks it is sane to even think of "case" of a byte. That's impossible unless you
actually DO believe its a bunch of characters. What is it?

-- robin

Linus Torvalds

unread,

Feb 17, 2004, 9:25:36 PM2/17/04

to Robin Rosenberg, tri...@samba.org, Kernel Mailing List, Al Viro

On Tue, 17 Feb 2004, Robin Rosenberg wrote:
>
> On Tuesday 17 February 2004 17.57, Linus Torvalds wrote:
> [case-insanesititvity proposal ///]
> > See where I'm going? Would this be acceptable to you? Are there any samba
> > people who are knowledgeable about the VFS-layer and have the time/energy
> > to try something like this?
>
> So the same guy that strongly insist that a file is a string of bytes and nothing else,
> now thinks it is sane to even think of "case" of a byte. That's impossible unless you
> actually DO believe its a bunch of characters. What is it?

Which part of my argumen don't you understand?

The kernel proper thinks it's just a stream of bytes, and all the existing
interfaces do likewise.

But we'd have a kernel helper module to let samba do what it already does
now, except help it do so more efficiently?

The fact that _I_ think pathnames are just a nice stream of bytes sadly
doesn't make Windows clients do the same. Some day when I'm King Of The
World, and I can outlaw windows clients, we'll finally get rid of the
braindamage, but until then I'm pragmatic enough to say "let's help out
the poor samba people who have to deal with the crap day in and day out".

What's your problem with that?

Linus

Robin Rosenberg

unread,

Feb 17, 2004, 10:38:12 PM2/17/04

to Linus Torvalds, tri...@samba.org, Kernel Mailing List, Al Viro

On Tuesday 17 February 2004 22.17, Linus Torvalds wrote:
> The fact that _I_ think pathnames are just a nice stream of bytes sadly
> doesn't make Windows clients do the same. Some day when I'm King Of The
> World, and I can outlaw windows clients, we'll finally get rid of the

LPA = Linus' Patriot Act.

> braindamage, but until then I'm pragmatic enough to say "let's help out
> the poor samba people who have to deal with the crap day in and day out".
>
> What's your problem with that?

Nothing wrong with helping people.

Having to put up with the existence of Windows day in and out is the reason I'm still on
an eight-bit encoding. Sorry for not explaining the REAL problem, but only a partial
problem. I need to support all kinds of clients on Windows with protocols that convey no
character set info. With samba that's no problem. Having to put up with a Unix world running
ISO-8859-1 (or ISO-8859-15) is another. Ofcourse that means Linux machines also add
to the disturbance by not storing things as unicode. The real obstable is file names,
everything else including content of files, I can handle (I think). Maybe I'll find a solution
for the filenames too, but usually some hot discussions are needed for the brain to kick
into the right gear.

I want to switch to UTF-8 to work better with the outside world, but as things are people will
start to take notice of what OS is running in the shadows when they see the filename problems, and
start demanding Windows, and ... You see; I'm not mean; I don't want to do that to them (or myself),

-- robin

tri...@samba.org

unread,

Feb 17, 2004, 11:23:50 PM2/17/04

to Linus Torvalds, Kernel Mailing List, Al Viro

Linus,

> Yes, we could add context sensitivity to the dcache with a context
> bitmask.
>
> However, it's _not_ correct.
>
> It assumes that there is only one way to do lower/upper case, which just
> isn't true. What about different locales that have different case rules?
> Your "one bit per dentry" becomes "one bit per locale per dentry". That's
> just horribly hard to do.

I think you're making it sound much harder than it really is.

We just add a VFS hook in the filesystems. The filesystem chooses the
encoding specific comparison function. If the filesystem doesn't
provide one then don't do case insensitivity. If the filesystem does
provide one (for example NTFS, JFS) then use it. Then all I need to do
is convince one of the filesystem maintainers to add a mount time
option to specify the case table (for example by specifying the name
of a file in the filesystem that holds it).

So, all the really ugly stuff is then in the per-filesystem code, and
all the VFS and dcache has to do is know about a single context bit
per dentry.

> I don't know how Windows does it, so maybe this thing is hardcoded, and
> you don't even want "true" case insensitivity.

NTFS has a 128k table on disk, created at mkfs time and indexed by the
UCS2 character. The interesting thing about this table is that it
doesn't seem to vary between different locales as one might expect. I
have checked 3 locales so far (Swedish, Japanese and English) and all
have the same 128k table. I should check a few more locales to see if
it really is the same everywhere. Contact me off-list if you have a
NTFS filesystem created in a different locale and would be willing to
run a test program against it to see if the table is different from
the one we have in Samba.

There is stuff in the charset handling of every locale that does vary
in windows, but it isn't the case table, its the "valid characters"
map used to determine what characters are allowed when converting
strings into legacy multi-byte encodings. Even I don't think that the
kernel will ever have to deal with that crap unless someone is foolish
enough to port Samba into the kernel (several people have actually
done that despite the insanity of the idea, but they all did an
absolutely terrible job of it and certainly didn't take care to get
all the charset handling right).

> How "correct" is Windows?

from my rather limited point of view I always have to assume that
windows is "correct", unless I can show that its behaviour leads to
data loss, a security hole or something equally extreme.

Cheers, Tridge

tri...@samba.org

unread,

Feb 17, 2004, 11:24:55 PM2/17/04

to Neil Brown, Linus Torvalds, Kernel Mailing List, Al Viro

Neil,

> I thought the value of a case-insensitive filenames was for
> legacy applications which have been written to the WIN32 API and took
> lots of liberties with "pretty-casing" filenames between readdir and
> open.

No, thats a common misconception. It does happen (the "pretty-casing")
but its relatively rare these days. The real problem is *proving* that
a file doesn't exist. If a file does exist then there are all sorts of
heuristic and cache mechanisms that can be used to get the real
filename quickly on average, but if you have to prove absolutely that
a file does not exist then all of that stuff is pretty much useless.

Samba (and any other system that wants case-insensitive semantics on
Linux) can't make do with "oh, it probably doesn't exist". That way
leads to data loss. You have to know with 100% certainty that the file
doesn't exist in any case combination.

Unfortunately, that is also the hardest thing to do.

Cheers, Tridge

Linus Torvalds

unread,

Feb 17, 2004, 11:46:26 PM2/17/04

to tri...@samba.org, Kernel Mailing List, Al Viro

On Wed, 18 Feb 2004 tri...@samba.org wrote:
>
> I think you're making it sound much harder than it really is.

I think I'm just making the mistake of assuming that anybody would care to
do it "right", while everybody really only cares to get it be compatible
with Windows.

For example, if you only want to be compatible with Windows, you don't
have to worry about UCS-4, you only have the UCS-2 part, which means that
you can do a silly array-lookup based thing or something.

> We just add a VFS hook in the filesystems. The filesystem chooses the
> encoding specific comparison function. If the filesystem doesn't
> provide one then don't do case insensitivity. If the filesystem does
> provide one (for example NTFS, JFS) then use it. Then all I need to do
> is convince one of the filesystem maintainers to add a mount time
> option to specify the case table (for example by specifying the name
> of a file in the filesystem that holds it).

Ugh. What a horrible kludge, and it won't work without "preparing" the
filesystem at mount-time. I'd much rather leave the translation table in
user space, and just give it as an argument to the "look up case
insensitive" special thing.

That would mean that we can hold the directory semaphore over the whole
thing, which would simplify _my_ kludge, since there would be no need to
worry about user space having separate stages.

The hard part would be negative dentries. We'd have to invalidate all
"case-insensitive" negative dentries when creating any new file in a
directory, and that would be something the generic VFS layer would have to
know about, and that might be unacceptable to Al.

Linus

tri...@samba.org

unread,

Feb 18, 2004, 12:01:56 AM2/18/04

to Linus Torvalds, Kernel Mailing List, Al Viro

Linus,

> It's also hard to know what to do when there are two filenames that
> literally _are_ the same when not comparing cases. Which can obviously
> happen under Linux - you'd have a case-sensitive app that creates a both
> "makefile" and "Makefile", and now you have a case-insensitive app that
> looks it up (or worse, removes it), and what the *heck* is the dcache now
> supposed to really do?

This is really not as bad as it first seems. Just think what the
absolutely obvious thing to do is and do that. It's like all those
things in POSIX where it says "if you do XXX then the behaviour is
undefined" and the implementations end up doing whatever the heck they
find easiest to do. It's the same here.

In the example you give then you just give whatever file you come
across first or happen to have in the dcache. You can't do better than
that, as the problem is fundamentally insoluble in a sane fashion, so
just don't try. We've been doing exactly that in Samba for 12 years
(picking the first file we come across) and I can't recall a *single*
complaint about that behaviour. Users *expect* the server to just pick
one, and have no pre-conceived idea of which one it will pick.

Of course, some samba-tuned filesystem could have a mount option to
refuse to allow the creation of filenames that conflict in this way,
but don't even try to enforce this in the kernel core.

> This is why I'd hate for the generic Linux dcache to know about case
> sensitivity, and I'd be a lot happier having a separate path (which isn't
> as speed-critical) that can be used to help implement helper functions for
> doing case-insensitive things.

The problem is that if that separate path doesn't go via the dcache
then we won't get the invalidation of our negative dentries so we
won't be able to do any better than scanning the whole directory every
time to prove files don't exist. The dcache has to know about this as
its the only place where all the information that is needed comes
together (I'm sure you'll correct me if I'm wrong about this).

> That way the bugs and strange behaviour would be all be limited to the
> case-insensitive special code, and not pollute the "sane" side.

except when something like a file create happens on the "sane" side of
things and we then have no way of knowing that our name space has just
changed. I suppose we could create a completely new dcache in parallel
with the current one and have some sort of notify between the "sane"
and "insane" worlds, but I suspect the glue code between them would be
worse than just adding that context bit to the main dcache.

> For example, I fundamentally can't easily do an atomic exclusive
> case-insensitive "create" or "rename", but we _could_ expose things like
> directory generation counts to the special interfaces, and thus allow at
> least "local-atomic" operations (but they would _not_ be atomic over a
> network, to give you an idea of the kinds of _fundamental_ limitations
> there are here).

yes, doing atomic network file operations sucks, but please don't let
that stop us doing it in a reasonable fashion for local filesystems.

Doing a nice atomic case-insensitive create or rename is really *no*
different from what we do now in Linux, it just means that we need to
have case-insensitive dentries that mean "this is a negative dentry
that covers all possible case combinations of the name it
contains". It is up to the filesystem to provide you with that -ve
dentry (just like the filesystem provides the case-sensitive -ve
dentries now) and the dcache just has to use it in the same way that
it uses the existing ones.

If you really don't want to do this then fine, in which case I'll ask
again in a year or twos time and see if I can convince you then. I
know this would make the code messier, and making code messier for the
sake of interoperability with windows is perhaps reason enough not to
do it. But please don't tell me it *can't* be done or that it is just
too hard. That's just not true.

> - regular atomic create of random case-_sensitive_ name using something
> tempnam()-like (use a prefix that is invalid on windows or something:
> make the first character be 0xff or whatever).
> - "read directory local sequence count"
> - readdir to make sure that the new name is still unique even in the
> case-insensitive sense
> - "atomic move conditionally on the local sequence count still being X"

that could make things atomic, but it won't make it fast. Think about
the fact that modern filesystems are now using better than linear
lists for directories. So in most cases lookups in large directories
can be done in much better than O(n) time (for reasonable values of
n). The above solution means Samba will never be better than O(n), so
for large directories we will always suck performance wise. It doesn't
have to be that way.

> We can even allow that case-insensitive module to set some flags in the
> dentries (so that you can create negative dentries that have a flag set
> "this is negative for all cases").

ahh! yipee!

yes, if we have that dentry bit then we have a hope. Without that I
think it won't help much.

> See where I'm going? Would this be acceptable to you? Are there any samba
> people who are knowledgeable about the VFS-layer and have the time/energy
> to try something like this?

I'll discuss this with some of the people here in OzLabs and see if we
can come up with a plan. I suspect most of OzLabs will be avoiding me
for a day or two in an attempt to not be the one to do this :-)

Cheers, Tridge

Neil Brown

unread,

Feb 18, 2004, 12:15:21 AM2/18/04

to tri...@samba.org, Linus Torvalds, Kernel Mailing List, Al Viro

On Wednesday February 18, tri...@samba.org wrote:
>
> Samba (and any other system that wants case-insensitive semantics on
> Linux) can't make do with "oh, it probably doesn't exist". That way
> leads to data loss. You have to know with 100% certainty that the file
> doesn't exist in any case combination.
>
> Unfortunately, that is also the hardest thing to do.

Hi Tridge,

Maybe if it is so hard, we should just define it to be easy.... just
change the universe a bit.....

I'm, sure you've thought about this a lot more that I have or will, so
I must be missing something, but there seems to be a solution that is
efficient, predictable, and should we acceptable.

The first observation is that POSIX applications and WIN32 application
cannot both get exactly the file system, semantics they expect in the
same directory. The example:
POSIX:
create "Makefile"
create "makefile"
WIN32:
unlink "MakeFile"
seems to show that.

So decide up front that a WIN32 application will see something
different, and decide what the best thing for it to see would be
(i.e. change the universe).

First cut:
An application that wants case-insensitive filenames only
sees those filenames that are in a case-insensitive-canonical-form.
So the interface maps all file names in requests to a canonical
form, and the readdir equivalent discards all non-canonical names.

Thus in the above example, the WIN32 app would unlink "makefile"
and never notice that "Makefile" exists.

This has (to me) two problems.
1/ case gets lost, so if I save "My File", I will find "my file"
has been created (unless the application pretty-cases things, in
which case I can expect case to change anyway).

2/ Files created by posix apps might be invisible.

To answer 2/, I'd say "tough". If you want posix files to be
visible to WIN32 apps, choose appropriate names. However I would
allow there to be a process, either once-off or periodic, which
creates symlinks from canonical names to non-canocial filenames.
This would allow you to access pre-existing files where there was
no ambiguity.

To answer 1/ I would suggest a second cut at the problem...

Second cut:
As above, but readdir tries to be clever. If it sees two (or
more) names which have the same canonical form, it chooses just
one of them (predictably), prefering a non-canonical name which is
a symlink to the canonical name.

Then when creating an a object, you create it with the canonical
name and (if that succeeds) subsequently create a symlink from the
requested name to the canonical name (if that is possible, don't
worry if it isn't).

Given this approach:

If only case-insensitive apps use a linux filesystem, they will see
exactly the semantics they expect, with minimal performance impact.

If case-sensitive and case-insensitive apps use a linux filesystem,
they will each see a consistent view and though they may not see the
same view, there will be well-defined mechanisms which can work at a
user-space level to resolve or highlight any issues.

The biggest cost I see with this is with large directories. The
"readdir" equivalent would need to read the whole directory before it
could reliably return any of it.
However dropping the "guarantee to preserve case" semantic on really
large directories probably isn't an enormous cost (and could be
configurable).

NeilBrown

Robert White

unread,

Feb 18, 2004, 12:22:50 AM2/18/04

to tri...@samba.org, linux-...@vger.kernel.org, Linus Torvalds, Kernel Mailing List, Al Viro, Neil Brown

OK, so I wrote the below, but then in the summary I realized that there was
a significant factor that doesn't fit in with the rest of the post. Case
insensitivity, and more generally locale equivalence rules, is a security
nightmare. Consider the number of different file names that "su" could map
to if you apply case insensitivity (4) and/or worse yet the various accents
and umlats (?,etc) that sort-equivalent for "u" in some locales. The user
types "su" and runs "S(u-umlat)" etc.

====

In point of fact (ok in point of "technically abstract truth"), it is a "bad
thing" that Windows (and seemingly only Windows these days) is case
insensitive. It is sometimes said that windows is really an application and
not an OS. If you ignore the occasionally snide *way* it is said you can
find some technical truth to the matter.

In point of fact the entire windows application space has a singular active
locale at any one time and there is a well-defined but horrible layer of
indirection where "long names" like "My Documents" become "real names" like
"MYDOCU~1". Essentially every windows file name is subject to a
double-indirect file name translation. The first pass is the strcasecmp()
locale-dependent traversal of the "long name" list. The second is the
strcasecmp() frozen-locale-spec-dependent traversal of (US Latin?) 8.3 file
naming standard list of media elements (files/directories).

In point of fact, Windows is *not* "properly" case insensitive at the file
system level. Use "dir /x" more often on your windows box to relive the
experience. The "real" file names are mangled to good old 8.3 uppercase
internally(1). You don't usually have to think about this, but if you have
ever lost the long-to-short file name mapping on a drive you know the hell
that ensues. (see also iso9660.)

So the application file naming interface wedge thingy (in windows) creates
and maintains the mixed case names as an illusion. It just happens to be an
illusion planted so deeply in the application space that it appears to be
coming up from the "operating system level".

OK, as time has moved on, some later versions of later file systems *may* (I
honestly don't know) have modified the double-indirection model, but if they
have, they must have done so in a guaranteed-to-look-the-same way. Either
way it ends up being quite costly.

Further, the model only really works because a DOS (and therefore windows)
based program invariably and individually takes responsibility for doing all
sorts of tasks like wildcard expansions (etc) in the application space
(often "free" through comctl32.dll). [This tends to be foreign to Linux
(UNIX) programmers where shells and such do the expansion.]

The line is then blurred further by the subsequent steady creep of
wildcarding and file selection back into common DLLs. (more comctl32.dll
and friends.)

The thing is, to match this ersatz "functionality" on a system where more
than one locale may be used at the same time, you end up with a kind of
Cartesian product of user locales and filesystem native locales. The cost
could get extreme and can only really be amortized if Linux were to declare
our own 8.3 style pronouncement for the character classes used for the
"real" file name storage (etc).

Late stage case insensitivity isn't that hard to put in a linux application,
just crack open your file selection dialog boxes and have them use
strcasecmp() in all their select/sort logic. Also then replace open() with
CaseOpen() which does a find/search operation before daring to creat().
That is, in every practical way, how Windows handles these problems. It
just happens in some fairly interesting and hard-to-predict places depending
on context.

It is easier, IMHO, to bring the users into the 20th century (let alone the
21st 8-) by making them mean what they say (if they deign to step out from
behind their GUIs).

So what was I saying... Oh yea...

-- Single Locale storage standard required to prevent multiplicative cost.
-- Not that hard to fake case insensitivity "when necessary".
-- Cheaper in CPU/Space to mix case.
-- Native file names in native locales simplifies administration and
expectations. (not elaborated above, but true.)
-- Case insensitivity and locale equivalence leads to uncertainties about
what/which file may be intended in a given context, which could often lead
to exploitable error.

Rob.

(1) The actual truth is a tad uglier than this, the media can have the 8.3
names stored in interesting ways, but essentially a "toupper()" is done on
every file name as it is retrieved and processed. This cuts out a lot of
possibilities and leads to a lot of "tildes of shame" in even some of the
more harmless seeming name conflicts.

Linus Torvalds

unread,

Feb 18, 2004, 12:29:54 AM2/18/04

to Robert White, tri...@samba.org, Kernel Mailing List, Al Viro, Neil Brown

On Tue, 17 Feb 2004, Robert White wrote:
>
> OK, so I wrote the below, but then in the summary I realized that there was
> a significant factor that doesn't fit in with the rest of the post. Case
> insensitivity, and more generally locale equivalence rules, is a security
> nightmare. Consider the number of different file names that "su" could map
> to if you apply case insensitivity (4) and/or worse yet the various accents
> and umlats (?,etc) that sort-equivalent for "u" in some locales. The user
> types "su" and runs "S(u-umlat)" etc.

This is but one reason why I will _refuse_ to make case insensitivity
magically start happening on regular "open()" etc calls.

You'd literally have to use a _different_ system call to do a
case-insensitive file open. Exactly because anything else would be very
confusing to existing apps (and thus be potential security holes).

Linus

Robert White

unread,

Feb 18, 2004, 1:06:28 AM2/18/04

to tri...@samba.org, Kernel Mailing List, Al Viro, Neil Brown, Linus Torvalds

P.S. Given that the GUI libraries (almost invariably) already deal with
displaying things in a case insensitive way, the "best place to cut" to add
case insensitivity to the user command-line experience would be adding a
flag to file name completion in bash. Bash is already doing file name finds
and lookups when you press tab; and the user is actively looking at the
correctness and singularity/duality of the results.

So the proverbial "vi makef{tab}" would, if the flag was set, show you
makefile, Makefile, and MakeFile (etc) as existent or just switch makef to
"Makefile" if the name were unique.

It doesn't make lives easier for the API level project programmer people
(c.f. samba), but it could uber-happy the incoming newbies, and people like
me who have to interoperate within a vast wasteland of directories full of
inconsistently named files created by windows programmers (like SOCKET.C,
Socket.H, constants.h, and ss_switch.c all in one directory tree with
hundreds of their friends. 8-)

I would however, be forced to throttle myself with my own intestine if
kernel started doing this magic mapping "for me", especially "in some
calls/contexts but not in others". (Not that I want to provide my possible
death as a strong motivation for adding the feature. 8-)

Rob.

Andy Lutomirski

unread,

Feb 18, 2004, 1:12:33 AM2/18/04

to Kernel Mailing List, Andrew Tridgell, Linus Torvalds, Al Viro

Linus Torvalds wrote:
> int magic_open(
> /* Input arguments */
> const char *pathname,
> unsigned long flags,
> mode_t mode,
>
> /* output arguments */
> int *fd,
> struct stat *st,
> int *successful_path_length);
>
> ie the system call would:
>
> - look up as far into the pathname (using _exact_ lookup) as possible
> - return the error code of the last failure
> - the "flags" could be extended so that you can specify that you mustn't
> traverse ".." or symlinks (ie those would count as failures)
>
> but also:
>
> - fill in the "struct stat" information for the last _successful_
> pathname component.
> - fill in the "fd" with a fd of the last _successful_ pathname component.
> - tell how much of the pathname it could traverse.

Aside from just case-insensitivity, I imagine this could give lots of other
benefits:

- file servers that don't want to follow symlinks can do it quickly.
- Apache could serve things like http://www.foo.com/a/b/c/d.php/e/f/g a lot
faster.
- a flag to avoid traversing mountpoints could help someone
- a flag for root to see _through_ mountpoints would make it possible to clean
up initramfs and such that got mounted over, or to do other useful and currently
impossible tasks. (e.g. I could see what's under my devfs mount...)

I would be nice to see this added even if it's not the perfect solution for samba :)

BTW, here's a thought for solving samba's negative lookup problem:

int ugly_stat(char *pattern, struct stat *st, char *match_out)

Pattern would be some description of what the filename should look like.
Something like:

- pattern is an array of slash-delimited groups of characters separated by nulls
and terminated by two nulls. For example, ugly_stat("F/f\0O/o\0O/o\0\0", ...)
finds a file called foo, case-insensitively in English, while
ugly_stat("F\0i\0l\0e\011/22/33") finds "File" followed by either 11, 22, or 33.
- the dcache problem is easy: don't use it. All Andrew wants (I think) is proof
that there is no such file or the name if there is one. Samba can cache it
itself; I don't think the kernel should involve itself in trying to cache this.
- ugly_stat does not traverse directories -- that's why the slash trick is safe.
- st gets the stat data, and match_out gets the filename if any
- if there are multiple matches, one is arbitrarily selected.

If the file-system doesn't have specific support for this, then either VFS or
the caller could emulate it (probably VFS -- it would avoid lots of syscalls).

Would ugly_stat + magic_open be sufficient?

--Andy

H. Peter Anvin

unread,

Feb 18, 2004, 2:39:19 AM2/18/04

to linux-...@vger.kernel.org

Followup to: <16434.41376....@samba.org>

By author: tri...@samba.org
In newsgroup: linux.dev.kernel
>

> > I don't know how Windows does it, so maybe this thing is hardcoded, and
> > you don't even want "true" case insensitivity.
>
> NTFS has a 128k table on disk, created at mkfs time and indexed by the
> UCS2 character.

So you're hosed if anyone uses characters outside the UCS-2 character
set...

> The interesting thing about this table is that it doesn't seem to
> vary between different locales as one might expect. I have checked 3
> locales so far (Swedish, Japanese and English) and all have the same
> 128k table. I should check a few more locales to see if it really is
> the same everywhere. Contact me off-list if you have a NTFS
> filesystem created in a different locale and would be willing to run
> a test program against it to see if the table is different from the
> one we have in Samba.

There is a "standard" table, which is published by the Unicode
consortium. However, the "standard" table isn't what you want in
certain locales, e.g. Turkish.

> There is stuff in the charset handling of every locale that does vary
> in windows, but it isn't the case table, its the "valid characters"
> map used to determine what characters are allowed when converting
> strings into legacy multi-byte encodings. Even I don't think that the
> kernel will ever have to deal with that crap unless someone is foolish
> enough to port Samba into the kernel (several people have actually
> done that despite the insanity of the idea, but they all did an
> absolutely terrible job of it and certainly didn't take care to get
> all the charset handling right).
>
> > How "correct" is Windows?
>
> from my rather limited point of view I always have to assume that
> windows is "correct", unless I can show that its behaviour leads to
> data loss, a security hole or something equally extreme.

Well, we don't want to support a bunch of hacks to make it behave like
Windows if what Windows does doesn't make sense. If so you should use
a metalayer where you canonicalize the filenames and don't store
"Makefile" on the disk; store "makefile" and keep the "real" filename
stashed elsewhere, perhaps an EA.

-hpa

tri...@samba.org

unread,

Feb 18, 2004, 2:52:21 AM2/18/04

to Robert White, linux-...@vger.kernel.org, Linus Torvalds, Al Viro, Neil Brown

Robert,

Just about everything in your posting is either years out of date or
just totally wrong.

> OK, so I wrote the below, but then in the summary I realized that there was
> a significant factor that doesn't fit in with the rest of the post. Case
> insensitivity, and more generally locale equivalence rules, is a security
> nightmare. Consider the number of different file names that "su" could map
> to if you apply case insensitivity (4) and/or worse yet the various accents
> and umlats (?,etc) that sort-equivalent for "u" in some locales. The user
> types "su" and runs "S(u-umlat)" etc.

This is no different from the "stupid admin puts . in $PATH"
problem. Simple solutions:

1) don't mount your root filesystem with case insensitive naming
2) use a sane $PATH
3) don't allow untrusted users to create files in your $PATH
4) don't run bash in case insensitive mode if you can't for some
you can't do (1) or (2) or (3)

any of (1), (2) or (3) solves this.

> In point of fact the entire windows application space has a
> singular active locale at any one time and there is a well-defined
> but horrible layer of indirection where "long names" like "My
> Documents" become "real names" like "MYDOCU~1". Essentially every
> windows file name is subject to a double-indirect file name
> translation. The first pass is the strcasecmp() locale-dependent
> traversal of the "long name" list. The second is the strcasecmp()
> frozen-locale-spec-dependent traversal of (US Latin?) 8.3 file
> naming standard list of media elements (files/directories).

this is just total crap. That might have been true for msdos and even
possibly win9x, but its totally untrue for NTFS. There are enough
stupidities in windows without having to invent more.

NTFS is case insensitive at the filesystem level. In fact, its
selectable whether its case sensitive or case insensitive per-process
(a process can switch between the two models). The case mapping table
is built into the filesystem itself. That mapping has absolutely
*zero* to do with US Latin or any other legacy multi-byte encoding.

What you have done is the equivalent of stating that Linux can only do
14 character filenames, because once upon a time Linux had a
filesystem called minix. We've moved beyond that and so has windows.

> In point of fact, Windows is *not* "properly" case insensitive at the file
> system level. Use "dir /x" more often on your windows box to relive the
> experience. The "real" file names are mangled to good old 8.3 uppercase
> internally(1). You don't usually have to think about this, but if you have
> ever lost the long-to-short file name mapping on a drive you know the hell
> that ensues. (see also iso9660.)

again, this is just complete crap. NTFS has had the ability to
completely disable 8.3 "alternative name" support for ages. Microsoft
is even starting to use this switch in their published benchmark
results, and I suspect it will become the default in a couple of
years.

We've been through the same transition in Samba:

- Samba 0.x only supported 8.3
- Samba 1.x was oriented towards 8.3, but also supported long names
- Samba 2.x and 3.x is oriented towards long names, and can disable 8.3
names to some extent

by the time Samba 4.x comes out (I am working on it now) we may see a
significant number of sites disabling 8.3 completely.

> The thing is, to match this ersatz "functionality" on a system where more
> than one locale may be used at the same time, you end up with a kind of
> Cartesian product of user locales and filesystem native locales. The cost
> could get extreme and can only really be amortized if Linux were to declare
> our own 8.3 style pronouncement for the character classes used for the
> "real" file name storage (etc).

you are *way* out of date here. All recent windows apps use the UCS-2
interfaces which provides a single charset encoding across all
locales. I've heard that they may be redefining this as UCS-16 to
allow for an even larger range of characters, although I haven't seen
this popping up on the wire yet (then again, I just might not have
noticed). I wish they had chosen UTF-8 instead of UCS-2, but at least
they chose something and got it into every part of the OS years ago.

> Late stage case insensitivity isn't that hard to put in a linux application,
> just crack open your file selection dialog boxes and have them use
> strcasecmp() in all their select/sort logic. Also then replace open() with
> CaseOpen() which does a find/search operation before daring to
> creat().

Have you read *any* of what I've been saying about how expensive this is??

> That is, in every practical way, how Windows handles these problems. It
> just happens in some fairly interesting and hard-to-predict places depending
> on context.

No, that is *not* how current versions of windows do things.

> So what was I saying... Oh yea...
>
> -- Single Locale storage standard required to prevent multiplicative cost.

windows has this. Linux doesn't.

> -- Not that hard to fake case insensitivity "when necessary".

ditto

> -- Cheaper in CPU/Space to mix case.

ditto

> -- Native file names in native locales simplifies administration and
> expectations. (not elaborated above, but true.)

?? single locale storage makes this just a no-op

> -- Case insensitivity and locale equivalence leads to uncertainties about
> what/which file may be intended in a given context, which could often lead
> to exploitable error.

and that is just a complete load of crap. Windows has had exploitable
bugs due to case insensitivity, but the cause was things like leaving
directories in the search path writeable by unprivileged users. It was
*not* due to anything fundamentally insecure about case-insensitive
names in filesystems.

> (1) The actual truth is a tad uglier than this, the media can have the 8.3
> names stored in interesting ways, but essentially a "toupper()" is done on
> every file name as it is retrieved and processed. This cuts out a lot of
> possibilities and leads to a lot of "tildes of shame" in even some of the
> more harmless seeming name conflicts.

oh i get it, you're just a troll ....

Cheers, Tridge

tri...@samba.org

unread,

Feb 18, 2004, 3:06:22 AM2/18/04

to Robin Rosenberg, Linus Torvalds, Kernel Mailing List, Al Viro

Robin,

> Having to put up with the existence of Windows day in and out is
> the reason I'm still on an eight-bit encoding. Sorry for not
> explaining the REAL problem, but only a partial problem. I need to
> support all kinds of clients on Windows with protocols that convey
> no character set info. With samba that's no problem. Having to put
> up with a Unix world running ISO-8859-1 (or ISO-8859-15) is
> another. Ofcourse that means Linux machines also add to the
> disturbance by not storing things as unicode. The real obstable is
> file names, everything else including content of files, I can
> handle (I think). Maybe I'll find a solution for the filenames too,
> but usually some hot discussions are needed for the brain to kick
> into the right gear.

I suspect you are running Samba 2.x, which negotiated all that
multi-byte stuff on the wire. Samba 3.x does the same as windows
servers have done for years and negotiates UCS-2, which means that
every windows box that connects to it no matter what locale it is in
uses the same charset encoding as every other windows box.

There are still some legacy interfaces on the wire that use the old
encodings, but they are rare and getting rarer. To support these,
Samba3 juggles 4 character set encodings internally:

* the unix-charset, which it uses to talk to the OS, and defaults to
UTF-8

* the windows wire charset, which is always UCS-2

* the dos-charset for legacy parts of the protocol, which you have
to configure in the samba config if you care about these legacy
parts of the protocol (for example if you have older apps). It
defaults to either CP850 or ASCII depending on what autoconf
discovers.

* the display-charset which is used to put stuff on an admins
terminal for utilities like smbclient. The default depends on your
LOCALE setting, or if nothing is set it uses ASCII.

Internally Samba3 only ever stores stuff in the "unix-charset"
encoding, which is usually UTF-8. It converts to the others as needed
when talking on the wire or to terminals.

> I want to switch to UTF-8 to work better with the outside world,
> but as things are people will start to take notice of what OS is
> running in the shadows when they see the filename problems, and
> start demanding Windows, and ... You see; I'm not mean; I don't
> want to do that to them (or myself),

If you use Samba3 then they will not notice what charset you are using
on your Linux filesystems. The windows clients will just see UCS-2.

Cheers, Tridge

Linus Torvalds

unread,

Feb 18, 2004, 3:07:42 AM2/18/04

to H. Peter Anvin, linux-...@vger.kernel.org

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
>
> Well, we don't want to support a bunch of hacks to make it behave like
> Windows if what Windows does doesn't make sense.

I'd disagree, for a very simple reason: case-insensitivity itself simply
does not make sense, so the _only_ reason for having a bunch of hacks is
literally to support windows file exports and nothing else.

I obviously agree with the fact that we should _not_ put those hacks into
the VFS layer proper - we should keep them as a separate thing, and we
should make it clear that it makes no sense _except_ for Windows
compatibility.

Think of it as nothing more than a binary compatibility layer, the same
way we have hooks to support "lcall 7,0" for binary compatibility with
some silly (and much less interesting) x86 OSes through external modules.

Linus

H. Peter Anvin

unread,

Feb 18, 2004, 3:17:49 AM2/18/04

to Linus Torvalds, linux-...@vger.kernel.org

Linus Torvalds wrote:
>
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
>
>>Well, we don't want to support a bunch of hacks to make it behave like
>>Windows if what Windows does doesn't make sense.
>
>
> I'd disagree, for a very simple reason: case-insensitivity itself simply
> does not make sense, so the _only_ reason for having a bunch of hacks is
> literally to support windows file exports and nothing else.
>
> I obviously agree with the fact that we should _not_ put those hacks into
> the VFS layer proper - we should keep them as a separate thing, and we
> should make it clear that it makes no sense _except_ for Windows
> compatibility.
>
> Think of it as nothing more than a binary compatibility layer, the same
> way we have hooks to support "lcall 7,0" for binary compatibility with
> some silly (and much less interesting) x86 OSes through external modules.
>

Well, this is also true :) I still say it belongs in userspace.

For 100% bug-compatibility with Windows, though, it is probably
worthwhile to have the filename in the native filesystem be not what a
Windows user would see, but rather the normalized filename. That makes
a userspace implementation much easier.

-hpa

tri...@samba.org

unread,

Feb 18, 2004, 3:30:05 AM2/18/04

to Linus Torvalds, Kernel Mailing List, Al Viro

Linus,

> For example, if you only want to be compatible with Windows, you don't
> have to worry about UCS-4, you only have the UCS-2 part, which means that
> you can do a silly array-lookup based thing or something.

Even within UCS-2 land the case-mapping table is sparse as only some
characters have a upper/lower mapping. In fact, there are just 636
characters out of 64k that have an upper/lower case mapping that isn't
the identity. That is across *all* languages that windows uses for
UCS-2.

In Samba that's not sparse enough that its worth saving the single
mmap of 128k to encode it sparsely in memory, but in UCS-4 land you
would obviously use a sparse mapping, and that mapping table would
probably be just a few k in size. If you allow for extents then I
expect you could encode it in a couple of hundred bytes.

(I experimented with using a sparse mapping in Samba, and it was a
slight loss on the machine I was testing on compared to just doing the
mmap, so I went with the mmap. Maybe someone else can do a better
sparse encoding than I did and actually get a win due to better cache
behaviour.)

> Ugh. What a horrible kludge, and it won't work without "preparing" the
> filesystem at mount-time. I'd much rather leave the translation table in
> user space, and just give it as an argument to the "look up case
> insensitive" special thing.

The case mapping table must remain the same for the lifetime of the
mounted filesyste, otherwise you'd get chaos. That's why tying it to
the filesystem (ie. hanging it off the superblock) makes sense.

> The hard part would be negative dentries. We'd have to invalidate all
> "case-insensitive" negative dentries when creating any new file in a
> directory, and that would be something the generic VFS layer would have to
> know about

Right, the handling of negative dentries is the key. I don't think its
quite as bad as you say though, as you can do this:

1) use a filesystem provided case-insensitive hash in the dcache. If
the filesystem provided hash isn't case-insensitive then don't try
to do case-insensitive lookups on this filesystem.

2) you only need to potentially invalidate entries in the same hash
bucket as the name you are creating.

3) Even better, you don't need to invalidate entries that don't have
the same hash value (presuming your hash values are larger than
your truncated hash keys).

> and that might be unacceptable to Al.

yes, and I'm quite sympathentic to that point of view. I just want to
make sure that if we don't do this then we use honest reasons for not
doing it, not "that's impossible" reasons which are bogus when you
examine them.

Cheers, Tridge

Linus Torvalds

unread,

Feb 18, 2004, 3:31:04 AM2/18/04

to H. Peter Anvin, linux-...@vger.kernel.org

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
>
> Well, this is also true :) I still say it belongs in userspace.

The thing is, I do agree with Tridge on one simple fact: it's very hard
indeed to do atomic file operations from user space.

That's not necessarily a problem if samba is the only process accessing
the directories in question, since then samba could do all locking
internally and make sure that it never does anything inconsistent.

However, clearly people who run samba on a machine want to potentially
_also_ export that same filesystem as a NFS volume, as a way to have both
Windows and UNIX clients access the same data. And that pretty much means
that other people _will_ access the directories, and that samba can't do
its internal locking in that kind of environment.

This is why I am symphathetic to the need to add _some_ kind of support
for this. And the only common place ends up being the kernel.

> For 100% bug-compatibility with Windows, though, it is probably
> worthwhile to have the filename in the native filesystem be not what a
> Windows user would see, but rather the normalized filename. That makes
> a userspace implementation much easier.

Oh, absolutely. But that's something that samba can easily do internally:
it can choose to just entirely ignore filenames that aren't normalized, or
it can export it on the wire (obviously in the normalized UCS-2 format),
and just consider non-normalized names to be another "case". In fact,
that's what the naive implementation would do anyway, so that's not any
added complexity.

(And samba clearly _cannot_ show the client a non-normalized name per se,
since the smb protocol ends up using UCS-2).

Linus

tri...@samba.org

unread,

Feb 18, 2004, 4:14:27 AM2/18/04

to h...@zytor.com, Kernel Mailing List

Hpa,

> So you're hosed if anyone uses characters outside the UCS-2 character
> set...

I've heard they are re-defining all those 16 bit numbers to be UCS-16
instead of UCS-2 for exactly that reason. This is rather similar to
the move in the Unix community to start using UTF-8.

Note that I am not at all proposing that we use UCS-2 in the Linux
kernel (except in places where you have to, like the NTFS
filesystem). I am proposing that the filesystems be able to offer a
case-insenstive hash function to the dcache, and I would expect that
this function would be based on UTF-8.

The function might operate internally by converting UTF-8 to UCS-2, or
it might use a sparse mapping table. It would almost certainly have a
fast-path that looked first to see if there are any bytes with the top
bit set, and if there are none then it can do a really easy 7 bit
table based hash which would make this really fast for most users.

The point is that the kernel proper (the VFS and dcache in particular)
won't have to care how this hash works. They're just consumers of it.

> There is a "standard" table, which is published by the Unicode
> consortium.

The table used in windows is not exactly the same as the one on
unicode.org. Which is "correct" I will leave up to the pedants to
discuss, as all that Samba cares about is that it uses the same table
as w2k.

> However, the "standard" table isn't what you want in certain
> locales, e.g. Turkish.

I'd really like someone to confirm this for me by volunteering to run
a tool I provide on a Turkish NTFS filesystem or sending me a
compressed empty Turkish NTFS volume (please ask first by email - I
only need one of these). Up to now I have only ever seen the one 128k
table used across all windows locales. If this table really *is*
different in some locales then I need to know.

Cheers, Tridge

Dave Jones

unread,

Feb 18, 2004, 4:18:04 AM2/18/04

to Linus Torvalds, Linux Kernel, gr...@kroah.com

On Tue, Feb 17, 2004 at 08:00:35PM -0800, Linus Torvalds wrote:

> > I felt masochistic, so decided to 'see what would happen' when I ran this..
> Whee. Fun. Do you actually have the hardware for it, or did it blow up
> even without any supported hardware?

-ENOHARDWARE.
Judging by the number of bugs this has thrown up, not many folks test the
'try a driver without the hardware' paths too often.

Dave

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 18, 2004, 4:36:42 AM2/18/04

to Dave Jones, Linus Torvalds, Linux Kernel, gr...@kroah.com

On Wed, Feb 18, 2004 at 04:02:52AM +0000, Dave Jones wrote:
> On Tue, Feb 17, 2004 at 08:00:35PM -0800, Linus Torvalds wrote:
>
> > > I felt masochistic, so decided to 'see what would happen' when I ran this..
> > Whee. Fun. Do you actually have the hardware for it, or did it blow up
> > even without any supported hardware?
>
> -ENOHARDWARE.
> Judging by the number of bugs this has thrown up, not many folks test the
> 'try a driver without the hardware' paths too often.

Hell, yes. You should've seen the crap fixed by NE* patches in drivers/net/* -
it certainly looks like the common mindset is "if this miserable box can't,
for whatever reason, run My Driver(tm) - why the fuck does it dare to exist?"

H. Peter Anvin

unread,

Feb 18, 2004, 5:35:56 AM2/18/04

to linux-...@vger.kernel.org

Followup to: <16434.56190....@samba.org>

By author: tri...@samba.org
In newsgroup: linux.dev.kernel
>

> In Samba that's not sparse enough that its worth saving the single
> mmap of 128k to encode it sparsely in memory, but in UCS-4 land you
> would obviously use a sparse mapping, and that mapping table would
> probably be just a few k in size. If you allow for extents then I
> expect you could encode it in a couple of hundred bytes.
>

If all you care about is the UTF-16-compatible range, you only need
1088K entries in your table; small enough that it can be reasonably
had in userspace.

> (I experimented with using a sparse mapping in Samba, and it was a
> slight loss on the machine I was testing on compared to just doing the
> mmap, so I went with the mmap. Maybe someone else can do a better
> sparse encoding than I did and actually get a win due to better cache
> behaviour.)

The thing is, you're probably only touching small parts of your table,
so the kernel and the CPU cache works quite well on the large table as
it is.

Wouldn't work in kernel space, though.

-hpa

Marc Lehmann

unread,

Feb 18, 2004, 7:56:42 AM2/18/04

to linux-...@vger.kernel.org

On Wed, Feb 18, 2004 at 02:26:54PM +1100, tri...@samba.org wrote:
> Even within UCS-2 land the case-mapping table is sparse as only some
> characters have a upper/lower mapping. In fact, there are just 636
> characters out of 64k that have an upper/lower case mapping that isn't
> the identity. That is across *all* languages that windows uses for
> UCS-2.

This is because scripts differentiating between upper and lower case are
rare exceptions in the world.

Unfortunately, commonly used exceptions, and still locale dependent.

Having a samba-helper kernel module that would contain this table (I am
confident that it's only a single table in existing versions of windows,
but maybe they improve that in future versions) could solve this problem.

I still wonder wether it ever can be made efficient, though.

--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / p...@goof.com |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |

Helge Hafting

unread,

Feb 18, 2004, 9:36:27 AM2/18/04

to Neil Brown, linux-...@vger.kernel.org

Neil Brown wrote:

> 1/ case gets lost, so if I save "My File", I will find "my file"
> has been created (unless the application pretty-cases things, in
> which case I can expect case to change anyway).
>
> 2/ Files created by posix apps might be invisible.
>
>
> To answer 2/, I'd say "tough". If you want posix files to be

This is a bit worse than just "though".
win32: rmdir foo
directory not empty!
win32: there are _no_ files there?

Helge Hafting

Robin Rosenberg

unread,

Feb 18, 2004, 10:07:17 AM2/18/04

to tri...@samba.org, h...@zytor.com, Kernel Mailing List

On Wednesday 18 February 2004 05.08, tri...@samba.org wrote:
> Hpa,
>
> > So you're hosed if anyone uses characters outside the UCS-2 character
> > set...
>
> I've heard they are re-defining all those 16 bit numbers to be UCS-16
> instead of UCS-2 for exactly that reason. This is rather similar to
> the move in the Unix community to start using UTF-8.

I've read it also: http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
"The fundamental representation of text in Windows NT-based operating systems is UTF-16"

-- robin

tri...@samba.org

unread,

Feb 18, 2004, 11:45:36 AM2/18/04

to Robin Rosenberg, h...@zytor.com, Kernel Mailing List

Robin,

> I've read it also:
> http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
> "The fundamental representation of text in Windows NT-based
> operating systems is UTF-16"

yep, in this thread I've been mistakenly using the term UCS-16 when I
should have said UTF-16 (ie. the variable length, 2 byte encoding).

Samba currently treats the bytes on the wire from windows as UCS-2 (a
2 byte fixed width encoding), whereas perhaps it should be treating
them as UTF-16. I should write a smbtorture test to detect the
difference and see what different versions of windows actually use.

luckily the new charset handling stuff in samba3 and samba4 will make
this easy to fix :-)

Robin Rosenberg

unread,

Feb 18, 2004, 12:33:31 PM2/18/04

to tri...@samba.org, h...@zytor.com, Kernel Mailing List

On Wednesday 18 February 2004 12.43, tri...@samba.org wrote:
> Robin,
> > I've read it also:
> > http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
> > "The fundamental representation of text in Windows NT-based
> > operating systems is UTF-16"

I believe (please correct me if this is wrong) that Windows never actually
supported any of the UCS-2 code that were in conflict with UTF-16. The cost
of this operation was that some of the "private" code blocks of unicode 2.0, i.e.
U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the
UTF-16 encoding more or less backwards compatible with UCS-2. And it's
UTF-16LE and UCS-2LE, but I suspect you knew that :-)

> yep, in this thread I've been mistakenly using the term UCS-16 when I
> should have said UTF-16 (ie. the variable length, 2 byte encoding).
>
> Samba currently treats the bytes on the wire from windows as UCS-2 (a
> 2 byte fixed width encoding), whereas perhaps it should be treating
> them as UTF-16. I should write a smbtorture test to detect the
> difference and see what different versions of windows actually use.

See above, and most importantly the definition in Amendment 1 of the unicode
3.0 standard.

> luckily the new charset handling stuff in samba3 and samba4 will make
> this easy to fix :-)

Happy man!

-- robin

H. Peter Anvin

unread,

Feb 18, 2004, 4:53:56 PM2/18/04

to Robin Rosenberg, tri...@samba.org, Kernel Mailing List

Robin Rosenberg wrote:
>
> I believe (please correct me if this is wrong) that Windows never actually
> supported any of the UCS-2 code that were in conflict with UTF-16. The cost
> of this operation was that some of the "private" code blocks of unicode 2.0, i.e.
> U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the
> UTF-16 encoding more or less backwards compatible with UCS-2. And it's
> UTF-16LE and UCS-2LE, but I suspect you knew that :-)
>

Make that Unicode 1.0 and 1.1, and you're correct.

-hpa

H. Peter Anvin

unread,

Feb 18, 2004, 8:03:56 PM2/18/04

to linux-...@vger.kernel.org

Followup to: <4033974F...@zytor.com>
By author: "H. Peter Anvin" <h...@zytor.com>
In newsgroup: linux.dev.kernel

>
> Robin Rosenberg wrote:
> >
> > I believe (please correct me if this is wrong) that Windows never actually
> > supported any of the UCS-2 code that were in conflict with UTF-16. The cost
> > of this operation was that some of the "private" code blocks of unicode 2.0, i.e.
> > U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the
> > UTF-16 encoding more or less backwards compatible with UCS-2. And it's
> > UTF-16LE and UCS-2LE, but I suspect you knew that :-)
> >
>
> Make that Unicode 1.0 and 1.1, and you're correct.
>

Err, that was supposed to be 1.1 and 2.0.

Unicode 1.1 reshuffled the private use range from Unicode 1.0, in
order to make room for surrogates in Unicode 2.0.

UTF-16, what a horrible ugly hack.

Robert White

unread,

Feb 18, 2004, 9:00:39 PM2/18/04

to tri...@samba.org, linux-...@vger.kernel.org, Linus Torvalds, Al Viro, Neil Brown

I guess I don't get it...

tri...@samba.org [mailto:tri...@samba.org] said:

> NTFS is case insensitive at the filesystem level. In fact, its
> selectable whether its case sensitive or case insensitive per-process
> (a process can switch between the two models). The case mapping table
> is built into the filesystem itself. That mapping has absolutely
> *zero* to do with US Latin or any other legacy multi-byte encoding.

If the process selects whether it wants to be case insensitive or not how is
NTFS case insensitive "at the file-system level"? Let me guess, they have
two complete paths through the logic? Lots of DLLs? Redundant conflicting
access semantics^Wfeatures?

> you are *way* out of date here. All recent windows apps use the UCS-2
> interfaces which provides a single charset encoding across all locales.

Which kind of directly supports where I said that to amortize the expense
Linux would have to set up its *own* cannon about all file systems using the
same encoding. The fact that I kept bringing up 8.3 was out of date. Point
to you. The point that picking an arbitrary encoding will lead Linux
getting out of date, or at least require a catastrophic realignment of every
program that deigns to open() any file anywhere, remains germane.

> Have you read *any* of what I've been saying about how expensive this is??

Yes, I understand the expense. I have *paid* that expense in excruciating
detail on several occasions. You want to have the kernel pay that expense
(in place of the application) as a fixed (amortized) cost or you want to
codify the file names with a standard encoding which would penalize the
entire system uniformly by raising the base cost to localize.

I appreciate the unbounded regex-like expense of iteratively applying
case/encoding insensitivity to a list of files. I really don't want to pay
that cost in every application when I only need it at the front end. Sue
me.

I also understand the pain of having to load any/each entire directory into
memory one blasted dirent at a time, and appreciate that since the kernel is
bulk loading them at the filesystem interface it seems (is) wasteful to have
to spoon them across the kernel/user-space interface. I really do
understand. (ASIDE: a bulk-fetch-directory-into-buffer call might be nice,
I havn't looked lately, but I presume none such exists.)

Your proposed "single locale storage" would penalize all us embedded systems
types with our space sensitive embedded file systems and low-powered CPUs so
that the larger system that _can_ afford to pay the cost only when necessary
don't have to. Two-bytes for one in every file name isn't a good trade off
when you are dealing with a 32k file system image.

I kind of tried (and apparently utterly failed) to make the points about how
the Windows model worked and what it would cost by describing the basis for
the model, not the current implementation. That is kind of why I *started*
the message with "(ok in point of "technically abstract truth")" and
mentioned later that what I was saying may have changed, but if so, it
changed in a way consistent with the model as described.

Windows has been digging themselves steadily out of the deep hole of
case-insensitive file name handling for years; which does nothing to entice
me to jump in and join them. So bully for windows that they have, iteration
after iteration, managed to reduce the cost of their mistake.

Even *with* a standardized file name character set/encoding case
insensitivity would still be very bad-off in some important areas. Consider
a simple security log. "[date] user command xx satisfied with executive
Xx." etc. I can think of *lots* of times when I would have to open a file
and then have to ask what the real name of the file I opened actually was.
"I asked for 'Bob', what did I get?" isn't a fun question to have to answer
*after* an open. Yes, all this *can* be addressed by scrubbing paths, but
history suggests that this doesn't happen and the more the system does for
you, the more likely you are to miss something.

At the application level, since I have to sort file names for a picklist
anyway, I'd rather pay the case insensitivity cost while I was sorting.
It's actually cleaner and I am already paying to sort.

I used to write SMB based applications (yes, I'm still way out of date) and
I appreciate the painful tit-for-tat non-streaming ugliness. I feel your
pain at having to read a whole directory and doing the sort/search. I
understand the race condition that occurs between the directory read and the
actual open where the file could be renamed or replaced. I really do.

But "fixing" Linux so that it can share Window's pain doesn't seem wise.

I can imagine a mod/module that would graft a localized and/or
case-insensitive companion hash onto the dirent(s) as the central facility
was doing its work. I can imagine an alternate open that traversed this
alternate tree. Creating sort of a giant look-aside into the current file
information tree. But I can't imagine any winning scenario that came from
making that alternate hash the normal access method. Too many people and
projects would suddenly break.

{And I try not to troll, but I apparently have a knack for getting peoples
dander in a bunch when I write. I think it is because I write as I speak,
and the loss of tone and inflection in writing makes my turn-of-phrase come
off very priggish. I'm not sure how to fix that. /sigh 8-)

Rob.

tri...@samba.org

unread,

Feb 18, 2004, 9:33:29 PM2/18/04

to Linus Torvalds, H. Peter Anvin, linux-...@vger.kernel.org

Linus,

> The thing is, I do agree with Tridge on one simple fact: it's very hard
> indeed to do atomic file operations from user space.

I'm glad I'm making progress :)

The second basic fact that I think is relevant is that its not
possible to do case-insensitive filesystem operations efficiently
without the filesystem having knowledge of the fact that you want a
case-insensitive lookup.

The reason for this is that modern filesystems do much better than an
O(n) linear scan for lookups in directories. They use a hash, or a
tree or whatever you like to take advantage of an ordering function on
the names in the directory. The days of linear scans in directories
are fast dwindling.

The only way you are going to avoid the linear scan for a
case-insensitive lookup is to make that ordering function
case-insensitive. The question really is whether we are willing to pay
the price in terms of complexity for doing that. I've tried to make
the claim in this thread that the code complexity cost of doing this
isn't really all that high, but it is definately non-zero.

So your magic_open() proposal would probably be a help, and would
certainly reduce the amount of code we would need in userspace, but it
doesn't change the fundamental linear scan of directories problem at
all.

That doesn't mean I won't take you up on the magic_open() proposal,
it's just that I'd need to try it to see if its a sufficient win to
justify using it given the limitations.

Cheers, Tridge

Ville Herva

unread,

Feb 18, 2004, 9:50:16 PM2/18/04

to Linus Torvalds, Robert White, tri...@samba.org, Kernel Mailing List, Al Viro, Neil Brown

On Tue, Feb 17, 2004 at 04:20:26PM -0800, you [Linus Torvalds] wrote:
>
> This is but one reason why I will _refuse_ to make case insensitivity
> magically start happening on regular "open()" etc calls.
>
> You'd literally have to use a _different_ system call to do a
> case-insensitive file open.

Tongue-in-cheek:

int Open(const char *pathname, int flags); ?

-- v --

v...@iki.fi

Linus Torvalds

unread,

Feb 18, 2004, 10:27:10 PM2/18/04

to tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Thu, 19 Feb 2004 tri...@samba.org wrote:
>
> The second basic fact that I think is relevant is that its not
> possible to do case-insensitive filesystem operations efficiently
> without the filesystem having knowledge of the fact that you want a
> case-insensitive lookup.

That's not my problem. That is _your_ problem, and I don't care. I
disagree violently with the notion that we would push this down to a
filesystem level.

Sorry, but there are limits to how much we care about broken operating
systems.

Linus

Linus Torvalds

unread,

Feb 18, 2004, 10:31:09 PM2/18/04

to tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Wed, 18 Feb 2004, Linus Torvalds wrote:
>
> That's not my problem. That is _your_ problem, and I don't care. I
> disagree violently with the notion that we would push this down to a
> filesystem level.
>
> Sorry, but there are limits to how much we care about broken operating
> systems.

Side note: this only matters for cold cache entries anyway, so I doubt
you'll see any performance improvement on a file server from passing the
brain damage down to the lower levels.

And I bet the performance advantages of _not_ doing native case
insensitivity are likely to dominate hugely.

tri...@samba.org

unread,

Feb 18, 2004, 10:53:26 PM2/18/04

to Linus Torvalds, H. Peter Anvin, linux-...@vger.kernel.org

Linus,

> And I bet the performance advantages of _not_ doing native case
> insensitivity are likely to dominate hugely.

This part I just don't understand at all. The proposed changes would
be extremely cheap performance wise as you are just replacing one hash
with another, and dealing with one extra context bit in the
dcache. There is no way that this could come anywhere near the cost of
doing linear directory scans.

The hash function would be slightly more expensive (when enabled), but
not much, especially when you put in the obvious optimisation for 7
bit characters. The string comparison function in a couple of places
would also become more expensive, but once again it would only be
expensive for case-insensitive processes and benefits from the 7 bit
optimisation so that the average case will only be very slightly more
expensive than the current function.

Fair enough that you don't want to do this for code complexity
reasons, but please don't tell me it would be slower than what we have
to do now.

Try an strace of Samba trying to unlink() a non-existant file in a
large directory. It's enough to make you want to curl up and die :)

Cheers, Tridge

Linus Torvalds

unread,

Feb 18, 2004, 11:04:42 PM2/18/04

to tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Thu, 19 Feb 2004 tri...@samba.org wrote:
>

> > And I bet the performance advantages of _not_ doing native case
> > insensitivity are likely to dominate hugely.
>
> This part I just don't understand at all. The proposed changes would
> be extremely cheap performance wise as you are just replacing one hash
> with another, and dealing with one extra context bit in the
> dcache. There is no way that this could come anywhere near the cost of
> doing linear directory scans.

Why do you focus on linear directory scans?

They simply do not happen under any reasonable IO patterns. You look up
names under the same name that they are on the disk. So the _only_ thing
that should matter is the exact match.

The inexact matches should be a case of "make them correct". Screw
performance. And tell people that they are slower.

Sure, I can imaging that MS would make some benchmark to show that case,
but at that point I just don't care.

Linus

tri...@samba.org

unread,

Feb 18, 2004, 11:14:33 PM2/18/04

to Linus Torvalds, H. Peter Anvin, linux-...@vger.kernel.org

Linus,

> Why do you focus on linear directory scans?

Because a large number of file operations are on filenames that don't
exist. I have to *prove* they don't exist. That includes:

* every file create. I have to prove there wasn't an existing file
under a different case combination.

* every rename. Again, I have to prove that the destination name
doesn't exist.

* every open of a non-existant name (*very* common, its what MS
office does all the time).

etc etc.

If I had a single function that could quickly tell me that a file does
not exist in any case combination then I would be much better off.

> They simply do not happen under any reasonable IO patterns. You look up
> names under the same name that they are on the disk. So the _only_ thing
> that should matter is the exact match.

nope, see above. The most common pattern of accesses involves doing a
full directory scan on every access.

> Sure, I can imaging that MS would make some benchmark to show that case,
> but at that point I just don't care.

It's not just "some benchmark". It's the normal use case.

Cheers, Tridge

Linus Torvalds

unread,

Feb 18, 2004, 11:21:50 PM2/18/04

to tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Thu, 19 Feb 2004 tri...@samba.org wrote:
>

> > Why do you focus on linear directory scans?
>
> Because a large number of file operations are on filenames that don't
> exist. I have to *prove* they don't exist.

And you only need to do that ONCE per name.

There is zero reason to do it over and over again, and there is zero
reason to push case insensitivity deep into the filesystem.

Have you checked how many filesystems we have? Hint:

ls -l fs/ | grep '^d' | wc

The thing is, you have to realize that Windows-compatibility is very very
much second-class. If you refuse to realize that, you can't argue
effectively, because you are arguing for things that simply WILL NOT
happen.

So instead of having this crazy windows-centric idea, I would suggest you
try to come up with ways to make it easier for you. I can tell you already
that it won't be everything you want or need, but quite frankly, your
choice is between _nada_ and something reasonable.

So give it up. We're not making the same STUPID mistakes that Microsoft
has done.

Linus

Pascal Schmidt

unread,

Feb 19, 2004, 12:09:17 AM2/19/04

to tri...@samba.org, linux-...@vger.kernel.org

On Thu, 19 Feb 2004 00:40:21 +0100, you wrote in linux.kernel:

> Because a large number of file operations are on filenames that don't

> exist. I have to *prove* they don't exist. That includes:

Evil question: do you need to be case-preserving? 'Cause if not, you
could simply squash all incoming filenames from case-insensitive clients
to some canonical form (say, all lower-case) and use that.

--
Ciao,
Pascal

tri...@samba.org

unread,

Feb 19, 2004, 1:03:39 AM2/19/04

to Pascal Schmidt, linux-...@vger.kernel.org

Pascal,

> Evil question: do you need to be case-preserving? 'Cause if not, you
> could simply squash all incoming filenames from case-insensitive clients
> to some canonical form (say, all lower-case) and use that.

yes, we have to be case preserving, but thats not the problem. Keeping
some name mapping in user space or xattrs is tedious but conceptually
easy and potentially quite efficient.

The problem is that Samba isn't the only program to be accessing these
directories. Multi-protocol file servers and file servers where users
also have local access are common. That means we can't assume that
some other filesystem user hasn't created a file which matches in a
case-insensitive manner. That means we need to do an awful lot of
directory scans.

I also understand the decision Linus has made that we won't be doing
anything fundamental at the filesystem level to fix this, so we will
just have to live with it.

Cheers, Tridge

Hua Zhong

unread,

Feb 19, 2004, 1:09:47 AM2/19/04

to tri...@samba.org, Pascal Schmidt, linux-...@vger.kernel.org

> The problem is that Samba isn't the only program to be accessing these
> directories. Multi-protocol file servers and file servers where users
> also have local access are common. That means we can't assume that
> some other filesystem user hasn't created a file which matches in a
> case-insensitive manner. That means we need to do an awful lot of
> directory scans.

Do you also require NFSD or other file daemons to do the same
case-insensitivity check? Say you create a foo, how do you prevent NFSD
from creating FOO? What could you do about that?

tri...@samba.org

unread,

Feb 19, 2004, 1:48:29 AM2/19/04

to hzh...@cisco.com, Pascal Schmidt, linux-...@vger.kernel.org

Hua,

> Do you also require NFSD or other file daemons to do the same
> case-insensitivity check?

no. That's the point of the per-process check. Only Samba needs to pay
the price.

> Say you create a foo, how do you prevent NFSD from creating FOO?
> What could you do about that?

You don't need to do anything in particular about it. I did explain
this earlier in this thread, but here goes again:

* samba always tries the name exactly as given by the client. If that
succeeds then we are done.

* if it doesn't find an exact match then it does a directory scan. It
uses the first case-insensitive matching name it finds, or if it
reaches the end of the directory then it concludes that the file
doesn't exist.

So if FOO and foo both exist in the filesystem, and someone asks for
FoO then its pretty much random which one they get (ok, not exactly
random, but close enough for this argument). The thing is that just
making an arbitrary choice is a perfectly fine set of semantics. You
can't deal with this situation any more sanely, so don't even try.

well, actually, there is something you could do that we don't do. We
could have some special marker that distinguishes files created by
windows clients and files created by unix clients, and preferentially
return the one created by windows clients, I just don't think this is
worth doing. Nobody has even complained (within earshot of me anyway)
of the current "pick one" method.

Cheers, Tridge

Theodore Ts'o

unread,

Feb 19, 2004, 2:47:23 AM2/19/04

to tri...@samba.org, Pascal Schmidt, linux-...@vger.kernel.org

On Thu, Feb 19, 2004 at 12:01:53PM +1100, tri...@samba.org wrote:
> The problem is that Samba isn't the only program to be accessing these
> directories. Multi-protocol file servers and file servers where users
> also have local access are common. That means we can't assume that
> some other filesystem user hasn't created a file which matches in a
> case-insensitive manner. That means we need to do an awful lot of
> directory scans.

Actually, not necessarily. What if Samba gets notifications of all
filename renames and creates in the directory, so that after the
initial directory scan, it can keep track of what filenames are
present in the directory? It can then "prove the negative", as you
put it, without having to continuously do directory scans.

Yeah, there can be some race conditions, but Samba already has to deal
with the race condition where it tries to create "MaKeFiLe" either
just before or just after a Posix process creates "Makefile".

- Ted

Daniel Newby

unread,

Feb 19, 2004, 3:00:53 AM2/19/04

to Linus Torvalds, Andrew Tridgell, Kernel Mailing List

Linus Torvalds wrote:
> So some variation of the interface
>
> int magic_open(
> /* Input arguments */
> const char *pathname,
> unsigned long flags,
> mode_t mode,

What about making the pathname hold the alternative cases for each
character, not just an exact string? If Samba wanted to open
"A File.txt", it would do

magic_open( "[a|A][ ][f|F][i|I][e|E][.][t|T][x|X][t|T]", ... )

The syntax shown is conceptual; the actual code would use binary
packing. Characters would be variable length to support UTF-8 and
the like.

Userland would be responsible for making a useful pathname. If it
tried something like "[aL|P|#][m|m]", the kernel would cheerfully
use it. The only sanity checking would be that special characters
like "/" and ":" cannot have alternatives.

Pros:

1. Filesystem names are looked up in kernel mode, where it might be
efficient. (Less grossly slow at least.)

2. But the kernel doesn't care about encodings and character sets.

3. No new kernel infrastructure needed. (I hope?) The case-
insensitive system calls don't take a performance hit.

4. The kernel can detect name collisions and decide what to do
based on a flag.

5. Lookup tables are totally in userland and outside locks. Each
app can use the table it finds appropriate.

6. A naughty app can't deadlock the filesystem.

7. Case-insensitive calls can be atomic, if you're willing to pay
the performance price. It's straightforward for magic_creat() to
refuse to create collisions.

Cons:

1. Looking up multiple alternatives is hairy. (Not that the other
approaches are much prettier.)

2. Massive filenames would get turned into something *really*
massive (five times as many bytes for a simple packing). Does this
break anything?

-- Daniel Newby

tri...@samba.org

unread,

Feb 19, 2004, 3:23:52 AM2/19/04

to Theodore Ts'o, Pascal Schmidt, linux-...@vger.kernel.org

Ted,

> Actually, not necessarily. What if Samba gets notifications of all
> filename renames and creates in the directory, so that after the
> initial directory scan, it can keep track of what filenames are
> present in the directory? It can then "prove the negative", as you
> put it, without having to continuously do directory scans.

Currently dnotify doesn't give you the filename that is being
added/deleted/renamed. It just tells you that something has happened,
but not enough to actually maintain a name cache in user space.

That could be changed, so that on a dnotify event you do a fcntl() to
ask for the name of the file. Or perhaps we could cram it into the
structure the signal handler gets passed? I doubt that would make
sense, but maybe some signal guru can tell me otherwise. Maybe we
could even invent a new dnotify system where you do a read on a file
descriptor to get details on what event happened, and give some
"everything has changed" error when you run out of buffers.

If that happened then we could build our own dcache in user space, but
it will be a very second rate dcache, with a racy and slow update
mechanism that will in itself chew cpu. Maybe thats the best we can
do, or maybe I should be asking distro vendors if they would consider
a case-insensitive patch, especially the vendors aiming for
"enterprise" scalability which might include serving windows clients.

> Yeah, there can be some race conditions, but Samba already has to deal
> with the race condition where it tries to create "MaKeFiLe" either
> just before or just after a Posix process creates "Makefile".

yes, thats true.

The races aren't my primary concern really. I've spent the last week
doing profiling of a large Samba install, and after fixing a
horrendous scalability problem do to with fcntl locking (more on that
later) the next thing on the profile is stat() and directory
scans. That's why the efficiency of this stuff is a hot topic for me
right now.

It's not all as bleak as perhaps I make it seem though. I suspect
there is still quite a bit of improvement that can be made in Samba
just because our code is so messy that sometimes we do a stat() call
or a directory scan when perhaps we can prove that we don't need
to. The Samba4 code is much cleaner, and maybe we have room to keep
improving things for a couple of years by finding those inefficiencies
and fixing them. We will eventually hit a wall, but it could be a fair
way off. Maybe windows will be dead by then.

Cheers, Tridge

Jamie Lokier

unread,

Feb 19, 2004, 8:13:05 AM2/19/04

to Linus Torvalds, tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

Linus Torvalds wrote:
> > > Why do you focus on linear directory scans?
> >
> > Because a large number of file operations are on filenames that don't
> > exist. I have to *prove* they don't exist.
>
> And you only need to do that ONCE per name.
>
> There is zero reason to do it over and over again, and there is zero
> reason to push case insensitivity deep into the filesystem.

Linus, while I agree with you wholeheartedly on everything else in
this thread - how can Samba only do that lookup ONCE per name if a
client is issuing many requests for non-existent opens or stats?

Example: A client has a search path for executables or libraries.

Each time SomeThing.DLL is looked up by the client, it will issue an
open() for each entry in the path, until it finds the file it wants.

For each request, Samba must readdir() every directory in the path
until the file is found.

If a directory doesn't change between requests, Samba can use dnotify
to cache the negative lookups.

However, if any change occurs in a directory, or if the directory is
not dnotify-capable, Samba is not allowed to cache these negative
results: It has to do the readdir() for _every_ request.

-- Jamie

Helge Hafting

unread,

Feb 19, 2004, 10:08:01 AM2/19/04

to tri...@samba.org, Theodore Ts'o, Pascal Schmidt, linux-...@vger.kernel.org

tri...@samba.org wrote:
> Ted,
>
> > Actually, not necessarily. What if Samba gets notifications of all
> > filename renames and creates in the directory, so that after the
> > initial directory scan, it can keep track of what filenames are
> > present in the directory? It can then "prove the negative", as you
> > put it, without having to continuously do directory scans.
>
> Currently dnotify doesn't give you the filename that is being
> added/deleted/renamed. It just tells you that something has happened,
> but not enough to actually maintain a name cache in user space.
>

You can still keep per-directory caches that you simply invalidate on each dnotify,
and rebuild when necessary. At least it would help the "repeated
lookup of nonexistant filenames" case.
Path searches for executables usually happens on directories that don't
see much writing.

Helge Hafting

Paulo Marques

unread,

Feb 19, 2004, 12:14:04 PM2/19/04

to tri...@samba.org, Theodore Ts'o, Pascal Schmidt, linux-...@vger.kernel.org

tri...@samba.org wrote:

> Currently dnotify doesn't give you the filename that is being
> added/deleted/renamed. It just tells you that something has happened,
> but not enough to actually maintain a name cache in user space.

This might be a crazy / stupid idea, so flame at will :)

Wouldn't it be possible to do a samba "super-server" mode, in which samba would
assume that it controlled the directories it is exporting?

In this mode a "corporate" Samba server, serving Windows clients, could improve
performance by assuming that its cache was always up-to-date.

If if we wanted to access the directory locally we could always mount locally
using samba, and access the files anyway, albeit a lot slower and without linux
permissions, etc.

What we would gain was the ability to say "I want to give priority to my samba
server" (and set it to "super-server" mode) or "my priority is to the linux
native filesystem, and just want to share my files with windows users anyway"
(and keep using samba as always).

--
Paulo Marques - www.grupopie.com

"In a world without walls and fences who needs windows and gates?"

Theodore Ts'o

unread,

Feb 19, 2004, 2:10:56 PM2/19/04

to tri...@samba.org, Pascal Schmidt, linux-...@vger.kernel.org

On Thu, Feb 19, 2004 at 02:20:44PM +1100, tri...@samba.org wrote:
> Currently dnotify doesn't give you the filename that is being
> added/deleted/renamed. It just tells you that something has happened,
> but not enough to actually maintain a name cache in user space.
>
> That could be changed, so that on a dnotify event you do a fcntl() to
> ask for the name of the file. Or perhaps we could cram it into the
> structure the signal handler gets passed? I doubt that would make
> sense, but maybe some signal guru can tell me otherwise. Maybe we
> could even invent a new dnotify system where you do a read on a file
> descriptor to get details on what event happened, and give some
> "everything has changed" error when you run out of buffers.

Yes, that's what I was suggesting. One advantage of such a scheme is
that it's not just for Windows compatibility. A more rich directory
change notification scheme would also be useful for graphical file
managers, automatic indexing tools, and many, many other applications.

No, it's not everything you were requesting, but it may very well
represent three-quarters of a loaf, instead of nothing.

> If that happened then we could build our own dcache in user space, but
> it will be a very second rate dcache, with a racy and slow update
> mechanism that will in itself chew cpu. Maybe thats the best we can
> do, or maybe I should be asking distro vendors if they would consider
> a case-insensitive patch, especially the vendors aiming for
> "enterprise" scalability which might include serving windows clients.

I don't know that the update mechanism has to seriously chew that much
CPU. It can certainly can be designed to minimize the amount of CPU
that is consumed, especially if it is read via a file descriptor so
that multiple updates can be sent via a single read() system call,
instead of sending a signal every single time a directory entry is
created, renamed, or deleted.

The problem with a case-insentive patch is that for most modern
filesystems (i.e., any filesystem that does better than O(1) directory
searches), it will have to involve a format change, since the case
insensitivity has to be built into the hash function or the tree
comparison fucture, or both. At this point, the filesystem author has
to make the choice of whether to try to solve the Windows-specific
problem, in which case the fundamental filesystem format would have to
be tailored to the Windows case mapping table, or try to solve the
more general I18N case mapping problem. (Lots of luck! It's
constantly changing over time as new character sets are added or
modified...) Yes, a few such filesystems might have this support
already, but I doubt distributions would be willing to accept patches
that make filesystem format-incompatible changes just for the sake of
accelerating Samba operations.

I don't know if the distributions would be willing to accept a
case-insensitive patch, but my suspicions is that it would be
difficult, and I would argue that it might be more efficient to get a
richer directory change notification system, for the reasons I argued
above.

- Ted

Linus Torvalds

unread,

Feb 19, 2004, 4:06:50 PM2/19/04

to Jamie Lokier, tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Thu, 19 Feb 2004, Jamie Lokier wrote:
>
> Linus, while I agree with you wholeheartedly on everything else in
> this thread - how can Samba only do that lookup ONCE per name if a
> client is issuing many requests for non-existent opens or stats?

While I'm not willing to push case insensitivity deep into the
filesystems, I _am_ willing to entertain the notion of an extra flag to a
dcache entry that the regular VFS operations ignore (apart from clearing
it when they change anything and having to flush them under some
circumstances), which would basically be "this dentry has been judged
unique in a case-insensitive environment".

So assuming nobody else is touching the directory, the case-insensitive
special module could create these kinds of dentries to its hearts content
when it does a lookup.

> Example: A client has a search path for executables or libraries.
>
> Each time SomeThing.DLL is looked up by the client, it will issue an
> open() for each entry in the path, until it finds the file it wants.
>
> For each request, Samba must readdir() every directory in the path
> until the file is found.
>
> If a directory doesn't change between requests, Samba can use dnotify
> to cache the negative lookups.
>
> However, if any change occurs in a directory, or if the directory is
> not dnotify-capable, Samba is not allowed to cache these negative
> results: It has to do the readdir() for _every_ request.

But this is exactly what I _am_ willing to entertain: have some limited
special logic inside the kernel (but outside the VFS layer proper), that
allows samba to use special interfaces that avoids this.

For example, the rule can be that _any_ regular dentry create will
invalidate all the "case-insensitive" dentries. Just to be simple about
it. But if samba is the only thing that accesses a certain directory (or
the directory is not written to, like / and /usr etc usually behave), the
"windows hack" interface will be able to populate it with its fake
dentries all it wants.

Or something like this. Basically, I'm convinced that the problem _can_ be
solved without going deep into the VFS layer. Maybe I'm wrong. But I'd
better not be, because we're definitely not going to screw up the VFS
layer for Windows.

Linus

Jamie Lokier

unread,

Feb 19, 2004, 4:41:15 PM2/19/04

to Linus Torvalds, tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

Linus Torvalds wrote:
> For example, the rule can be that _any_ regular dentry create will
> invalidate all the "case-insensitive" dentries. Just to be simple about
> it.

If that's the rule, then with exactly the same algorithmic efficiency,
readdir+dnotify can be used to maintain the cache in userspace
instead. There is nothing gained by using the helper module in that case.

It follows that a helper module is only useful if readdir+dnotify
isn't fast enough, and the invalidation rule has to be more selective.

(Although, maybe there are atomicity concerns I haven't thought of).

-- Jamie

Linus Torvalds

unread,

Feb 19, 2004, 4:51:48 PM2/19/04

to Jamie Lokier, tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Thu, 19 Feb 2004, Jamie Lokier wrote:

> Linus Torvalds wrote:
> > For example, the rule can be that _any_ regular dentry create will
> > invalidate all the "case-insensitive" dentries. Just to be simple about
> > it.
>
> If that's the rule, then with exactly the same algorithmic efficiency,
> readdir+dnotify can be used to maintain the cache in userspace
> instead. There is nothing gained by using the helper module in that case.

Wrong.

Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.

Think about it. Think about samba doing a "rename()" within the directory.

Linus

Jamie Lokier

unread,

Feb 19, 2004, 6:34:30 PM2/19/04

to Linus Torvalds, tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

Linus Torvalds wrote:
> > > For example, the rule can be that _any_ regular dentry create will
> > > invalidate all the "case-insensitive" dentries. Just to be simple about
> > > it.
> >
> > If that's the rule, then with exactly the same algorithmic efficiency,
> > readdir+dnotify can be used to maintain the cache in userspace
> > instead. There is nothing gained by using the helper module in that case.
>
> Wrong.
> Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.

Ah, I didn't know you meant "_any_ regular dentry create (except for
Samba operations)".

To apply that rule, you either need alternate versions of rename() and
other file syscalls, or something akin to a process-specific flag (set
by the helper module) saying that this is a Samba process and dentry
creation _by this process_ shouldn't invalidate case-insensitive
dentries.

And if you have either of those, the bit of code which says "don't
invalidate case-insenitive dentries because this is a Samba process"
can just as easily say "don't send dnotify events to the current
process".

And once you've done that, it's easier just to add a DN_IGNORE_SELF
flag to dnotify meaning to ignore events caused by the current
process, and forget about the helper module. That'd be useful for
other programs, too.

-- Jamie

Helge Hafting

unread,

Feb 19, 2004, 6:55:39 PM2/19/04

to Paulo Marques, tri...@samba.org, Theodore Ts'o, Pascal Schmidt, linux-...@vger.kernel.org

On Thu, Feb 19, 2004 at 12:11:32PM +0000, Paulo Marques wrote:
> tri...@samba.org wrote:
>
> >Currently dnotify doesn't give you the filename that is being
> >added/deleted/renamed. It just tells you that something has happened,
> >but not enough to actually maintain a name cache in user space.
>
> This might be a crazy / stupid idea, so flame at will :)
>
> Wouldn't it be possible to do a samba "super-server" mode, in which samba
> would assume that it controlled the directories it is exporting?
>
> In this mode a "corporate" Samba server, serving Windows clients, could
> improve performance by assuming that its cache was always up-to-date.
>
> If if we wanted to access the directory locally we could always mount
> locally using samba, and access the files anyway, albeit a lot slower and
> without linux permissions, etc.
>

You don't really need to go to such extremes. Samba can use dnotify,
and run with caching and great performance as long as nobody touch
the files in other ways. There is no need to _enforce_ it though,
samba can cope by invalidating the cache on those rare occations
the files are accessed in other ways. It won't happen often, because:

1. Linux/nfs people have no business in a directory full of
windows .dll's and .exe's
2. On a corporate server you simply tell people to stay out.
nfs may export another set of homedirs for the unix people.

> What we would gain was the ability to say "I want to give priority to my
> samba server" (and set it to "super-server" mode) or "my priority is to the
> linux native filesystem, and just want to share my files with windows users
> anyway" (and keep using samba as always).
>

Thanks to dnotify even the "linux priority" setup will be able to benefit
from a cache. Particularly if we can get a dnotify that doesn't trip
when samba is the one making changes.

Helge Hafting

unread,

Feb 19, 2004, 7:04:15 PM2/19/04

to Linus Torvalds, Jamie Lokier, tri...@samba.org, H. Peter Anvin, linux-...@vger.kernel.org

On Thu, Feb 19, 2004 at 08:54:51AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 19 Feb 2004, Jamie Lokier wrote:
> > Linus Torvalds wrote:
> > > For example, the rule can be that _any_ regular dentry create will
> > > invalidate all the "case-insensitive" dentries. Just to be simple about
> > > it.
> >
> > If that's the rule, then with exactly the same algorithmic efficiency,
> > readdir+dnotify can be used to maintain the cache in userspace
> > instead. There is nothing gained by using the helper module in that case.
>
> Wrong.
>
> Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.
>
> Think about it. Think about samba doing a "rename()" within the directory.

Avoiding its own operations is a nice one. Could dnotify pass
some information, such as the inode number involved to samba?
samba could then look up the filename in its cache and take a
closer look at that file only. That would avoid loosing the cache,
even in case of other processes intruding.

Helge Hafting

Linus Torvalds

unread,

Feb 19, 2004, 7:46:01 PM2/19/04

to Tridge, Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Ok,
I think I've got it. Here's an algorithm that will have "perfect"
behaviour under normal circumstances as long as you've got enough memory.

Admittedly the "you've got enough memory" part is a downside, but it's so
_damn_ clean and simple that it is, I think, a reasonable trade-off.
Besides, if you want good file serving numbers, you'd better have enough
memory anyway.

Basic approach: add two bits to the VFS dentry flags. That's all that is
needed. Then you have two new system calls:

- set_bit_one(dirfd)
- set_bit_two_if_one_is_set(dirfd);
- check_or_create_name(dirfd, name, case_table_pointer, newfd);

The VFS rule is:
- all new dentries start off with the two magic bits clear
- whenever we shrink a dentry, we clear the two magic bits in the parent

and that is _all_ the VFS layer ever does. Even Al won't find this
obnoxious (yeah, we might clear the bits after a timeout on things that
need re-validation, but that's in the noise).

The "set_bit_one()" system call will set one of the magic bits (with the
dcache lock held) in the dentry that is pointed to by the file descriptor.
Nothing more.

The "set_bit_two_if_one_is_set()" system call will set the _other_ magic
bit (with the dcache lock held) in the dentry, if the first bit is set.
Otherwise it will just return.

Let's leave the "check_or_create_name()" thing for now, and see how we can
use this in user space (and realize that we only do this on cache failure,
so this is the "slow case"):

set_bit_one(dir);
lseek(dir, 0, SEEK_SET);
while (readdir(dir, de)) {
stat(de->d_name);
.. might also compare the name here with whatever it is
working on right now..
}
set_bit_two_if_one_is_set(dirfd);

Notice what the above does? After the above loop, bit two will be set IFF
the dentry cache now contains every single name in the directory.
Otherwise it will be clear. Bit two will basically be a "dcache complete"
bit.

Now, let's go to "check_or_create_name()", which can thus do:

- for each name in the dcache name list, compare the dang thing
without case.
- return "lookup succeeded" (the file descriptor of the thing it
successfully looked up) if a match with a positive dentry occurs.
- check bit two, return -ENOCACHE if it was clear.
- create the new dentry with the new name and the new file descriptor
inode, and return success.

Notice? Basically _ZERO_ changes to the VFS layer, together with basically
perfect hot-cache-case behaviour.

Yeah, yeah, the above is probably glossing over a lot of issues (there's a
race if somebody does both the "readdir loop" and the "create" case at the
same time, so that would need a lock around it in user space, but please
realize that the readdir loop only happens if the "check_or_create()"
thing fails, so the readdir loop should basically never happen in the
hot-cache case.

And the above allows perfect behaviour even for new filenames that we have
never seen before (ie a create of a new file with a random name). At least
as long as the dcache for that directory remains "complete" (which it will
do, until the kernel needs to throw something out).

Am I a super-intelligent bastard, or am I a complete nincompoop? You
decide.

Linus

Linus Torvalds

unread,

Feb 19, 2004, 7:50:21 PM2/19/04

to Tridge, Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
> Basic approach: add two bits to the VFS dentry flags. That's all that is
> needed. Then you have two new system calls:

^^^

> - set_bit_one(dirfd)
> - set_bit_two_if_one_is_set(dirfd);
> - check_or_create_name(dirfd, name, case_table_pointer, newfd);

[ deletia ]

> Am I a super-intelligent bastard, or am I a complete nincompoop? You
> decide.

I think my lack of counting ability basically answers that question.

Damn.

Linus "complete nincompoop" Torvalds

H. Peter Anvin

unread,

Feb 19, 2004, 7:54:11 PM2/19/04

to Linus Torvalds, Tridge, Al Viro, Jamie Lokier, Kernel Mailing List

Linus Torvalds wrote:
>
> On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
>>Basic approach: add two bits to the VFS dentry flags. That's all that is
>>needed. Then you have two new system calls:
>
> ^^^
>
>> - set_bit_one(dirfd)
>> - set_bit_two_if_one_is_set(dirfd);
>> - check_or_create_name(dirfd, name, case_table_pointer, newfd);
>
>
> [ deletia ]
>
>
>>Am I a super-intelligent bastard, or am I a complete nincompoop? You
>>decide.
>
>
> I think my lack of counting ability basically answers that question.
>
> Damn.
>

How about a compomise - super-intelligent complete nincompoop bastard?

[:^)

-hpa

Linus Torvalds

unread,

Feb 19, 2004, 8:01:50 PM2/19/04

to H. Peter Anvin, Tridge, Al Viro, Jamie Lokier, Kernel Mailing List

On Thu, 19 Feb 2004, H. Peter Anvin wrote:
>
> How about a compomise - super-intelligent complete nincompoop bastard?

Ok, but in the meantime I think I can save face by saying that you only
need two system calls, by simply making a "lseek(fd, 0, SEEK_SET)"
implicitly set the first bit. So then the "set second bit if first is set"
just becomes a "dcache fill complete" notifier.

So I'll take half credit.

Linus "super-complete bastard" Torvalds

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 19, 2004, 8:08:58 PM2/19/04

to Linus Torvalds, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> The VFS rule is:
> - all new dentries start off with the two magic bits clear
> - whenever we shrink a dentry, we clear the two magic bits in the parent
>
> and that is _all_ the VFS layer ever does. Even Al won't find this
> obnoxious (yeah, we might clear the bits after a timeout on things that
> need re-validation, but that's in the noise).

> Notice what the above does? After the above loop, bit two will be set IFF
> the dentry cache now contains every single name in the directory.
> Otherwise it will be clear. Bit two will basically be a "dcache complete"
> bit.

What about dentry getting dropped in the middle of that loop _and_
another task setting the first bit again before the loop ends?

Robert White

unread,

Feb 19, 2004, 8:15:42 PM2/19/04

to tri...@samba.org, Theodore Ts'o, Pascal Schmidt, linux-...@vger.kernel.org

(I may, of course, be overly naive... but a thought occurs... 8-)

It would seem that the there is a moment of opportunity at the
dentry_operations invocation point to harvest all the information you would
need to maintain a specialized dcache in a separate module. Unfortunately,
since the individual file systems get to tweak their own pointer(s) to
this/these struct-of-calls it could get hard to hijack things at that level.

With two changes to core Linux behavior, which could easily be implemented
as a configurable kernel option, you could create an advisory hook.

1) add a usually-null pointer(*) to dentry_operations structure to the
superblock data structure in vfs (and, of course, an install/remove
structure call pair) as a look-aside mechanism, and

2) if-not-null "parallel" invocations of these "advisory" calls are then
added to the fixed vfs invocation points along side the normal dentry
notices...

You could then add any imaginable advisory behavior to any file system. A
well crafted module could then attach to file systems, flag directories (+),
and get low-level advisory service at core dentry action time.

A module so attached could answer all your negative enquiries quickly and
yet remain nicely segregated. You could probably create the magic_open
dream logic of your choice and net near, if not absolute, race elimination.

You still might have to readdir a whole dirctory from time to time just to
clean-up a partily aged cache, but there would be no need for the stepwise
transfer of this information into the user context.

100% of the native function of each file system is preserved and there are
probably other applications for this look-aside feature like low-level
security auditing or semantic mirroring (a-la real-time rdist).

But, you know, just a thought...

Rob.

(*) this should, if enabled, be arranged as a linked list of structures so
that multiple modules could be installed for different purposes.

(+) flagging and un-flagging directories of interest ad-hoc is needed to
prevent saturation of resources.

Linus Torvalds

unread,

Feb 19, 2004, 8:20:17 PM2/19/04

to vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:

> On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> > The VFS rule is:
> > - all new dentries start off with the two magic bits clear
> > - whenever we shrink a dentry, we clear the two magic bits in the parent
> >
> > and that is _all_ the VFS layer ever does. Even Al won't find this
> > obnoxious (yeah, we might clear the bits after a timeout on things that
> > need re-validation, but that's in the noise).
>
> > Notice what the above does? After the above loop, bit two will be set IFF
> > the dentry cache now contains every single name in the directory.
> > Otherwise it will be clear. Bit two will basically be a "dcache complete"
> > bit.
>
> What about dentry getting dropped in the middle of that loop _and_
> another task setting the first bit again before the loop ends?

Hey, you snipped the part where I said that the application has to have
its own locking around the loop and around the lookup to avoid races.

We can avoid that requirement by using sequence numbers and making it a
bit more complex, but the simple version was for samba only (ie "only one
app that wants this").

Realize that none of this makes the internal kernel (or filesystem) data
structures be wrong, so even if the app has a bug and doesn't do the right
locking, at worst that just results in problems for that application, not
for the rest of the system.

But yes, if we want to make others use this, we'd need to have the kernel
actually support some kind of locking, probably by just making the whole
readdir loop be inside the kernel itself (at which point we can use the
inode semaphore for this).

The "dcache full" bit could be potentially useful regardless of any
case-ignorant operating system emulation crap, although I don't see any
really obvious applications (we could speed up regular "readdir()", but we
don't have the d_offset thing, so..)

Linus

Linus Torvalds

unread,

Feb 19, 2004, 8:32:23 PM2/19/04

to vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
> >
> > What about dentry getting dropped in the middle of that loop _and_
> > another task setting the first bit again before the loop ends?
>
> Hey, you snipped the part where I said that the application has to have
> its own locking around the loop and around the lookup to avoid races.

[ That, btw, implies that we do need to make the "set bit one" a system
call of its own, so that somebody elses "lseek(fd, 0, SEEK_SET)" wouldn't
mess up. Mea culpa. ]

Anyway, if we're willing to make some other changes to the VFS layer, we
could make all of this a bit more efficient by _not_ requiring the actual
filesystem lookup to take place.

If we had a flag that allowed a dentry to not have a d_inode pointer, but
still _not_ be considered automatically negative, then we could just make
a loop that fills the dcache directly from the readdir() data inside the
kernel, without calling down to the filesystem to look up the inode.

That would save a _lot_ of memory - quite often we'd only need the dentry
itself.

That would require a third bit in the VFS dentry flags (something like
D_DENTRY_LIKELY_POSITIVE), and would require that "d_lookup()" not just
assume that a dentry without an inode is always negative (check the new
flag, and if so, do the filesystem lookup when the lookup actually
happens).

Doesn't look _too_ bad, and considering the potential memory savings (and
not having to seek around the disk to look up the inode data), it would
probably be worth thinking about at least as a "second stage".

So then we could have a dcache that is fully populated, even though the
actual inode data hasn't been loaded yet.

Comments?

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 19, 2004, 8:48:31 PM2/19/04

to Linus Torvalds, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 12:32:55PM -0800, Linus Torvalds wrote:
> Anyway, if we're willing to make some other changes to the VFS layer, we
> could make all of this a bit more efficient by _not_ requiring the actual
> filesystem lookup to take place.
>
> If we had a flag that allowed a dentry to not have a d_inode pointer, but
> still _not_ be considered automatically negative, then we could just make
> a loop that fills the dcache directly from the readdir() data inside the
> kernel, without calling down to the filesystem to look up the inode.
>
> That would save a _lot_ of memory - quite often we'd only need the dentry
> itself.

> So then we could have a dcache that is fully populated, even though the

> actual inode data hasn't been loaded yet.
>
> Comments?

*Ugh*

That will cause all sorts of nastiness for filesystems that _have_
case-insensitive lookups. Remember the crap we had to deal with to avoid
multiple dentries for directory? It will come back, AFAICS.

Another thing I really don't like is that we now get real lookups
on hashed dentry. That potentially changes a lot and can lead to very
interesting results for some filesystems.

Jamie Lokier

unread,

Feb 19, 2004, 8:53:15 PM2/19/04

to Linus Torvalds, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> Comments?

Yes: The slow part of my brain thinks dnotify with a new flag
DN_IGNORE_SELF, meaning don't notify for things done by the process
which is watching, would provide equivalent functionality.

That is:

Samba looks up a name:

1. Look up cache entry in Samba's cache; fails.
2. Try exact name; fails.
3. Open directory.
4. Register dnotify (DN_IGNORE_SELF | DN_CREATE | DN_RENAME | DN_DELETE).
5. readdir(); no case-insensitive match.
6. Stores negative cache entry in Samba.

Future lookups just succeed in Samba's cache.

Negative cache entries are simply invalidated whenever a dnotify is
received for that directory.

Samba already maintains a cache for positive entries, so this would be
very little logic to add.

In what way is your two bit proposal better?

- Jamie

Linus Torvalds

unread,

Feb 19, 2004, 9:26:29 PM2/19/04

to vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:
>

> > So then we could have a dcache that is fully populated, even though the
> > actual inode data hasn't been loaded yet.
> >
> > Comments?
>
> *Ugh*
>
> That will cause all sorts of nastiness for filesystems that _have_
> case-insensitive lookups. Remember the crap we had to deal with to avoid
> multiple dentries for directory? It will come back, AFAICS.

No no. Look at how this works:
- only one dentry actually exists. It is marked "tentative", which means
that nobody will use it as-such without doing a lookup on it. It has
zero impact on aliases etc, because it's really just a place-holder: it
doesn't point to any inodes at all, it only says "there may or may not
be a file here"

NOTE! This dentry is in no way case-insensitive. It happens to have
_exactly_ the contents (and hash) that the readdir entry had, but it
has no meaning outside of that.

- each caller of "__d_lookup()" will have to check if it's a tentative
dentry and basically ignore it if so.

There aren't that many of them, and I think it all comes together in
"do_lookup()", which may be the _only_ place that actually cares right
now. Look at how that works right now:

dentry = __d_lookup(..);
if (!dentry)
goto needs_lookup; /* This case will allocate a whole
new dentry and use that for lookup */

/* NEW CASE! */
if (dentry->d_flags & D_TENTATIVE)
goto needs_lookup_with_this_dentry;
done:
path->mnt = mnt;
path->dentry = dentry;
return 0;

/*
* NEW CASE!!
*
* Unhash the tentative one, and look up a real one.
*/
needs_lookup_with_this_dentry:
d_drop(dentry);
dentry = NULL;

/* OLD REGULAR CASE */
needs_lookup:
...

In other words, neither the low-level filesystem NOR anything else really
ever sees the tentative dentry (the above is the really stupid approach: a
slightly more clever one will avoid the "real_lookup()" alloc_dentry()
thing and just use the tentative dentry after having unhashed it and
verified that it's the only user).

See? Nobody actually ever sees the "raw dentry". They all go through
__d_lookup(), and the rule would be:

- if "d_lookup()" sees a tentative dentry, it will just unhash it and
drop it (it has the dcache lock, so it can do that)
- all callers of "__d_lookup()" will have to check for D_TENTATIVE, and
decide what to do with it. I think there are exactly _three_ callers,
and one of them is d_lookup() itself.

See? Very very minimal impact that I can see (really, the biggest part
would be to do the dentry re-use in the better version of "do_lookup()" -
that would mean some re-organization, but maybe that optimization isn't
even worth it).

Or did I miss anything?

Linus

Linus Torvalds

unread,

Feb 19, 2004, 9:33:21 PM2/19/04

to Jamie Lokier, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Jamie Lokier wrote:
>
> Yes: The slow part of my brain thinks dnotify with a new flag
> DN_IGNORE_SELF, meaning don't notify for things done by the process
> which is watching, would provide equivalent functionality.

Basically, yes. However, I can tell you that directory name caching is
damn hard, and the kernel does it better than anybody else.

The hardest part of caching is not filling the cache - it's knowing when
to release it. In other words, forget the filling part, and think about
the replacement policy (balacing between the page cache, the directory
cache, and regular pages). The kernel already has that.

Besides, I really think that we can do this with basically just a few
lines of code in the kernel (apart from the actual case comparison, which
I'm not even going to worry about - that's totally independent of the
cache handling itself, and I don't care about how to write a
"windows_equivalent_strncasecmp()".

Linus

Linus Torvalds

unread,

Feb 19, 2004, 9:42:57 PM2/19/04

to vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
> No no. Look at how this works:
> - only one dentry actually exists.

That was really badly phrased. There can be _millions_ of these things,
but they are all "unique" - they have zero impact on each other, and have
no linkages. They never shadow any existing dentries (ie when we create
these, we'd obviously never create a tentative dentry with the same name
as an existing _valid_ dentry), and they are never visible to the
filesystem.

So it's not that "only one dentry" exists, but that that this tentative
dentry only exists as a unique marker of "a dentry of this name _may_
exist".

Linus Torvalds

unread,

Feb 19, 2004, 9:47:23 PM2/19/04

to vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>

> See? Nobody actually ever sees the "raw dentry". They all go through
> __d_lookup(), and the rule would be:
>
> - if "d_lookup()" sees a tentative dentry, it will just unhash it and
> drop it (it has the dcache lock, so it can do that)
> - all callers of "__d_lookup()" will have to check for D_TENTATIVE, and
> decide what to do with it. I think there are exactly _three_ callers,
> and one of them is d_lookup() itself.

Actually, I've got a better setup: instead of having a D_TENTATIVE flag in
the dentry flags, just do

#define TENTATIVE_INODE ((struct inode *) 1)

and just have "dentry->d_inode = TENTATIVE_INODE" for the dentries that
were filled directly from "readdir()" data.

This not only avoids using a bit in the dentry flags, but it pretty much
guarantees that everybody is forced to use them correctly. It would be
very hard to have a buggy user: the dentry will clearly not be a negative
dentry (since d_inode is not NULL), but if anybody ever uses it as a
positive dentry, you'll get a nice and immediate oops.

So we'd see very quickly if these tentative dentries were to escape
outside of __d_lookup().

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Feb 19, 2004, 9:48:42 PM2/19/04

to Linus Torvalds, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, Feb 19, 2004 at 01:45:32PM -0800, Linus Torvalds wrote:
> So we'd see very quickly if these tentative dentries were to escape
> outside of __d_lookup().

Ahem... You'll see them (at least) in dcache pruning codepaths. And
those will dereference inodes...

Linus Torvalds

unread,

Feb 19, 2004, 9:54:31 PM2/19/04

to vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:
>

> On Thu, Feb 19, 2004 at 01:45:32PM -0800, Linus Torvalds wrote:
> > So we'd see very quickly if these tentative dentries were to escape
> > outside of __d_lookup().
>
> Ahem... You'll see them (at least) in dcache pruning codepaths. And
> those will dereference inodes...

Yea, you be right. Many of those paths would not need to care about
TENTATIVE at all, so using the d_inode thing would make them uglier, I
agree. Maybe the flag is better after all (and it really should be pretty
well contained by just checking all __d_lookup callers, so it should be
hard to get it wrong, but maybe I've forgotten some path).

We could do it both ways - do the TENTATIVE_INODE thing as a debugging
thing at first to make sure none of these dentries escape, and then remove
it (and the unnecessary tests in the pruning paths) once everybody is
convinced that it is working correctly.

Linus

David Lang

unread,

Feb 19, 2004, 10:26:01 PM2/19/04

to Linus Torvalds, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

surfacing from normal lurk mode.

it looks to me like this could also end up allowing pruning lots of the
existing negative dcache entries.

if you fully cache the directory (and set the new flag) then any lookup
that isn't found is known to not exist.

is it worth freeing the negative dcache entries when you set the flag to
say that you have things fully cached? if so this could end up being a
significant memory savings.

David Lang

On Thu, 19 Feb 2004, Linus Torvalds wrote:

> Date: Thu, 19 Feb 2004 13:53:43 -0800 (PST)
> From: Linus Torvalds <torv...@osdl.org>
> To: vi...@parcelfarce.linux.theplanet.co.uk
> Cc: Tridge <tri...@samba.org>, Jamie Lokier <ja...@shareable.org>,
> H. Peter Anvin <h...@zytor.com>,
> Kernel Mailing List <linux-...@vger.kernel.org>
> Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

tri...@samba.org

unread,

Feb 19, 2004, 11:40:21 PM2/19/04

to Linus Torvalds, Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Linus,

I'm probably just thicker than a complete set of superman comics, and
probably haven't had enough coffee this morning, but I'm still trying
to understand exactly how much this is going to gain us.

If I understand it, your suggestion gives us:

- a way of telling if a directory is fully cached in the dcache
- a way of scanning that full cache with whatever braindead
comparison algorithm we want

At first I didn't understand the scanning part at all, because I
didn't realise that you could scan just the dentries associated with a
single directory. Al was kind enough to correct me on that.

What your proposal doesn't give us is case-insensitive indexing into
the dcache. The reason the dcache is such a great thing in Linux is
that it is indexed by name, so you rarely do any scanning at all, and
even the case where you have never seen the name before we avoid
scanning because fast filesystems also use a "indexed by name"
scheme. Now maybe I'm just over-obsessed about this scanning stuff and
I'd need some profiles to see how much it would cost (although the
cost as the directory size gets really large seems obvious).

The really interesting part of your proposal is that it opens up the
possibility of a coherence mechanism between a cache that is indexed
by some windows like scheme and the real dcache. If those two bits
could be used by the windows_braindead module to determine if its own
separately indexed cache was current then we'd really be getting
somewhere.

If we didn't do the separate cache at all, then your proposal still
should hugely reduce the number of times we ask the filesystem for a
list of files in the directory, although as those calls are already
cached at the block device level what I suppose it does is move the
cache up a level. I don't have a clear idea of how much faster it is
to do this scanning in the dcache versus in the filesystem in the
hot-cache case, so I am not clear on how much this wins us. I'm
prepared to believe it could be quite significant though.

I really need more coffee-and-think time on this, plus maybe some
quick and dirty profiling tests to see what the various costs are
like.

While I'm here I should point out that I'm thinking of the 2.7/2.8
kernel (or even 3.0) for any change, not 2.6. Maybe thats obvious
anyway, but the corresponding userspace changes in Samba definately
won't be happening in Samba 3.0, so this is a Samba 4.0 thing, which
is a fair way off. This means we've got plenty of time to try some
experiments and see what schemes really help.

Cheers, Tridge

Linus Torvalds

unread,

Feb 19, 2004, 11:59:34 PM2/19/04

to Andrew Tridgell, Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004 tri...@samba.org wrote:
>
> What your proposal doesn't give us is case-insensitive indexing into
> the dcache.

Correct.

And I've told you OVER AND OVER again that you have a choice: better than
what you do now, or nothing. Whining about the fact that Windows is
stupid will only make me convinced that there is no point to even helping
samba, since what you really want is WNT.

If what you want is WNT, then go away. That's not what I'm offering. And
it's not going to _be_ that I offer.

I offer you _sane_ VFS semantics, with some accelerators for your insane
needs. If that isn't enough, then please just stop bothering me.

Comprende?

> The reason the dcache is such a great thing in Linux is
> that it is indexed by name, so you rarely do any scanning at all

And that is still true of any exact matches.

If you have a fuzzy lookup of a name that does exist, but doesn't match,
or you have a new name that simply doesn't _exist_ in the dcache, then you
will have to scan all dentries. But now you can scan them in-memory by
following the pointers directly without having to index through the
filesystem data structures and worrying about disk reads. And you can
optimize that to do a fast mismatch (ie in most cases you can probably
look at the first one of two characters and determine immediately that
there is no match).

The only way to avoid that is to make the hash weaker. Which I'm not
willing to do: I'm not willing to make the _proper_ lookups go slower
because of some insane crap generated by Microsoft.

In other words, put up or shut up. If you are only going to repeat your
whine about how you want the Linux VFS layer to look like Windows, I'm
simply NOT INTERESTED.

Linus

Jamie Lokier

unread,

Feb 20, 2004, 12:03:23 AM2/20/04

to Linus Torvalds, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> The hardest part of caching is not filling the cache - it's knowing when
> to release it. In other words, forget the filling part, and think about
> the replacement policy (balacing between the page cache, the directory
> cache, and regular pages). The kernel already has that.

It's worth noting that Samba already has a dcache in userspace: tridge
mentioned that positive cache-insensitive lookups are cached, so the
replacement policy is already skewed by that.

Will your proposal eliminate Samba's positive cache as well?

> Besides, I really think that we can do this with basically just a few
> lines of code in the kernel (apart from the actual case comparison, which
> I'm not even going to worry about - that's totally independent of the
> cache handling itself, and I don't care about how to write a
> "windows_equivalent_strncasecmp()".

What I like about my idea is that no windows_equivalent_strncasecmp()
needs to go into the kernel. I.e. no need for a Samba-specific module.

The other thing I like is that DN_IGNORE_SELF would be useful for
other applications too.

What I like about your idea is that it'll be a bit faster, the dcache
replacement policy will be nicer, and if there are atomicity
conditions we haven't thought of, it'll be easier to handle them.

-- Jamie

Linus Torvalds

unread,

Feb 20, 2004, 12:18:36 AM2/20/04

to Jamie Lokier, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004, Jamie Lokier wrote:
>
> Will your proposal eliminate Samba's positive cache as well?

Samba has to work on different kernels, so they'll have to have their own
code anyway. Whether they want to turn it off or not if better
alternatives are found is up to them. Right now it appears that what
Tridge wants is a WNT dcache, and since he's not going to get it, I guess
the whole discussion is moot.

> What I like about my idea is that no windows_equivalent_strncasecmp()
> needs to go into the kernel. I.e. no need for a Samba-specific module.
>
> The other thing I like is that DN_IGNORE_SELF would be useful for
> other applications too.

I agree. It might even be acceptable not as a new flag, but as a
modification to existing behaviour. I can't imagine that a file manager is
all that interested in seeing the changes it itself does be reported back
to it. And I don't really know of any other uses of dnotify.

(That said, clearly it's better to just have a new flag, since that way
there is no possibility of anything breaking).

On the other hand, even with a nice dnotify infrastructure, you simply
_cannot_ get absolute atomicity guarantees. Because by the time you
actually execute the "mv" operation, another process may create a new file
with the "same" name (ie different name, but comparing the same ignoring
case) on another CPU. By the time you get the dnotify, it's too late, and
the move will have happened, and undoing the operation (and hiding it from
the client) may well be impossible - possibly because another process
creating a file with the old name.

NOTE! Even an in-kernel implementation fundamentally cannot fix this race
on something like NFS. So the in-kernel version would only help for local
filesystems that the kernel has exclusive write access to.

Linus

Linus Torvalds

unread,

Feb 20, 2004, 12:26:43 AM2/20/04

to Jamie Lokier, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>

> I agree. It might even be acceptable not as a new flag, but as a
> modification to existing behaviour. I can't imagine that a file manager is
> all that interested in seeing the changes it itself does be reported back
> to it. And I don't really know of any other uses of dnotify.

I take that back. Even a file manager may very well be interested in moves
that it does itself - most of them have some soft of multi-window view
capability, and if they use dnotify, they might well be using it to keep
the different views coherent.

So yes, a new flag would likely be required.

That said, who actually _uses_ dnotify? The only time dnotify seems to
come up in discussions is when people complain how badly designed it is,
and I don't think I've ever heard anybody say that they use it and
that they liked it ;)

tri...@samba.org

unread,

Feb 20, 2004, 12:35:49 AM2/20/04

to Linus Torvalds, Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

Linus,

> And I've told you OVER AND OVER again that you have a choice: better than
> what you do now, or nothing. Whining about the fact that Windows is
> stupid will only make me convinced that there is no point to even helping
> samba, since what you really want is WNT.

yes, I've acknowledged that. I know you aren't going to give me the
ideal solution, I'm just exploring how far this is from the ideal and
trying to get a feel for how much it actually gains us compared to
what we do now.

If I understand things correctly then I think that your suggestion
probably does gain us a fair bit, but I think that biting my head off
for exploring just how much the gain is versus the current code and
the "ideal" code is a bit much.

Cheers, Tridge

Trond Myklebust

unread,

Feb 20, 2004, 12:44:41 AM2/20/04

to Linus Torvalds, Kernel Mailing List

På to , 19/02/2004 klokka 16:24, skreiv Linus Torvalds:
> That said, who actually _uses_ dnotify? The only time dnotify seems t=
o
> come up in discussions is when people complain how badly designed it =

is,
> and I don't think I've ever heard anybody say that they use it and
> that they liked it ;)

We use it in the idmapper and RPCSEC_GSS userland daemons in order to
track which NFS clients are up and running (by peeking inside the
rpc_pipefs). Works fine there...

Cheers,
Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel"=

Linus Torvalds

unread,

Feb 20, 2004, 12:48:17 AM2/20/04

to Andrew Tridgell, Al Viro, Jamie Lokier, H. Peter Anvin, Kernel Mailing List

On Fri, 20 Feb 2004 tri...@samba.org wrote:
>

> yes, I've acknowledged that. I know you aren't going to give me the
> ideal solution, I'm just exploring how far this is from the ideal and
> trying to get a feel for how much it actually gains us compared to
> what we do now.

I suspect the only way to know that is to code something up.

The kernel side (with the full "readdir()" loop and a TENTATIVE flag etc)
is not likely to be that many lines of code, but it's definitely something
where the person who writes those lines needs to really understand the
kernel code to get anywhere at all. And it's in an "interesting" area of
the kernel, so you have to be really careful. And you'd need somebody who
is used to samba too, in order to do the path component walk side in user
space work right with the new interface. So..

I an try to see if I can write something - I'd not do the actual
comparison function, but I have the rough framework in my mind. I won't
get to it for another day or two, at _least_, though.

With that set up, getting numbers and doing a kernel profile to see where
the time goes is probably not hard - again, if you have a samba setup with
benchmarks already set up. I just don't know anybody who knows both pieces
of the puzzle..

(This, btw, was the big problem with pthreads too. The 2.6.x threading
improvements were things that had been discussed for years, but it took
until Ingo, Uli and Roland actually sat down and looked at both the user
side and the kernel side before anything really happened).

Linus

Jamie Lokier

unread,

Feb 20, 2004, 12:55:29 AM2/20/04

to Linus Torvalds, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:

> I can't imagine that a file manager is all that interested in seeing
> the changes it itself does be reported back to it.

No, but any file manager that's made of libraries where one thread is
showing the window and another thread is doing operations will care -
unless they explicitly communicate. Right now they might, or they
might not.

> (That said, clearly it's better to just have a new flag, since that way
> there is no possibility of anything breaking).

Quite.

> And I don't really know of any other uses of dnotify.

High performance web template cache:
dnotify is used to invalidate cached info about prerequisite files,
so that quite a lot of files can be used to create a page, the
output is cached, and validating the cache for each page request
as actually zero cost (because dnotify is a signal, so validating is
just checking that you didn't receive the signal).

Accelerated Make:
dnotify is used to invalidate cached stat() results between runs.
A daemon runs in the background to retain the information.
(Communicating with the daemon is only faster than calling stat()
if the retained information includes precomputed dependencies,
pre-parsed Makefiles and such.

Java VM accelerator:
Let the JVM precompile class files to a machine-specific code and
keep that in a mmaped file between invocations. When a new JVM
process is started, it checks that all the class files for a
particular program haven't changed; a daemon using dnotify can
speed up this check, or even provide a stronger guarantee, if you
don't trust stat() mtimes.

Fontconfig accelerator:
When a program using fontconfig (e.g. any GNOME program and many
others) starts, it calls stat() on every font file in ~/.fonts.
This is lovely to use because you just drop font files in there,
but the stat() calls are slow when you have a very large number.
A daemon using dnotify can monitor this and allow a program to
skip those calls.

Maildir accelerator:
Similar to fontconfig, but on mail directories for validating
the cached summary information about all mails in a folder.

Shared cache directory:
A program stores files in a shared cache, e.g. like a web browser.
dnotify can be used to monitor the cached files, to invalidate
in-memory data structures parsed from them if other programs are
modifying the same cached file data structures.

Shared database in a file (like Berkeley DB et al):
dnotify is used to notice when another process modifies the file.
You still need to lock and write updates, but you can avoid reading
and parsing the database file between queries and use calculated
in-memory data for the queries, if you know the file hasn't been
changed by another process.

One thing you can't do is real-time updatedb+locate, because of the
need to have an open file descriptor for every directory that's monitored.

-- Jamie

Jamie Lokier

unread,

Feb 20, 2004, 1:03:12 AM2/20/04

to Linus Torvalds, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

Linus Torvalds wrote:
> That said, who actually _uses_ dnotify? The only time dnotify seems to
> come up in discussions is when people complain how badly designed it is,
> and I don't think I've ever heard anybody say that they use it and
> that they liked it ;)

I've not used it, but I have plenty of ideas (see the other email),
and one big project I'm working on that intends to use it, which isn't
a file manager.

I must say it is badly designed and I don't like it :)

Actually the design is ok because it's easy to understand. It is just
a bit limiting for more adventurous purposes than a file manager.

Something that fitted nicely into the epoll style of event queue, and
also allowed whole directory trees to be monitored, and told exactly
what changed, and let you take out leases on files that caught writing
as well as opens, and worked even across reboots or with no program
running (using generation numbers of some kind).... that I'd like a
tiny bit more :)

-- Jamie

tri...@samba.org

unread,

Feb 20, 2004, 1:11:13 AM2/20/04

to Linus Torvalds, Jamie Lokier, vi...@parcelfarce.linux.theplanet.co.uk, H. Peter Anvin, Kernel Mailing List

Linus,

> That said, who actually _uses_ dnotify? The only time dnotify seems to
> come up in discussions is when people complain how badly designed it is,
> and I don't think I've ever heard anybody say that they use it and
> that they liked it ;)

This may not be the example you want, but Samba uses it and it is
absolutely vital to good performance.

The common situation is this:

- 1000 windows drones sitting in an office with their windows
explorer windows open on their home directory on the server, but
not doing any real work.

- all those windows boxes ask the Samba server "let me know when the
directory changes so I can refresh this window that nobody is
looking at anyway"

- before we had dnotify samba had to continuously poll all those
directories, looking for a change in a checksum of the directory
contents. We had tunable parameters for how often to poll, whether
to poll etc, but basically it sucked, because windows users with
nothing better to do ask "why doesn't it behave just like NT"

- now samba just watches for dnotify events

The other situation where it really sucked was for windows developers
using visual C. The builtin make-like system in that braindead tool
actually got compilations wrong if the file server didn't tell it that
a file in its directory had changed. It would say "nothing to do" when
you do a build and we hadn't polled recently enough. Cue the angry
windows developers and people screaming to put a real NT box in
instead of Samba.

So dnotify has been a huge bonus for Samba, I just wish a few more
non-Samba tools would use it so it doesn't run the risk of being
removed because only Samba cares. It sucks being the ugly duckling,
and knowing that nobody is ever going to tell you you're really a
swan :)

Cheers, Tridge

H. Peter Anvin

unread,

Feb 20, 2004, 1:20:28 AM2/20/04

to Linus Torvalds, Andrew Tridgell, Al Viro, Jamie Lokier, Kernel Mailing List

Linus Torvalds wrote:
>
> The only way to avoid that is to make the hash weaker. Which I'm not
> willing to do: I'm not willing to make the _proper_ lookups go slower
> because of some insane crap generated by Microsoft.
>

Or, to be fair, have a secondary set of hash entries (effectively a
parallel dcache, which would optimize on normalized names instead of
true names.)

A multi-dcache approach seems scary as hell, though...

-hpa

Paul Wagland

unread,

Feb 20, 2004, 1:23:58 AM2/20/04

to Linus Torvalds, Jamie Lokier, vi...@parcelfarce.linux.theplanet.co.uk, Tridge, H. Peter Anvin, Kernel Mailing List

On Fri, 2004-02-20 at 01:24, Linus Torvalds wrote:

> That said, who actually _uses_ dnotify? The only time dnotify seems to
> come up in discussions is when people complain how badly designed it is,
> and I don't think I've ever heard anybody say that they use it and
> that they liked it ;)

Well, in the desktop land both kde and gnome use fam, and fam can use
dnotify as it's backend to watch files. Server side, courier can use fam
as well, so although there are not a lot of programs that use dnotify
directly, there are a lot that can use it indirectly, and will fall back
to polling on a non-dnotify system. I don't know if the famd people like
it or not though ;-)

Cheers,
Paul

signature.asc