-mm -> 2.6.13 merge status

Andrew Morton

unread,

Jun 21, 2005, 5:40:22 AM6/21/05

to

This summarises my current thinking on various patches which are presently
in -mm. I cover large things and small-but-controversial things. Anything
which isn't covered here (and that's a lot of material) is probably a "will
merge", unless it obviously isn't.

(If you reply to this email it would be a good idea to alter the Subject:
to reflect which feature you are discussing)

git-ocfs

The OCFS2 filesystem. OK by me, although I'm not sure it's had enough
review.

sparsemem

OK by me for a merge. Need to poke arch maintainers first, check that
they've looked at it sufficiently closely.

vm-early-zone-reclaim

Needs some convincing benchmark numbers to back it up. Otherwise OK.

avoiding-mmap-fragmentation

Tricky. Addresses vm area fragmentation issues due to recent
optimisations to the free-area lookup code. Will merge.

periodically-drain-non-local-pagesets

Will merge

pcibus_to_node and users

Will merge

CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ Kconfigurable.

Will merge (will switch default to 1000 Hz later if that seems necessary)

dmi-*.patch

Will merge. I have a comment "The below break x440". Maybe it got
fixed. We'll doubtless hear if not.

xen-*.patch

These are little cleanups and abstractions which make a Xen merge
easier. May as well merge them.

CPU hotplug for x86 and x86_64

Not really useful on current hardware, but these provide
infrastructure which some power management patches need, and it seems
sensible to make the reference architecture support hotplug. Will merge.

swsusp-on-SMP

Will merge.

cfq version 3

Not sure. Jens seems to be setting up a few git trees. On hold.

RCUification of the key management code

Don't know - dhowells seemed diffident last time we discussed this.

timers-fixes-improvements.patch

SMP speedups for the core timer code. It was bumpy, but this seems
stable now. Will merge.

kprobes-*

Will merge

rapidio-*

Will merge.

namespace*.patch

Awaiting viro ack.

xtensa architecture

Is xtensa now, or will it be in the future a sufficiently popular
architecture to justify the cost of having this code in the tree?

Heaven knows. Will merge.

dlm-*.patch: Red Hat distributed lock manager

Hard. Right now it seems that no in-kernel projects will use this and
only one out-of-kernel project will use it. Shelve the problem until
after Kernel Summit, where some light may be shed.

Opinions are sought...

connector.patch

Nice idea IMO, but there are still questions around the
implementation. More dialogue needed ;)

connector-add-a-fork-connector.patch

OK, but needs connector.

inotify

There are still concerns about the userspace API and internal
implementation details. More slogging needed.

pcmcia-*.patch

Makes the pcmcia layer generate hotplug events and deprecates cardmgr.
Will merge.

NUMA-aware slab allocator

Seems stable now, but it needs some ifdef reduction work before
merging, please.

CPU scheduler

Will merge some of these patches. We're still discussing which ones.

perfctr

Not yet, but getting closer. The PPC64 guys still need to sort out a
few interface issues with Mikael. We might be able to fit this into
2.6.13 if people get a move on.

cachefs

This is a ton of code which knows rather a lot about pagecache
internals. It allows the AFS client to cache file contents on a local
blockdev.

I don't think it's a justified addition for only AFS and I'd prefer to
see it proven for NFS as well.

Issues around add-page-becoming-writable-notification.patch need to
be resolved.

cachefs-for-nfs

A recent addition. Needs review from NFS developers and considerably
more testing.

These things aren't looking likely for 2.6.13.

kexec and kdump

I guess we should merge these.

I'm still concerned that the various device shutdown problems will
mean that the success rate for crashing kernels is not high enough for
kdump to be considered a success. In which case in six months time we'll
hear rumours about vendors shipping wholly different crashdump
implementations, which would be quite bad.

But I think this has gone as far as it can go in -mm, so it's a bit of
a punt.

reiser4

Merge it, I guess.

The patches still contain all the reiser4-specific namespace
enhancements, only it is disabled, so it is effectively dead code. Maybe
we should ask that it actually be removed?

v9fs

I'm not sure that this has a sufficiently high
usefulness-to-maintenance-cost ratio.

fuse

This is useful, but there are, AFAIK, two issues:

- We're still deadlocked over some permission-checking hacks in there

- It has an NFS server implementation which only works if the
to-be-served file happens to be in dcache.

It has been said that a userspace NFS server can be used to get
full NFS server functionality with FUSE. I think the half-assed kernel
implementation should be done away with.

execute-in-place

Will merge. Have the embedded guys commented on the usefulness of
this for execute-out-of-ROM?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Miklos Szeredi

unread,

Jun 21, 2005, 6:30:07 AM6/21/05

to

> fuse
>
> This is useful, but there are, AFAIK, two issues:
>
> - We're still deadlocked over some permission-checking hacks in there

Oh, god. Let me try to explain this again:

- This is a security issue with unprivileged mounts

- Since no other filesystem currently offers secure unpivileged
mounts in Linux, this is something "new"

- Since it's something new, there's a big resistance to acceptance.
I understand this, I only ask people, to please read
Documentation/filesystems/fuse.txt, before arguing against it

- IMO it's not a hack, and it's not something that can be solved
otherwise (no, private namespaces are NOT a solution, they are
mosty orthogonal to this).

So I welcome constructive discussion. However bear in mind, that I
definitely don't want to disable unprivileged mounts. For me that is
_the_ most important feature of FUSE.

> - It has an NFS server implementation which only works if the
> to-be-served file happens to be in dcache.

More preciesly it relies on icache.

> It has been said that a userspace NFS server can be used to
> get full NFS server functionality with FUSE. I think the
> half-assed kernel implementation should be done away with.

I won't shed many tears if you drop fuse-nfs-export.patch. It would
at least give the userspace solution some boost.

However the patch is pretty small, and despite it's flaws, I know it's
used by a number of people.

Thanks,
Miklos

Andi Kleen

unread,

Jun 21, 2005, 8:10:11 AM6/21/05

to

Andrew Morton <ak...@osdl.org> writes:

> perfctr
>
> Not yet, but getting closer. The PPC64 guys still need to sort out a
> few interface issues with Mikael. We might be able to fit this into
> 2.6.13 if people get a move on.

So the problems IA64 had with this are resolved now?

Unfortunately there is a perfmon for i386/x86-64 implementation
floating around now (with some unmergeable stuff but might be fixable)
which is kind of competing now :/

> reiser4
>
> Merge it, I guess.
>
> The patches still contain all the reiser4-specific namespace
> enhancements, only it is disabled, so it is effectively dead code. Maybe
> we should ask that it actually be removed?

Has there been actually any serious review on this?
Last time I looked there was a lot of very ugly code in there.

Also I'm not sure things like comming with an own profiler
and spinlock debugger are really acceptable. At least this stuff
should be removed too.

-Andi

Andrey Panin

unread,

Jun 21, 2005, 8:20:14 AM6/21/05

to

Fixed, patch merged in -mm as dmi-move-acpi-sleep-quirk-fix.patch

http://marc.theaimsgroup.com/?l=linux-kernel&m=111829134708641&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=111832375203467&w=2

--
Andrey Panin | Linux and UNIX system administrator
pa...@donpac.ru | PGP key: wwwkeys.pgp.net

signature.asc

Carsten Otte

unread,

Jun 21, 2005, 9:00:10 AM6/21/05

to

On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:
> execute-in-place
>
> Will merge. Have the embedded guys commented on the usefulness of
> this for execute-out-of-ROM?

Allright. Going to push our test-team to run their tests on the xip
code that is in -mm.

Alan Cox

unread,

Jun 21, 2005, 9:00:16 AM6/21/05

to

On Maw, 2005-06-21 at 07:54, Andrew Morton wrote:
> CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ Kconfigurable.
> Will merge (will switch default to 1000 Hz later if that seems necessary)

This has been in Fedora for a while. DaveJ can probably give you more
info. From own testing 100Hz is how far down you want to go to avoid
random clock slew on laptops and to see power improvements.

Alan Cox

unread,

Jun 21, 2005, 9:00:17 AM6/21/05

to

On Maw, 2005-06-21 at 11:22, Miklos Szeredi wrote:
> So I welcome constructive discussion. However bear in mind, that I
> definitely don't want to disable unprivileged mounts. For me that is
> _the_ most important feature of FUSE.

If the choice was "merge FUSE without unpriv mounts for now" or "discard
fuse completely" which is preferable.

It seems to me (just IMHO) that it would be better to merge FUSE without
that feature and then spend the time getting that feature right _in
parallel_ with people using, breaking and reviewing FUSE a lot more.

Nigel Cunningham

unread,

Jun 21, 2005, 9:20:14 AM6/21/05

to

Hi.

(Marcelo: Copied for issue at the bottom).

On Tue, 2005-06-21 at 16:54, Andrew Morton wrote:
> CPU hotplug for x86 and x86_64
>
> Not really useful on current hardware, but these provide
> infrastructure which some power management patches need, and it seems
> sensible to make the reference architecture support hotplug. Will merge.

Yay. I'm not going to use it yet, but know Pavel wants it for the next
one.

> swsusp-on-SMP
>
> Will merge.

>
> kexec and kdump
>
> I guess we should merge these.
>
> I'm still concerned that the various device shutdown problems will
> mean that the success rate for crashing kernels is not high enough for
> kdump to be considered a success. In which case in six months time we'll
> hear rumours about vendors shipping wholly different crashdump
> implementations, which would be quite bad.
>
> But I think this has gone as far as it can go in -mm, so it's a bit of
> a punt.

No potential clashes with suspend code, I assume?

> execute-in-place
>
> Will merge. Have the embedded guys commented on the usefulness of
> this for execute-out-of-ROM?

Switch roles for a mo and put my Cyclades hat on. Probably not useful to
us at the moment, at least in the case of the products I work on.
Marcelo?

Regards,

Nigel

Jörn Engel

unread,

Jun 21, 2005, 9:20:21 AM6/21/05

to

On Mon, 20 June 2005 23:54:58 -0700, Andrew Morton wrote:
>
> execute-in-place
>
> Will merge. Have the embedded guys commented on the usefulness of
> this for execute-out-of-ROM?

It looks fairly useful, but XIP for NOR flashes still needs additional
work. No objections from my side.

Jörn

--
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall

Arjan van de Ven

unread,

Jun 21, 2005, 9:30:24 AM6/21/05

to

On Tue, 2005-06-21 at 13:35 +0100, Alan Cox wrote:
> On Maw, 2005-06-21 at 07:54, Andrew Morton wrote:
> > CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ Kconfigurable.
> > Will merge (will switch default to 1000 Hz later if that seems necessary)
>
> This has been in Fedora for a while. DaveJ can probably give you more
> info. From own testing 100Hz is how far down you want to go to avoid
> random clock slew on laptops and to see power improvements.

actually 250Hz is a not so fun value. 300 is a lot nicer (multiple of
both 50Hz and 60Hz and thus covers most TV standards)

signature.asc

Miklos Szeredi

unread,

Jun 21, 2005, 9:30:23 AM6/21/05

to

> > So I welcome constructive discussion. However bear in mind, that I
> > definitely don't want to disable unprivileged mounts. For me that is
> > _the_ most important feature of FUSE.
>
> If the choice was "merge FUSE without unpriv mounts for now" or "discard
> fuse completely" which is preferable.

FUSE is doing fine outside mainline, so discard wouldn't be such a big
setback. Including it without unpriv mounts would effectively fork
FUSE into an out-of-tree and an in-tree version, which is sure to
cause confusion.

So yes, I'd prefer not merging to merging without unpriv mounts. But
it's GPL, so obviously I don't have any legal control over it.

> It seems to me (just IMHO) that it would be better to merge FUSE without
> that feature and then spend the time getting that feature right _in
> parallel_ with people using, breaking and reviewing FUSE a lot more.

The security measure in question is actually very simple (10 lines or
so). So it's not the implementation that people have problems with
but the concept. The concept itself is hard to swallow, because it
does something unexpected, but what it does is in fact very logical.

That's why I ask people to read the documentation, think about it and
_then_ argue. Up till now the discussion with Christoph Hellwig about
this hasn't been on the level of rational arguments (and he's the only
definite naysayer).

Thanks,
Miklos

Eric Van Hensbergen

unread,

Jun 21, 2005, 10:00:16 AM6/21/05

to

On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:
>

> v9fs
>
> I'm not sure that this has a sufficiently high
> usefulness-to-maintenance-cost ratio.
>

I think v9fs/9P has some unique aspects which differentiate it from
the other distributed system protocols integrated into Linux:
a) it presents a unified distributed resource sharing protocol. It
will be able to distribute devices, file systems, system services, and
application interfaces.
b) it provides non-caching RPC-style access to synthetic file systems
which could be used with in-kernel file systems such as sysfs or with
user-space synthetics such as those provided by FUSE
c) its implementation supports transport independence enabling easy
support for different interconnects (shared memory, Xen device
channels, RDMA, Infiniband, etc.)

v9fs-2.0 has a somewhat limited audience at the moment - but now that
the initial implementation is more or less complete we are working to
build applications on top of it (and provide a better server). It's
being integrated into cluster projects at LANL and being looked at wrt
virtualization I/O at IBM. Its our hope that these improvements and
cluster applications will motivate more wide-spread use of the v9fs
module.

-eric

Martin Hicks

unread,

Jun 21, 2005, 10:20:10 AM6/21/05

to

On Mon, Jun 20, 2005 at 11:54:58PM -0700, Andrew Morton wrote:
>
> vm-early-zone-reclaim
>
> Needs some convincing benchmark numbers to back it up. Otherwise OK.

The only benchmarks I have for this were included in my last mail to
linux-mm:

http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2

Are they convincing? Well, the patch doesn't seem to make the memory
thrashing case much worse ("make -j" kernbench run) which is a good
thing since the VM is trying to reclaim much earlier.

In the same e-mail I mention that there is a fairly good performance
gain in the optimal case, where processes are tied to a single node and
the node's memory is filled with page cache. With zone reclaim turned
on the "make -j8" kernel build runs in 700 seconds; 735 seconds with
no reclaim.

mh

--
Martin Hicks Wild Open Source Inc.
mo...@wildopensource.com 613-266-2296

Pavel Machek

unread,

Jun 21, 2005, 10:50:35 AM6/21/05

to

Hi!

> > This is useful, but there are, AFAIK, two issues:
> >
> > - We're still deadlocked over some permission-checking hacks in there
>
> Oh, god. Let me try to explain this again:
>
> - This is a security issue with unprivileged mounts

Pretty please, just merge it without unpriviledged mounts. I see they are
usefull,
but they are too strange for now. If user tries mounting themselves,
he gets -EPERM, and applies 10-liner from you. Does not look like "fork" or
anything serious.
Pavel

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

Pavel Machek

unread,

Jun 21, 2005, 10:50:36 AM6/21/05

to

Hi!

> > kexec and kdump
> >
> > I guess we should merge these.
> >

...

> No potential clashes with suspend code, I assume?
>

I test suspend in -mm series from time to time, and it seems ok;
so this one should be safe.

Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

-

Pavel Machek

unread,

Jun 21, 2005, 10:50:38 AM6/21/05

to

Hi!

> This summarises my current thinking on various patches which are presently
> in -mm. I cover large things and small-but-controversial things. Anything
> which isn't covered here (and that's a lot of material) is probably a "will
> merge", unless it obviously isn't.

I'd like to ask about 802.11 stack and ipw2100 in particular... Is it
"small enough that it did not need mentioning"?
Working wireless in mainline would be great...

Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

-

John Stoffel

unread,

Jun 21, 2005, 11:00:20 AM6/21/05

to

I'd like to see FUSE merged too, even without user mounts, if only to
get more motivated to actually play with this and see how it works.

John

Marcelo Tosatti

unread,

Jun 21, 2005, 11:20:07 AM6/21/05

to

> > execute-in-place
> >
> > Will merge. Have the embedded guys commented on the usefulness of
> > this for execute-out-of-ROM?
>
> Switch roles for a mo and put my Cyclades hat on. Probably not useful to
> us at the moment, at least in the case of the products I work on.
> Marcelo?

Well yes, its definately very useful for embedded folks where RAM is a
precious resource (not our case at the moment).

I'm not aware of any users of this XIP implementation, maybe Tim Bird or
Russell have reviewed/tested it?

It went through filesystem folks reviewing (and I'm pretty sure akpm knows
about that already)...

Hope to be helpful.

Miklos Szeredi

unread,

Jun 21, 2005, 11:30:14 AM6/21/05

to

> I'd like to see FUSE merged too, even without user mounts, if only to
> get more motivated to actually play with this and see how it works.

You can try it out now. Download from fuse.sf.net, ./configure; make;
make install. It's not as if it was harder than compiling a kernel :)

Miklos

Jeff Garzik

unread,

Jun 21, 2005, 11:40:19 AM6/21/05

to

Andrew Morton wrote:
> git-ocfs
>
> The OCFS2 filesystem. OK by me, although I'm not sure it's had enough
> review.

Every time I come up with a complaint about ocfs2, the Oracle guys
manage to shoot me down. I think it's OK to merge.

> sparsemem
>
> OK by me for a merge. Need to poke arch maintainers first, check that
> they've looked at it sufficiently closely.

seems sane, though there are some whitespace niggles that should be
cleaned up

> rapidio-*
>
> Will merge.

send through netdev, as is proper

> dlm-*.patch: Red Hat distributed lock manager
>
> Hard. Right now it seems that no in-kernel projects will use this and
> only one out-of-kernel project will use it. Shelve the problem until
> after Kernel Summit, where some light may be shed.
>
> Opinions are sought...

I hate to say it, since its my own employer, but I agree: wait for
in-kernel users, at the very least.

> connector.patch
>
> Nice idea IMO, but there are still questions around the
> implementation. More dialogue needed ;)
>
> connector-add-a-fork-connector.patch
>
> OK, but needs connector.

I don't like connector

> inotify
>
> There are still concerns about the userspace API and internal
> implementation details. More slogging needed.

We should ask hpa what he needs for kernel.org. Ideally kernel.org
probably wants <something> that facilitates listening to <something> for
a list of files being changed. That would greatly speed up the robots,
and possibly rsync-like activities too.

> pcmcia-*.patch
>
> Makes the pcmcia layer generate hotplug events and deprecates cardmgr.
> Will merge.

Testing? The goal behind the patch is certainly good, but I worry about
exposure.

> cachefs
>
> This is a ton of code which knows rather a lot about pagecache
> internals. It allows the AFS client to cache file contents on a local
> blockdev.
>
> I don't think it's a justified addition for only AFS and I'd prefer to
> see it proven for NFS as well.
>
> Issues around add-page-becoming-writable-notification.patch need to
> be resolved.
>
> cachefs-for-nfs
>
> A recent addition. Needs review from NFS developers and considerably
> more testing.
>
> These things aren't looking likely for 2.6.13.

If I could vote more than once, I would! I really like cachefs, and
have been pushing for its inclusion for a while.

> kexec and kdump
>
> I guess we should merge these.
>
> I'm still concerned that the various device shutdown problems will
> mean that the success rate for crashing kernels is not high enough for
> kdump to be considered a success. In which case in six months time we'll
> hear rumours about vendors shipping wholly different crashdump
> implementations, which would be quite bad.
>
> But I think this has gone as far as it can go in -mm, so it's a bit of
> a punt.

I'm not particularly pleased with these, and indeed vendors ARE shipping
other crashdump methods.

> reiser4
>
> Merge it, I guess.
>
> The patches still contain all the reiser4-specific namespace
> enhancements, only it is disabled, so it is effectively dead code. Maybe
> we should ask that it actually be removed?

The plugin stuff is crap. This is not a filesystem but a filesystem +
new layer. IMO considered in that light, it duplicates functionality
elsewhere.

> v9fs
>
> I'm not sure that this has a sufficiently high
> usefulness-to-maintenance-cost ratio.

agreed (though I think 9P is neat)

> It has been said that a userspace NFS server can be used to get
> full NFS server functionality with FUSE. I think the half-assed kernel
> implementation should be done away with.

"It has been said" -- its true. A userspace NFS server can do 100% of
userspace FS functionality.

Jeff

Miklos Szeredi

unread,

Jun 21, 2005, 11:40:16 AM6/21/05

to

> > > This is useful, but there are, AFAIK, two issues:
> > >
> > > - We're still deadlocked over some permission-checking hacks in there
> >
> > Oh, god. Let me try to explain this again:
> >
> > - This is a security issue with unprivileged mounts
>
> Pretty please, just merge it without unpriviledged mounts. I see
> they are usefull, but they are too strange for now.

An emotional argument again. What's "strange" about it?

You have a choice of:

1) believe me that the current solution is fine

2) get down and try to understand the damn thing, and then come up
with technical arguments for/against it

I know that 2) takes time and energy, and not a lot of people are
interested enough to go through it, but why on earth do you think it
will _ever_ be easier than now.

Thanks,
Miklos

Avuton Olrich

unread,

Jun 21, 2005, 11:40:12 AM6/21/05

to

On 6/21/05, Miklos Szeredi <mik...@szeredi.hu> wrote:
> I won't shed many tears if you drop fuse-nfs-export.patch. It would
> at least give the userspace solution some boost.
>
> However the patch is pretty small, and despite it's flaws, I know it's
> used by a number of people.

Why not leave it up to the user as an option, for the time being at
least. Does this somehow break things?

thanks,
avuton

--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Uriel

unread,

Jun 21, 2005, 11:50:18 AM6/21/05

to

On Tue, Jun 21, 2005 at 08:51:27AM -0500, Eric Van Hensbergen wrote:
> On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:
> >
> > v9fs
> >
> > I'm not sure that this has a sufficiently high
> > usefulness-to-maintenance-cost ratio.

The 9P protocol implemented by v9fs is the result of over a decade of
research in distributed systems at Bell Labs by the original Unix team,
and it has various implementations for other operating systems that have
been used in production systems for many years.

9P is designed to be portable across systems and transport protocols,
it's network transparent, and it gives us interoperativity with
Inferno(which can run hosted under Linux already), Plan 9, and p9p, and
implementations for *BSD and other systems are in the works.

9P has the potential to become the standard protocol for distributed
resources and I don't think any of the alternatives come anywhere near
being as well designed, well proven and encompassing.

uriel

Robert Love

unread,

Jun 21, 2005, 11:50:09 AM6/21/05

to

On Tue, 2005-06-21 at 11:26 -0400, Jeff Garzik wrote:

> > inotify
> >
> > There are still concerns about the userspace API and internal
> > implementation details. More slogging needed.
>
> We should ask hpa what he needs for kernel.org. Ideally kernel.org
> probably wants <something> that facilitates listening to <something> for
> a list of files being changed. That would greatly speed up the robots,
> and possibly rsync-like activities too.

I've talked to some people who've hooked inotify into rsync
successfully. Cool hack.

Robert Love

Matt Porter

unread,

Jun 21, 2005, 12:00:41 PM6/21/05

to

On Tue, Jun 21, 2005 at 11:26:44AM -0400, Jeff Garzik wrote:
> Andrew Morton wrote:
> > rapidio-*
> >
> > Will merge.
>
> send through netdev, as is proper

rapidio-support-net-driver.patch is the only netdev portion.

-Matt

Lee Revell

unread,

Jun 21, 2005, 12:00:38 PM6/21/05

to

On Mon, 2005-06-20 at 23:54 -0700, Andrew Morton wrote:
> CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
> Kconfigurable.
>
> Will merge (will switch default to 1000 Hz later if that seems
> necessary)

Are you serious? You're changing the *default* HZ in a stable kernel
series?!?

This is a big regression, it degrades the resolution of system calls.

Lee

Miklos Szeredi

unread,

Jun 21, 2005, 12:10:07 PM6/21/05

to

> > I won't shed many tears if you drop fuse-nfs-export.patch. It would
> > at least give the userspace solution some boost.
> >
> > However the patch is pretty small, and despite it's flaws, I know it's
> > used by a number of people.
>
> Why not leave it up to the user as an option, for the time being at
> least.

Making it a config option could make sense, yes.

> Does this somehow break things?

You mean outside NFS export? No, it's completely harmless.

NFS export itself is slightly broken (random ESTALE errors), but it's
still useful.

Thanks,
Miklos

Lee Revell

unread,

Jun 21, 2005, 12:50:11 PM6/21/05

to

On Mon, 2005-06-20 at 23:54 -0700, Andrew Morton wrote:
> CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
> Kconfigurable.
>
> Will merge (will switch default to 1000 Hz later if that seems
> necessary)

How about delaying this until the high res timers patches are ready?
That way you can save power and avoid the latency regression, in fact it
would be a huge improvement from user POV.

Consider a program with a 5ms RT constraint, like a game or mplayer.
Currently it uses the RTC on 2.4/HZ=100 systems and usleep() on
2.6/HZ=1000. Allowing HZ to regress to 250 would force us to handle
2.4, 2.6.1 - 2.6.12, and 2.6.13+ separately. It would be a huge mess.

Lee

Chris Zankel

unread,

Jun 21, 2005, 1:30:20 PM6/21/05

to

Andrew Morton wrote:
> xtensa architecture
>
> Is xtensa now, or will it be in the future a sufficiently popular
> architecture to justify the cost of having this code in the tree?
>
> Heaven knows. Will merge.

Andrew,

I understand your concern and am glad that you give Xtensa and the other
smaller non-mainstream architectures a chance.

The Xtensa architecture is highly configurable and usually buried inside
an SOC device. So, if you buy a new printer, digital camera, or cell
phone, there is a chance that there is an Xtensa inside even though you
don't know it (sometimes as a small audio-engine or as a control CPU).
Linux hasn't been adopted widely with Xtensa yet, but with Linux growing
in the embedded space, I am sure it will become much more important --
at least this is where I bet my time (and spare time) on.

To minimize the impact on other developers, I do understand
that changes that affect all architectures will only be applied to the
mainstream architectures and that the maintainers of the non-mainstream
architectures then have to pick it up. Luckily, the architecture
dependent files have their own confined space in the arch and asm
directories.

In my opinion, as long as an architecture, driver, etc. is maintained
and not obviously obsolete, it should be allowed to remain in the
kernel.

I do have a few small patches in the queue but am struggling with some
changes I want to make to the syscalls that might break some older code.

Thanks,
~Chris

Alan Cox

unread,

Jun 21, 2005, 2:20:07 PM6/21/05

to

On Maw, 2005-06-21 at 17:26, Lee Revell wrote:
> Consider a program with a 5ms RT constraint, like a game or mplayer.
> Currently it uses the RTC on 2.4/HZ=100 systems and usleep() on
> 2.6/HZ=1000. Allowing HZ to regress to 250 would force us to handle
> 2.4, 2.6.1 - 2.6.12, and 2.6.13+ separately. It would be a huge mess.

Vendors already ship 100Hz and 1KHz kernels. 2.4 and 2.6 are different
already. I can see the argument for not picking another new value
though.

Hans Reiser

unread,

Jun 21, 2005, 2:50:13 PM6/21/05

to

vs and zam, please comment on what we get from our profiler and spinlock
debugger that the standard tools don't supply. I am sure you have a
reason, but now is the time to articulate it.

We would like to keep the disabled code in there until we have a chance
to prove (or fail to prove) that cycle detection can be resolved
effectively, and then with a solution in hand argue its merits.

Hans

Andi Kleen wrote:

> Also I'm not sure things like comming with an own profiler
>
>and spinlock debugger are really acceptable. At least this stuff
>should be removed too.
>
>-Andi

Nish Aravamudan

unread,

Jun 21, 2005, 3:00:24 PM6/21/05

to

On 6/21/05, Lee Revell <rlre...@joe-job.com> wrote:
> On Mon, 2005-06-20 at 23:54 -0700, Andrew Morton wrote:
> > CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
> > Kconfigurable.
> >
> > Will merge (will switch default to 1000 Hz later if that seems
> > necessary)
>
> Are you serious? You're changing the *default* HZ in a stable kernel
> series?!?
>
> This is a big regression, it degrades the resolution of system calls.

Not that my opinion should sway anybody else, but I really would
prefer more of the in-kernel sleep callers were converted to use
human-time units (and thus were verified to be correct) so that
switching HZ will result in the *same* latencies. How much of moving
to lower HZ values is due to the fact that everything is request 10ms
for 1 jiffy of sleep instead of 1 ms? It's hard to tell if the gain is
there or from the lower frequency of interrupts.

I've sent out a lot of patches in this direction (using msleep() and
msleep_interruptible() and my soft-timer rework on top of John
Stultz's timeofday rework converts the entire soft-timer subsystem to
use human-time instead of jiffies as it's unit of expiration), but
there is still *a lot* of work left to do :( I will keep sending
patches, but am being pulled in other directions currently.

Just my $.02.

Thanks,
Nish

Christoph Lameter

unread,

Jun 21, 2005, 3:30:19 PM6/21/05

to

On Tue, 21 Jun 2005, Robert Love wrote:

> > We should ask hpa what he needs for kernel.org. Ideally kernel.org
> > probably wants <something> that facilitates listening to <something> for
> > a list of files being changed. That would greatly speed up the robots,
> > and possibly rsync-like activities too.
>
> I've talked to some people who've hooked inotify into rsync
> successfully. Cool hack.

I noticed that select() is not working on real files. Could inotify
be used to fix select()?

randy_dunlap

unread,

Jun 21, 2005, 3:50:10 PM6/21/05

to

On Tue, 21 Jun 2005 11:26:44 -0400 Jeff Garzik wrote:

can you be more specific, like you did with reiser4?

any specifics on the "not particularly pleased" part?

I don't think that r4 is just a filesystem either, but you know more
about that than I do.

thanks,
---
~Randy

Robert Love

unread,

Jun 21, 2005, 3:50:12 PM6/21/05

to

On Tue, 2005-06-21 at 12:22 -0700, Christoph Lameter wrote:

> I noticed that select() is not working on real files. Could inotify
> be used to fix select()?

Select the system call? It should work fine. ;-)

Who is confused?

Robert Love

Christoph Lameter

unread,

Jun 21, 2005, 4:00:18 PM6/21/05

to

On Tue, 21 Jun 2005, Robert Love wrote:

> On Tue, 2005-06-21 at 12:22 -0700, Christoph Lameter wrote:
>
> > I noticed that select() is not working on real files. Could inotify
> > be used to fix select()?
>
> Select the system call? It should work fine. ;-)

Hmmm. I just wrote an app that uses select to do essentially a "tail"
waiting for new content in a log file. The file descriptors for real disk
files are always ready even if there is no content available for the
application.

The file is positioned at the end of the file after open via lseek.
select tells me that data is available but the read() returns zero bytes.

The current fix on the app level is to checking if useful work was
done as a result of "READY" file descriptors. If the read() operations
do not return any data then the app will simply sleep for a couple of
seconds. So the app degenerates to a kind of poll mode if disk files are
used.

Andrew Morton

unread,

Jun 21, 2005, 4:00:17 PM6/21/05

to

Martin Hicks <mo...@wildopensource.com> wrote:
>
> On Mon, Jun 20, 2005 at 11:54:58PM -0700, Andrew Morton wrote:
> >
> > vm-early-zone-reclaim
> >
> > Needs some convincing benchmark numbers to back it up. Otherwise OK.
>
> The only benchmarks I have for this were included in my last mail to
> linux-mm:
>
> http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2
>
> Are they convincing? Well, the patch doesn't seem to make the memory
> thrashing case much worse ("make -j" kernbench run) which is a good
> thing since the VM is trying to reclaim much earlier.
>
> In the same e-mail I mention that there is a fairly good performance
> gain in the optimal case, where processes are tied to a single node and
> the node's memory is filled with page cache. With zone reclaim turned
> on the "make -j8" kernel build runs in 700 seconds; 735 seconds with
> no reclaim.

Ah, OK, I failed to capture that info. (I always have to move the info in
the [patch 0/n] email into the first real patch, and this time I didn't)

Thanks.

Andi Kleen

unread,

Jun 21, 2005, 4:10:16 PM6/21/05

to

On Tue, Jun 21, 2005 at 11:44:55AM -0700, Hans Reiser wrote:
> vs and zam, please comment on what we get from our profiler and spinlock
> debugger that the standard tools don't supply. I am sure you have a
> reason, but now is the time to articulate it.
>
> We would like to keep the disabled code in there until we have a chance
> to prove (or fail to prove) that cycle detection can be resolved
> effectively, and then with a solution in hand argue its merits.

How about the review of your code base? Has reiser4 ever been
fully reviewed by people outside your group?

Normally full review is a requirement for merging.

Martin Hicks

unread,

Jun 21, 2005, 4:10:09 PM6/21/05

to

On Tue, Jun 21, 2005 at 12:54:57PM -0700, Andrew Morton wrote:
>
> Ah, OK, I failed to capture that info. (I always have to move the info in
> the [patch 0/n] email into the first real patch, and this time I didn't)

Oops. I'll try to remember to stick the benchmark info into one of the
real patches next time.

mh

--
Martin Hicks Wild Open Source Inc.
mo...@wildopensource.com 613-266-2296

Hans Reiser

unread,

Jun 21, 2005, 4:20:11 PM6/21/05

to

Jeff Garzik wrote:

>
>
>> reiser4

>>
>>
>
>
> The plugin stuff is crap. This is not a filesystem but a filesystem +
> new layer. IMO considered in that light, it duplicates functionality
> elsewhere.

What functionality where? Please remember that this is per file, per
item, per node, per attribute, per disk format, per bitmap, per super
block, etc., abstracting, not per filesystem abstracting.

Plugins allow a number of things:

1) They allow us to never pay the cost to change the disk format again,
no matter how much we add in future years. This really matters: the
prohibitive cost of disk format changes are the number one impediment to
filesystem improvements, and the primary reason why most filesystems
ossify after time has past.

2) They allow us to easily structure code for reuse. If we want to
create a new kind of file that is like some other kind of file except
for one thing, we just write the one thing, and then easily reuse all
the other code, and create a new plugin id.

The use of plugins forced all the programmers to think about reusability
at every layer of design. V3 of reiserfs is way too hard to work on and
modify. If you ask one of the team to code something for V3 instead of
V4, they quietly groan at the thought. It is just so much easier to do
in V4.

When I asked one of our team to completely change the key assignment
algorithm for V4 (which controls what things get packed near what in the
tree), he complained that it would take 6 weeks to do it. Under V3 it
would have taken 3 months. It took him 3 days, and now to change it
again would take him 3 hours I think. Oh, by the way, the change
boosted performance dramatically.

If you take the time to become familiar with coding within our plugin
layer, I think you will find yourself wanting the same to exist for
other filesystems. Of course, no other filesystem needs to be impacted
by our plugin layer, and there is no way reiser4 could easily be
rewritten to exist without it (it would be like requiring that the
kernel not use function calls and only use gotos).

Reiser4's plugin layer has as its explicit objective making it possible
for the weekend hacker to add something useful to reiser4 and send it in
for inclusion. We want to democratize filesystem innovation so that
random bright guys who usually work on something other than filesystems
can act on their bright ideas without spending 3 years in the filesystem
field to do it. I believe that most great filesystem innovations are
envisioned by persons not working on filesystems, and go nowhere because
of the especially high cost of entry into our field.

I am not as bright as other filesystem developers, and so we must tinker
with three ideas and keep one of them. The speed of tinkering is
crucial to us, and the plugin layer increases that speed several fold.

Please reconsider your view.

Christoph Lameter

unread,

Jun 21, 2005, 4:20:07 PM6/21/05

to

On Tue, 21 Jun 2005, Zan Lynx wrote:

> I've never tried doing that. It might work, for all I know.

I was told that Linux cannot do this. It always returns the filehandles as
ready for disk files.

Andrew Morton

unread,

Jun 21, 2005, 4:20:07 PM6/21/05

to

Pavel Machek <pa...@ucw.cz> wrote:
>
> Hi!
>
> > This summarises my current thinking on various patches which are presently
> > in -mm. I cover large things and small-but-controversial things. Anything
> > which isn't covered here (and that's a lot of material) is probably a "will
> > merge", unless it obviously isn't.
>
> I'd like to ask about 802.11 stack and ipw2100 in particular... Is it
> "small enough that it did not need mentioning"?
> Working wireless in mainline would be great...

That's up to Jeff.

Christoph Lameter

unread,

Jun 21, 2005, 4:30:11 PM6/21/05

to

On Tue, 21 Jun 2005, Robert Love wrote:

> > I was told that Linux cannot do this. It always returns the filehandles as
> > ready for disk files.
>

> Inotify would definitely work.

Well we could use it in kernel to make select() work correctly. For disk
files set up a notification for write and then only return from select if
new data is available.

Zan Lynx

unread,

Jun 21, 2005, 4:30:18 PM6/21/05

to

On Tue, 2005-06-21 at 13:10 -0700, Christoph Lameter wrote:
> On Tue, 21 Jun 2005, Robert Love wrote:
>
> > > I was told that Linux cannot do this. It always returns the filehandles as
> > > ready for disk files.
> >
> > Inotify would definitely work.
>
> Well we could use it in kernel to make select() work correctly. For disk
> files set up a notification for write and then only return from select if
> new data is available.

You could do it inside glibc.
--
Zan Lynx <zl...@acm.org>

signature.asc

Robert Love

unread,

Jun 21, 2005, 4:30:15 PM6/21/05

to

On Tue, 2005-06-21 at 13:06 -0700, Christoph Lameter wrote:
> On Tue, 21 Jun 2005, Zan Lynx wrote:
>
> > I've never tried doing that. It might work, for all I know.
>
> I was told that Linux cannot do this. It always returns the filehandles as
> ready for disk files.

Inotify would definitely work.

Robert Love

Andrew Morton

unread,

Jun 21, 2005, 4:40:13 PM6/21/05

to

Jeff Garzik <jga...@pobox.com> wrote:
>
> > sparsemem
> >
> > OK by me for a merge. Need to poke arch maintainers first, check that
> > they've looked at it sufficiently closely.
>
> seems sane, though there are some whitespace niggles that should be
> cleaned up
>

There are? I thought I fixed most of them.

*general sigh*. I wish people would absorb CodingStyle. It's not hard,
and fixing the style post-facto creates a real mess. I now have a great
string of kexec patches followed by a "kexec-code-cleanup.patch" which
totally buggers up the patch sequencing and really needs to be split into
18 parts and sprinkled back over the entire series.

> > rapidio-*
> >
> > Will merge.
>
> send through netdev, as is proper
>

OK. But then the master version vanishes into the jgarzik git forest and I
won't know how to get it ;)

> > connector.patch
> >
> > Nice idea IMO, but there are still questions around the
> > implementation. More dialogue needed ;)
> >
> > connector-add-a-fork-connector.patch
> >
> > OK, but needs connector.
>
> I don't like connector
>

How come?

>
> > pcmcia-*.patch
> >
> > Makes the pcmcia layer generate hotplug events and deprecates cardmgr.
> > Will merge.
>
> Testing? The goal behind the patch is certainly good, but I worry about
> exposure.
>

Yes, there will be a few problems I guess. But people are testing it - we
know, because we've had lots of bug reports which were actually due to
greg-pci breakage...

>
> > cachefs
> >
> > This is a ton of code which knows rather a lot about pagecache
> > internals. It allows the AFS client to cache file contents on a local
> > blockdev.
> >
> > I don't think it's a justified addition for only AFS and I'd prefer to
> > see it proven for NFS as well.
> >
> > Issues around add-page-becoming-writable-notification.patch need to
> > be resolved.
> >
> > cachefs-for-nfs
> >
> > A recent addition. Needs review from NFS developers and considerably
> > more testing.
> >
> > These things aren't looking likely for 2.6.13.
>
> If I could vote more than once, I would! I really like cachefs, and
> have been pushing for its inclusion for a while.
>

You've been using it?

> > kexec and kdump
> >
> > I guess we should merge these.
> >
> > I'm still concerned that the various device shutdown problems will
> > mean that the success rate for crashing kernels is not high enough for
> > kdump to be considered a success. In which case in six months time we'll
> > hear rumours about vendors shipping wholly different crashdump
> > implementations, which would be quite bad.
> >
> > But I think this has gone as far as it can go in -mm, so it's a bit of
> > a punt.
>
> I'm not particularly pleased with these,

How come?

> and indeed vendors ARE shipping
> other crashdump methods.

Which ones?

>
> > reiser4
> >
> > Merge it, I guess.
> >
> > The patches still contain all the reiser4-specific namespace
> > enhancements, only it is disabled, so it is effectively dead code. Maybe
> > we should ask that it actually be removed?
>

> The plugin stuff is crap. This is not a filesystem but a filesystem +
> new layer. IMO considered in that light, it duplicates functionality
> elsewhere.
>

hm.

Christoph Hellwig

unread,

Jun 21, 2005, 4:40:12 PM6/21/05

to

On Tue, Jun 21, 2005 at 09:56:43PM +0200, Andi Kleen wrote:
> On Tue, Jun 21, 2005 at 11:44:55AM -0700, Hans Reiser wrote:
> > vs and zam, please comment on what we get from our profiler and spinlock
> > debugger that the standard tools don't supply. I am sure you have a
> > reason, but now is the time to articulate it.
> >
> > We would like to keep the disabled code in there until we have a chance
> > to prove (or fail to prove) that cycle detection can be resolved
> > effectively, and then with a solution in hand argue its merits.
>
> How about the review of your code base? Has reiser4 ever been
> fully reviewed by people outside your group?

I don't think so. Everyone used the previous criteria of the broken
core changes, broken filesystem semantics and it's own useless abtraction
layer (*) as an excuse not to look deeply at this huge mess yet.

(*) which isn't gone yet. and I need to look again if the core changes
are okay yet. And the broken sematics should go completely aswell, if
you want to reintroduce them it needs to happen at the VFS level anyway.

Andrew Morton

unread,

Jun 21, 2005, 4:50:10 PM6/21/05

to

Lee Revell <rlre...@joe-job.com> wrote:
>
> On Mon, 2005-06-20 at 23:54 -0700, Andrew Morton wrote:
> > CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
> > Kconfigurable.
> >
> > Will merge (will switch default to 1000 Hz later if that seems
> > necessary)
>
> Are you serious? You're changing the *default* HZ in a stable kernel
> series?!?
>
> This is a big regression, it degrades the resolution of system calls.
>

Well we'll see what happens. As I said, if it's determined to be a real
problem we'll put the default back to 1000Hz prior to 2.6.13 release.

Zan Lynx

unread,

Jun 21, 2005, 4:50:11 PM6/21/05

to

On Tue, 2005-06-21 at 15:38 -0400, Robert Love wrote:
> On Tue, 2005-06-21 at 12:22 -0700, Christoph Lameter wrote:
>
> > I noticed that select() is not working on real files. Could inotify
> > be used to fix select()?
>
> Select the system call? It should work fine. ;-)
>
> Who is confused?
>
> Robert Love

Sounds interesting. tail -f could use it. Instead of sleep 1, seek to
current position, read to eof; just select() for read on the file and
sleep in select() until someone else writes to that file.

I've never tried doing that. It might work, for all I know.

--
Zan Lynx <zl...@acm.org>

signature.asc

Christoph Hellwig

unread,

Jun 21, 2005, 4:50:08 PM6/21/05

to

Hans, we had this discussion before. And the consensus was pretty simple:
the disk structure plugins are your business and fine to keep. The
higher-level pluging that just add another useless abstraction below
file_operation/inode_operation/etc. are not. keep the former and kill
the latter and you're one step further.

Lee Revell

unread,

Jun 21, 2005, 5:10:09 PM6/21/05

to

On Tue, 2005-06-21 at 13:32 -0700, Andrew Morton wrote:
> Lee Revell <rlre...@joe-job.com> wrote:
> >
> > On Mon, 2005-06-20 at 23:54 -0700, Andrew Morton wrote:
> > > CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
> > > Kconfigurable.
> > >
> > > Will merge (will switch default to 1000 Hz later if that seems
> > > necessary)
> >
> > Are you serious? You're changing the *default* HZ in a stable kernel
> > series?!?
> >
> > This is a big regression, it degrades the resolution of system calls.
> >
>
> Well we'll see what happens. As I said, if it's determined to be a real
> problem we'll put the default back to 1000Hz prior to 2.6.13 release.
>

I just think it's silly to merge CONFIG_HZ this late in the game, when
dynamic tick and high res timers are right around the corner. Seems
like more trouble than it's worth.

Lee

Gerrit Huizenga

unread,

Jun 21, 2005, 5:20:05 PM6/21/05

to

On Tue, 21 Jun 2005 13:22:04 PDT, Andrew Morton wrote:
> Jeff Garzik <jga...@pobox.com> wrote:
> > > kexec and kdump
> > >
> > > I guess we should merge these.
> > >
> > > I'm still concerned that the various device shutdown problems will
> > > mean that the success rate for crashing kernels is not high enough for
> > > kdump to be considered a success. In which case in six months time we'll
> > > hear rumours about vendors shipping wholly different crashdump
> > > implementations, which would be quite bad.
> > >
> > > But I think this has gone as far as it can go in -mm, so it's a bit of
> > > a punt.
> >
> > I'm not particularly pleased with these,
>
> How come?
>
> > and indeed vendors ARE shipping
> > other crashdump methods.
>
> Which ones?

And which ones that __WORK__? We have a few customers out there from
both distros and all the crash dump methods that I've seen fail either
ALWAYS or ALMOST ALWAYS on customer sites. And yes, we hear about them
and I believe that our partners understand the pain that this causes
us and our customers.

Kexec/kdump has a chance of working reliably. The others are complete
crap.

gerrit

Andrew Morton

unread,

Jun 21, 2005, 5:20:08 PM6/21/05

to

Gerrit Huizenga <g...@us.ibm.com> wrote:
>
> Kexec/kdump has a chance of working reliably.

IOW: Kexec/kdump has a chance of not working reliably.

Worried.

Ronald G. Minnich

unread,

Jun 21, 2005, 5:30:17 PM6/21/05

to

On Tue, 21 Jun 2005, Eric Van Hensbergen wrote:

> On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:
> >
> > v9fs
> >
> > I'm not sure that this has a sufficiently high
> > usefulness-to-maintenance-cost ratio.
> >

I got pointed at this discussion. Here are my $.02 on why we at LANL are
interested in v9fs.

We build clusters on the order of 2000 machines at present, with larger
systems coming along. The system which we use to run these clusters is
bproc. While bproc has proven to be very powerful to date, it does have
its limits:
- requires homogenous system
- the network protocols it uses, while simple, are somewhat ad-hoc
(as is common in this type of system)
- if you are on a bproc system as user x, using 25% of the system,
you still see 100% of the processes. This is a bit of a security issue.

We have a desire to build single-system-image looking clusters along the
bproc model, but at the same time compose those clusters of, e.g.,
Opterons and G5s. This mixing is highly desirable for compoutations that
have phases, some of which belong on one type of a machine, and some on
another.

We are going to use v9fs as the glue for our next-generation cluster
software, called 'xcpu'. Xcpu has been implemented on Plan 9 and works
there. I have ported xcpu to Linux, using v9fs as the client side and Russ
Cox's plan9ports server to write servers.

xcpu presents a remote execution service as a 9p server. xcpu has been
tested across architectures and it works very well. By summer 2006, we
hope to have cut over our bproc systems to xcpu.

That's one use for v9fs. We also plan to use v9fs to provide us with
servers for global /proc, monitoring, and control systems for our
clusters.

The global /proc is interesting. bproc provides a global /proc, but it is
incomplete; entries for, e.g., exe and maps are not filled in. bproc also
caches part of the /proc, but the rules about what is cached and what the
timeouts are, are set in the kernel module and not easily changed. We are
going to have an "aggregating" user level 9p server based on
Mirtchovskis's aggrfs, which will both aggregate all the cluster nodes,
and have caching rules that make sense in clusters of 1000s of node (for
example, it is ok to cache /proc/x/status; there is no need to cache
/proc/x/maps, and you probably don't want to anyway).

A neat capability is that if we give a user, e.g., 25% of the cluster, we
can tailor that user's name space so that they only see their procs and
the 25% of the cluster they own. This is good for security, but also good
for convenience: most users don't really care that some other user is on
75% of the cluster. Global pid spaces are neat in theory, messy in
practice at large scale. I want my global pid space to be global to *me*,
meaning I see the global space of the nodes I care about. The sysadmin,
of course, wants to see everything. All this is possible. V9fs, along with
Linux private name spaces, will allow us to provide this model: users can
see some or all of the global pid space, depending on need; users can be
constrained to only see part of the global pid space, depending on other
issues.

9p will also replace the Supermon protocol, allowing people to easily view
status information in a file system.

In addition to the cluster usage, there is also grid usage. The 9grid,
composed of plan 9 systems, is connected by 9p servers. Linux systems can
join the 9grid with no problem, once Linux has v9fs.

Were v9fs just a file system, I would not really be interested in it one
way or another; we have NFS, after all. But v9fs is really the key piece
of a new model of cluster services we are building at LANL. 9p will be the
glue, and v9fs will be the needed client side for hooking 9p servers into
the file system name space.

I'm hoping we can see v9fs in the kernel someday.

thanks

ron

Gerrit Huizenga

unread,

Jun 21, 2005, 5:40:10 PM6/21/05

to

On Tue, 21 Jun 2005 14:04:41 PDT, Andrew Morton wrote:
> Gerrit Huizenga <g...@us.ibm.com> wrote:
> >
> > Kexec/kdump has a chance of working reliably.
>
> IOW: Kexec/kdump has a chance of not working reliably.
>
> Worried.

No worries. Machine locks up hard, hardware failures, etc., there
is a possibility that nothing but a hard reset can unlock a machine.
But that is rare and outside the scope of the simple locking problems
that today prevent crash dumps. There are still some rough edges in
PCI shutdown code, reinitialization, etc. that will need to be shaken
out over time with more experience, but those at least can be addressed
in the fundamental architecture of kexec/kdump.

About the only possible solution that *might* be fail proof (and even
that case I doubt) would be an external dump program under control
of the firmware (assuming the firmware can still run) which does a
copy of memory off to some device without any assistance from the
kernel.

Kexec/kdump needs much wider exposure at this point and there will
a few bumps along the way. They should be isolated to cases where
the machine is already crashing and the only thing that doesn't work
is the crash dump/restart. And those we will fix like any other bugs.

gerrit

Carsten Otte

unread,

Jun 21, 2005, 5:40:17 PM6/21/05

to

On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:

> > and indeed vendors ARE shipping
> > other crashdump methods.
>
> Which ones?

For 390, we ship standalone bootable crashdump tools with both sles
and rhel. As for kexec, I'd like to see a kexec based 390 bootloader
in the future which would be more flexible then our current one. So
I'd like to vote for merging kexec/kdump.

Stephen Hemminger

unread,

Jun 21, 2005, 6:10:10 PM6/21/05

to

Posix requires select() of regular files always return true:
http://www.opengroup.org/onlinepubs/009695399/functions/select.html

File descriptors associated with regular files shall always select true for ready to read,
ready to write, and error conditions.

--
Stephen Hemminger <shemm...@osdl.org>

Arjan van de Ven

unread,

Jun 21, 2005, 6:50:07 PM6/21/05

to

On Tue, 2005-06-21 at 14:04 -0700, Andrew Morton wrote:
> Gerrit Huizenga <g...@us.ibm.com> wrote:
> >
> > Kexec/kdump has a chance of working reliably.
>
> IOW: Kexec/kdump has a chance of not working reliably.
>
> Worried.

it's about rates... you can hose your system so bad that nothing can fix
it.

the "distro" stuff probably succeeds 10% of the time in non-artificial
cases, the kexec one has the potential at least to go 90% (so yes I
admit that I pull these numbers out of my backside but I'm with Gerrit
on this one). The stuff the distros ship is also not so nice in that you
can't dump to LVM or SAN or ... etc; kexec gets all that right. It's
also far more an all-or-nothing thing; if you make the transition you
know you're going to get a good dump; most of the other dumps die
halfway in the process at some point.

signature.asc

Alan Cox

unread,

Jun 21, 2005, 7:10:11 PM6/21/05

to

On Maw, 2005-06-21 at 21:06, Christoph Lameter wrote:
> On Tue, 21 Jun 2005, Zan Lynx wrote:
>
> > I've never tried doing that. It might work, for all I know.
>
> I was told that Linux cannot do this. It always returns the filehandles as
> ready for disk files.

That is because disk files are always ready - select/poll are for waits
for data (or space) to become available not for events in the sense of
inotify.
That said there *is* scope in the poll() API [but not select()] to add a
new kind of poll notification type.

Jeff Garzik

unread,

Jun 21, 2005, 7:30:12 PM6/21/05

to

Christoph Lameter wrote:
> On Tue, 21 Jun 2005, Robert Love wrote:
>
>
>>>We should ask hpa what he needs for kernel.org. Ideally kernel.org
>>>probably wants <something> that facilitates listening to <something> for
>>>a list of files being changed. That would greatly speed up the robots,
>>>and possibly rsync-like activities too.
>>
>>I've talked to some people who've hooked inotify into rsync
>>successfully. Cool hack.

>
>
> I noticed that select() is not working on real files. Could inotify
> be used to fix select()?

Non-blocking file I/O is an open issue.

AIO is probably a better path.

Jeff

Pavel Machek

unread,

Jun 21, 2005, 7:40:11 PM6/21/05

to

Hi!

> > > > This is useful, but there are, AFAIK, two issues:
> > > >
> > > > - We're still deadlocked over some permission-checking hacks in there
> > >
> > > Oh, god. Let me try to explain this again:
> > >
> > > - This is a security issue with unprivileged mounts
> >
> > Pretty please, just merge it without unpriviledged mounts. I see
> > they are usefull, but they are too strange for now.
>
> An emotional argument again. What's "strange" about it?

Not so emotional argument...

System where users can mount their own filesystems should not be
called "Unix" any more. It introduces new mechanism, similar to
ptrace. It restricts root in ways not seen before. How is
updatedb/locate supposed to work on system with this? How is it going
to interact with backup tools?

Add this to your A): "by tricking some interpretter to think script is
setuid".

> You have a choice of: 1) believe me that the current solution is
> fine

> 2) get down and try to understand the damn thing, and then come up
> with technical arguments for/against it

Argument is "it is **** ugly".

Your fuse.txt explains why it is not security hole. It does not
explain why your interface is the best possible, and what alternative
ways of "not security hole" exist.

Pavel
--
teflon -- maybe it is a trademark, but it should not be.

Christoph Lameter

unread,

Jun 21, 2005, 7:50:05 PM6/21/05

to

On Tue, 21 Jun 2005, Jeff Garzik wrote:

> Non-blocking file I/O is an open issue.
>
> AIO is probably a better path.

AIO is requiring you to poll and check if I/O is complete. select() does
not require any polling and just needs to be made to work the way it was
intended to.

Jeff Garzik

unread,

Jun 21, 2005, 8:10:12 PM6/21/05

to

Christoph Lameter wrote:
> On Tue, 21 Jun 2005, Jeff Garzik wrote:
>
>
>>Non-blocking file I/O is an open issue.
>>
>>AIO is probably a better path.
>
>
> AIO is requiring you to poll and check if I/O is complete. select() does

Incorrect. The entire point of AIO is that its an async callback
system, when the I/O is complete... just like the kernel's internal I/O
request queue system.

Jeff

Hans Reiser

unread,

Jun 21, 2005, 9:10:07 PM6/21/05

to

Christoph,

Reiser4 users love the plugin concept, and all audiences which have
listened to a presentation on plugins have been quite positive about
it. Many users think it is the best thing about reiser4. Can you
articulate why you are opposed to plugins in more detail? Perhaps you
are simply not as familiar with it as the audiences I have presented
to. Perhaps persons on our mailing list can comment.....

In particular, what is wrong with having a plugin id associated with
every file, storing the pluginid on disk in permanent storage in the
stat data, and having that plugin id define the set of methods that
implement the vfs operations associated with a particular file, rather
than defining VFS methods only at filesystem granularity?

What is wrong with having an encryption plugin implemented in this
manner? What is wrong with being able to have some files implemented
using a compression plugin, and others in the same filesystem not.

What is wrong with having one file in the FS use a write only plugin, in
which the encrypion key is changed with every append in a forward but
not backward computable manner, and in order to read a file you must
either have a key that is stored on another computer or be reading what
was written after the moment of cracking root?

What is wrong with having a set of critical data files use a CRC
checking file plugin?

What we have hurts no one but us. I have never seen an audience for one
of my talks that thought it hurt us..... most audiences think it is
great.

Let us tinker with our FS, and you tinker with yours, and so long as
what we do does not affect your FS, let the users choose.

In the end, somebody will write a new fs that steals the good ideas from
both of us, and obsoletes us both. They can only do this though if we
are left to be both free to implement differing filesystem designs.

I do not tell you how to design XFS, why are you making my life unpleasant?

Jeff Garzik

unread,

Jun 21, 2005, 9:20:06 PM6/21/05

to

Hans Reiser wrote:
> Christoph,
>
> Reiser4 users love the plugin concept, and all audiences which have
> listened to a presentation on plugins have been quite positive about
> it. Many users think it is the best thing about reiser4. Can you
> articulate why you are opposed to plugins in more detail? Perhaps you
> are simply not as familiar with it as the audiences I have presented
> to. Perhaps persons on our mailing list can comment.....
>
> In particular, what is wrong with having a plugin id associated with
> every file, storing the pluginid on disk in permanent storage in the
> stat data, and having that plugin id define the set of methods that
> implement the vfs operations associated with a particular file, rather
> than defining VFS methods only at filesystem granularity?

You're basically implementing another VFS layer inside of reiser4, which
is a big layering violation.

This sort of feature should -not- be done at the low-level filesystem level.

What happens if people decide plugins are a good idea, and they want
them in ext3? We need massive surgery to extract the guts from reiser4.

Jeff

Andrew Morton

unread,

Jun 21, 2005, 9:30:17 PM6/21/05

to

Hans Reiser <rei...@namesys.com> wrote:
>
> What is wrong with having an encryption plugin implemented in this
> manner? What is wrong with being able to have some files implemented
> using a compression plugin, and others in the same filesystem not.
>
> What is wrong with having one file in the FS use a write only plugin, in
> which the encrypion key is changed with every append in a forward but
> not backward computable manner, and in order to read a file you must
> either have a key that is stored on another computer or be reading what
> was written after the moment of cracking root?
>
> What is wrong with having a set of critical data files use a CRC
> checking file plugin?

I think the concern here is that this is implemented at the wrong level.

In Linux, a filesystem is some dumb thing which implements
address_space_operations, filesystem_operations, etc.

Advanced features such as those which you describe are implemented on top
of the filesystem, not within it. reiser4 turns it all upside down.

Now, some of the features which you envision are not amenable to
above-the-fs implementations. But some will be, and that's where we should
implement those.

Andi Kleen

unread,

Jun 21, 2005, 9:40:08 PM6/21/05

to

First Hans let me remind you that this discussion is not XFS vs
reiser4. Christoph does a lot of reviewing and your child definitely
is in serious need of that to be mergeable. I'm sure Christoph is able
to review inpartially even when he is involved with other FS.

Jeff Garzik <jga...@pobox.com> writes:
>
> You're basically implementing another VFS layer inside of reiser4,
> which is a big layering violation.
>
> This sort of feature should -not- be done at the low-level filesystem level.
>
> What happens if people decide plugins are a good idea, and they want
> them in ext3? We need massive surgery to extract the guts from
> reiser4.

We already kind of have them, there are toolkits to do stackable FS with
the existing VFS.

However I suspect Hans is unwilling to redo his file system in this
point. While it looks quite unnecessary there might be other
areas which deserve more attention. In general all the code
needs review, even if it is not as controversal as the reinvented VFS.

-Andi

Hans Reiser

unread,

Jun 21, 2005, 9:50:12 PM6/21/05

to

Andi Kleen wrote:

>On Tue, Jun 21, 2005 at 11:44:55AM -0700, Hans Reiser wrote:
>
>
>>vs and zam, please comment on what we get from our profiler and spinlock
>>debugger that the standard tools don't supply. I am sure you have a
>>reason, but now is the time to articulate it.
>>
>>We would like to keep the disabled code in there until we have a chance
>>to prove (or fail to prove) that cycle detection can be resolved
>>effectively, and then with a solution in hand argue its merits.
>>
>>
>
>How about the review of your code base? Has reiser4 ever been
>fully reviewed by people outside your group?
>

>Normally full review is a requirement for merging.
>
>
V4 has a mailing list, and a large number of testers who read the code
and comment on it. V4 has been reviewed and tested much more than V3
was before merging. Given that we sent it in quite some time ago, your
suggestion that an additional review by unspecified additional others be
a requirement for merging seems untimely. Do you see my point of view
on this?

I would however enjoy receiving coding suggestions at ANY time. We
don't get as much of that as I would like. I would in particular love
to have you Andi Kleen do a full review of V4 if you could be that
generous with your time, as I liked much of the advice you gave us on V3.

Unspecified others doing a review, well, who knows, I will surely take
the time to consider what is said by them though.....

I would prefer to not get reviews from authors of other filesystems who
prefer their own code, skim through our code without taking the time to
grok our philosophy and approach in depth, and then complain that our
code is different from what they chose to write, and think that our
changing to be like them should be mandated. I will not name names here....

Some of the suggestions on our mailing list are great, some reflect a
lack of 5 years working with our code, perhaps I should feed our mailing
list into the linux kernel mailing list so that people on the kernel
mailing list are more aware that we exist and are active?

Jeff Garzik

unread,

Jun 21, 2005, 10:10:06 PM6/21/05

to

Hans Reiser wrote:
> V4 has a mailing list, and a large number of testers who read the code
> and comment on it. V4 has been reviewed and tested much more than V3
> was before merging. Given that we sent it in quite some time ago, your
> suggestion that an additional review by unspecified additional others be
> a requirement for merging seems untimely. Do you see my point of view
> on this?
>
> I would however enjoy receiving coding suggestions at ANY time. We
> don't get as much of that as I would like. I would in particular love
> to have you Andi Kleen do a full review of V4 if you could be that
> generous with your time, as I liked much of the advice you gave us on V3.
>
> Unspecified others doing a review, well, who knows, I will surely take
> the time to consider what is said by them though.....
>
> I would prefer to not get reviews from authors of other filesystems who
> prefer their own code, skim through our code without taking the time to
> grok our philosophy and approach in depth, and then complain that our
> code is different from what they chose to write, and think that our
> changing to be like them should be mandated. I will not name names here....

The Linux system isn't the greatest in the world, but here's reality:
when a merge is imminent, a lot more attention is paid.

Andrew regularly uses this knowledge of human psychology to his (and
Linux's) benefit :)

The MAJOR downside is that merge-stopping issues are sometimes ignored
until an upstream merge is imminent.

If you want to get your code merged, you gotta work with the system, and
LISTEN to the feedback.

Jeff, who doesn't have a filesystem axe to grind

Andi Kleen

unread,

Jun 21, 2005, 10:10:07 PM6/21/05

to

On Tue, Jun 21, 2005 at 06:38:07PM -0700, Hans Reiser wrote:
> V4 has a mailing list, and a large number of testers who read the code
> and comment on it. V4 has been reviewed and tested much more than V3
> was before merging. Given that we sent it in quite some time ago, your
> suggestion that an additional review by unspecified additional others be
> a requirement for merging seems untimely. Do you see my point of view
> on this?

The point of the merge review is that people who are familiar with the existing
Linux code double check that the way your code interfacts
with the rest of the kernel is sane, does not have obvious bugs and follows the
existing good practice.

Once the code is in mainline it will get maintained and fixed when needed,
but to make this possible without undue work on the mainline hackers it is needed
to do a full review first.

AFAIK reiserfs has not gotten such a full review yet.

Also good reviewers are rare so it is not a good idea to be picky here.

> Unspecified others doing a review, well, who knows, I will surely take
> the time to consider what is said by them though.....
>
> I would prefer to not get reviews from authors of other filesystems who
> prefer their own code, skim through our code without taking the time to
> grok our philosophy and approach in depth, and then complain that our
> code is different from what they chose to write, and think that our
> changing to be like them should be mandated. I will not name names here....

Someone qualified to review a new file system for inclusion will have need necessary
some experience in Linux file systems, and that can be hardly gotten without ever
having touched one. So you will have to live with other file system authors
commenting on your code.

The main philosophy that is of concern here is also the philosophy of the
core VFS and other kernel interfaces.

> Some of the suggestions on our mailing list are great, some reflect a
> lack of 5 years working with our code, perhaps I should feed our mailing
> list into the linux kernel mailing list so that people on the kernel
> mailing list are more aware that we exist and are active?

Nobody doubts that you are active. Just there are doubts that your
code follows the Linux coding standards enough to not be a undue
mainteance burden in mainline. A review with the following changes
could probably fix that.

Hans Reiser

unread,

Jun 21, 2005, 10:50:06 PM6/21/05

to

Andi Kleen wrote:

> Christoph does a lot of reviewing
>

and he is notorious for making needed linux contributors go away and not
come back, and I won't say which famous person on this mailing list told
me that....

>and your child definitely
>is in serious need of that to be mergeable. I'm sure Christoph is able
>to review inpartially even when he is involved with other FS.
>
>

As impartial as a puppy on PCP....

Christoph is aggressive about things he does not take the time to
understand or ask about first. I hate that. I wish he would go away
please. He is not exactly an Ousterhout, Rob Pike, Granger, Mazieres,
Frans Kaashoek, etc., in his accomplishments, so why is he reviewing
other people's filesystems? Reviews are great, how about finding
persons who have created filesystem innovations (and thus are less
likely to reject innovations without understanding them) to do them?

How about review by benchmark instead?

It works, it runs faster than the competition, users like it, we
addressed the core kernel patch complaints, it should go in and receive
the exposure that will result in lots of useful improvements and
suggestions. It seems like we are getting an unusual review process.

I used a bunch of algorithms for which there was a consensus among those
consulted that the algorithms were a bad idea, only the fact that I paid
for the code to be written and required that it be done my way (ignoring
the "you just use me as a typewriter" remarks as best I could) caused
the algorithms to get implemented at all, and the benchmarks then proved
the algorithms were a good idea. V3 performance sucks in major part
because I listened to the consensus when I knew better. I don't really
care for consensus on my FS anymore. If you guys want to understand
what I am doing I am happy to talk about it extensively, but please
don't require that I groupthink. I frankly think that with my
benchmarks, I should be allowed to tinker on my own.

Hans The Mad

Hans Reiser

unread,

Jun 21, 2005, 11:00:15 PM6/21/05

to

Andi Kleen wrote:

>
> Just there are doubts that your
>code follows the Linux coding standards enough to not be a undue
>mainteance burden in mainline.
>

We get only a few bugfixes from outsiders, and the rest are done by us.
The customers who buy licenses in addition to the GPL from us for
hundreds of thousands of dollars tend to make remarks to the effect of
"we licensed your code for more money in part because it was way easier
to work on than XXX linux filesystem".

I like feedback on our code, and I particularly like feedback from a Mr.
Andi Kleen, but there is no need to tie it to merging. If, however, it
serves as an effective excuse to get some of your time allocated by SuSE
management, sure, go for it.;-)

Hans

Jeff Garzik

unread,

Jun 21, 2005, 11:10:05 PM6/21/05

to

Hans Reiser wrote:
> I like feedback on our code, and I particularly like feedback from a Mr.
> Andi Kleen, but there is no need to tie it to merging. If, however, it
> serves as an effective excuse to get some of your time allocated by SuSE
> management, sure, go for it.;-)

All merges of new code go like this. You've been around here for a
while, this should not be a shock.

"Hans' team says its good stuff" is not a criteria for merging.

Jeff

Kyle Moffett

unread,

Jun 21, 2005, 11:30:09 PM6/21/05

to

On Jun 21, 2005, at 22:47:13, Hans Reiser wrote:

> Andi Kleen wrote:
>> and your child definitely
>> is in serious need of that to be mergeable. I'm sure Christoph is
>> able
>> to review inpartially even when he is involved with other FS.
> As impartial as a puppy on PCP....

So where else are you planning to get a competent reviewer who is fluent
in the internals of filesystems? Good reviewers don't grow on trees,
and
in order to be able to understand filesystem issues, one must generally
have worked with them before... Besides, what good is insulting others
going to do?

> Christoph is aggressive about things he does not take the time to
> understand or ask about first.

[rant snipped]

From my objective re-reading of his posts, I can see that he is
critical
of things that are difficult to understand not just to be critical, but
to provoke additional thought over those portions of the code. In many
cases this leads to better abstractions and simpler code than otherwise.

> How about review by benchmark instead?

/dev/sda is a great filesystem with awesome benchmarks, assuming one
only
needs to store a single file. Besides, benchmarks aren't the only thing
important about code. If the interface consists of:

void clear_current_filename(void);
void add_char_to_current_filename(char x);
void read_bytes_from_current_file(char *byte, unsigned long size);
void write_bytes_to_current_file(const char *byte, unsigned long
size);

then this is clearly not a good API, regardless of how well or poorly it
may perform.

> It works, it runs faster than the competition, users like it, we
> addressed the core kernel patch complaints, it should go in and
> receive
> the exposure that will result in lots of useful improvements and
> suggestions. It seems like we are getting an unusual review process.

If you look over other patches in -mm, you will see that your review
process is not unusual, especially given the number of concerns that
other
developers have raised over Reiser4.

[irrelevant algorithm rant snipped]

> If you guys want to understand
> what I am doing I am happy to talk about it extensively, but please
> don't require that I groupthink. I frankly think that with my
> benchmarks, I should be allowed to tinker on my own.

Yes, you can tinker on your own all you want. Another project that has
taken that route is GrSecurity, which was alive and well last I checked.

If you don't like others criticisms, take up your marbles and go home,
just don't expect them to accept your work when you've not fixed it to
community standards.

Cheers,
Kyle Moffett

--
Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
software stuff and not get a real job. Charles Shultz had the best
answer:

"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't.
That's why
I draw cartoons. It's my life."
-- Charles Shultz

Jeff Garzik

unread,

Jun 21, 2005, 11:40:17 PM6/21/05

to

Andrew Morton wrote:
> Pavel Machek <pa...@ucw.cz> wrote:
>
>>Hi!
>>
>>
>>>This summarises my current thinking on various patches which are presently
>>>in -mm. I cover large things and small-but-controversial things. Anything
>>>which isn't covered here (and that's a lot of material) is probably a "will
>>>merge", unless it obviously isn't.
>>
>>I'd like to ask about 802.11 stack and ipw2100 in particular... Is it
>>"small enough that it did not need mentioning"?
>>Working wireless in mainline would be great...
>
>
> That's up to Jeff.

802.11 stack is still too ipw-specific.

Someone needs to get together another driver using 802.11 stack (such as
HostAP, the original location of much of the code).

So, the merge criteria is: something other than ipw uses it.

Otherwise, it'll never be generic...

Jeff, who has several SuSE wireless patches to merge still

Rik Van Riel

unread,

Jun 21, 2005, 11:50:06 PM6/21/05

to

On Mon, 20 Jun 2005, Andrew Morton wrote:

> git-ocfs
>
> The OCFS2 filesystem. OK by me, although I'm not sure it's had enough
> review.

The only problem I can see with this is that people will want
to use OCFS together with CLVM, and both use a different cluster
infrastructure.

IMHO it would be good if they both used the same underlying
cluster infrastructure...

--
The Theory of Escalating Commitment: "The cost of continuing mistakes is
borne by others, while the cost of admitting mistakes is borne by yourself."
-- Joseph Stiglitz, Nobel Laureate in Economics

David Teigland

unread,

Jun 22, 2005, 12:20:06 AM6/22/05

to

On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:
> git-ocfs
>
> The OCFS2 filesystem. OK by me, although I'm not sure it's had enough
> review.

Does this include configfs? I'd especially like to see that sooner
rather than later.

Dave

Andrew Morton

unread,

Jun 22, 2005, 12:30:06 AM6/22/05

to

David Teigland <dtei...@gmail.com> wrote:
>
> On 6/21/05, Andrew Morton <ak...@osdl.org> wrote:
> > git-ocfs
> >
> > The OCFS2 filesystem. OK by me, although I'm not sure it's had enough
> > review.
>
> Does this include configfs? I'd especially like to see that sooner
> rather than later.

There's not a lot of point in adding a fs which has no in-kernel users.

David Masover

unread,

Jun 22, 2005, 12:30:11 AM6/22/05

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff Garzik wrote:
> Hans Reiser wrote:
>
>> Christoph,
>>
>> Reiser4 users love the plugin concept, and all audiences which have
>> listened to a presentation on plugins have been quite positive about
>> it. Many users think it is the best thing about reiser4. Can you
>> articulate why you are opposed to plugins in more detail? Perhaps you
>> are simply not as familiar with it as the audiences I have presented
>> to. Perhaps persons on our mailing list can comment.....
>>
>> In particular, what is wrong with having a plugin id associated with
>> every file, storing the pluginid on disk in permanent storage in the
>> stat data, and having that plugin id define the set of methods that
>> implement the vfs operations associated with a particular file, rather
>> than defining VFS methods only at filesystem granularity?
>
>
> You're basically implementing another VFS layer inside of reiser4, which
> is a big layering violation.

There's been sloppy code in the kernel before. I remember one bit in
particular which was commented "Fuck me gently with a chainsaw." If I
remember correctly, this had all of the PCI ids and the names and
manufacturers of the corresponding devices -- in a data structure -- in
C source code. It was something like one massive array definition, or
maybe it was a structure. I don't remember exactly, but...

The point is, this was in the kernel for quite awhile, and it was so
ugly that someone would rather be fucked with a chainsaw. If something
that bad can make it in the kernel and stay for awhile because it
worked, and no one wanted to replace it -- nowdays, that database is
kept in userland as some nice text format -- then I vote for putting
Reiser4 in the kernel now, and ironing the sloppiness ("violation") out
later. It runs now.

> This sort of feature should -not- be done at the low-level filesystem
> level.

I agree there, too. In fact, some people have suggested that all
"legacy" (read: non-reiser) filesystems should be implemented as Reiser4
plugins, effectively killing VFS.*

So, Reiser4 may eventually take over VFS and be the only Linux
filesystem, but if so, it will have to do it much more slowly. Take the
good ideas -- things like plugins -- and make them at least look like
incremental updates to the current VFS, and make them available to all
filesystems.

Eventually, this would result in a full merge of Reiser and Linux, such
that the only thing left of "Reiser4" are a few plugins -- things like
storage plugins and maybe some more exotic things like fibration that I
don't really understand.

> What happens if people decide plugins are a good idea, and they want
> them in ext3? We need massive surgery to extract the guts from reiser4.

And here is the crucial point. Reiser4 is usable and useful NOW, not
after it has undergone massive surgery, and Namesys is bankrupt, and
users have given up and moved on to XFS. But the massive surgery should
happen eventually, partly to make all filesystems better (see below),
and partly to make the transition easier and more palatable for those
who don't work for Namesys.

* Imagine ext3 as a storage-level plugin for reiser4. You get one
benefit immediately -- lazy allocation. Lazy allocation is nice both
for fragmentation and for busy systems writing and nuking a lot of
temporary files. Imagine a file which is written and then deleted
before it hits disk -- it should never touch the disk, regardless of the
underlying storage layout.

Other benefits are subtler but inevitable. Right now, it would be an
understatement to say that there's duplication of effort between ext3,
xfs, and reiser4. And yet, I don't think there are many core design
decisions that influence my decision as to which I pick up. I just want
it to be fast, stable, and have a stable on-disk format so I don't have
to reformat too often. I honestly don't care about how XFS is
segmenting the disk, or Reiser4 packs really well, or ext3 can be read
as ext2 and flushes to disk every five seconds. These are the kinds of
things which should be set to sane defaults, tunable for enterprise
users, but are not really enough to warrant completely separate code bases.

I would imagine that it wouldn't be too long after this before an
uber-fs rose, something which combined enough of the strong points of
all the existing Linux filesystems that no one would use anything else.
But, Linux still needs support for FAT32, ISO9660, UDF, and all the
other filesystems which won't go away as easily as XFS and ext3, and it
would be nice if these could all share as much code as possible.

I don't know if storage plugins really work that way, but they should.

I think. I don't work here.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iQIVAwUBQrjoNHgHNmZLgCUhAQIYYw/7BWZ0gVvy0ln5tRo405yUoRHJ/jVFBGyP
5pQ610ECMZORVWRO1qP/NXbvGZwDjEggM5iIeahsGqnBWzNGDsB6RslMUZAxzCYy
iAovi0881zoARf3dmhKrDgbGkvNLPTx+/ypa20oRcHLnyI+NAjerUxNuvGc79PJn
Ybm9JhX6tQsqGIKjZye9uNDHCifLqzA1gdxucPwWo9sz4ymzM9FgsMGvb+IxrU50
2a2G2K6AHcH+nkomEHw3xgY3PmUZUy0s2hQDSkJx6dCido7fGZwwykaSXm4IZs9s
xZqBxKx32G/LDnDV3W8gpj8ZisUE+58kefRbIjo4Ml6IzgXvQqpRjaQOuz8JoKDX
9KUV43tcZkPpK+neIYPQYpXCrdMQqB5+ISpbNKuz/Z/abkcVw1042sy0EKM+/VnD
n3Jr7PpSyk0lfCyADr+33qnqPQRFq02gQTohg9FheoMthhV01aaeGW5ExmTM/Wwa
6Dmv/qnn2oNImi+Ebz5u3Lbnz7lL3MVpRL4aoMmEOVUB3xhehnxesf//yacBmwj9
M/3KVae9epwX4K6yi8qQXzH4160IBFHpWUxBLc9LnOZlHQuZ+Fz3BIrbKQBvmaHT
7lrwi0Os6TmiBGMSFd6OUWHcYs4p97aMw30NG33fCL6e8X6oNVFwwnJXZpBPN1Mr
+lrsVAywKEI=
=rHdK
-----END PGP SIGNATURE-----

Jeff Garzik

unread,

Jun 22, 2005, 2:10:11 AM6/22/05

to

David Masover wrote:
> There's been sloppy code in the kernel before. I remember one bit in
> particular which was commented "Fuck me gently with a chainsaw." If I
> remember correctly, this had all of the PCI ids and the names and
> manufacturers of the corresponding devices -- in a data structure -- in
> C source code. It was something like one massive array definition, or
> maybe it was a structure. I don't remember exactly, but...
>
> The point is, this was in the kernel for quite awhile, and it was so
> ugly that someone would rather be fucked with a chainsaw. If something
> that bad can make it in the kernel and stay for awhile because it
> worked, and no one wanted to replace it -- nowdays, that database is
> kept in userland as some nice text format -- then I vote for putting
> Reiser4 in the kernel now, and ironing the sloppiness ("violation") out
> later. It runs now.

Existence of ugly code is not an excuse to add more.

We have to maintain said ugly code for decades. Maintainability is a
big deal when you deal with the timeframes we deal with.

> So, Reiser4 may eventually take over VFS and be the only Linux
> filesystem, but if so, it will have to do it much more slowly. Take the
> good ideas -- things like plugins -- and make them at least look like
> incremental updates to the current VFS, and make them available to all
> filesystems.

This is how all Linux development is done.

The code evolves over time.

You have just described the path ReiserFS needs to take, in order to get
this stuff into the kernel, in fact.

> And here is the crucial point. Reiser4 is usable and useful NOW, not
> after it has undergone massive surgery, and Namesys is bankrupt, and
> users have given up and moved on to XFS. But the massive surgery should
> happen eventually, partly to make all filesystems better (see below),
> and partly to make the transition easier and more palatable for those
> who don't work for Namesys.

We care about technical merit, not some random company's financial
solvancy. Reiser4 has layering violations, and doesn't work with the
current security layer. Those are two biggies.

There is no entitlement to be merged in the upstream kernel. If people
don't like how the Linux kernel is managed, they are free to maintain
their own fork. If its better, people will use that instead.

> I would imagine that it wouldn't be too long after this before an
> uber-fs rose, something which combined enough of the strong points of
> all the existing Linux filesystems that no one would use anything else.
> But, Linux still needs support for FAT32, ISO9660, UDF, and all the
> other filesystems which won't go away as easily as XFS and ext3, and it
> would be nice if these could all share as much code as possible.
>
>
> I don't know if storage plugins really work that way, but they should.

No, they shouldn't.

> I think. I don't work here.

Obviously.

Jeff

Christoph Lameter

unread,

Jun 22, 2005, 4:00:46 AM6/22/05

to

On Tue, 21 Jun 2005, Jeff Garzik wrote:

> > AIO is requiring you to poll and check if I/O is complete. select() does
>
> Incorrect. The entire point of AIO is that its an async callback system, when
> the I/O is complete... just like the kernel's internal I/O request queue
> system.

Hmmm.. Okay it may work like dnotify. You get some signal and
then its up to you to figure out what was going on. Traditionally select()
does that all for you and tells you which stream got input.

Miklos Szeredi

unread,

Jun 22, 2005, 4:00:30 AM6/22/05

to

> Not so emotional argument...
>
> System where users can mount their own filesystems should not be
> called "Unix" any more.

It's not. It's "Linux". And anyway, sysadmin may set whatever
owner/group/permissions on '/dev/fuse' to disallow or selectively
allow users to be able to mount FUSE filesystems.

> It introduces new mechanism, similar to ptrace. It restricts root in
> ways not seen before.

Not true. Root squash in NFS has similar effect.

> How is updatedb/locate supposed to work on system with this? How is
> it going to interact with backup tools?

I assure you, that it will cause no problems whatever. These programs
are able to gracefully handle errors.

> Add this to your A): "by tricking some interpretter to think script is
> setuid".

How would you do that?

> > You have a choice of: 1) believe me that the current solution is
> > fine
>
> > 2) get down and try to understand the damn thing, and then come up
> > with technical arguments for/against it
>
> Argument is "it is **** ugly".

Yeah, that's your opinion. Mine is that it's f****** beautiful ;).

There are plenty of ugly things in Unix/Linux that you've become so
accustomed to, that they no longer seem ugly. Think about the sticky
bit on directories for example. That one was breaking assumptions
left and right when it got introduced, but people came to accept it,
because it's useful.

> Your fuse.txt explains why it is not security hole. It does not
> explain why your interface is the best possible, and what alternative
> ways of "not security hole" exist.

That's because I don't see any alternative. The "preventing user from
tracing root" and "preventing access to user's filesysem by root" must
come together. There's doesn't seem to be any other way.

BTW, thanks for reading through fuse.txt :)

Miklos

Matthias Urlichs

unread,

Jun 22, 2005, 4:00:38 AM6/22/05

to

Hi, Christoph Lameter wrote:

> Well we could use it in kernel to make select() work correctly.

select() already works correctly. It answers the "will I not block if I
call read()/write() on this" question, and since you never block on files
(assuming infinite disk speed ;-) select() will always return True on it.

You can't change this, it's in POSIX.

... or maybe I misunderstood your comment.

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | sm...@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
- -
"I don't really miss God
but i sure miss Santa Claus!"
[Courtney Love]

David Masover

unread,

Jun 22, 2005, 4:10:16 AM6/22/05

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph Hellwig wrote:

> On Tue, Jun 21, 2005 at 11:25:24PM -0500, David Masover wrote:
>
>>>You're basically implementing another VFS layer inside of reiser4, which
>>>is a big layering violation.
>>
>>There's been sloppy code in the kernel before. I remember one bit in
>>particular which was commented "Fuck me gently with a chainsaw." If I
>>remember correctly, this had all of the PCI ids and the names and
>>manufacturers of the corresponding devices -- in a data structure -- in
>>C source code. It was something like one massive array definition, or
>>maybe it was a structure. I don't remember exactly, but...
>
>

> Every device driver has a big array of corresponing device ids as an
> array in C code - oh my god we're doomed .. not.

I could throw the same sarcasm back at you. We must be doomed because
Reiser does some stuff that VFS already does! Or am I misunderstanding
the complaint?

>>I agree there, too. In fact, some people have suggested that all
>>"legacy" (read: non-reiser) filesystems should be implemented as Reiser4
>>plugins, effectively killing VFS.*
>>
>>So, Reiser4 may eventually take over VFS and be the only Linux
>>filesystem, but if so, it will have to do it much more slowly. Take the
>>good ideas -- things like plugins -- and make them at least look like
>>incremental updates to the current VFS, and make them available to all
>>filesystems.
>
>

> And why exactly would we replace a stable, working abstraction with an
unpoven
> mess?

How does it get proven if you won't give it a chance as a *separate*
unproven mess, with a big fat EXPERIMENTAL flag, for users to play with?

I know, it exists as a separate patch. But it works now, and I think
the best way to "prove" it would be to package it with the kernel.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iQIVAwUBQrkXY3gHNmZLgCUhAQLQUw//ZFN1KS+2wS/yDMa+/oWXVemZ690sMCLx
ZlKGg82bnv2XxqMXQwuPG9V02oN/D+1bkPmZzr8rD/tm5WshxpAHroIhnp3ZVpRi
lbMwULFQ8Z8fcsY3+YUag4XAUYGK+tmIeZc47FJGL0avsRa3RFJsFm+Kb6E/fi2f
H4wda43rt2CJYD5GqCtMsqyxzHzPclKHq25betIcPWBOqvE5NzQbc2tFTo0n3KMb
vmyZc4B34kiKhrrW/7pZCxDpiGjoxr87F19Tk8IIltM9kAuSVLXgtY/T+2DA2vJE
2N/Offr1rZh9zSq8PGkGoI+K41AaY3CkeYGjUF2eiZd4qwE624/1jUSEg685Puse
091EuIMzdndJYM0H+OsaFtvH9Rc67Hv6yR7aucNF5j8sIam37y7Fl+MToRgJK1+E
7YSpm/Ld61RaPqbJ4mqv4f0fHLTa2SpbFI8vmA1ARuiA+/YtUz9jBjLrPtMo4ppj
VvNTVMmftUgRr1NlQ+MKJO4Kxt4kKQnt1OtUe2y4bjCqO7ldUvPWLKGhsY0EsS0k
9yymlBbhsjTFrY9CsyrThshyHe9ikBVSLY7i16W+KhjLF/FKaq9k93nHd4B5Shni
Km9zyd0DlCUr3Y20SpBDITCWtM0CL0YQzeEW0JJTxVpHIDjh6s65XcBfrlWwEUiw
j/GJZA5h+bw=
=fBov

Andrew Morton

unread,

Jun 22, 2005, 4:10:19 AM6/22/05

to

Miklos Szeredi <mik...@szeredi.hu> wrote:
>
> > It would be helpful if we could have a brief description of the feature
> > which you're discussing here. We discussed this a couple of months back,
> > but I've forgotten most of it and it was off-list I think.
> >
> > Doing `grep uid fs/fuse/*.c' gets us to the implementation, yes?
> >
> > Which parts are controversial?
>
> The controversial part is fuse_allow_task() called from
> fuse_permission() and fuse_revalidate() (fs/fuse/dir.c).
>
> This function (as explained by the header comment) disallows access to
> the filesystem for any task, which the filesystem owner (the user who
> did the mount) is not allowed to ptrace.

That's fairly weird. Overloading ptraceability is awkward, but also the
*direction* is wrong. It's saying "if I can ptrace you, you can read my
data". I'd have expected to see "if you can ptrace me, you can access my
data".

> The rationale is that accessing the filesystem gives the filesystem
> implementor ptrace like capabilities (detailed in
> Documentation/filesystems/fuse.txt)

hrm. Makes some sense.

> It is controversial, because obviously root owned tasks are not
> ptrace-able by the user, and so these tasks will be denied access to
> the user mounted filesystem (-EACCESS is returned on stat() or any
> other file operation).
>
> However nobody raised _any_ concrete technical problem associated with
> this, and the 4 years of widespread use didn't turn up any either. So
> IMO it's "ugly" only in people's heads and not in reality.

It's ugly ;)

But the problem you're addressing here largely revolves around the fact that
the filesystem implementation is a userspace process which is potentially
owned by a different user. So you need to prevent the mount owner from
peeking at the fs user's activity. That problem is unique to FUSE and so a
solution within fuse is appropriate.

This security feature doesn't sounds terribly important to me. So the fuse
server can find out what files I'm looking at. But I've already
deliberately given the fuse server the ability to ptrace my process?

Can we enhance private namespaces so they can squash setuid/setgid? If so,
is that adequate?

Andrew Morton

unread,

Jun 22, 2005, 4:10:32 AM6/22/05

to

Miklos Szeredi <mik...@szeredi.hu> wrote:
>
> > Not so emotional argument...
> >
> > System where users can mount their own filesystems should not be
> > called "Unix" any more.
>
> It's not. It's "Linux".

It would be helpful if we could have a brief description of the feature

which you're discussing here. We discussed this a couple of months back,
but I've forgotten most of it and it was off-list I think.

Doing `grep uid fs/fuse/*.c' gets us to the implementation, yes?

Which parts are controversial?

How _should_ we implement unprivileged mounts, if not this way?

Martin J. Bligh

unread,

Jun 22, 2005, 4:20:09 AM6/22/05

to

--Andrew Morton <ak...@osdl.org> wrote (on Tuesday, June 21, 2005 14:04:41 -0700):

> Gerrit Huizenga <g...@us.ibm.com> wrote:
>>
>> Kexec/kdump has a chance of working reliably.
>
> IOW: Kexec/kdump has a chance of not working reliably.
>
> Worried.

Personally I'm more concerned about the design issues - I can't see how
any of the other options are sustainable / workable. Things that require
maintaining their own driver base are just insane. Things that dump from
the panicing kernel are just broken. People want to be able to dump to
disk / network / flash-ram card / god-knows-what, so we need something
that's flexible.

I don't think kdump is perfect and bug-free yet, but at least it has a
design that looks like it'll be workable and sustainable through the future.
Plus it's a small patch on top of kexec, which is useful in it's own right
(for fast reboot, etc) so we get to reuse a lot of code.

We could go into how crashdump itself is important (eg. first time failure
capture is critical for customers, less downtime, I can ship you better
data on bugs I find in test, etc, etc) but I kind of assumed most people
were convinced of that by now. Even Linus seemed to think kdump was the
sensible way forward (at KS last year), and he seems to be one of the
most ardent sceptics of crashdump I've ever met ;-)

M.

Hans Reiser

unread,

Jun 22, 2005, 4:30:20 AM6/22/05

to

Jeff Garzik wrote:

>
>
> "Hans' team says its good stuff" is not a criteria for merging.
>
>

Try benchmarking it. Maybe benchmarks mean more than our
chattering..... at least to the users.....

Eric Van Hensbergen

unread,

Jun 22, 2005, 1:20:06 PM6/22/05

to

On 6/22/05, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
> > 1) only allow user's to mount/bind on directories/files where they
> > have unconditional write access.
>
> Like say /tmp. Build a bizarre behaving /tmp and I can do funky stuff
> with some third party suid apps. Its a good start but you probably want
> a stronger policy and one enforced by the user space side not kernel (eg
> "Below ~")
>

Well in the original discussions Miklos had classified directories
that had the sticky bit set (such as /tmp) as out-of-bounds for
user-mounts. However, its a point well taken. I had originally
proposed having some sort of a policy file (sort of like an extended
fstab with regular expressions) to give more granular control over
where users could and couldn't mount (along with what types of
devices, network servers, and file systems they could mount from).
However, this leans more towards the "super-mount" suid-application
which I think many found undesirable. An alternative would be some
way for the kernel to consult with an application about different
mount policies. I don't know what the right thing is here.

> > 2) enforce NOSUID mount options on user-mounts
>
> 2 is unneccessarily crude. Just enforce suid owner/owner group.
>

I'm dense this morning, not sure what you mean here.

-eric

Eric W. Biederman

unread,

Jun 22, 2005, 1:20:05 PM6/22/05

to

Andrew Morton <ak...@osdl.org> writes:

> Jeff Garzik <jga...@pobox.com> wrote:
> >
> > > sparsemem
> > >
> > > OK by me for a merge. Need to poke arch maintainers first, check that
> > > they've looked at it sufficiently closely.
> >
> > seems sane, though there are some whitespace niggles that should be
> > cleaned up
> >
>
> There are? I thought I fixed most of them.
>
> *general sigh*. I wish people would absorb CodingStyle. It's not hard,
> and fixing the style post-facto creates a real mess. I now have a great
> string of kexec patches followed by a "kexec-code-cleanup.patch" which
> totally buggers up the patch sequencing and really needs to be split into
> 18 parts and sprinkled back over the entire series.

It looks like I have another hole in my schedule where I can put some
work into kexec so I will see what I can do.

If you want people to absorb CodingStyle it needs to be more explicit.
Of the things that patch fixes you almost have to read between
the lines of CodingStyle to see. If there is anything backing
it up at all. Until the problems were pointed out to me I simply
could not see them, and reading CodingStyle was not enlightening.
I point this out not to complain but more so people know which
part of the process needs fixing.

> > > kexec and kdump
> > >
> > > I guess we should merge these.
> > >
> > > I'm still concerned that the various device shutdown problems will
> > > mean that the success rate for crashing kernels is not high enough for
> > > kdump to be considered a success. In which case in six months time we'll
>
> > > hear rumours about vendors shipping wholly different crashdump
> > > implementations, which would be quite bad.
> > >
> > > But I think this has gone as far as it can go in -mm, so it's a bit of
> > > a punt.
> >
> > I'm not particularly pleased with these,
>
> How come?

Please tell.

With respect to users of crashdumps there are at least two groups
converging on kexec based crashdumps. Is there active development
on any of the rest of the tools.

On to the practical response. The kexec has effectively zero
impact on the kernel, except when it is invoked, and the
code is small. Kexec is also useful for a lot more than
just crash dumps. It happens that crashdumps seem to be the only
case where the other alternatives are noticeably less sane.

There is also another important piece about kexec based crashdumps
that is not usually envisioned. The kexec based solution is much more
flexible. For example on a cluster the worst case scenario for
a network based crashdump is all 1000+ nodes will output a crashdump
simultaneously. Poor crashdump server. Where with the kexec based
approach it is possible to have the machines send an SNMP alert
and then you can come in and have the machine dump only when you are
ready. Or you might even start a gdb-stubs server and interact
with a live core dump. :) And all of this happens to fall out
naturally from using a kernel after the crash.

There are a few associated bug fixes that are problematic but I think
they are fixable. For the crashdump case the work really is going
into getting hardening linux so it can run on imperfectly behaving hardware.
I.e. things that are bugs in any event but that using the kernel to
take a crashdump exacerbates.

Andrew the good news is unless something comes up I will have time
starting about Monday to help with the 2.6.13 merge. It looks like
the first thing I should do is split up the indent patch so it pairs
well with the rest. And then I have a few pending patches for the user
space and I think I can fix a number of the device_shutdown problems,
for at least the normal kexec path.

Eric

Theodore Ts'o

unread,

Jun 22, 2005, 1:40:16 PM6/22/05

to

On Wed, Jun 22, 2005 at 09:16:34AM +0200, Miklos Szeredi wrote:
> The controversial part is fuse_allow_task() called from
> fuse_permission() and fuse_revalidate() (fs/fuse/dir.c).
>
> This function (as explained by the header comment) disallows access to
> the filesystem for any task, which the filesystem owner (the user who
> did the mount) is not allowed to ptrace.
>

> The rationale is that accessing the filesystem gives the filesystem
> implementor ptrace like capabilities (detailed in
> Documentation/filesystems/fuse.txt)
>

> It is controversial, because obviously root owned tasks are not
> ptrace-able by the user, and so these tasks will be denied access to
> the user mounted filesystem (-EACCESS is returned on stat() or any
> other file operation).

I don't think this should be objectionable, since we already have
times when root-owned tasks can get EACCESS when accessing some
filesystem. This can happen with any distributed filesystem that
enforces real security --- whether it be nfs-root-squash, or the
Andrew Filesystem, or NFSv4. Root can get "permission denied" when
some other userid with appropriate credentials would be allowed to
access the file/directory.

On the other hand, sometimes a root process, or some other user's
process, might _want_ to give permission to allow a trusted FUSE
filesystem the potential to monkey with it (return potentially
untrusted information, or stop it entirely), in exchange for access to
the filesystem. So it would be nice if there was some way that a
process could tell the kernel that it is willing to give permission to
allow certain FUSE filesystems to potentially affect it. Say, via a
fnctl() call, perhaps.

- Ted

Miklos Szeredi

unread,

Jun 22, 2005, 1:40:13 PM6/22/05

to

> > It's related to the problem of a suid program accessing synthetic
> > filesystem, and filesystem doing something bad to suid program (make
> > it hang, supply bogus data ...). This can be solved by "squashing"
> > suid for the whole namespace (basically the Plan 9 solution).
> > Unfortunately this is not really practical in Linux/Unix.
> >
>
> Just to make sure I understand you - if I don't squash suid for the
> entire name space, a user could mount a malicious synthetic (even with
> NOSUID) and then launch an SUID app from an inherited mount which
> would then traverse to the malicious synthetic.

Yes.

> That's a nasty case I hadn't considered before -- however, what's the
> potential damage there? The user could hold up progress of the SUID
> app that they launched, but that wouldn't necessarily impede system
> progress since system-critical suid apps wouldn't be typically
> launched by a user. I suppose there is the possibility that if
> multiple instances of such an SUID app share a global lock you could
> get into trouble -- do we have any concrete example apps that would
> exhibit this kind of behavior?

I don't know any. But with 'sudo' the potential set of SUID apps is
basically infinite, so you'd have a hard time proving that this sort
of situation won't arise.

> Are there other vunerabilities that I'm missing?

Another theoretical possibility is that you make the SUID app consume
some resource by feeding it a large-file/deep-directory/etc that quota
would otherwise prevent (you can't do quota on a synthetic filesystem,
without the filesystem's cooperation).

Miklos

Horst von Brand

unread,

Jun 22, 2005, 1:50:10 PM6/22/05

to

Artem B. Bityuckiy <dede...@yandex.ru> wrote:
> Markus TЖrnqvist wrote:
> > So merge it as it is

Fix it first. The "merge as it stands" just gives rise to stuff that is
/never/ fixed properly.

> > and move the stuff to the VFS as needed or
> > deemed necessary. And enable the pseudo interface, or at least
> > set it in menuconfig and enable by default, it needs testing too.

Then test it out of the standard tree...

> Reiser4 has a number of great (IMO) things like file as directory,

Urgh.

> atomic operations,

What is atomic that isn't in the standard filesystems? How do you guarantee
it doesn't stop the system dead in its tracks waiting for some very long
transaction to finish?

> different kinds of stat data,

Usefulness? Sounds like kernel bloat leading to userspace bloat and
applications/users wondering what the heck goes on when they don't grok the
particular stat format.

> fibretions, etc,

???

> etc. Some thing is not yet ready - doesn't matter. Some of this is of
> general interest, some is Reiser4-dedicated.

I don't see anything that would interest me at least, so you can safely
scratch the "general interest" part.

> New interfaces are needed to allow users to utilize that all.

That is a quite strong argument /against/ it all in my book. It means bloat
in /every/ filesystem, and they have shown to be able to do without for
some 30 years now. I'd need /very/ strong reasons for adding something.

> My point
> is that the things that are of general interest must not be
> Reiser4-only.

Reiser4-only stuff is of very limited use, if it isn't just internal
stuff. And that doesn't need any changes.

> For example, I should have a possibility to implement
> files-like-dir in _another_ FS using the same interfaces that Reiser4
> uses. That's all I wanted to say.

It has been argued over and over that that particular feature /can't/ be
implemented sanely anyway, so it has to stay out. Besides not making any
sense. "You've got files and directories" is a bit asymetrical and so not
quite nice; "all you have is directories" is symmetrical, estetic, and
completely useless; "some files are directories, some aren't; files in
file-directories are different than regular files in directory-directories"
is just a mindless jumble.

> The other question that it may be difficult to foresee everything and
> it may make sense to move some things upper in future.

Another good reason to keep it out ;-)
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

Nish Aravamudan

unread,

Jun 22, 2005, 2:10:11 PM6/22/05

to

On 6/21/05, Nish Aravamudan <nish.ar...@gmail.com> wrote:
> On 6/21/05, Lee Revell <rlre...@joe-job.com> wrote:
> > On Mon, 2005-06-20 at 23:54 -0700, Andrew Morton wrote:
> > > CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
> > > Kconfigurable.
> > >
> > > Will merge (will switch default to 1000 Hz later if that seems
> > > necessary)
> >
> > Are you serious? You're changing the *default* HZ in a stable kernel
> > series?!?
> >
> > This is a big regression, it degrades the resolution of system calls.
>
> Not that my opinion should sway anybody else, but I really would
> prefer more of the in-kernel sleep callers were converted to use
> human-time units (and thus were verified to be correct) so that
> switching HZ will result in the *same* latencies. How much of moving
> to lower HZ values is due to the fact that everything is request 10ms
> for 1 jiffy of sleep instead of 1 ms? It's hard to tell if the gain is
> there or from the lower frequency of interrupts.

After some further consideration, I don't think that my patches would
be at all changed by adjusting HZ's default value. I just want to make
sure maintainers are still responsive to appropriate patches to split
time-based delays from tick-based delays. So, CONFIG_HZ is ok by me,
but I consider it a band-aid.

Thanks,
Nish

Miklos Szeredi

unread,

Jun 22, 2005, 2:10:11 PM6/22/05

to

> I don't think this should be objectionable, since we already have
> times when root-owned tasks can get EACCESS when accessing some
> filesystem. This can happen with any distributed filesystem that
> enforces real security --- whether it be nfs-root-squash, or the
> Andrew Filesystem, or NFSv4. Root can get "permission denied" when
> some other userid with appropriate credentials would be allowed to
> access the file/directory.

Right.

> On the other hand, sometimes a root process, or some other user's
> process, might _want_ to give permission to allow a trusted FUSE
> filesystem the potential to monkey with it (return potentially
> untrusted information, or stop it entirely), in exchange for access to
> the filesystem. So it would be nice if there was some way that a
> process could tell the kernel that it is willing to give permission to
> allow certain FUSE filesystems to potentially affect it. Say, via a
> fnctl() call, perhaps.

Hmm. 'su' works for root.

How do you think fcntl() could be used? I think a task flag settable
via prctl() would be more appropriate.

Miklos

Miklos Szeredi

unread,

Jun 22, 2005, 5:20:04 AM6/22/05

to

> But the problem you're addressing here largely revolves around the fact that
> the filesystem implementation is a userspace process which is potentially
> owned by a different user. So you need to prevent the mount owner from
> peeking at the fs user's activity. That problem is unique to FUSE and so a
> solution within fuse is appropriate.

It's in fact not so unique to FUSE. It would equally well apply to
v9fs or even samba, since both want to allow unprvileged mounts, and
synthetic (or at least user-controlled) file serving.

> This security feature doesn't sounds terribly important to me. So the fuse
> server can find out what files I'm looking at. But I've already
> deliberately given the fuse server the ability to ptrace my process?

If it's deliberate, than OK.

However with suid/sgid, this is not a deliberate action of the user
under whose capabilities the process runs. Neither in the case, when
it's a daemon doing some recursive directory traversal.

And it's not just peeking at the filesystem access patterns. A much
more dangerous aspect is controlling _when_ an operation returns
(e.g. delaying it forever), and _what_ it returns (e.g. huge
files/directories).

Of course, this is only truly relevant for systems with untrusted
users. But I do want to make FUSE work securely in those cases too.

For the single user system, the sysadmin can turn this feature off,
and be done with it.

> Can we enhance private namespaces so they can squash setuid/setgid? If so,
> is that adequate?

We could. But that would again be overly restrictive. The goal is to
make the use of FUSE filesystems for users as simple as possible. If
the user has to manage multiple namespaces, each with it's own
restrictions, it's becoming a very un-user-friendly environment.

Thanks,
Miklos

Christoph Hellwig

unread,

Jun 22, 2005, 6:00:22 AM6/22/05

to

> git-ocfs
>
> The OCFS2 filesystem. OK by me, although I'm not sure it's had enough
> review.

I'll try to take a look next week. A major blocker is that it's not
endian-clean yet. Even if other review items where perfect that's something
preventing it from going to mainline completely.

Andrew Morton

unread,

Jun 22, 2005, 6:00:24 AM6/22/05

to

Miklos Szeredi <mik...@szeredi.hu> wrote:
>
> > > We could. But that would again be overly restrictive. The goal is to
> > > make the use of FUSE filesystems for users as simple as possible. If
> > > the user has to manage multiple namespaces, each with it's own
> > > restrictions, it's becoming a very un-user-friendly environment.
> >

> > I'd have thought that it would be possible to offer the same user interface
> > as you currently have with private namespaces. Hide any complexity in the
> > userspace tools? Where's the problem?
>
> Sorry, I don't get it.

I'm asking you to expand on what the problems would be if we were to
enhance the namespace code as suggested. What's the "very un-user-friendly
environment", and why cannot it be made more friendly with appropriate
support tools?

Andrew Morton

unread,

Jun 22, 2005, 6:40:17 AM6/22/05

to

Miklos Szeredi <mik...@szeredi.hu> wrote:
>
> > > > > We could. But that would again be overly restrictive. The goal is to
> > > > > make the use of FUSE filesystems for users as simple as possible. If
> > > > > the user has to manage multiple namespaces, each with it's own
> > > > > restrictions, it's becoming a very un-user-friendly environment.
> > > >
> > > > I'd have thought that it would be possible to offer the same user interface
> > > > as you currently have with private namespaces. Hide any complexity in the
> > > > userspace tools? Where's the problem?
> > >
> > > Sorry, I don't get it.
> >
> > I'm asking you to expand on what the problems would be if we were to
> > enhance the namespace code as suggested.
>

> OK, what I was thinking, is that the user could create a new
> namespace, that has all the filesystems remounted 'nosuid'. This
> wouldn't need any new kernel infrastructure, just a suid-root program
> (e.g. newns_nosuid), that would do a clone(CLONE_NEWNS), then
> recursively remount everything 'nosuid' in the new namespace. Then
> restore the user's privileges, and exec a shell.
>
> In this namespace the user could mount things to his heart's content.
> He could mount over system directories or even the root directory,
> without being able to do any harm.
>
> This is very nice, but a bit inpractical, since now all the other
> programs of the user, his desktop environment, login shells etc. Won't
> be able to see the userspace filesystems mounted in the private
> namespace.

Yup, that's useless. That makes the whole CLONE_NEWNS idea unworkable,
yes?

Have we exhausted the alternatives?

(If, as you say, v9fs and samba (did you mean smbfs/cifs?) want
unprivileged mounts, shouldn't the code which you have there be moved out
of fs/fuse/ and into fs/?)

Eric Van Hensbergen

unread,

Jun 22, 2005, 12:50:07 PM6/22/05

to

On 6/22/05, Eric Van Hensbergen <eri...@gmail.com> wrote:
...
>
> If you combine these two restrictions with only allowing unprivileged
> mounts in private name space I think you get 90% there. The only
> thing left to resolve is the best way to allow sharing private name
> spaces between threads/users -- and I still view this as more of
> extended functionality than a hard-requirement.
>

Reviewing my notes, there were a few subtle restrictions I forgot
(most of which originally suggested by Miklos):

(a) User's can't mount file system types not deemed "safe" (via flag
in the file system type) -- this should help mitigate user's
exploiting bugs in existing file systems to interfere with the system.
Judging when a file system type is "safe" is a nasty kettle of fish
though...
(b) Enforce NODEV along with NOSUID so that user-based synthetics
can't have device inodes with compromised permissions, etc.

-eric