[PATCH] thread wakeup fix for 2.4.0-test7

Rusty Russell

unread,

Aug 26, 2000, 3:00:00 AM8/26/00

to Linus Torvalds

In message <Pine.LNX.4.10.100082...@penguin.transmeta.com> you
write:
> On Fri, 25 Aug 2000, Rusty Russell wrote:
> >
> > BUG: thread 1 is doing poll() on an fd, thread 2 closes it,
> > and thread 1 doesn't wake up. Included is a small test program and
> > Makefile.
>
> Why do you call this a bug?
>
> Thread 1 still has the fd open (it's used by the poll()).
>
> The fact that thread 2 removed it from the fd table is immaterial.

Your implementation disagrees with you: poll already returns POLLNVAL.
With the patch it just returns it immediately, rather than returning
it when/if it times out.

Single Unix Specification also disagrees with you:

The close() function will deallocate the file descriptor
indicated by fildes. To deallocate means to make the file
descriptor available for return by subsequent calls to open()
or other functions that allocate file descriptors.

So we could sleep in close (forever, maybe) until the polls are
finished if you prefer.

close doesn't close. This would be a ``bug'', Linus.

Rusty.
--
Hacking time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Aug 26, 2000, 3:00:00 AM8/26/00

to Rusty Russell

On Sat, 26 Aug 2000, Rusty Russell wrote:
>
> > Why do you call this a bug?
> >
> > Thread 1 still has the fd open (it's used by the poll()).
> >
> > The fact that thread 2 removed it from the fd table is immaterial.
>
> Your implementation disagrees with you: poll already returns POLLNVAL.
> With the patch it just returns it immediately, rather than returning
> it when/if it times out.

Tough. It's still because YOUR program is buggy. You closed a file
descriptor that was in use.

> Single Unix Specification also disagrees with you:
>
> The close() function will deallocate the file descriptor
> indicated by fildes. To deallocate means to make the file
> descriptor available for return by subsequent calls to open()
> or other functions that allocate file descriptors.

No. SuS agrees with me 100%.

The above is exactly what Linux does. The file descriptor is the _integer
number_allocated_to_that_process_ to describe the file. It has got NOTHING
to do with the file itself - think about dup() etc.

The _file_ is kept open by the poll(). The file descriptor is closed.

Think about mmap(): you can do

mmap(fd)
close(fd);
... the _file_ is still active through the mapping ...

and the mmap doesn't go away. Nobody expects the mmap() to go away just
because you closed the fd. It's still there - and in fact SuS _requires_
that it is still there.

The poll case is 100% the same.

As is, btw, read() and write() and just about everything else. Try it. Do

thread 0

read(fd, xxxxx) /* blocks */

close(fd);

... read continues to work ...

and the above is obviously the only "sane" semantics.

"fd" is just a handle. Once you've looked it up, it's not used any more.
Closing it after the fact is a non-issue - you just got rid of the handle
that we've already used.

> So we could sleep in close (forever, maybe) until the polls are
> finished if you prefer.

Nope.

Should close() wait until all memory mappings are gone too? Obviously not.

Linus

Alexander Viro

unread,

Aug 26, 2000, 3:00:00 AM8/26/00

to Linus Torvalds

On Sat, 26 Aug 2000, Linus Torvalds wrote:

> On Sat, 26 Aug 2000, Rusty Russell wrote:
> >
> > > Why do you call this a bug?
> > >
> > > Thread 1 still has the fd open (it's used by the poll()).
> > >
> > > The fact that thread 2 removed it from the fd table is immaterial.
> >
> > Your implementation disagrees with you: poll already returns POLLNVAL.
> > With the patch it just returns it immediately, rather than returning
> > it when/if it times out.

Sigh... poll() API is _not_ thread-safe. It would make a lot of sense if
you would get _new_ (duplicate) descriptors as if you've received an
SCM_RIGHTS packet. The way it is defined you are getting an answer in
terms of descriptors you had when you had called poll(). Unless you've
done the protection in userland you are fscked, because they may have
completely different meanings when you return. Yes, it's asking for
trouble. Yes, inside the kernel such API would be fixed, just to avoid a
shitload of potential races. Unfortunately it's exported to userland, so
changing it is out of question.

> Tough. It's still because YOUR program is buggy. You closed a file
> descriptor that was in use.

Yup.

> The above is exactly what Linux does. The file descriptor is the _integer
> number_allocated_to_that_process_ to describe the file. It has got NOTHING
> to do with the file itself - think about dup() etc.

Not even to the process - to fdesc process group. I.e. descriptor-to-file
binding is a shared resource exposed to userland and should be treated as
such - with userland taking care about not stepping on its own toes.

> The _file_ is kept open by the poll(). The file descriptor is closed.
>
> Think about mmap(): you can do
>
> mmap(fd)
> close(fd);
> ... the _file_ is still active through the mapping ...
>
> and the mmap doesn't go away. Nobody expects the mmap() to go away just
> because you closed the fd. It's still there - and in fact SuS _requires_
> that it is still there.

Erm... Actually *BSD _does_ munmap() on close(). OTOH their handling of
shared descriptor tables is not just broken, it's dancing around with a
huge "screw me" sign on its arse (you can panic the box with UID==-2 and
_no_ permissions on any file or directory - having stdin opened is enough
to work with if you can call pipe(), rfork(), dup2() and close()).

> "fd" is just a handle. Once you've looked it up, it's not used any more.
> Closing it after the fact is a non-issue - you just got rid of the handle
> that we've already used.

Which is the reason why poll() turns out to be a dangerous thing in case
of shared file descriptor tables. Use with care. Kernel can't do anything
about that - if thread A call poll() and gets scheduled away immediately
after the return from the system call... just what could kernel do to stop
the thread B from closing one of descriptors and opening a new file?
Woops, A is thinking that descriptor 69 has something to read from, but
now 69 is not what it used to be when we returned from poll(). Same goes
for pipe(), open(), dup2(), socket() - whatever.

Rusty Russell

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Linus Torvalds

In message <Pine.LNX.4.10.100082...@penguin.transmeta.com> yo

u write:
> No. SuS agrees with me 100%.

My braino. You are completely correct about the close() semantics.

> mmap(fd)
> close(fd);
> ... the _file_ is still active through the mapping ...
>
> and the mmap doesn't go away. Nobody expects the mmap() to go away just
> because you closed the fd. It's still there - and in fact SuS _requires_
> that it is still there.
>

> The poll case is 100% the same.

Um, I think not. read/write/mmap/etc. all refer to `the file
associated with the open file descriptor'. poll and select actually
refer to the file descriptor itself. Usually this semantic difference
doesn't matter. In this case it does: any arguments involving
congruence with mmap/read are flawed.

Tested 2.2.15 (fd closed, data becomes available)
1) read returns as normal
2) poll returns POLLNVAL, even if it's readable.
3) select will not return EBADF, nor mark fd readable, even if it's
readable.

AFAICT our behaviour doesn't violate POSIX (although it differs from
Solaris), so the question is: is it OK to leave it as undefined
behavior, and say any program relying on it is buggy?

Rusty.
--
Hacking time.

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Rusty Russell

On Sun, 27 Aug 2000, Rusty Russell wrote:

> Tested 2.2.15 (fd closed, data becomes available)
> 1) read returns as normal
> 2) poll returns POLLNVAL, even if it's readable.

... unless ufds had been swapped out and close() happens from another
thread while we are paging it in before copying the data out.

> 3) select will not return EBADF, nor mark fd readable, even if it's
> readable.

Ditto.

> AFAICT our behaviour doesn't violate POSIX (although it differs from
> Solaris), so the question is: is it OK to leave it as undefined
> behavior, and say any program relying on it is buggy?

About the only case when program's behaviour is _not_ undefined with such
scenario is the following:

thread A:
poll() on array including descriptor n
thread B:
close(n);do something that leads to event on file that used to be
refered by n
everything else:
make damn sure that no events on that file will happen until B
will do its thing.

Subtle, britlle and nonportable.

kuz...@ms2.inr.ac.ru

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Alexander Viro

Hello!

> About the only case when program's behaviour is _not_ undefined with such
> scenario is the following:
>
> thread A:
> poll() on array including descriptor n
> thread B:
> close(n);do something that leads to event on file that used to be
> refered by n

"Something" is already done. It is close() itself.
And this is exactly the thing, which does not work in Linux.

> everything else:
> make damn sure that no events on that file will happen until B
> will do its thing.
>
> Subtle, britlle and nonportable.

It is nonportable _only_ to Linux (and to old freebsds btw,
its user level pthread library had the same bug).
This is de facto standard programming trick in pure MT programs.
It works in solaris and dux. Its direct analogue works in nt and macos x.
I have about dozen of bug reports blaming on this linux feature.
Actually, people who used it are in big troubles because absence
of possibility to shutdown thread sleeping in read/write/connect/poll
forces to add to programs new logic, required only for linux.

Indeed, the trick looks dirty from unix viewpoint, but it is valid,
when "B" is master thread, controlling state machine and timeouts
and "A"s are pure service threads. I.e. usual model in MT programming.

Alexey

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to kuz...@ms2.inr.ac.ru

On Sun, 27 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> > Subtle, britlle and nonportable.
>
> It is nonportable _only_ to Linux (and to old freebsds btw,
> its user level pthread library had the same bug).

Huh?
<doing cvs update just in case>
<reading sys_generic.c::read() and sys_generic.c::dofileread()>

Nope, world didn't change - read(9) does getfp(9) [== our fget(9)] and
then calls dofileread(9). The former returns the same on all dup()-created
copies of descriptor, the latter doesn't use fd at all unless you build
with KTRACE. Even if you do, the call that might, in principle, depend on
fd is done after fo_read(9).

_If_ somebody had changed FreeBSD so that close() on dup()'d descriptor
aborts read() on the original - he needs severe LARTing. I'm not on a
-CURRENT box right now, so I can't check it immediately, but you bet that
if it really does that idiocy it's going to be a send-pr time. Big way.
And AFAICS the idiocy above is the only way it can do what you've
described - it doesn't distinguish the dup()'d descriptors until the very
end of operation, way past any waiting for incoming data.

> This is de facto standard programming trick in pure MT programs.
> It works in solaris and dux. Its direct analogue works in nt and macos x.
> I have about dozen of bug reports blaming on this linux feature.
> Actually, people who used it are in big troubles because absence
> of possibility to shutdown thread sleeping in read/write/connect/poll
> forces to add to programs new logic, required only for linux.

And *BSD, unless they are using some userland wrappers around the read(2).
Which may be the case, but then I don't see where it becomes a kernel
problem on our end of things.

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to kuz...@ms2.inr.ac.ru

On Sun, 27 Aug 2000, Alexander Viro wrote:

> On Sun, 27 Aug 2000 kuz...@ms2.inr.ac.ru wrote:
>
> > > Subtle, britlle and nonportable.
> >
> > It is nonportable _only_ to Linux (and to old freebsds btw,
> > its user level pthread library had the same bug).
>
> Huh?
> <doing cvs update just in case>
> <reading sys_generic.c::read() and sys_generic.c::dofileread()>
>
> Nope, world didn't change - read(9) does getfp(9) [== our fget(9)] and
> then calls dofileread(9). The former returns the same on all dup()-created
> copies of descriptor, the latter doesn't use fd at all unless you build
> with KTRACE. Even if you do, the call that might, in principle, depend on
> fd is done after fo_read(9).
>
> _If_ somebody had changed FreeBSD so that close() on dup()'d descriptor
> aborts read() on the original - he needs severe LARTing. I'm not on a
> -CURRENT box right now, so I can't check it immediately, but you bet that
> if it really does that idiocy it's going to be a send-pr time. Big way.
> And AFAICS the idiocy above is the only way it can do what you've
> described - it doesn't distinguish the dup()'d descriptors until the very
> end of operation, way past any waiting for incoming data.

PS: after checking uipc_syscall.c it looks like they can't distinguish
between the different dup()'d copies for recvfrom() too - not until the
->pru_soreceive() returns.

And that's -CURRENT, so I really wonder whether the thing is done in the
kernel at all - I don't think that they went for such a violation of
normal UNIX semantics. close(dup(fd)) should have _no_ effect. POSIX locks
violate that rule, but that's one of the best reasons to avoid them like a
plague. Breaking it for read(2)... <shudder>

kuz...@ms2.inr.ac.ru

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Alexander Viro

Hello!

> > It is nonportable _only_ to Linux (and to old freebsds btw,
> > its user level pthread library had the same bug).
>
> Huh?

Al, I specially wrote "old" freebsd, because I do not know anything
about new ones. At least, FreeBSD folks promised to fix this, when
they will implement their own linux-like threads. I even do not know,
did they imlemented them eventually or they did not. 8)

Their user-level thread library is surely buggy and this
is not the only bug there.

> And *BSD, unless they are using some userland wrappers around the read(2).
> Which may be the case, but then I don't see where it becomes a kernel
> problem on our end of things.

When kernel leaves process frozen, the bug is in kernel.

Did process close file? Yes. Does user have some references to this file
to control it? No! P...ts. All the arguments sort of "user is fool"
are pure demagogy. This thing is called "leakage" usually and for user
level it looks exactly as leakage. Kernel is not self-consistent. Period.

If you are still in doubts, look at analogies in FS.
You make "ls somedir" in one shell continuosuly.
You make "rmdir somedir; mkdir somedir" in another one.
You get lots of errors "No such directory", but processes
do not hang and unlinked but "still referenced by ls" directories
do not overflow file tables. User is surely makes some crap, but
it is his right 8).

Another analogy, look at mm_struct. It has mm_users and mm_count.
Why? Following situation with files, mm_count is enough. And bug
poor users, when mm is not cleared occasionally.

You can find the same logic in any object maintanance system.

Alexey

kuz...@ms2.inr.ac.ru

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Alexander Viro

Hello!

> normal UNIX semantics. close(dup(fd)) should have _no_ effect.

Grrr... No doubts!

File is closed, when all the references from user disappear.

I have already tried to clean this issue times in the past.

File must have two refcounts: f_users and f_count.
f_users counts references from user i.e. from file descriptor
tables (and from a few of descriptor-less files inside kernel).

fget() and fput() work on f_count.

When f_users becomes zero, file is closed, but not destroyed.
Method fops->flush() is called to shutdown it. It wakes up
processes holding f_count on this file, particualry.

When f_count becomes zero, it is destroyed. Method fos->close()
is called before destruction.

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to kuz...@ms2.inr.ac.ru

On Sun, 27 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> > Which may be the case, but then I don't see where it becomes a kernel
> > problem on our end of things.
>
> When kernel leaves process frozen, the bug is in kernel.

WTF "frozen"? Does
% cat
leave the process frozen? Not? If your read() is uninterruptably blocked -
too fscking bad, but close() is unlikely to get you out of that. If it
isn't - what are you complaining about?

> Did process close file? Yes. Does user have some references to this file
> to control it? No! P...ts. All the arguments sort of "user is fool"

OK, should unlink("foo"); terminate truncate("foo", 1);?

Kstati, eto ne p..ts, a h...nya. I my takogo ne lechim...

> are pure demagogy. This thing is called "leakage" usually and for user
> level it looks exactly as leakage. Kernel is not self-consistent. Period.

> If you are still in doubts, look at analogies in FS.
> You make "ls somedir" in one shell continuosuly.
> You make "rmdir somedir; mkdir somedir" in another one.
> You get lots of errors "No such directory", but processes
> do not hang and unlinked but "still referenced by ls" directories
> do not overflow file tables. User is surely makes some crap, but
> it is his right 8).

Alexey, I'm afraid that I've got a surprise for you. Say rmdir `pwd`
and see if the inode gets released. It doesn' and it shouldn't. Nothing
is getting hung, indeed.

> Another analogy, look at mm_struct. It has mm_users and mm_count.
> Why? Following situation with files, mm_count is enough. And bug
> poor users, when mm is not cleared occasionally.

Excuse me? Sorry, but you've totally misread that code. If the last owner
of mm_struct exits while the thing is in use by some processor running
lazy-TLB process... Guess what - mm_struct hangs around until the next
context switch to non-lazy beast. On all CPUs that had it.

> You can find the same logic in any object maintanance system.
>

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to kuz...@ms2.inr.ac.ru

On Sun, 27 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> Hello!
>
> > normal UNIX semantics. close(dup(fd)) should have _no_ effect.
>
> Grrr... No doubts!
>
> File is closed, when all the references from user disappear.

Yes? So how about
fd2 = dup(fd1);
read(fd1,...);
close(fd1);

Is the file closed in that case? If no - you've got completely weird
semantics (having dup() changes rules for operations on original fd), if
yes - what happens with your arguments about leaks?

> I have already tried to clean this issue times in the past.
>
> File must have two refcounts: f_users and f_count.
> f_users counts references from user i.e. from file descriptor
> tables (and from a few of descriptor-less files inside kernel).
>
> fget() and fput() work on f_count.
>
> When f_users becomes zero, file is closed, but not destroyed.
> Method fops->flush() is called to shutdown it. It wakes up
> processes holding f_count on this file, particualry.

Wonderful. Now tell me what to do with SCM_RIGHTS cookies, OK? And while
you are at it, what to do with mmap()? Besides, what processes hold
->f_count on /dev/zero and how many lists poor task_struct should be on?

kuz...@ms2.inr.ac.ru

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Alexander Viro

Hello!

> too fscking bad, but close() is unlikely to get you out of that. If it
> isn't - what are you complaining about?

Sorry. close() is the only way to do this. 8)

I had to break semantics of shutdown() making it working
on not-connected sockets to help these folks.

> Excuse me? Sorry, but you've totally misread that code. If the last owner
> of mm_struct exits while the thing is in use by some processor running
> lazy-TLB process... Guess what - mm_struct hangs around until the next
> context switch to non-lazy beast. On all CPUs that had it.

The situation is the same _exactly_.

The only thing, why f_count, fget() and fput() are required
is to hold temporary references, when kernel cannot allow
destruction.

User references must be counted separately.

Actually, Linus seems to forget about his own idea
of returning valid error from close(). Think, how it is possible
to return valid error, when fput() is used to close file.

Also, lingering (i.e. waiting for looong time) in fput() smells
interesting. 8)

Alexey

kuz...@ms2.inr.ac.ru

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to Alexander Viro

Hello!

> > File is closed, when all the references from user disappear.
>
> Yes? So how about
> fd2 = dup(fd1);
> read(fd1,...);
> close(fd1);
>
> Is the file closed in that case? If no - you've got completely weird
> semantics (having dup() changes rules for operations on original fd), if
> yes - what happens with your arguments about leaks?

Dup gets another descriptor and another reference.

Stop for a moment and think. dup() is orthogonal to this siutation.

> Wonderful. Now tell me what to do with SCM_RIGHTS cookies, OK? And while
> you are at it, what to do with mmap()? Besides, what processes hold
> ->f_count on /dev/zero and how many lists poor task_struct should be on?

mmap and SCM_RIGHTS hold f_users of course.

Al, do not pretend, that you do not understand the difference
between user count and reference count. Man working on VFS knows this
by defintion. Seems, you jest.

fget()/fput() are to be used only to prevent destruction.

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to kuz...@ms2.inr.ac.ru

On Sun, 27 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> Hello!
>
> > too fscking bad, but close() is unlikely to get you out of that. If it
> > isn't - what are you complaining about?
>
> Sorry. close() is the only way to do this. 8)

What had happened with kill()? Sorry, I don't buy that argument - "we
shouldn't leave the process in hanging state - we don't, they are not
hung - oh, but if we put a process into such situation we will have to
use signals [or sendto(), or... depending on the case in question] and
we'ld like to use close()". Kernel doesn't leave them hung. Not more than
it does on read() from the pipe where writer doesn't want to write().
Should we declare read() broken because of (while true; do; done)|grep?

> I had to break semantics of shutdown() making it working
> on not-connected sockets to help these folks.
>
> > Excuse me? Sorry, but you've totally misread that code. If the last owner
> > of mm_struct exits while the thing is in use by some processor running
> > lazy-TLB process... Guess what - mm_struct hangs around until the next
> > context switch to non-lazy beast. On all CPUs that had it.
>
> The situation is the same _exactly_.
>
> The only thing, why f_count, fget() and fput() are required
> is to hold temporary references, when kernel cannot allow
> destruction.

?

> User references must be counted separately.
>
> Actually, Linus seems to forget about his own idea
> of returning valid error from close(). Think, how it is possible
> to return valid error, when fput() is used to close file.

SCM_RIGHTS cookies. mmap. fork(), for that matter.

> Also, lingering (i.e. waiting for looong time) in fput() smells
> interesting. 8)

Yes? Receiving process doesn't care to do recvmsg() on SCM_RIGHTS
datagram. What should happen on close(), again?

Alexander Viro

unread,

Aug 27, 2000, 3:00:00 AM8/27/00

to kuz...@ms2.inr.ac.ru

On Sun, 27 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> Hello!
>

> > > File is closed, when all the references from user disappear.
> >
> > Yes? So how about
> > fd2 = dup(fd1);
> > read(fd1,...);
> > close(fd1);
> >
> > Is the file closed in that case? If no - you've got completely weird
> > semantics (having dup() changes rules for operations on original fd), if
> > yes - what happens with your arguments about leaks?
>
> Dup gets another descriptor and another reference.

So _which_ case should abort read()?

> Al, do not pretend, that you do not understand the difference
> between user count and reference count. Man working on VFS knows this
> by defintion. Seems, you jest.

I don't understand why you _need_ user count in the first place.

> fget()/fput() are to be used only to prevent destruction.

Alexey, you are mixing files and descriptors. Look: you want to add some
strange state to the whole thing. Fine. Where should it live? If it's
associated with file - too bad, dup() will prevent aborting read() by
close(). If it's on descriptor - WTF does it have struct file?

Linus Torvalds

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to linux-...@vger.kernel.org

In article <2000082719...@ms2.inr.ac.ru>,

<kuz...@ms2.inr.ac.ru> wrote:
>
>User references must be counted separately.

Sorry, Alexey, but you're just wrong.

The behaviour you seem to advocate is stupid.

End of story.

>Actually, Linus seems to forget about his own idea
>of returning valid error from close(). Think, how it is possible
>to return valid error, when fput() is used to close file.

Bzzt. Wrong answer.

My claim is that a user program that does a close() on a file that is in
use is inherently racy. Is that so hard to understand?

I'm basically saying: if you want reliable behaviour, you should make
sure in user land that you don't close a file descriptor that is in use
somewhere else. Linux _will_ give you reliable and dependable semantics
even if you close() on an fd while it is in use, but I still claim that
you should not do it.

EXACTLY because you lose the error information, if for no other reason.

Imagine that another thread is doing a write() on the fd at the same
time as your thread is doing a close(). I claim that that is a stupid
thing to do, and that you shouldn't do it. Linux makes certain that the
kernel is internally consistent, and you might even choose to depend on
that internal consistency, but your program is doing things it shouldn't
be doing.

You'll lose the errors from the operation, for example. Tough. I'm not
"forgetting" anything. I'm claiming that it's a "Don't do that then"
case.

And I'm right. I'm always right, but in this case I'm just a bit more
right than I usually am.

Just don't close a file descriptor while it is in use. It's that simple.

(I'm also claiming that Linux will give the most reasonable possible
semantics for the case where you _do_ close the file desciptor when it
is in use. I still claim that you shouldn't do it.)

Linus

kuz...@ms2.inr.ac.ru

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to Alexander Viro

Hello!

> Should we declare read() broken because of (while true; do; done)|grep?

[ Side note. Al, I've said, it is poor demagogy.

I argue particularly to get some sane arguments _against_ f_users,
which could assure myself. I am not interested in propaganda,
I am enough skillfull in it myself. 8) I learned this skill
because I had to assure people that a bug is their one,
rather than not mine.
]

So, let's use technical terms yet.

> SCM_RIGHTS cookies. mmap. fork(), for that matter.

Nope. They cannot use fput(), exactly by the same reason why
fork() increases mm_users in addition to mm_count.

> > Also, lingering (i.e. waiting for looong time) in fput() smells
> > interesting. 8)
>
> Yes? Receiving process doesn't care to do recvmsg() on SCM_RIGHTS
> datagram. What should happen on close(), again?

Nothing. close() decreases f_users and exits, because f_users is not zero.

> > > Yes? So how about
> > > fd2 = dup(fd1);
> > > read(fd1,...);
> > > close(fd1);
> > >
> > > Is the file closed in that case? If no - you've got completely weird
> > > semantics (having dup() changes rules for operations on original fd), if
> > > yes - what happens with your arguments about leaks?
> >
> > Dup gets another descriptor and another reference.
>
> So _which_ case should abort read()?

No case. f_users is not zero, nothing happens.

> Alexey, you are mixing files and descriptors. Look: you want to add some
> strange state to the whole thing. Fine. Where should it live? If it's
> associated with file - too bad, dup() will prevent aborting read() by
> close(). If it's on descriptor - WTF does it have struct file?

Wow! You missed the whole problem. 8)

The problem happens only when threads share file table.

thread A thread B

read(fd);
sleeps...
close(fd);
read returns -EBADF
(poll returns POLLNVAL), if this file has no more users.

What happens now is race window: if close(fd) happens before
read(fd) grabbed fd or after it fput() it, we get EBADF,
otherwise read() hangs forever. (Actually, even absence of invariance
can be considered as bug.)

Note that problem does not happen with normal files, which
are stateless. It is problem of stateful files: sockets, pipes etc.
They need fops->close() to be called to be closed and it is not
called because read() etc. hold file. Dead loop. And it is evident
how to break this loop.

What's about implementation, it is very easy. Actually code
becomes more clean than it is now. Each time, when you create
new client reference to file increase f_users. Client references
are those references, which mean that file is still alive:
open(), socket(), dup() etc. etc. There are a few of places,
when file is kept alive without any file descriptor f.e. sockets
used by nfs, also mmaped files and SCM_RIGHTS.

When closing a descriptor, you mark it unused and
do cleanups associated to descriptor: locks, flushing signals
(or sending special signal, instead), then:

error = 0;
if (atomic_dec_and_test(&file->f_users)) {
if (fops->shutdown)
error = fops->shutdown(file);
}
fput(file)
return error;

where fput() is:

if (atomic_dec_and_test(&file->f_count)) {
if (fops->close)
fops->close(file);
free_filp(file);
}

fops->shutdown is better name for fops->flush().
fops->flush() in its current form seems to be meaningless.

This method is nil for stateless files. For sockets
it refers to shutdown(2) followed by linger, if linger
is enabled. It may block and may return error.

fops->close() does not block, though may sleep.

That's all. fget() does lookup in file table and increases both f_users
and f_count under fd table read_lock. Actually, it is better to name
file_lookup(), but the name is selected.

file_hold(struct file *) does simply atomic_inc(&file->f_count),
it is used when file is already held by caller to get more references.

It is usual scheme, it is used _everywhere_ over the kernel.
Frequently "users" count is implicit, because object is referenced
from fixed places (some hash tables etc.) and scheduled for death,
when it is removed from these tables. But it is required for
objects referenced from multiple varying places, which is the case
for struct file.

Alexey

Alexander Viro

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to kuz...@ms2.inr.ac.ru

On Mon, 28 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> > > > Yes? So how about
> > > > fd2 = dup(fd1);
> > > > read(fd1,...);
> > > > close(fd1);

> > So _which_ case should abort read()?
>
> No case. f_users is not zero, nothing happens.

In other words, with three threads we have

read(fd1...) fd2=dup(fd1);
close(fd1);
...
[here read() would succeed]
...
close(fd2);
_not_ aborting read() and in absence of the third thread we _do_ abort
read(). QED: close(dup(fd1)) changes the behaviour of program.

IIRC, you've agreed that it was wrong just a couple of posts upthread.

> What happens now is race window: if close(fd) happens before
> read(fd) grabbed fd or after it fput() it, we get EBADF,
> otherwise read() hangs forever. (Actually, even absence of invariance
> can be considered as bug.)

Yes? How about close() right _after_ read() returns? Should it
retroactively change the return value to -EBADF? You have the race
anyway.

Look: you are asking for completely new mechanism for aborting blocked
IO operations. Occam's Razor applies.

Arguments against that mechanism:
* doesn't work unless caller has all references to file at hands
and remembers about them.
* doesn't provide anything new compared to kill(2).
* in the case when read would succeed it introduces a new effect -
close(dup(fd)) changes the program behaviour.

What are your arguments for the inclusion of this mechanism, aside of the
fact that Solaris has it?

It's not fixing a bug. Application does blocking operation and then uses a
mechanism that would abort that operation on Solaris. This mechanism
doesn't work on Linux. There is another mechanism that works on both; as
the matter of fact it works on every UNIX. You are asking to add the
former to Linux. Question: what for? It's not a demagogy, I honestly don't
see the rationale for that change. That's my problem with your suggestion
and I would really like to hear the explantion on that point.

kuz...@ms2.inr.ac.ru

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to Alexander Viro

Hello!

> _not_ aborting read() and in absence of the third thread we _do_ abort
> read(). QED: close(dup(fd1)) changes the behaviour of program.
>
> IIRC, you've agreed that it was wrong just a couple of posts upthread.

Read is aborted as soon as file closed. File is closed,
when nobody has access to it. While third thread holds it open,
hothing happens. As soon as all the access points are deleted,
file is closed.

Alive object cannot be orphan. That's invariant to hold.

> Yes? How about close() right _after_ read() returns? Should it
> retroactively change the return value to -EBADF? You have the race
> anyway.

Moreover! If read() has some _data_ read while close() (due to race or
due to SO_RCVLOWAT) it does not return error. It returns data
despite of neither descriptor nor file itself exist to this time.
It is normal.

> Look: you are asking for completely new mechanism for aborting blocked
> IO operations. Occam's Razor applies.

Nope. "New" mechanism already exists, at least for sockets.
It is shutdown(2). But it works only in Linux.

> Arguments against that mechanism:
> * doesn't work unless caller has all references to file at hands
> and remembers about them.
> * doesn't provide anything new compared to kill(2).
> * in the case when read would succeed it introduces a new effect -
> close(dup(fd)) changes the program behaviour.
>
> What are your arguments for the inclusion of this mechanism, aside of the
> fact that Solaris has it?

No ones. Except for ecological ones: I prefer when children
have parents. Orphaned files must be closed.

What a pity, I have already used these your arguments. 8)

Only third one sounds different: not close(dup(fd)) changes
behaviour of program, but close() in thread is poorly defined
and its result is unpredictable, when more than one thread use
this descriptor. So, you (not you, but that poor programmer 8))
are fool, we are all in the white. 8)

The problem is that it is unpredictable in theory.
In practice kernel is able (and does in other unces) predictably
not to lose files and threads and not to hang in sleeps.

> former to Linux. Question: what for?

Do you want really honest answer? Keep it!

Because I am tired to prove that it is not a bug in TCP. 8)

As you can guess, the problem is really essential only for TCP,
because it has to make an active action (FIN) to start close
and because it has another syscalls (connect, accept), which
hang really frequently. And because no OSes provide tools to do this,
so that people learned and use close().

Also, because it is easy to argue to you (and even to Linus 8)),
I know that after some sleep you will calm down, understand my point,
switch to something constructive, and reject or accept it.
But it is impossible to argue to guy, who is porting his (bogus) software
from nt,solaris,dux in hope to make some more money (or PR, if he is more
clever) on linux boom and sees that his app has no chances to work. 8)
If you tell hime about kill(), he will kill you; he learned
MT programming exactly to forget about signals forever. 8)

Alexey

kuz...@ms2.inr.ac.ru

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to Alexander Viro

Hello!

> get -EBADF, in some - data. And that race is going to be a hell to hunt
> down.

Nothing to hunt. If master closes descriptor, he knows what he does.
If timeout is 5 minutes and data arrived exactly after
5 minutes+-0 microseconds, connection may die, may stay alive.
It is normal race condition with principially undefined result.

The only thing, which kernel could provide, is to avoid race
resulting in infinite sleep.

> Yes, programs may need a way to abort blocked IO. But overloading close()
> is a *very* bad idea.

Al, did I ever say that it is good idea? It is *very* dirty. 8)

But, alas, it is not an abstract _idea_, it is plain fucking _life_.

My idea is to change behaviour to _not_ less sane
(more sane to my veiwpoint, but it is difficult to argue about tastes),
and to live in peace with this cursed life after this. 8)

Alexander Viro

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to kuz...@ms2.inr.ac.ru

On Mon, 28 Aug 2000 kuz...@ms2.inr.ac.ru wrote:

> Hello!
>
> > _not_ aborting read() and in absence of the third thread we _do_ abort
> > read(). QED: close(dup(fd1)) changes the behaviour of program.
> >
> > IIRC, you've agreed that it was wrong just a couple of posts upthread.
>
> Read is aborted as soon as file closed. File is closed,

read() may succeed _before_ you close the dup()'d descriptor. You are
thinking about the case when the thing is never going to receive anything
_and_ kernel has no way to recognize that, right? (otherwise kernel should
just return 0 and be done with that). Now, look at the case where read()
_will_ eventually succeed (since we are in "no way to tell" area, we can't
exclude all these cases, right?) Look at the scenario above for that case.
You've got close(dup()) racing with read() and close(); in some cases you

get -EBADF, in some - data. And that race is going to be a hell to hunt
down.

Yes, programs may need a way to abort blocked IO. But overloading close()

is a *very* bad idea.

> Moreover! If read() has some _data_ read while close() (due to race or

> due to SO_RCVLOWAT) it does not return error. It returns data
> despite of neither descriptor nor file itself exist to this time.
> It is normal.

See above.

Alan Cox

unread,

Aug 28, 2000, 3:00:00 AM8/28/00

to Rusty Russell

> The close() function will deallocate the file descriptor
> indicated by fildes. To deallocate means to make the file
> descriptor available for return by subsequent calls to open()
> or other functions that allocate file descriptors.
>

> So we could sleep in close (forever, maybe) until the polls are
> finished if you prefer.
>

> close doesn't close. This would be a ``bug'', Linus.

Nope. The fd can be closed and a new one opened while the poll sleeps on the
existing one. The current code is compliant

Rusty Russell

unread,

Aug 29, 2000, 3:00:00 AM8/29/00

to Linus Torvalds

In message <8oci4c$b0l$1...@penguin.transmeta.com> you write:
> >Actually, Linus seems to forget about his own idea
> >of returning valid error from close(). Think, how it is possible
> >to return valid error, when fput() is used to close file.
>
> Bzzt. Wrong answer.
>
> My claim is that a user program that does a close() on a file that is in
> use is inherently racy. Is that so hard to understand?

Easy to understand: you're wrong, that's all 8). It makes perfect
sense for the master thread to close the socket every thread is
waiting on, they all wake up, get -EBADF whether they be in read(),
poll(), etc. They then shut down. Neat, huh?

> (I'm also claiming that Linux will give the most reasonable possible
> semantics for the case where you _do_ close the file desciptor when it
> is in use. I still claim that you shouldn't do it.)

No, Solaris is more programmer-friendly here. That doesn't mean Linux
is wrong, just that the above claim is dubious.

Hope that helps,
Rusty.
--
Hacking time.

Rusty Russell

unread,

Aug 29, 2000, 3:00:00 AM8/29/00

to kuz...@ms2.inr.ac.ru

In message <2000082816...@ms2.inr.ac.ru> you write:
> > > Also, lingering (i.e. waiting for looong time) in fput() smells
> > > interesting. 8)
> >
> > Yes? Receiving process doesn't care to do recvmsg() on SCM_RIGHTS
> > datagram. What should happen on close(), again?
>
> Nothing. close() decreases f_users and exits, because f_users is not zero.

I didn't use second reference count because it needs to be associated
with each fd in struct files_struct. NOT like f_count, which is in
`struct file'.

Consider:

fd2 = dup(fd1);
pthread_create(thread1);
pthread_create(thread2);
fork() => child1;

Thread 0 Thread 1 Thread 2 Child 1

read(fd1); read(fd2); read(fd1);

close(fd1);
MUST -EBADF MUST NOT -EBADF MUST NOT -EBADF

And then it becomes obvious that count is not needed, it is always 1
or 0; ie. only threads waiting on same fd in same struct files_struct
must be woken. My patch (at head of thread) was cc:'d to
vger.rutgers.edu by mistake. See below for original post.

Hope this clarifies,
Rusty.
--
Hacking time.

================
Hi all,

BUG: thread 1 is doing poll() on an fd, thread 2 closes it,
and thread 1 doesn't wake up. Included is a small test program and
Makefile.

Simply waking up everyone polling on a filp on every close
would suck for performance, so I only wake up those waiting with the
same struct files_struct (ie. those sharing the file table).
Unfortunately, the wait queue is not in the common part of the inode,
so this needs to be implemented for every pollable device. I have
only implemented it for sockets, pipes and fifos.

Before patch:
rusty@unpatched$ sudo make
UDP test
unwoken child: SUCCESS
FAILED: didn't wake me
unwoken: SUCCESS
TCP test
unwoken child: SUCCESS
FAILED: didn't wake me
unwoken: SUCCESS
RAW test
unwoken child: SUCCESS
FAILED: didn't wake me
unwoken: SUCCESS
PIPE test
unwoken child: SUCCESS
FAILED: didn't wake me
unwoken: SUCCESS
FIFO test
unwoken child: SUCCESS
FAILED: didn't wake me
unwoken: SUCCESS
CHARDEV test
unwoken child: SUCCESS
FAILED: didn't wake me
unwoken: SUCCESS

After patch:
rusty@patched $ sudo make
cc -Wall -W -o polltest polltest.c -lpthread
UDP test
woken: SUCCESS
unwoken: SUCCESS
unwoken child: SUCCESS
TCP test
woken: SUCCESS
unwoken: SUCCESS
unwoken child: SUCCESS
RAW test
woken: SUCCESS
unwoken: SUCCESS
unwoken child: SUCCESS
PIPE test
woken: SUCCESS
unwoken: SUCCESS
unwoken child: SUCCESS
FIFO test
woken: SUCCESS
unwoken: SUCCESS
unwoken child: SUCCESS
CHARDEV test
unwoken: SUCCESS
unwoken child: SUCCESS
FAILED: didn't wake me

Cheers,
Rusty.
--
Hacking time.

diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/fs/open.c working-2.4.0-test7/fs/open.c
--- linux-2.4.0-test7-official/fs/open.c Fri Aug 25 13:11:36 2000
+++ working-2.4.0-test7/fs/open.c Fri Aug 25 16:18:17 2000
@@ -791,6 +791,8 @@
unlock_kernel();
}
locks_remove_posix(filp, id);
+ if (filp->f_op && filp->f_op->fdclosed)
+ filp->f_op->fdclosed(filp, id);
fput(filp);
return retval;
}
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/fs/pipe.c working-2.4.0-test7/fs/pipe.c
--- linux-2.4.0-test7-official/fs/pipe.c Fri Aug 25 13:11:36 2000
+++ working-2.4.0-test7/fs/pipe.c Fri Aug 25 15:39:50 2000
@@ -375,6 +375,26 @@
return 0;
}

+static void pipe_fdclosed(struct file *filp, struct files_struct *files)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct list_head *i;
+ unsigned long flags;
+
+ /* Wake up everyone using same file table. */
+ down(PIPE_SEM(*inode));
+ wq_write_lock_irqsave(&PIPE_WAIT(*inode)->lock, flags);
+ for (i = PIPE_WAIT(*inode)->task_list.next;
+ i != &PIPE_WAIT(*inode)->task_list;
+ i = i->next) {
+ wait_queue_t *curr = list_entry(i, wait_queue_t, task_list);
+ if (curr->task->files == files)
+ wake_up_process(curr->task);
+ }
+ wq_write_unlock_irqrestore(&PIPE_WAIT(*inode)->lock, flags);
+ up(PIPE_SEM(*inode));
+}
+
/*
* The file_operations structs are not static because they
* are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -387,6 +407,7 @@
ioctl: pipe_ioctl,
open: pipe_read_open,
release: pipe_read_release,
+ fdclosed: pipe_fdclosed,
};

struct file_operations write_fifo_fops = {
@@ -397,6 +418,7 @@
ioctl: pipe_ioctl,
open: pipe_write_open,
release: pipe_write_release,
+ fdclosed: pipe_fdclosed,
};

struct file_operations rdwr_fifo_fops = {
@@ -407,6 +429,7 @@
ioctl: pipe_ioctl,
open: pipe_rdwr_open,
release: pipe_rdwr_release,
+ fdclosed: pipe_fdclosed,
};

struct file_operations read_pipe_fops = {
@@ -417,6 +440,7 @@
ioctl: pipe_ioctl,
open: pipe_read_open,
release: pipe_read_release,
+ fdclosed: pipe_fdclosed,
};

struct file_operations write_pipe_fops = {
@@ -427,6 +451,7 @@
ioctl: pipe_ioctl,
open: pipe_write_open,
release: pipe_write_release,
+ fdclosed: pipe_fdclosed,
};

struct file_operations rdwr_pipe_fops = {
@@ -437,6 +462,7 @@
ioctl: pipe_ioctl,
open: pipe_rdwr_open,
release: pipe_rdwr_release,
+ fdclosed: pipe_fdclosed,
};

struct inode* pipe_new(struct inode* inode)
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/include/linux/fs.h working-2.4.0-test7/include/linux/fs.h
--- linux-2.4.0-test7-official/include/linux/fs.h Fri Aug 25 13:11:43 2000
+++ working-2.4.0-test7/include/linux/fs.h Fri Aug 25 13:38:43 2000
@@ -746,6 +746,7 @@
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
+ void (*fdclosed) (struct file *, struct files_struct *);
};

struct inode_operations {
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/include/linux/net.h working-2.4.0-test7/include/linux/net.h
--- linux-2.4.0-test7-official/include/linux/net.h Fri Aug 25 12:10:21 2000
+++ working-2.4.0-test7/include/linux/net.h Fri Aug 25 13:38:43 2000
@@ -82,6 +82,7 @@

struct scm_cookie;
struct vm_area_struct;
+struct files_struct;

struct proto_ops {
int family;
@@ -108,6 +109,7 @@
int (*sendmsg) (struct socket *sock, struct msghdr *m, int total_len, struct scm_cookie *scm);
int (*recvmsg) (struct socket *sock, struct msghdr *m, int total_len, int flags, struct scm_cookie *scm);
int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma);
+ void (*fdclosed) (struct socket *sock, struct files_struct *files);
};

struct net_proto_family
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/include/net/sock.h working-2.4.0-test7/include/net/sock.h
--- linux-2.4.0-test7-official/include/net/sock.h Fri Aug 25 13:11:48 2000
+++ working-2.4.0-test7/include/net/sock.h Fri Aug 25 13:39:12 2000
@@ -708,6 +708,8 @@
int *addr_len);
int (*bind)(struct sock *sk,
struct sockaddr *uaddr, int addr_len);
+ void (*fdclosed)(struct sock *sk,
+ struct files_struct *files);

int (*backlog_rcv) (struct sock *sk,
struct sk_buff *skb);
@@ -836,6 +838,8 @@
extern void sock_kfree_s(struct sock *sk, void *mem, int size);

extern int copy_and_csum_toiovec(struct iovec *iov, struct sk_buff *skb, int hlen);
+extern void sock_fdclosed(struct socket *sock, struct files_struct *files);
+

/*
* Functions to fill in entries in struct proto_ops when a protocol
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/net/core/sock.c working-2.4.0-test7/net/core/sock.c
--- linux-2.4.0-test7-official/net/core/sock.c Fri Aug 25 13:11:48 2000
+++ working-2.4.0-test7/net/core/sock.c Fri Aug 25 13:36:53 2000
@@ -107,6 +107,7 @@
#include <linux/interrupt.h>
#include <linux/poll.h>
#include <linux/init.h>
+#include <linux/wait.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -1062,6 +1063,28 @@
{
/* Mirror missing mmap method error code */
return -ENODEV;
+}
+
+void sock_fdclosed(struct socket *sock, struct files_struct *files)
+{
+ struct sock *sk = sock->sk;
+ struct list_head *i;
+
+ if (sk->sleep && waitqueue_active(sk->sleep)) {
+ unsigned long flags;
+
+ /* Wake up everyone using same file table. */
+ wq_write_lock_irqsave(&sk->sleep->lock, flags);
+ for (i = sk->sleep->task_list.next;
+ i != &sk->sleep->task_list;
+ i = i->next) {
+ wait_queue_t *curr = list_entry(i, wait_queue_t,
+ task_list);
+ if (curr->task->files == files)
+ wake_up_process(curr->task);
+ }
+ wq_write_unlock_irqrestore(&sk->sleep->lock, flags);
+ }
}

/*
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/net/ipv4/af_inet.c working-2.4.0-test7/net/ipv4/af_inet.c
--- linux-2.4.0-test7-official/net/ipv4/af_inet.c Fri Aug 25 13:11:48 2000
+++ working-2.4.0-test7/net/ipv4/af_inet.c Fri Aug 25 13:36:53 2000
@@ -910,7 +910,8 @@
getsockopt: inet_getsockopt,
sendmsg: inet_sendmsg,
recvmsg: inet_recvmsg,
- mmap: sock_no_mmap
+ mmap: sock_no_mmap,
+ fdclosed: sock_fdclosed
};

struct proto_ops inet_dgram_ops = {
@@ -931,6 +932,7 @@
sendmsg: inet_sendmsg,
recvmsg: inet_recvmsg,
mmap: sock_no_mmap,
+ fdclosed: sock_fdclosed
};

struct net_proto_family inet_family_ops = {
diff -ur -X /tmp/filexiytUc --minimal linux-2.4.0-test7-official/net/socket.c working-2.4.0-test7/net/socket.c
--- linux-2.4.0-test7-official/net/socket.c Fri Aug 25 13:11:50 2000
+++ working-2.4.0-test7/net/socket.c Fri Aug 25 13:36:53 2000
@@ -104,7 +104,7 @@
unsigned long count, loff_t *ppos);
static ssize_t sock_writev(struct file *file, const struct iovec *vector,
unsigned long count, loff_t *ppos);
-
+static void sock_do_fdclosed(struct file *file, struct files_struct *files);

/*
* Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
@@ -122,7 +122,8 @@
release: sock_close,
fasync: sock_fasync,
readv: sock_readv,
- writev: sock_writev
+ writev: sock_writev,
+ fdclosed: sock_do_fdclosed
};

/*
@@ -706,6 +707,15 @@
sock_fasync(-1, filp, 0);
sock_release(socki_lookup(inode));
return 0;
+}
+
+static void sock_do_fdclosed(struct file *file, struct files_struct *files)
+{
+ struct socket *sock;
+
+ sock = socki_lookup(file->f_dentry->d_inode);
+ if (sock->ops->fdclosed)
+ sock->ops->fdclosed(sock, files);
}

/*
================================================================
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/errno.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <netinet/udp.h>
#include <netinet/ip.h>
#include <stddef.h>
#include <unistd.h>
#include <stdio.h>
#include <poll.h>
#include <pthread.h>
#include <sys/wait.h>

/* We expect to be woken within 10 seconds. */
void *woken(void *arg)
{
struct pollfd pfd;
char buf[10];
int r;
time_t start = time(NULL);

pfd.fd = (int)arg;
pfd.events = POLLIN;

/* 10 seconds. */
r = poll(&pfd, 1, 10000);
if (r == 1 && (pfd.revents & POLLNVAL)) {
if (recv(pfd.fd, buf, 10, 0) >= 0)
printf("FAILED: recv didn't fail\n");
else if (errno != EBADF)
printf("FAILED: errno %u != EBADF\n", errno);
else if (time(NULL) >= start + 10)
printf("FAILED: didn't wake me\n");
else
printf("woken: SUCCESS\n");
} else {
printf("FAILED: Poll didn't %s...\n",
r == 0 ? "wake me" : "mark fd");
}
return NULL;
}

/* We expect NOT to be woken (dup'ed fd) */
void *unwoken(void *arg)
{
struct pollfd pfd;

pfd.fd = (int)arg;
pfd.events = POLLIN;

/* 10 seconds. */
if (poll(&pfd, 1, 10000) != 0)
printf("FAILED: Poll woke me!\n");
else
printf("unwoken: SUCCESS\n");

return NULL;
}

/* We're a child: same fd, different file table, should not be woken */
void unwoken_child(int fd)
{
struct pollfd pfd;

pfd.fd = fd;
pfd.events = POLLIN;

/* 10 seconds. */
if (poll(&pfd, 1, 10000) != 0)
printf("FAILED: Poll woke child!\n");
else
printf("unwoken child: SUCCESS\n");
exit(0);
}

int main(int argc, const char *argv[])
{
pthread_t waker, sleeper;
int fd, fd2;

if (argc != 2) {
fprintf(stderr, "Usage: polltest [pipe|fifo|udp|tcp|raw|<filename>]\n");
exit(1);
}
if (strcmp(argv[1], "udp") == 0) {
struct sockaddr_in name;
if ((fd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
perror("opening datagram socket");
exit(1);
}

name.sin_family = AF_INET;
name.sin_addr.s_addr = INADDR_ANY;
name.sin_port = 0;
if (bind(fd, &name, sizeof(name)) != 0) {
perror("binding datagram socket");
exit(1);
}
} else if (strcmp(argv[1], "tcp") == 0) {
struct sockaddr_in name;
if ((fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
perror("opening stream socket");
exit(1);
}

name.sin_family = AF_INET;
name.sin_addr.s_addr = INADDR_ANY;
name.sin_port = 0;
if (bind(fd, &name, sizeof(name)) != 0) {
perror("binding stream socket");
exit(1);
}
if (listen(fd, 5) < 0) {
perror("listening on stream socket");
exit(1);
}
} else if (strcmp(argv[1], "raw") == 0) {
if ((fd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW)) < 0) {
perror("opening raw socket");
exit(1);
}
} else if (strcmp(argv[1], "pipe") == 0) {
int fds[2];

if (pipe(fds) != 0) {
perror("creating pipe");
exit(1);
}
fd = fds[0];
} else if (strcmp(argv[1], "fifo") == 0) {
unlink("/tmp/fifotest");
/* Linux doesn't block on mkfifo O_RDWR, so save a fork */
if ((mkfifo("/tmp/fifotest", O_RDWR)) < 0) {
perror("creating fifo");
exit(1);
}
if ((fd = open("/tmp/fifotest", O_RDWR)) < 0) {
perror("opening fifo");
exit(1);
}
} else {
/* Some character device, I presume */
if ((fd = open(argv[1], O_RDONLY)) < 0) {
perror("opening file");
exit(1);
}
}

if ((fd2 = dup(fd)) < 0) {
perror("doing dup");
exit(1);
}

switch (fork()) {
case -1:
perror("doing fork");
exit(1);
case 0:
unwoken_child(fd);
break;
}

if (pthread_create(&waker, NULL, woken, (void *)fd) != 0) {
perror("Creating waker thread");
exit(1);
}

if (pthread_create(&sleeper, NULL, unwoken, (void *)fd2) != 0) {
perror("Creating sleeper thread");
exit(1);
}

sleep(4);
close(fd);
pthread_join(waker, NULL);
pthread_join(sleeper, NULL);
wait(NULL);
return 0;
}
================
#! /usr/bin/make

# This needs to be a quiescent device (ie. won't wake up things on poll).
CHARDEV:=/dev/ttyS1

CFLAGS:=-Wall -W

default: alltests

alltests: polltest udptest tcptest rawtest pipetest fifotest chardevtest

udptest:
@echo UDP test; ./polltest udp

tcptest:
@echo TCP test; ./polltest tcp

rawtest:
@echo RAW test; ./polltest raw

pipetest:
@echo PIPE test; ./polltest pipe

fifotest:
@echo FIFO test; ./polltest fifo

chardevtest:
@echo CHARDEV test; ./polltest $(CHARDEV)

polltest: polltest.c
$(CC) $(CFLAGS) -o $@ $< -lpthread

clean:
@rm -f polltest

Dan Kegel

unread,

Aug 29, 2000, 3:00:00 AM8/29/00

to linux-...@vger.kernel.org

Linus wrote:
> My claim is that a user program that does a close() on a file that is in
> use is inherently racy. Is that so hard to understand?

Unfortunately, Java programs do this a lot, as it's the
only reliable way to break a thread out of a blocking I/O call
there.

> Just don't close a file descriptor while it is in use. It's that simple.
>

> (I'm also claiming that Linux will give the most reasonable possible
> semantics for the case where you _do_ close the file desciptor when it
> is in use. I still claim that you shouldn't do it.)

I'll forward your note to the folks in JSR-51 (Sun's "new I/O for Java"
group), fwiw.
- Dan

Alan Cox

unread,

Aug 29, 2000, 3:00:00 AM8/29/00

to Dan Kegel

> Linus wrote:
> > My claim is that a user program that does a close() on a file that is in
> > use is inherently racy. Is that so hard to understand?
>
> Unfortunately, Java programs do this a lot, as it's the
> only reliable way to break a thread out of a blocking I/O call
> there.

Then either the Java program is broken and its not a portable java guarantee
or the java runtime is broken and assumes things that are not guaranteed by
the underlying environment

Tigran Aivazian

unread,

Aug 29, 2000, 3:00:00 AM8/29/00

to Linus Torvalds

On 27 Aug 2000, Linus Torvalds wrote:
> I'm basically saying: if you want reliable behaviour, you should make

> sure in user land that you don't close a file descriptor that is in use
~~~~~~~~~

> ...

> And I'm right. I'm always right, but in this case I'm just a bit more
> right than I usually am.
>

> Just don't close a file descriptor while it is in use. It's that simple.
>
> (I'm also claiming that Linux will give the most reasonable possible
> semantics for the case where you _do_ close the file desciptor when it
> is in use. I still claim that you shouldn't do it.)
>

Dear Linus,

I have read your email but would you care to say if you also object to
the idea of doing that in kernel land, please?

As you say, the kernel side is consistent and not-racey. So, is there
anything wrong with the idea of closing a file descriptor currently
(possibly) in use in the kernel, as required for the forced umount? This
is the safest (to the best of my knowledge) way of doing the forced umount
in the present Linux/VFS framework.

To make things more specific, I am talking about the patch below.

Regards,
Tigran

diff -urN -X dontdiff linux/fs/Makefile badfs/fs/Makefile
--- linux/fs/Makefile Thu Aug 10 06:51:11 2000
+++ badfs/fs/Makefile Tue Aug 15 10:27:09 2000
@@ -18,9 +18,9 @@
ALL_SUB_DIRS := coda minix ext2 fat msdos vfat proc isofs nfs umsdos ntfs \
hpfs sysv smbfs ncpfs ufs efs affs romfs autofs hfs lockd \
nfsd nls devpts devfs adfs partitions qnx4 udf bfs cramfs \
- openpromfs autofs4 ramfs jffs
+ openpromfs autofs4 ramfs jffs badfs

-SUB_DIRS :=
+SUB_DIRS := badfs

ifeq ($(CONFIG_QUOTA),y)
O_OBJS += dquot.o
diff -urN -X dontdiff linux/fs/badfs/Makefile badfs/fs/badfs/Makefile
--- linux/fs/badfs/Makefile Thu Jan 1 01:00:00 1970
+++ badfs/fs/badfs/Makefile Tue Aug 15 10:27:09 2000
@@ -0,0 +1,9 @@
+#
+# Makefile for badfs filesystem.
+#
+
+O_TARGET := badfs.o
+O_OBJS := inode.o
+M_OBJS := $(O_TARGET)
+
+include $(TOPDIR)/Rules.make
diff -urN -X dontdiff linux/fs/badfs/inode.c badfs/fs/badfs/inode.c
--- linux/fs/badfs/inode.c Thu Jan 1 01:00:00 1970
+++ badfs/fs/badfs/inode.c Tue Aug 15 10:27:09 2000
@@ -0,0 +1,272 @@
+/*
+ * badfs - the Bad Filesystem
+ *
+ * Author - Tigran Aivazian <tig...@veritas.com>
+ *
+ * Thanks to:
+ * Manfred Spraul <man...@colorfullife.com>, for useful comments.
+ *
+ * This file is released under the GPL.
+ *
+ * The badfs filesystem is used by forced umount ('umount -f' command)
+ * to move inodes that keep the filesystem being umounted busy to it.
+ *
+ * The entry point into this module is via quiesce_filesystem() called
+ * from fs/super.c:do_umount() if MNT_FORCE is passed.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+
+#define BADFS_MAGIC 0xBADF0001
+
+static struct super_block *badfs_read_super(struct super_block *,void *,int);
+
+#define FS_FLAGS_BADFS (FS_NOMOUNT | FS_SINGLE)
+static DECLARE_FSTYPE(badfs_fs_type,"badfs",badfs_read_super,FS_FLAGS_BADFS);
+
+static struct vfsmount *badfs_mnt; /* returned by kern_mount() */
+static struct super_block *badfs_sb; /* badfs_mnt->mnt_sb */
+static struct dentry *badfs_root; /* badfs_sb->s_root */
+
+static int __init init_badfs_fs(void)
+{
+ int err = register_filesystem(&badfs_fs_type);
+
+ if (!err) {
+ badfs_mnt = kern_mount(&badfs_fs_type);
+ if (IS_ERR(badfs_mnt)) {
+ err = PTR_ERR(badfs_mnt);
+ unregister_filesystem(&badfs_fs_type);
+ } else {
+ badfs_sb = badfs_mnt->mnt_sb;
+ err = 0;
+ }
+ }
+ return err;
+}
+
+module_init(init_badfs_fs);
+
+static struct inode *badfs_get_inode(struct super_block *sb, int mode)
+{
+ struct inode *inode = get_empty_inode();
+
+ if (inode) {
+ make_bad_inode(inode);
+ inode->i_sb = sb;
+ inode->i_dev = sb->s_dev;
+ inode->i_mode = mode;
+ inode->i_nlink = 1;
+ inode->i_size = 0;
+ inode->i_blocks = 0;
+ }
+ return inode;
+}
+
+/* VFS ->read_super() method */
+static struct super_block *badfs_read_super(struct super_block * sb,
+ void * data, int silent)
+{
+ static struct super_operations badfs_ops = {};
+ struct inode * root = badfs_get_inode(sb, S_IFDIR|S_IRUSR|S_IWUSR);
+
+ if (!root)
+ return NULL;
+ sb->s_blocksize = 1024;
+ sb->s_blocksize_bits = 10;
+ sb->s_magic = BADFS_MAGIC;
+ sb->s_op = &badfs_ops;
+ badfs_root = sb->s_root = d_alloc(NULL,
+ &(const struct qstr){ "bad:", 5, 0});
+ if (!badfs_root) {
+ iput(root);
+ return NULL;
+ }
+ sb->s_root->d_sb = sb;
+ sb->s_root->d_parent = sb->s_root;
+ d_instantiate(sb->s_root, root);
+ return sb;
+}
+
+static void disable_pwd(struct fs_struct *fs)
+{
+ struct inode *inode;
+ struct dentry *dentry;
+
+ inode = badfs_get_inode(badfs_sb, S_IFDIR|0755);
+ if (!inode) {
+ printk(KERN_ERR "disable_pwd(): can't allocate inode\n");
+ return;
+ }
+ dentry = d_alloc(badfs_root, &(const struct qstr){"dead_pwd", 8, 0});
+ if (!dentry) {
+ iput(inode);
+ printk(KERN_ERR "disable_pwd(): can't allocate dentry\n");
+ return;
+ }
+ d_instantiate(dentry, inode);
+ dget(dentry);
+ set_fs_pwd(fs, badfs_mnt, dentry);
+}
+
+static void disable_root(struct fs_struct *fs)
+{
+ struct inode *inode;
+ struct dentry *dentry;
+
+ inode = badfs_get_inode(badfs_sb, S_IFDIR|0755);
+ if (!inode) {
+ printk(KERN_ERR "disable_root(): can't allocate inode\n");
+ return;
+ }
+ dentry = d_alloc(badfs_root, &(const struct qstr){"dead_root", 9, 0});
+ if (!dentry) {
+ iput(inode);
+ printk(KERN_ERR "disable_root(): can't allocate dentry\n");
+ return;
+ }
+ d_instantiate(dentry, inode);
+ dget(dentry);
+ set_fs_root(fs, badfs_mnt, dentry);
+}
+
+/* called from do_umount() if MNT_FORCE is specified */
+void quiesce_filesystem(struct vfsmount *mnt)
+{
+ struct task_struct *p;
+ struct file *file;
+ struct inode *inode;
+
+ /* we do three passes through the task list, examining:
+ * 1. p->fs{->pwd,root} that can keep this mnt busy
+ * 2. p->files, i.e. open files (we do_close them)
+ * 3. p->mm, i.e. mmaped files (we simply do_munmap them)
+ * There is no guarantee that by the time we restart the loop
+ * the amount of work to do in the loop has not increased.
+ */
+repeat1:
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ struct fs_struct *fs;
+
+ /* get a reference to p->fs */
+ task_lock(p);
+ fs = p->fs;
+ if (!fs) {
+ task_unlock(p);
+ continue;
+ } else
+ atomic_inc(&fs->count);
+ task_unlock(p);
+
+ if (fs->pwdmnt == mnt) {
+ read_unlock(&tasklist_lock);
+ disable_pwd(fs); /* may sleep */
+ put_fs_struct(fs);
+ goto repeat1;
+ }
+ if (fs->rootmnt == mnt) {
+ read_unlock(&tasklist_lock);
+ disable_root(fs); /* may sleep */
+ put_fs_struct(fs);
+ goto repeat1;
+ }
+ put_fs_struct(fs);
+ }
+ read_unlock(&tasklist_lock);
+
+repeat2:
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ unsigned int fd, j = 0;
+ struct files_struct *files;
+
+ /* get a reference to p->files */
+ task_lock(p);
+ files = p->files;
+ if (!files) {
+ task_unlock(p);
+ continue;
+ } else {
+ atomic_inc(&files->count);
+ write_lock(&files->file_lock);
+ }
+ task_unlock(p);
+
+ /* check if this process has open files here */
+ while (1) {
+ unsigned long set;
+
+ fd = j * __NFDBITS;
+ if (fd >= files->max_fdset || fd >= files->max_fds)
+ break;
+ set = files->open_fds->fds_bits[j++];
+ while (set) {
+ if (set & 1) {
+ file = files->fd[fd];
+ inode = file->f_dentry->d_inode;
+ if (inode && (file->f_vfsmnt==mnt)) {
+ files->fd[fd] = NULL;
+ FD_CLR(fd, files->close_on_exec);
+ __put_unused_fd(files, fd);
+ write_unlock(&files->file_lock);
+ read_unlock(&tasklist_lock);
+ put_files_struct(files);
+ filp_close(file, files);
+ goto repeat2;
+ }
+ }
+ fd++;
+ set >>= 1;
+ }
+ }
+ write_unlock(&files->file_lock);
+ put_files_struct(files);
+ }
+ read_unlock(&tasklist_lock);
+
+repeat3:
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+
+ /* get a reference to p->mm */
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ continue;
+ } else
+ atomic_inc(&mm->mm_users);
+ task_unlock(p);
+
+ /* check for mmap'd files and unmap them */
+ vmlist_modify_lock(mm);
+ for (vma = mm->mmap; vma; vma=vma->vm_next) {
+ file = vma->vm_file;
+ if (!file)
+ continue;
+ inode = file->f_dentry->d_inode;
+ if (!inode || !inode->i_sb)
+ continue;
+ if (file->f_vfsmnt == mnt) {
+ vmlist_modify_unlock(mm);
+ read_unlock(&tasklist_lock);
+ down(&mm->mmap_sem);
+ do_munmap(mm, vma->vm_start,
+ vma->vm_end - vma->vm_start);
+ up(&mm->mmap_sem);
+ mmput(mm);
+ goto repeat3;
+ }
+ }
+ vmlist_modify_unlock(mm);
+ mmput(mm);
+ }
+ read_unlock(&tasklist_lock);
+}
diff -urN -X dontdiff linux/fs/super.c badfs/fs/super.c
--- linux/fs/super.c Tue Aug 15 10:25:23 2000
+++ badfs/fs/super.c Tue Aug 15 10:27:09 2000
@@ -1033,6 +1033,9 @@
return retval;
}

+ if (flags&MNT_FORCE)
+ quiesce_filesystem(mnt);
+
spin_lock(&dcache_lock);
if (atomic_read(&mnt->mnt_count) > 2) {
spin_unlock(&dcache_lock);
diff -urN -X dontdiff linux/include/linux/fs.h badfs/include/linux/fs.h
--- linux/include/linux/fs.h Tue Aug 15 10:25:23 2000
+++ badfs/include/linux/fs.h Tue Aug 15 10:37:49 2000
@@ -1189,6 +1189,8 @@
extern kdev_t ROOT_DEV;
extern char root_device_name[];

+/* fs/badfs/inode.c - used by forced umount */
+extern void quiesce_filesystem(struct vfsmount *);

extern void show_buffers(void);
extern void mount_root(void);