[PATCH] SMP race in ext2 - metadata corruption.

Alexander Viro

未读，

2001年4月26日 11:50:352001/4/26

收件人 Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

Ext2 does getblk+wait_on_buffer for new metadata blocks before
filling them with zeroes. While that is enough for single-processor,
on SMP we have the following race:

getblk gives us unlocked, non-uptodate bh
wait_on_buffer() does nothing
read from device locks it and starts IO
we zero it out.
on-disk data overwrites our zeroes.
we mark it dirty
bdflush writes the old data (_not_ zeroes) back to disk.

Result: crap in metadata block. Proposed fix: lock_buffer()/unlock_buffer()
around memset()/mark_buffer_uptodate() instead of wait_on_buffer() before
them.

Patch against 2.4.4-pre7 follows. Please, apply.
Al

--- S4-pre7/fs/ext2/inode.c Wed Apr 25 20:43:08 2001
+++ S4-pre7-ext2/fs/ext2/inode.c Thu Apr 26 11:36:11 2001
@@ -397,13 +397,13 @@
* the pointer to new one, then send parent to disk.
*/
bh = getblk(inode->i_dev, parent, blocksize);
- if (!buffer_uptodate(bh))
- wait_on_buffer(bh);
+ lock_buffer(bh);
memset(bh->b_data, 0, blocksize);
branch[n].bh = bh;
branch[n].p = (u32*) bh->b_data + offsets[n];
*branch[n].p = branch[n].key;
mark_buffer_uptodate(bh, 1);
+ unlock_buffer(bh);
mark_buffer_dirty_inode(bh, inode);
if (IS_SYNC(inode) || inode->u.ext2_i.i_osync) {
ll_rw_block (WRITE, 1, &bh);
@@ -587,10 +587,10 @@
struct buffer_head *bh;
bh = getblk(dummy.b_dev, dummy.b_blocknr, inode->i_sb->s_blocksize);
if (buffer_new(&dummy)) {
- if (!buffer_uptodate(bh))
- wait_on_buffer(bh);
+ lock_buffer(bh);
memset(bh->b_data, 0, inode->i_sb->s_blocksize);
mark_buffer_uptodate(bh, 1);
+ unlock_buffer(bh);
mark_buffer_dirty_inode(bh, inode);
}
return bh;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

未读，

2001年4月26日 14:21:322001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote:
> Ext2 does getblk+wait_on_buffer for new metadata blocks before
> filling them with zeroes. While that is enough for single-processor,
> on SMP we have the following race:
>
> getblk gives us unlocked, non-uptodate bh
> wait_on_buffer() does nothing
> read from device locks it and starts IO
> we zero it out.
> on-disk data overwrites our zeroes.
> we mark it dirty
> bdflush writes the old data (_not_ zeroes) back to disk.
>
> Result: crap in metadata block. Proposed fix: lock_buffer()/unlock_buffer()
> around memset()/mark_buffer_uptodate() instead of wait_on_buffer() before
> them.
>
> Patch against 2.4.4-pre7 follows. Please, apply.

correct. I bet other fs are affected as well btw.

Andrea

Alexander Viro

未读，

2001年4月26日 14:29:422001/4/26

收件人 Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> correct. I bet other fs are affected as well btw.

If only... block_read() vs. block_write() has the same race. I'm going
through the list of all wait_on_buffer() users right now.

Linus Torvalds

未读，

2001年4月26日 14:55:042001/4/26

收件人 Andrea Arcangeli、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote:
>
> Ext2 does getblk+wait_on_buffer for new metadata blocks before
> filling them with zeroes. While that is enough for single-processor,
> on SMP we have the following race:
>
> getblk gives us unlocked, non-uptodate bh
> wait_on_buffer() does nothing
> read from device locks it and starts IO

I see the race, but I don't see how you can actually trigger it.

Exactly _who_ does the "read from device" part? Somebody doing a
"fsck" while the filesystem is mounted read-write and actively written
to? Yeah, you'd get disk corruption that way, but you'll get it regardless
of this bug.

There's nothing else that should be using that block at that stage. And if
there were, that would be a bug in itself, as far as I can tell. We've
just allocated it, and we're the only and exclusive owners of that block
on the disk. Anybody else who touches it is seriously broken.

Now, I don't disagree with your patch (it's just obviously cleaner to lock
it properly), but I don't think this is a real bug. I suspect that even
the wait-on-buffer is not strictly necessary: it's probably there to make
sure old write-backs have completed, but that doesn't really matter
either.

We used to have "breada()" do physical read-ahead that could have
triggered this, but we've long since gotten rid of that.

Or am I overlooking something?

Linus

Chris Mason

未读，

2001年4月26日 15:09:522001/4/26

收件人 Alexander Viro、Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thursday, April 26, 2001 02:24:26 PM -0400 Alexander Viro
<vi...@math.psu.edu> wrote:

>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
>> correct. I bet other fs are affected as well btw.
>
> If only... block_read() vs. block_write() has the same race. I'm going
> through the list of all wait_on_buffer() users right now.
>

Looks like reiserfs has it too when allocating tree blocks, but it should
be harder to hit. The fix should be small but it will take me a bit to
make sure it doesn't affect the rest of the balancing code.

-chris

Alexander Viro

未读，

2001年4月26日 15:13:122001/4/26

收件人 Linus Torvalds、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Linus Torvalds wrote:

> I see the race, but I don't see how you can actually trigger it.
>
> Exactly _who_ does the "read from device" part? Somebody doing a
> "fsck" while the filesystem is mounted read-write and actively written
> to? Yeah, you'd get disk corruption that way, but you'll get it regardless
> of this bug.

> There's nothing else that should be using that block at that stage. And if
> there were, that would be a bug in itself, as far as I can tell. We've
> just allocated it, and we're the only and exclusive owners of that block
> on the disk. Anybody else who touches it is seriously broken.

> Now, I don't disagree with your patch (it's just obviously cleaner to lock
> it properly), but I don't think this is a real bug. I suspect that even
> the wait-on-buffer is not strictly necessary: it's probably there to make
> sure old write-backs have completed, but that doesn't really matter
> either.
>
> We used to have "breada()" do physical read-ahead that could have
> triggered this, but we've long since gotten rid of that.
>
> Or am I overlooking something?

Somebody doing dd(1) _from_ that disk. Sure, he's bound to get crap.
But I really don't think that opening device for read should be able
to affect its contents in any way.

BTW, same race exists between block_read() and block_write(). And that
one is even more obviously wrong:

xterm A: xterm B:

dd if=/dev/hda of=/dev/hdb dd if=/dev/hdb of=/dev/null

result: some blocks on hdb retaining their old contents.

IMO "no matter what you read, you don't affect the contents" is a good
general principle. Sure, you can get crap if you read in the middle of
write. That's expected and sane. However, the final contents of file
depends only on the things done by writers.

Al

Alexander Viro

未读，

2001年4月26日 15:22:192001/4/26

收件人 Linus Torvalds、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, I wrote:

> On Thu, 26 Apr 2001, Linus Torvalds wrote:
>
> > I see the race, but I don't see how you can actually trigger it.
> >
> > Exactly _who_ does the "read from device" part? Somebody doing a
> > "fsck" while the filesystem is mounted read-write and actively written
> > to? Yeah, you'd get disk corruption that way, but you'll get it regardless
> > of this bug.

OK, I think I've a better explanation now:

Suppose /dev/hda1 is owned by root.disks and permissions are 640.
It is mounted read-write.

Process foo belongs to pfy.staff. PFY is included into disks, but doesn't
have root. I claim that he should be unable to cause fs corruption on
/dev/hda1.

Currently foo _can_ cause such corruption, even though it has nothing
resembling write permissions for device in question.

IMO it is wrong. I'm not saying that it's a real security problem. I'm
not saying that PFY is not idiot or that his actions make any sense.
However, I think that situation when he can do that without write
access to device is just plain wrong.

Does the above make sense?

Andrea Arcangeli

未读，

2001年4月26日 15:25:122001/4/26

收件人 Linus Torvalds、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 11:49:14AM -0700, Linus Torvalds wrote:
>
> On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote:
> >
> > Ext2 does getblk+wait_on_buffer for new metadata blocks before
> > filling them with zeroes. While that is enough for single-processor,
> > on SMP we have the following race:
> >
> > getblk gives us unlocked, non-uptodate bh
> > wait_on_buffer() does nothing
> > read from device locks it and starts IO
>
> I see the race, but I don't see how you can actually trigger it.
>
> Exactly _who_ does the "read from device" part? Somebody doing a

/sbin/dump

> the wait-on-buffer is not strictly necessary: it's probably there to make

maybe not but I need to check some more bit to be sure.

Andrea

Alexander Viro

未读，

2001年4月26日 15:38:152001/4/26

收件人 Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> > the wait-on-buffer is not strictly necessary: it's probably there to make
>
> maybe not but I need to check some more bit to be sure.

Same scenario, but with read-in-progress started before we do getblk(). BTW,
old writeback is harmless - we will overwrite anyway. And _that_ can happen
without direct access to device - truncate() doesn't terminate writeout of
the indirect blocks it frees (IMO it should, but that's another story).

Andrea Arcangeli

未读，

2001年4月26日 15:47:442001/4/26

收件人 Linus Torvalds、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 09:15:57PM +0200, Andrea Arcangeli wrote:
> maybe not but I need to check some more bit to be sure.

yes we probably don't need it for fs against fs in 2.4 because we make
the new metadata block visible to a reader (splice) only after they're
all uptodate and all directory operations are serialized by the vfs
(with the i_zombie).

Andrea Arcangeli

未读，

2001年4月26日 15:53:402001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> Same scenario, but with read-in-progress started before we do getblk(). BTW,

how can the read in progress see a branch that we didn't spliced yet? We
clear and mark uptodate the new part of the branch before it's visible
to any reader no? then in splice we write the key into the where->p and
the branch become visible to the readers but by that time the reader
won't start I/O because the buffer are just uptodate. I only had a short
look now and to verify Ingo's fix, so maybe I overlooked something.

> without direct access to device - truncate() doesn't terminate writeout of
> the indirect blocks it frees (IMO it should, but that's another story).

If the block is under I/O or dirty that's another story, the only issue
here is when the buffer block is new and not uptodate.

Andrea

Alexander Viro

未读，

2001年4月26日 16:00:452001/4/26

收件人 Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> > Same scenario, but with read-in-progress started before we do getblk(). BTW,
>
> how can the read in progress see a branch that we didn't spliced yet? We

fd = open("/dev/hda1", O_RDONLY);
read(fd, buf, sizeof(buf));

Linus Torvalds

未读，

2001年4月26日 16:44:052001/4/26

收件人 Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Alexander Viro wrote:
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > how can the read in progress see a branch that we didn't spliced yet? We
>
> fd = open("/dev/hda1", O_RDONLY);
> read(fd, buf, sizeof(buf));

Note that I think all these arguments are fairly bogus. Doing things like
"dump" on a live filesystem is stupid and dangerous (in my opinion it is
stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
discussion in itself), and there really are no valid uses for opening a
block device that is already mounted. More importantly, I don't think
anybody actually does.

The fact that you _can_ do so makes the patch valid, and I do agree with
Al on the "least surprise" issue. I've already applied the patch, in fact.
But the fact is that nobody should ever do the thing that could cause
problems.

Linus

Andrea Arcangeli

未读，

2001年4月26日 16:44:082001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 03:17:54PM -0400, Alexander Viro wrote:
>
>
> On Thu, 26 Apr 2001, I wrote:
>
> > On Thu, 26 Apr 2001, Linus Torvalds wrote:
> >
> > > I see the race, but I don't see how you can actually trigger it.
> > >
> > > Exactly _who_ does the "read from device" part? Somebody doing a
> > > "fsck" while the filesystem is mounted read-write and actively written
> > > to? Yeah, you'd get disk corruption that way, but you'll get it regardless
> > > of this bug.
>
> OK, I think I've a better explanation now:
>
> Suppose /dev/hda1 is owned by root.disks and permissions are 640.
> It is mounted read-write.
>
> Process foo belongs to pfy.staff. PFY is included into disks, but doesn't
> have root. I claim that he should be unable to cause fs corruption on
> /dev/hda1.
>
> Currently foo _can_ cause such corruption, even though it has nothing
> resembling write permissions for device in question.
>
> IMO it is wrong. I'm not saying that it's a real security problem. I'm
> not saying that PFY is not idiot or that his actions make any sense.
> However, I think that situation when he can do that without write
> access to device is just plain wrong.
>
> Does the above make sense?

Sure. And as said `dump` has the same issues.

Andrea

Linus Torvalds

未读，

2001年4月26日 16:46:572001/4/26

收件人 Andrea Arcangeli、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> What I'm saying above is that even without the wait_on_buffer ext2 can
> screwup itself because the splice happens after the buffer are just all
> uptodate so any "reader" (I mean any reader through ext2 not through
> block_dev) will never try to do a bread on that blocks before they're
> just zeroed and uptodate.

I assume you meant "..can _not_ screw up itself..", otherwise the rest of
the sentence doesn't seem to make much sense.

Linus

Richard B. Johnson

未读，

2001年4月26日 16:48:162001/4/26

收件人 Alexander Viro、Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Alexander Viro wrote:

>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> > > the wait-on-buffer is not strictly necessary: it's probably there to make
> >
> > maybe not but I need to check some more bit to be sure.
>
> Same scenario, but with read-in-progress started before we do getblk(). BTW,
> old writeback is harmless - we will overwrite anyway. And _that_ can happen
> without direct access to device - truncate() doesn't terminate writeout of
> the indirect blocks it frees (IMO it should, but that's another story).
>

This seems to be the problem reported about a year ago, but never fixed.
It exists, even in early kernels.

mke2fs /dev/fd0
mount /dev/fd0 /mnt
cp stuff /mnt

lilo -C - <<EOF
boot = /dev/fd0
map = /mnt/map
backup = /dev/null
install=/mnt/boot.b
image=/mnt/vmlinuz
initrd=/mnt/initrd
root=/dev/ram0
EOF

umount /dev/fd0
cp /dev/fd0 raw.bin

The disk image, raw.bin, does NOT contain the image of the floppy.
Most of boot stuff added by lilo is missing. It will eventually
get there, but it's not there now, even though the floppy was
un-mounted!

A work-around was to do:

ioctl(fd, FDFLUSH, NULL);

... from a program before copying the image.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.

Alexander Viro

未读，

2001年4月26日 16:54:572001/4/26

收件人 Linus Torvalds、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Linus Torvalds wrote:

> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

Agreed.

> The fact that you _can_ do so makes the patch valid, and I do agree with
> Al on the "least surprise" issue. I've already applied the patch, in fact.
> But the fact is that nobody should ever do the thing that could cause
> problems.

IMO the real issue is in fuzzy rules for use of wait_on_buffer(). There is
one case when it's 100% correct - when we had done ll_rw_block() on that
bh at earlier point and want to make sure that it's completed. And there
is a lot of uses that are kinda-sorta correct for UP, but break on SMP.
unmap_buffer() was similar to that race. So are races in minix, sysvfs and
ufs. So is one in block_write() and here the problem is quite real - it's
not as idiotic as device/mounted fs races.

Basically, all legitimate cases are ones where we would be very unhappy
about buffer being not uptodate afterwards.
getblk(); if (!buffer_uptodate) wait_on_buffer();
is not in that class. It _is_ OK on UP as long as we don't block, but on
SMP it doesn't guarantee anything - buffer_head can be in any state
when we are done. IMO all such places require fixing.

How about adding
if (!buffer_uptodate(bh)) {
printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
show_task(current);
}
in the end of wait_on_buffer() for a while?
Al

Alexander Viro

未读，

2001年4月26日 16:59:522001/4/26

收件人 Richard B. Johnson、Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Richard B. Johnson wrote:

> The disk image, raw.bin, does NOT contain the image of the floppy.
> Most of boot stuff added by lilo is missing. It will eventually
> get there, but it's not there now, even though the floppy was
> un-mounted!

I doubt that you can reproduce that on anything remotely recent.
All buffers are flushed when last user closes device.

Andrea Arcangeli

未读，

2001年4月26日 17:08:162001/4/26

收件人 Linus Torvalds、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:
> But the fact is that nobody should ever do the thing that could cause
> problems.

dump in 2.4 also gets uncoherent view of the data which make things even
worse than in 2.2 (to change that we should hash in the buffer hashtable
all the bh overlapped in the pagecache and no I'm not suggesting that
relax). The only reason it has a chance to work with ext2 is because
ext2 is very dumb and it misses an inode map and in turn inodes are at a
predictable location on disk so it cannot run totally out of control.

Andrea

Andrea Arcangeli

未读，

2001年4月26日 17:09:462001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 03:55:19PM -0400, Alexander Viro wrote:
>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> > > Same scenario, but with read-in-progress started before we do getblk(). BTW,
> >
> > how can the read in progress see a branch that we didn't spliced yet? We
>
> fd = open("/dev/hda1", O_RDONLY);
> read(fd, buf, sizeof(buf));

You misunderstood the context of what I said, I perfectly know the race
you are talking about, I was answering Linus's question "the
wait_on_buffer isn't even necessary to protect ext2 against ext2". You
are talking about the other race that is "ext2" against "block_dev", and
I obviously agree on that one since the first place as I immediatly
answered you "correct".

What I'm saying above is that even without the wait_on_buffer ext2 can
screwup itself because the splice happens after the buffer are just all
uptodate so any "reader" (I mean any reader through ext2 not through
block_dev) will never try to do a bread on that blocks before they're
just zeroed and uptodate.

Andrea

Andrea Arcangeli

未读，

2001年4月26日 17:12:442001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 10:11:09PM +0200, Andrea Arcangeli wrote:
> On Thu, Apr 26, 2001 at 03:55:19PM -0400, Alexander Viro wrote:
> >
> >
> > On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> > > > Same scenario, but with read-in-progress started before we do getblk(). BTW,
> > >
> > > how can the read in progress see a branch that we didn't spliced yet? We
> >
> > fd = open("/dev/hda1", O_RDONLY);
> > read(fd, buf, sizeof(buf));
>
> You misunderstood the context of what I said, I perfectly know the race
> you are talking about, I was answering Linus's question "the
> wait_on_buffer isn't even necessary to protect ext2 against ext2". You
> are talking about the other race that is "ext2" against "block_dev", and
> I obviously agree on that one since the first place as I immediatly
> answered you "correct".
>
> What I'm saying above is that even without the wait_on_buffer ext2 can

^^^ "cannot" of course

Andrea Arcangeli

未读，

2001年4月26日 17:16:352001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 04:49:20PM -0400, Alexander Viro wrote:
> getblk(); if (!buffer_uptodate) wait_on_buffer();
> is not in that class. It _is_ OK on UP as long as we don't block, but on
> SMP it doesn't guarantee anything - buffer_head can be in any state
> when we are done. IMO all such places require fixing.

Yes, actually it's probably ok for most of other "fs" against "fs" cases
because those fses still hold the big lock while handling metadata but
they should really use the lock_buffer way if they want to protect
against the block_dev accesses too.

> How about adding
> if (!buffer_uptodate(bh)) {
> printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
> show_task(current);
> }
> in the end of wait_on_buffer() for a while?

At the _top_ of wait_on_buffer would be better then at the end.

Andrea

Andrea Arcangeli

未读，

2001年4月26日 17:18:262001/4/26

收件人 Linus Torvalds、Alexander Viro、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 01:26:15PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > What I'm saying above is that even without the wait_on_buffer ext2 can
> > screwup itself because the splice happens after the buffer are just all
> > uptodate so any "reader" (I mean any reader through ext2 not through
> > block_dev) will never try to do a bread on that blocks before they're
> > just zeroed and uptodate.
>
> I assume you meant "..can _not_ screw up itself..", otherwise the rest of

yes, it was a typo sorry.

Andrea

Andrzej Krzysztofowicz

未读，

2001年4月26日 17:24:342001/4/26

收件人 Linus Torvalds、Alexander Viro、Andrea Arcangeli、Alan Cox、kufel!vger.kernel...@green.mif.pg.gda.pl

>
> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

I know a few people that often do:

dd if=/dev/hda1 of=/dev/hdc1
e2fsck /dev/hdc1

to make an "exact" copy of a currently working system.

Maybe it is stupid, but they do.
Fortunately, their systems are not SMP...

Andrzej

Alexander Viro

未读，

2001年4月26日 19:30:152001/4/26

收件人 Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> > How about adding
> > if (!buffer_uptodate(bh)) {
> > printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
> > show_task(current);
> > }
> > in the end of wait_on_buffer() for a while?
>
> At the _top_ of wait_on_buffer would be better then at the end.

In that case ll_rw_block() + wait_on_buffer() (absolutely legitimate
combination) will scream at you.

Andrea Arcangeli

未读，

2001年4月26日 20:15:442001/4/26

收件人 Alexander Viro、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 07:25:23PM -0400, Alexander Viro wrote:
>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> > > How about adding
> > > if (!buffer_uptodate(bh)) {
> > > printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
> > > show_task(current);
> > > }
> > > in the end of wait_on_buffer() for a while?
> >
> > At the _top_ of wait_on_buffer would be better then at the end.
>
> In that case ll_rw_block() + wait_on_buffer() (absolutely legitimate
> combination) will scream at you.

--- 2.4.4pre7/include/linux/locks.h Thu Apr 26 05:22:11 2001
+++ 2.4.4pre7aa1/include/linux/locks.h Fri Apr 27 01:52:31 2001
@@ -18,6 +18,11 @@
{
if (test_bit(BH_Lock, &bh->b_state))
__wait_on_buffer(bh);
+ else if (!buffer_uptodate(bh)) {
+ __label__ here;
+ here:
+ printk(KERN_ERR "IO error or racy use of wait_on_buffer() from %p\n", &&here);
+ }
}

extern inline void lock_buffer(struct buffer_head * bh)

Andrea

Linus Torvalds

未读，

2001年4月26日 20:25:332001/4/26

收件人 Andrea Arcangeli、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Fri, 27 Apr 2001, Andrea Arcangeli wrote:
> + __label__ here;
> + here:
> + printk(KERN_ERR "IO error or racy use of wait_on_buffer() from %p\n", &&here);

Detail nit: don't do this. Use "current_text_addr()" instead. Simpler to
read, and gcc will actually do the right thing wrt inlining etc.

Linus

Andrea Arcangeli

未读，

2001年4月26日 20:44:382001/4/26

收件人 Linus Torvalds、Alexander Viro、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 05:19:26PM -0700, Linus Torvalds wrote:
> Detail nit: don't do this. Use "current_text_addr()" instead. Simpler to
> read, and gcc will actually do the right thing wrt inlining etc.

Agreed, thanks for the info.

Andrea

Richard B. Johnson

未读，

2001年4月27日 07:48:202001/4/27

收件人 Alexander Viro、Andrea Arcangeli、Linus Torvalds、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Alexander Viro wrote:

>
>
> On Thu, 26 Apr 2001, Richard B. Johnson wrote:
>
> > The disk image, raw.bin, does NOT contain the image of the floppy.
> > Most of boot stuff added by lilo is missing. It will eventually
> > get there, but it's not there now, even though the floppy was
> > un-mounted!
>
> I doubt that you can reproduce that on anything remotely recent.
> All buffers are flushed when last user closes device.
>

2.4.3

Buffers are not flushed (actually written) to disk. The floppy continues
to be written for 20 to 30 seconds after `umount` returns to
the shell. A program like `cp` , accessing the raw device during this time
does not get what will eventually be written.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.

Vojtech Pavlik

未读，

2001年4月27日 09:08:492001/4/27

收件人 Linus Torvalds、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:

> On Thu, 26 Apr 2001, Alexander Viro wrote:
> > On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> > >
> > > how can the read in progress see a branch that we didn't spliced yet? We
> >
> > fd = open("/dev/hda1", O_RDONLY);
> > read(fd, buf, sizeof(buf));
>
> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

Actually this is done quite often, even on mounted fs's:

hdparm -t /dev/hda

> The fact that you _can_ do so makes the patch valid, and I do agree with
> Al on the "least surprise" issue. I've already applied the patch, in fact.
> But the fact is that nobody should ever do the thing that could cause
> problems.

--
Vojtech Pavlik
SuSE Labs

Alexander Viro

未读，

2001年4月27日 09:28:502001/4/27

收件人 Vojtech Pavlik、Linus Torvalds、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Fri, 27 Apr 2001, Vojtech Pavlik wrote:

> Actually this is done quite often, even on mounted fs's:
>
> hdparm -t /dev/hda

You would need either hdparm -t /dev/hda<something> or mounting the
whole /dev/hda.

Buffer cache for the disk is unrelated to buffer cache for parititions.

Andi Kleen

未读，

2001年4月27日 10:31:592001/4/27

收件人 Linus Torvalds、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:

> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

You can use LVM snapshot volumes to do it safely.

-Andi

Ville Herva

未读，

2001年4月27日 10:35:082001/4/27

收件人 Alexander Viro、Linus Torvalds、linux-...@vger.kernel.org

On Fri, Apr 27, 2001 at 09:23:57AM -0400, you [Alexander Viro] claimed:

>
>
> On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
>
> > Actually this is done quite often, even on mounted fs's:
> >
> > hdparm -t /dev/hda
>
> You would need either hdparm -t /dev/hda<something> or mounting the
> whole /dev/hda.
>
> Buffer cache for the disk is unrelated to buffer cache for parititions.

Well, I for one have been running

hdparm -t /dev/md0
or
time head -c 1000m /dev/md0 > /dev/null

while /dev/md0 was mounted without realizing that this could be "stupid" or
that it could eat my data.

/dev/md0 on /backup-versioned type ext2 (rw)

I often cat(1) or head(1) partitions or devices (even mounted ones) if I
need dummy randomish test data for compression or tape drives (that I've
been having trouble with).

BTW: is 2.2 affected? 2.0?

-- v --

v...@iki.fi

Linus Torvalds

未读，

2001年4月27日 12:54:442001/4/27

收件人 Vojtech Pavlik、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
>

> Actually this is done quite often, even on mounted fs's:
>
> hdparm -t /dev/hda

Note that this one happens to be ok.

The buffer cache is "virtual" in the sense that /dev/hda is a completely
separate name-space from /dev/hda1, even if there is some physical
overlap.

Linus

Linus Torvalds

未读，

2001年4月27日 13:02:072001/4/27

收件人 Neil Conway、Kernel Mailing List

[ linux-kernel added back as a cc ]

On Fri, 27 Apr 2001, Neil Conway wrote:
>
> I'm surprised that dump is deprecated (by you at least ;-)). What to
> use instead for backups on machines that can't umount disks regularly?

Note that dump simply won't work reliably at all even in 2.4.x: the buffer
cache and the page cache (where all the actual data is) are not
coherent. This is only going to get even worse in 2.5.x, when the
directories are moved into the page cache as well.

So anybody who depends on "dump" getting backups right is already playing
russian rulette with their backups. It's not at all guaranteed to get the
right results - you may end up having stale data in the buffer cache that
ends up being "backed up".

Dump was a stupid program in the first place. Leave it behind.

> I've always thought "tar" was a bit undesirable (updates atimes or
> ctimes for example).

Right now, the cpio/tar/xxx solutions are definitely the best ones, and
will work on multiple filesystems (another limitation of "dump"). Whatever
problems they have, they are still better than the _guaranteed_(*) data
corruptions of "dump".

However, it may be that in the long run it would be advantageous to have a
"filesystem maintenance interface" for doing things like backups and
defragmentation..

Linus

(*) Dump may work fine for you a thousand times. But it _will_ fail under
the right circumstances. And there is nothing you can do about it.

Martin Dalecki

未读，

2001年4月27日 13:18:112001/4/27

收件人 Linus Torvalds、Neil Conway、Kernel Mailing List

Linus Torvalds wrote:

> Dump was a stupid program in the first place. Leave it behind.

Not quite Linus - dump/restore are nice tools to create for example
automatic over network installation servers, i.e. efficient system
images
or such. tar/cpio and friends don't deal properly with

a. holes inside of files.
b. hardlinks between files.

Really they are not useless. However I wouldn't recommend them
for backup practicies as well.

Please see for example:

http://www.systime-solutions.de/index.php?topic=produkte&subtopic=setupserver

Well yes, if you understand german...

Jeff Garzik

未读，

2001年4月27日 13:38:242001/4/27

收件人 Linus Torvalds、Neil Conway、Kernel Mailing List

Linus Torvalds wrote:
> On Fri, 27 Apr 2001, Neil Conway wrote:
> >
> > I'm surprised that dump is deprecated (by you at least ;-)). What to
> > use instead for backups on machines that can't umount disks regularly?
>
> Note that dump simply won't work reliably at all even in 2.4.x: the buffer
> cache and the page cache (where all the actual data is) are not
> coherent. This is only going to get even worse in 2.5.x, when the
> directories are moved into the page cache as well.

> Dump was a stupid program in the first place. Leave it behind.

Dump/restore are useful, on-line dump is silly. I am personally amazed
that on-line, mounted dump was -ever- supported. I guess it would work
if mounted ro...

dump is still the canonical solution, IMHO, for saving and restoring
filesystem metadata OFFLINE. tar/cpio can be taught to do stuff like
security ACLs and EAs and such, but such code and formats are not yet
standardized, and they do not approach dump when it comes to taking an
accurate snapshot of the filesystem.

--
Jeff Garzik | Disbelief, that's why you fail.
Building 1024 |
MandrakeSoft |

LA Walsh

未读，

2001年4月27日 14:06:432001/4/27

收件人 Andrzej Krzysztofowicz、linux-...@vger.kernel.org

Andrzej Krzysztofowicz wrote:

> I know a few people that often do:
>
> dd if=/dev/hda1 of=/dev/hdc1
> e2fsck /dev/hdc1
>
> to make an "exact" copy of a currently working system.

---
Presumably this isn't a problem is the source disks are either unmounted or mounted 'read-only' ?

--
The above thoughts and | They may have nothing to do with
writings are my own. | the opinions of my employer. :-)
L A Walsh | Trust Technology, Core Linux, SGI
l...@sgi.com | Voice: (650) 933-5338

dek...@konerding.com

未读，

2001年4月27日 14:19:322001/4/27

收件人 LA Walsh、linux-...@vger.kernel.org

On Fri, Apr 27, 2001 at 11:02:17AM -0700, LA Walsh wrote:
> Andrzej Krzysztofowicz wrote:
>
> > I know a few people that often do:
> >
> > dd if=/dev/hda1 of=/dev/hdc1
> > e2fsck /dev/hdc1
> >
> > to make an "exact" copy of a currently working system.
>
> ---
> Presumably this isn't a problem is the source disks are either unmounted or mounted 'read-only' ?
>
>

I thought the known best solution on this was to use COW snapshots,
because then you copy the filesystem as exactly the state when the snapshot
was made, without impacting the writability of the filesystem while
the (potentially very long) dump is made?

I tried using this on LVM, but after seeing a few messages on the list about
kernel oopses happening with snapshots of filesystems with heavy write
activities, as well as experiencing serious problems with the LVM userspace
tools (they would core dump on startup if the LVM filesystem had any sort
of corruption or integrity failure) I decided to put it away until the LVM
folks managed to get a production version ready.

Shane Wegner

未读，

2001年4月27日 15:24:022001/4/27

收件人 linux-...@vger.kernel.org

On Fri, Apr 27, 2001 at 09:52:19AM -0700, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
> >
> > Actually this is done quite often, even on mounted fs's:
> >
> > hdparm -t /dev/hda
>
> Note that this one happens to be ok.
>
> The buffer cache is "virtual" in the sense that /dev/hda is a completely
> separate name-space from /dev/hda1, even if there is some physical
> overlap.

Wouldn't something like "hdparm -t /dev/md0" trigger it
though. It is the same device as gets mounted as md
devices aren't partitioned.

Shane

--
Shane Wegner: sh...@cm.nu
http://www.cm.nu/~shane/
PGP: 1024D/FFE3035D
A0ED DAC4 77EC D674 5487
5B5C 4F89 9A4E FFE3 035D

Albert D. Cahalan

未读，

2001年4月28日 00:58:562001/4/28

收件人 Linus Torvalds、Vojtech Pavlik、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

Linus Torvalds writes:

> The buffer cache is "virtual" in the sense that /dev/hda is a
> completely separate name-space from /dev/hda1, even if there
> is some physical overlap.

So the aliasing problems and elevator algorithm confusion remain?
Is this ever likely to change, and what is with the 1 kB assumptions?
(Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)

Matthias Urlichs

未读，

2001年4月28日 04:34:242001/4/28

收件人 Martin Dalecki、linux-...@vger.kernel.org

Martin Dalecki :

> tar/cpio and friends don't deal properly with
>
> a. holes inside of files.
> b. hardlinks between files.
>

??? GNU tar does both. The only thing it currently cannot handle is Not
Changing Anything: either atime or ctime _will_ be updated.

Jens Axboe

未读，

2001年4月28日 06:12:562001/4/28

收件人 Albert D. Cahalan、Linus Torvalds、Vojtech Pavlik、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Sat, Apr 28 2001, Albert D. Cahalan wrote:
> Linus Torvalds writes:
>
> > The buffer cache is "virtual" in the sense that /dev/hda is a
> > completely separate name-space from /dev/hda1, even if there
> > is some physical overlap.
>
> So the aliasing problems and elevator algorithm confusion remain?

At least for the I/O scheduler confusion, requests to partitions will
remap the buffer location and this problem disappears nicely. It's not a
big issue, really.

> Is this ever likely to change, and what is with the 1 kB assumptions?
> (Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)

What 1kB assumption?

--
Jens Axboe

Olaf Titz

未读，

2001年4月28日 09:36:592001/4/28

收件人 linux-...@vger.kernel.org

> or such. tar/cpio and friends don't deal properly with
> a. holes inside of files.
> b. hardlinks between files.

GNU tar handles both of these. (Not particularly efficiently in the
case of sparse files, but that's a minor issue in this case.) See -S flag.

Perhaps more importantly, for a _robust_ backup solution which can
deal with partially unreadable tapes, you have pretty much no option
other than tar for the actual storage.

Olaf

Neil Conway

未读，

2001年4月30日 04:53:192001/4/30

收件人 Linus Torvalds、linux-...@vger.kernel.org

Hiya.

Linus Torvalds wrote:
> So anybody who depends on "dump" getting backups right is already playing
> russian rulette with their backups. It's not at all guaranteed to get the
> right results - you may end up having stale data in the buffer cache that
> ends up being "backed up".
>
> Dump was a stupid program in the first place. Leave it behind.

Ouch. I just re-read the man page and it doesn't caution (*) against
using it on mounted filesystems. That probably means that there are
thousands of other losers like me using it on production machines.
Volunteers to (a) change the man page, (b) talk to the distros about
dumping "dump"?

> However, it may be that in the long run it would be advantageous to have a
> "filesystem maintenance interface" for doing things like backups and
> defragmentation..

Yup, sounds good.

Neil

(*) The KNOWNBUGS file mentions "possible" problems while dumping active
mounted filesystems, but I've (elsewhere) seen these characterised as no
real problem; also, this falls a long way short of discouraging use in
this fashion.

vol...@mindspring.com

未读，

2001年5月3日 02:26:022001/5/3

收件人 Linus Torvalds、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

On Thu, 26 Apr 2001, Linus Torvalds wrote:

>
>
> On Thu, 26 Apr 2001, Alexander Viro wrote:
> > On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> > >
> > > how can the read in progress see a branch that we didn't spliced yet? We
> >
> > fd = open("/dev/hda1", O_RDONLY);
> > read(fd, buf, sizeof(buf));
>

> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

Actually I did. I might do it again :) The point was to get the kernel to
cache certain blocks in the RAM.

Vladimir Dergachev

>
> The fact that you _can_ do so makes the patch valid, and I do agree with
> Al on the "least surprise" issue. I've already applied the patch, in fact.
> But the fact is that nobody should ever do the thing that could cause
> problems.
>

> Linus

Alan Cox

未读，

2001年5月3日 05:29:212001/5/3

收件人 vol...@mindspring.com、Linus Torvalds、Alexander Viro、Andrea Arcangeli、Alan Cox、linux-...@vger.kernel.org

> > discussion in itself), and there really are no valid uses for opening a
> > block device that is already mounted. More importantly, I don't think
> > anybody actually does.
>
> Actually I did. I might do it again :) The point was to get the kernel to
> cache certain blocks in the RAM.

Ditto for some CD based stuff. You burn the important binaries to the front
of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
avoid a lot of seeking during boot up from the CD-ROM.

However I could do that from an initrd before mounting

Alan

Linus Torvalds

未读，

2001年5月3日 13:24:152001/5/3

收件人 Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

On Thu, 3 May 2001, Alan Cox wrote:
>
> > > discussion in itself), and there really are no valid uses for opening a
> > > block device that is already mounted. More importantly, I don't think
> > > anybody actually does.
> >
> > Actually I did. I might do it again :) The point was to get the kernel to
> > cache certain blocks in the RAM.
>
> Ditto for some CD based stuff. You burn the important binaries to the front
> of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> avoid a lot of seeking during boot up from the CD-ROM.
>
> However I could do that from an initrd before mounting

Ehh. Doing that would be extremely stupid, and would slow down your boot
and nothing more.

The page cache is _not_ coherent with the buffer cache. Any filesystem
that uses the page cache for data caching (which pretty much all of them
do, because it's the only way to get sane mmap semantics, and it's a lot
faster than the ol dbuffer cache ever was), the above will do _nothing_
but spend time doing IO that the page cache will just end up doing again.

Currently it can help to pre-load the meta-data, but quite frankly, even
that is suspect, and won't work in 2.5.x when Al's metadata page-cache
stuff is merged (at least directories, and likely inodes too).

In short, don't do it. It doesn't work reliably (and hasn't since 2.0.x),
and it will only get more and more unreliable.

Linus

Rogier Wolff

未读，

2001年5月4日 07:43:362001/5/4

收件人 Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

Linus Torvalds wrote:
>
> On Thu, 3 May 2001, Alan Cox wrote:
> > Ditto for some CD based stuff. You burn the important binaries to the front
> > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> > avoid a lot of seeking during boot up from the CD-ROM.
> >
> > However I could do that from an initrd before mounting
>
> Ehh. Doing that would be extremely stupid, and would slow down your boot
> and nothing more.

Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By
analyzing my boot process I determine that 50M of my disk is used
during boot. I can then reshuffle my disk to have that 50M of data at
the beginning and reading all that into 50M of cache, I can save
thousands of 10ms seeks. Boot time would go from several tens of
seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU
time.

This doesn't work if I don't have the memory to cache 50M of
disk-blocks.

Is this simply: Don't try this then?

Roger.

--
** R.E....@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.

Jens Axboe

未读，

2001年5月4日 08:02:132001/5/4

收件人 Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, May 04 2001, Rogier Wolff wrote:
> > On Thu, 3 May 2001, Alan Cox wrote:
> > > Ditto for some CD based stuff. You burn the important binaries to the front
> > > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> > > avoid a lot of seeking during boot up from the CD-ROM.
> > >
> > > However I could do that from an initrd before mounting
> >
> > Ehh. Doing that would be extremely stupid, and would slow down your boot
> > and nothing more.
>
> Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By
> analyzing my boot process I determine that 50M of my disk is used
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save
> thousands of 10ms seeks. Boot time would go from several tens of
> seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU
> time.

Provided that the buffer cache and page cache are coherent, which they
are not. So at most you'll cache fs meta data by doing the dd trick,
which is hardly worth the effort.

Or you can rewrite block_read/write to use the page cache, in which case
you'd have more luck doing the above.

--
Jens Axboe

Marc SCHAEFER

未读，

2001年5月4日 08:21:292001/5/4

收件人 linux-...@vger.kernel.org

Rogier Wolff <R.E....@BitWizard.nl> wrote:
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save

Wasn't that one of the goals of the LVM project, along snapshots and
block-level HSM ?

Andrea Arcangeli

未读，

2001年5月4日 11:37:172001/5/4

收件人 Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、linux-...@vger.kernel.org

On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
> Or you can rewrite block_read/write to use the page cache, in which case
> you'd have more luck doing the above.

once block_dev is in pagecache there will obviously be no-way to share
cache between the block device and the filesystem, because all the
caches will be in completly different address spaces.

Andrea

Linus Torvalds

未读，

2001年5月4日 13:30:152001/5/4

收件人 Rogier Wolff、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Rogier Wolff wrote:

>
> Linus Torvalds wrote:
> >
> > Ehh. Doing that would be extremely stupid, and would slow down your boot
> > and nothing more.
>
> Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second.

You obviously didn't read my explanation of _why_ it is stupid.

> By analyzing my boot process I determine that 50M of my disk is used
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save
> thousands of 10ms seeks.

No. Have you _tried_ this?

What the above would do is to move 50M of the disk into the buffer cache.

Then, a second later, when the boot proceeds, Linux would start filling
the page cache.

BY READING THE CONTENTS FROM DISK AGAIN!

In short, by doing a "dd" from the disk, you would _not_ help anything at
all. You would only make things slower, by reading things twice.

The Linux buffer cache and page cache are two separate entities. They are
not synchronized, and they are indexed through totally different
means. The page cache is virtually indexed by <inode,pagenr>, while
the
buffer cache is indexed by <dev,blocknr,blocksize>.

> Is this simply: Don't try this then?

Try it. You will see.

You _can_ actually try to optimize certain things with 2.4.x: all
meta-data is still in the buffer cache in 2.4.x, so what you could do is
to lay out the image so that the metadata is at the front of the disk,
and do the "dd" to cache just the metadata. Even then you need to be
careful, and make sure that the "dd" uses the same block size as the
filesystem will use.

And even that will largely stop working very early in 2.5.x when the
directory contents and possibly inode and bitmap metadata moves into the
page cache.

Now, you may ask "why use the page cache at all then"? The answer is that
the page cache is a _lot_ faster to look up, exactly because of the
virtual indexing (and also because the data structure is much better
designed - fixed-size entities with none of the complexities of the buffer
cache. The buffer cache needs to be able to do IO, while the page cache is
_only_ a cache and does that one thing really well - doing IO is a
completely separate issue with the page cache).

Now, if you want to speed up accesses, there are things you can do. You
can lay out the filesystem in the access order - trace the IO accesses at
bootup ("which file, which offset, which metadata block?") and lay out the
blocks of the files in exactly the right order. Then you will get linear
reads _without_ doing any "dd" at all.

Now, laying out the filesystem that way is _hard_. No question about it.
It's kind of equivalent to doing a filesystem "defreagment" operation,
except you use a different sorting function (instead of sorting blocks
linearly within each file, you sort according to access order).

Linus

Alexander Viro

未读，

2001年5月4日 13:41:452001/5/4

收件人 Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Linus Torvalds wrote:

> Now, if you want to speed up accesses, there are things you can do. You
> can lay out the filesystem in the access order - trace the IO accesses at
> bootup ("which file, which offset, which metadata block?") and lay out the
> blocks of the files in exactly the right order. Then you will get linear
> reads _without_ doing any "dd" at all.
>
> Now, laying out the filesystem that way is _hard_. No question about it.
> It's kind of equivalent to doing a filesystem "defreagment" operation,
> except you use a different sorting function (instead of sorting blocks
> linearly within each file, you sort according to access order).

Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
* add pagecache access for block device
* put your "real" root on /dev/loop0 (setup from initrd)
* dd
The last step will populate pagecache for underlying device and later
access to root fs will ultimately hit said pagecache, be it from page
cache of files or buffer cache of /dev/loop0 - loop_make_request() will
take care of that, by copying data from pagecache of /dev/<real_device>.

Al, feeling sadistic today...

Linus Torvalds

未读，

2001年5月4日 13:43:342001/5/4

收件人 Andrea Arcangeli、Jens Axboe、Rogier Wolff、Alan Cox、vol...@mindspring.com、Alexander Viro、linux-...@vger.kernel.org

On Fri, 4 May 2001, Andrea Arcangeli wrote:

> On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
> > Or you can rewrite block_read/write to use the page cache, in which case
> > you'd have more luck doing the above.
>
> once block_dev is in pagecache there will obviously be no-way to share
> cache between the block device and the filesystem, because all the
> caches will be in completly different address spaces.

They already pretty much are.

I do want to re-write block_read/write to use the page cache, but not
because it would impact anything in this discussion. I want to do it early
in 2.5.x, because:

- it will speed up accesses
- it will re-use existing code better and conceptualize things more
cleanly (ie it would turn a disk into a _really_ simple filesystem with
just one big file ;).
- it will make MM handling much better for things like fsck - the memory
pressure is designed to work on page cache things.
- it will be one less thing that uses the buffer cache as a "cache" (I
want people to think of, and use, the buffer cache as an _IO_ entity,
not a cache).

It will not make the "cache at bootup" thing change at all (because even
in the page cache, there is no commonality between a virtual mapping of a
_file_ (or metadata) and a virtual mapping of a _disk_).

It would have hidden the problem with "dd" or "dump" touching buffer cache
blocks that the filesystem was using, so the original metadata corruption
that started this thread would not happen. But that's not a design issue
or a design goal, that would just have been a random result.

Linus

Linus Torvalds

未读，

2001年5月4日 13:58:052001/5/4

收件人 Alexander Viro、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Alexander Viro wrote:
>
> Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
> * add pagecache access for block device
> * put your "real" root on /dev/loop0 (setup from initrd)
> * dd

You're one sick puppy.

Now, the above is basically equivalent to using and populating a
dynamically sized ramdisk.

If you really want to go this way, I'd much rather see you using a real
ram-disk (that you populate at startup with something like a compressed
tar-file). THAT is definitly going to speed up booting - thanks to
compression you'll not only get linear reads, but you will get fewer reads
than the amount of data you need would imply.

Couple that with tmpfs, or possibly something like coda (to dynamically
move things between the ramdisk and the "backing store" filesystem), and
you can get a ramdisk approach that actually shrinks (and, in the case of
coda or whatever, truly grows) dynamically.

Think of it as an exercise in multi-level filesystems and filesystem
management. Others have done it before (usually between disk and tape, or
disk and network), and in these days of ever-growing memory it might just
make sense to do it on that level too.

(No, I don't seriously think it makes sense today. But if RAM keeps
growing and becoming ever cheaper, it might some day. At the point where
everybody has multi-gigabyte memories, and don't really need it for
anything but caching, you could think of it as just moving the caching to
a higher level - you don't cache blocks, you cache parts of the
filesystem).

> Al, feeling sadistic today...

Sadistic you are.

Linus

Richard Gooch

未读，

2001年5月4日 14:22:402001/5/4

收件人 Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

Linus Torvalds writes:
> Now, if you want to speed up accesses, there are things you can
> do. You can lay out the filesystem in the access order - trace the
> IO accesses at bootup ("which file, which offset, which metadata
> block?") and lay out the blocks of the files in exactly the right
> order. Then you will get linear reads _without_ doing any "dd" at
> all.

A year ago I came up with an alternative approach for cache warming,
but I see that it wouldn't work with our current infrastructure.
However, maybe there is still a way to use the basic technique. If so,
please make suggestions.

The idea I had (motivated by the desire to eliminate random disc
seeks, which is the limiting factor in how fast my boxes boot) was:

- init(8) issues an ioctl(2) on the root FS block device which turns
on recording of block reads (it records block numbers)

- at the end of the bootup process, init(8) issues another ioctl(2) to
grab the buffered block numbers, and turn off recording

- init(8) then sorts this list in ascending order and saves the result
in a file

- next boot, init(8) checks the file, and if it exists, opens the root
FS block device and reads in each block listed in the file. The
effect is to warm the buffer cache extremely quickly. The head will
move in one direction, grabbing data as it flys by. I expect this
will take around 1 second

- init(8) now continues the boot process (starting the magic ioctl(2)
again so as to get a fresh list of blocks, in case something has
changed)

- booting is now super fast, thanks to no disc activity.

The advantage of this scheme over blindly reading the first 50 MB is
that it only reads in what you *need*, and thus will work better on
low memory systems. It's also useful for other applications, not just
speeding up the boot process.

However, doing an ioctl(2) on the block device won't help. So the
question is, where to add the hook? One possibility is the FS, and
record inum,bnum pairs. But of course we don't have a way of accessing
via inum in user-space. So that's no good. Besides, we want to get
block numbers on the block device, because that's the only meaningful
number to resort.

So, what, then? Some kind of hook on the page cache? Ideas?

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Alexander Viro

未读，

2001年5月4日 14:31:322001/5/4

收件人 Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Linus Torvalds wrote:

>
> On Fri, 4 May 2001, Alexander Viro wrote:
> >
> > Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:

^^^^^^^^^^^^^^^^^^^^^^^^^^^

> > * add pagecache access for block device
> > * put your "real" root on /dev/loop0 (setup from initrd)
> > * dd
>
> You're one sick puppy.

[snip]
/me bows

Nice to see that imitation was good enough ;-) Seriously, I half-expected
Albert to show up at that point of thread and tried to anticipate what
he'd produce.

ObProcfs: I don't think that walking the page tables is a good way to
compute RSS, especially since VM maintains the thing. Mind if I rip
it out? In effect, implementation of /prc/<pid>/statm
* produces extremely bogus values (VMA is from library if it goes
beyond 0x60000000? Might be even true 7 years ago...) and nobody
had cared about them for 6-7 years
* makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
_processes_ each 5 seconds. No wonder it's slow like hell and eats
tons of CPU time.

Alexander Viro

未读，

2001年5月4日 14:38:502001/5/4

收件人 Richard Gooch、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Richard Gooch wrote:

> However, doing an ioctl(2) on the block device won't help. So the
> question is, where to add the hook? One possibility is the FS, and
> record inum,bnum pairs. But of course we don't have a way of accessing
> via inum in user-space. So that's no good. Besides, we want to get
> block numbers on the block device, because that's the only meaningful
> number to resort.
>
> So, what, then? Some kind of hook on the page cache? Ideas?

Two of them: use less bloated shell (and link it statically) and clean
your rc scripts.

Richard Gooch

未读，

2001年5月4日 14:54:342001/5/4

收件人 Alexander Viro、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

Alexander Viro writes:
>
>
> On Fri, 4 May 2001, Richard Gooch wrote:
>
> > However, doing an ioctl(2) on the block device won't help. So the
> > question is, where to add the hook? One possibility is the FS, and
> > record inum,bnum pairs. But of course we don't have a way of accessing
> > via inum in user-space. So that's no good. Besides, we want to get
> > block numbers on the block device, because that's the only meaningful
> > number to resort.
> >
> > So, what, then? Some kind of hook on the page cache? Ideas?
>
> Two of them: use less bloated shell (and link it statically) and
> clean your rc scripts.

No, because I'm not using the latest bloated version of bash, and I'm
not using the slow and bloated RedHat boot scripts. My boot scripts
are lean and mean. Oh. And I already have init(8) warming the cache
with these scripts.

The problem is all the various daemons and system utilities (mount,
hwclock, ifconfig and so on) that turn a kernel into a useful system.
And then of course there's X...

Sorry. A "don't do that then" answer isn't appropriate for this
problem space.

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Jens Axboe

未读，

2001年5月4日 15:07:122001/5/4

收件人 Richard Gooch、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, May 04 2001, Richard Gooch wrote:
> The idea I had (motivated by the desire to eliminate random disc
> seeks, which is the limiting factor in how fast my boxes boot) was:
>
> - init(8) issues an ioctl(2) on the root FS block device which turns
> on recording of block reads (it records block numbers)
>
> - at the end of the bootup process, init(8) issues another ioctl(2) to
> grab the buffered block numbers, and turn off recording
>
> - init(8) then sorts this list in ascending order and saves the result
> in a file
>
> - next boot, init(8) checks the file, and if it exists, opens the root
> FS block device and reads in each block listed in the file. The
> effect is to warm the buffer cache extremely quickly. The head will
> move in one direction, grabbing data as it flys by. I expect this
> will take around 1 second
>
> - init(8) now continues the boot process (starting the magic ioctl(2)
> again so as to get a fresh list of blocks, in case something has
> changed)
>
> - booting is now super fast, thanks to no disc activity.

I did 95% of what you need sometime last year, to do I/O scheduler
profiling (blocks requested, merge stats, request sent to disk). It was
a pretty gross hack, requiring a pretty big ring buffer of kernel memory
to be able to log at a sufficiently fast rate (you'd be amazed how much
output a single dbench 48 run produces :-). A user space app would read
data from a simple char device, save for later inspection.

A better approach would be to map the ring buffer from the user app, but
it was just a quick fix.

--
Jens Axboe

Linus Torvalds

未读，

2001年5月4日 15:16:342001/5/4

收件人 Alexander Viro、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Alexander Viro wrote:
>
> ObProcfs: I don't think that walking the page tables is a good way to
> compute RSS, especially since VM maintains the thing.

Well, the VM didn't always use to maintain the stuff it does now, so I bet
that most of the code is just old code that still works.

Feel free to rip it out.

Linus

Alexander Viro

未读，

2001年5月4日 15:19:502001/5/4

收件人 Richard Gooch、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Richard Gooch wrote:

> > Two of them: use less bloated shell (and link it statically) and
> > clean your rc scripts.
>
> No, because I'm not using the latest bloated version of bash, and I'm

Umm... Last version of bash I could call not bloated was _long_ time
ago. Something like ash(1) might be a better idea for /bin/sh.

> The problem is all the various daemons and system utilities (mount,
> hwclock, ifconfig and so on) that turn a kernel into a useful system.
> And then of course there's X...

How do you partition the thing? I.e. what's the size of your root partition?
I'm usually doing something from 10Mb to 30Mb - that may be the reason of
differences.

Richard Gooch

未读，

2001年5月4日 15:39:422001/5/4

收件人 Alexander Viro、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

Alexander Viro writes:
>
>
> On Fri, 4 May 2001, Richard Gooch wrote:
>
> > > Two of them: use less bloated shell (and link it statically) and
> > > clean your rc scripts.
> >
> > No, because I'm not using the latest bloated version of bash, and I'm
>
> Umm... Last version of bash I could call not bloated was _long_ time
> ago. Something like ash(1) might be a better idea for /bin/sh.

The shell is irrelevant. I can easily preload that too, if I wanted
to, since it's just one thing. But it's not practical to preload all
files used by name, because it's just too hard to find out all that is
needed. Too much people time required, and it is specific to one
distribution (and a particular revision at that).

> > The problem is all the various daemons and system utilities (mount,
> > hwclock, ifconfig and so on) that turn a kernel into a useful system.
> > And then of course there's X...
>
> How do you partition the thing? I.e. what's the size of your root
> partition? I'm usually doing something from 10Mb to 30Mb - that may
> be the reason of differences.

I don't bother splitting /usr off /. I gave up doing that when disc
became cheap. There's no point anymore. And since I have a lightweight
distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
and a pile of other things), it makes even less sense to split /usr
off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
most of my day in emacs and xterm.

And even if I did split /usr off, that would just mean I'd want to
record block accesses for that device as well. This is because part of
my boot process requires stuff in /usr. And after that, firing up xdm.

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Alexander Viro

未读，

2001年5月4日 16:02:462001/5/4

收件人 Richard Gooch、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Richard Gooch wrote:

> I don't bother splitting /usr off /. I gave up doing that when disc
> became cheap. There's no point anymore. And since I have a lightweight

Yes, there is. Locality. Resistance to fs fuckups. Resistance to disk
fuckups. Easier to restore from tape. Different tunefs optimum (higher
inodes/blocks ratio, for one thing). Ability to keep /usr read-only.
Enough?

> distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
> and a pile of other things), it makes even less sense to split /usr
> off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
> most of my day in emacs and xterm.

What desktops? None of that crap on my boxen either. EMACS? What EMACS?
LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side).
Netrape? No chance in hell. bash <spit> is there, but I prefer to use
rc.

I don't see what does it have to keeping root on a separate filesystem,
though - the reasons have nothing to bloat in /usr/bin.

Richard Gooch

未读，

2001年5月4日 16:07:202001/5/4

收件人 Alexander Viro、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Andrea Arcangeli、linux-...@vger.kernel.org

Alexander Viro writes:
>
>
> On Fri, 4 May 2001, Richard Gooch wrote:
>
> > I don't bother splitting /usr off /. I gave up doing that when disc
> > became cheap. There's no point anymore. And since I have a lightweight
>
> Yes, there is. Locality. Resistance to fs fuckups. Resistance to
> disk fuckups. Easier to restore from tape. Different tunefs optimum
> (higher inodes/blocks ratio, for one thing). Ability to keep /usr
> read-only. Enough?

The correct solution to avoiding fs fuckups is to keep /tmp, /var and
/home separate. Basically, anything that gets written to for reasons
other than sysadmin/upgrades.

However, my point is not that it's always a bad idea to split /usr,
simply that the converse is not true. IOW, it is not true to say that
/usr *should* be split off. For a generic workstation, splitting /usr
is not useful. Importantly, it is most certainly entirely valid to
keep /usr on /.

> > distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
> > and a pile of other things), it makes even less sense to split /usr
> > off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
> > most of my day in emacs and xterm.
>
> What desktops? None of that crap on my boxen either. EMACS? What EMACS?
> LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side).
> Netrape? No chance in hell. bash <spit> is there, but I prefer to use
> rc.
>
> I don't see what does it have to keeping root on a separate
> filesystem, though - the reasons have nothing to bloat in /usr/bin.

In any case, my point is that splitting /usr wouldn't help, because
I'd want to preload stuff from there as well. Splitting /usr doesn't
address the problem.

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Alan Cox

未读，

2001年5月4日 17:12:082001/5/4

收件人 Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

> Now, if you want to speed up accesses, there are things you can do. You
> can lay out the filesystem in the access order - trace the IO accesses at
> bootup ("which file, which offset, which metadata block?") and lay out the
> blocks of the files in exactly the right order. Then you will get linear
> reads _without_ doing any "dd" at all.

iso9660 alas doesn't allow you to do that. You can speed it up by reading
the entire file into memory rather than paging it in (or reading it in and
then executing it). iso9660 layout is pretty constrained and designed for
linear file reads

Linus Torvalds

未读，

2001年5月4日 17:52:532001/5/4

收件人 Alan Cox、Rogier Wolff、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

On Fri, 4 May 2001, Alan Cox wrote:
>
> iso9660 alas doesn't allow you to do that. You can speed it up by reading
> the entire file into memory rather than paging it in (or reading it in and
> then executing it). iso9660 layout is pretty constrained and designed for
> linear file reads

Note that this you can do for any filesystem, including ext2. If you
instead of trying to remember what _blocks_ the bootup process reads, you
keep the trace at a higher level, and then sort the _high_level_ trace and
re-do that with some program, then you can obviously populate the virtual
caches properly with any filesystem.

The advantage of that approach is that it will continue to work forever,
because there will never be any cache aliasing issues. You're always
"pre-caching" using the same operation that you'll actually use when you
do the real reads..

Now, that still leaves the question on how to sort the virtual cache
accesses, and you might want to know what the low-level layout of the
filesystem is to actually create the "sort". You might not want to sort
alphabetically on the file-name, but use a "where on the disk is this
file", and use _that_ as the sort oder function.

That's easy to do, actually. Just use the "bmap()" ioctl.

Now, you won't be able to use "dd" to populate the caches: you'd have to
have your own program that walks your sorted action list and populates the
caches that way (and you might want to take kernel read-ahead etc
heuristics into account).

SMOP.

Linus

Albert D. Cahalan

未读，

2001年5月5日 07:24:102001/5/5

收件人 Alexander Viro、linux-...@vger.kernel.org

Alexander Viro writes:
>> On Fri, 4 May 2001, Alexander Viro wrote:

>>> Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ah, you learn from the master.

> ObProcfs: I don't think that walking the page tables is a good way to
> compute RSS, especially since VM maintains the thing. Mind if I rip

Handling of mapped device memory should not change. For example
there is the X server with mapped video memory. There is another
RSS value provided elsewhere in case one does not want to include
mapped device memory.

Currently top uses the statm file in the following manner:

case P_SIZE:
sprintf(tmp, "%5.5s ", scale_k((task->size << CL_pg_shift), 5, 1));
break;
case P_TRS:
sprintf(tmp, "%4.4s ", scale_k((task->trs << CL_pg_shift), 4, 1));
break;
case P_SWAP:
sprintf(tmp, "%4.4s ",
scale_k(((task->size - task->resident) << CL_pg_shift), 4, 1));
break;
case P_SHARE:
sprintf(tmp, "%5.5s ", scale_k((task->share << CL_pg_shift), 5, 1)); break;
case P_DT:
sprintf(tmp, "%3.3s ", scale_k(task->dt, 3, 0));
break;
case P_RSS: /* rss, not resident (which includes IO memory) */
sprintf(tmp, "%4.4s ",
scale_k((task->rss << CL_pg_shift), 4, 1));

> it out? In effect, implementation of /prc/<pid>/statm
> * produces extremely bogus values (VMA is from library if it goes
> beyond 0x60000000? Might be even true 7 years ago...) and nobody
> had cared about them for 6-7 years

One could count pages that are mapped executable and do not come
from the main executable... but this is pretty worthless and does
not consider non-executable library sections.

The latest "top" does not bother to display this value.

> * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
> _processes_ each 5 seconds. No wonder it's slow like hell and eats
> tons of CPU time.

On my system, "statm" takes 50% longer than "stat" or "status".
Maybe there is a significant difference with Oracle on a 32 GB box?

I'd rather top didn't have to read the file at all.

Rogier Wolff

未读，

2001年5月5日 09:55:332001/5/5

收件人 Richard Gooch、Linus Torvalds、Rogier Wolff、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

Richard Gooch wrote:
>
> - next boot, init(8) checks the file, and if it exists, opens the root
> FS block device and reads in each block listed in the file. The
> effect is to warm the buffer cache extremely quickly. The head will
> move in one direction, grabbing data as it flys by. I expect this
> will take around 1 second

FYI:

Around 1992 or 1993, I rewrote Minix-fsck to do this instead of
seeking all over the place.

Cut the total time to fsck my filesystem from around 30 to 28
seconds. (remember the days of small filesystems?)

That's when I decided that this was NOT an interesting project: there
was very little to be gained.

The explanation is: A seek over a few tracks isn't much slower than a
seek over hundreds of tracks. Almost any "skip" in linear access
incurs the average 6ms rotational latency anyway.

Roger.
--
** R.E....@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.

Alexander Viro

未读，

2001年5月5日 13:14:292001/5/5

收件人 Albert D. Cahalan、linux-...@vger.kernel.org

On Sat, 5 May 2001, Albert D. Cahalan wrote:

> case P_SWAP:
> sprintf(tmp, "%4.4s ",
> scale_k(((task->size - task->resident) << CL_pg_shift), 4, 1));
> break;

Albert, you can't be serious. The system had demand-loading for almost
ten years. ->size - ->resident can be huge with no swap at all. As in,
"box had never been subjected to swapon(8)".

That value is a mix of amount of stuff we hadn't paged in,
amount of stuff we had paged in but then dropped (e.g. code that
had never been touched for two weeks, since application only uses
it on startup) and amount of stuff that had been swapped out _and_
wasn't swapped in (it may very well stay in swap).

BTW, "shared" is also bogus - page_count(page) can be raised
by any number of things.

> > * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
> > _processes_ each 5 seconds. No wonder it's slow like hell and eats
> > tons of CPU time.
>
> On my system, "statm" takes 50% longer than "stat" or "status".
> Maybe there is a significant difference with Oracle on a 32 GB box?

Depends on that applications mix.

Richard Gooch

未读，

2001年5月5日 14:16:232001/5/5

收件人 Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

Rogier Wolff writes:
> Richard Gooch wrote:
> >
> > - next boot, init(8) checks the file, and if it exists, opens the root
> > FS block device and reads in each block listed in the file. The
> > effect is to warm the buffer cache extremely quickly. The head will
> > move in one direction, grabbing data as it flys by. I expect this
> > will take around 1 second
>
> FYI:
>
> Around 1992 or 1993, I rewrote Minix-fsck to do this instead of
> seeking all over the place.
>
> Cut the total time to fsck my filesystem from around 30 to 28
> seconds. (remember the days of small filesystems?)
>
> That's when I decided that this was NOT an interesting project: there
> was very little to be gained.
>
> The explanation is: A seek over a few tracks isn't much slower than a
> seek over hundreds of tracks. Almost any "skip" in linear access
> incurs the average 6ms rotational latency anyway.

Hm. I think the access patterns between boot-up and fsck are quite
different. An fsck has to seek to a large number of tracks. During
bootup, I think the number of tracks accessed is much lower, and there
is probably more data locality as well. Still, only one way to be
sure.

I haven't had time to look closely at this, but one thing that bothers
me is how to find out what is being accessed in the first place. A
C-library wrapper to intercept read(2) calls isn't any good, because a
lot of stuff is memory-mapped (in particular shared libraries). Anyone
have a clean way to do this?

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Andrea Arcangeli

未读，

2001年5月5日 20:42:412001/5/5

收件人 Chris Wedgwood、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、linux-...@vger.kernel.org

On Sat, May 05, 2001 at 03:18:08PM +1200, Chris Wedgwood wrote:

> On Fri, May 04, 2001 at 05:29:40PM +0200, Andrea Arcangeli wrote:
>
> once block_dev is in pagecache there will obviously be no-way to
> share cache between the block device and the filesystem, because
> all the caches will be in completly different address spaces.
>

> Once we are at this point... will there be any use in having block
> devices? FreeBSD appears to have done without them completely about a

moving block_dev in pagecache won't change anything from userspace point
of view, it's a transparent change (if we ignore the total loss of
cache coherency between block_dev and fs metadata that it implies, but
as Linus said such loss of coherency will happen anyways eventually
because metadata will go into its address space too). Basically there
will still be a use for the block devices as far as there are fsck and
other userspace applications that want to use it.

Andrea SYNAPSE (very amusing movie ;)

Andrea Arcangeli

未读，

2001年5月5日 22:52:182001/5/5

收件人 Chris Wedgwood、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、linux-...@vger.kernel.org

On Sun, May 06, 2001 at 02:14:37PM +1200, Chris Wedgwood wrote:
> You don't need block device for fsck, in fact some OS require you use
> character devices (e.g. Solaris).

Moving e2fsck into the kernel is a completly different matter than
caching the blockdevice accesses with pagecache instead of buffercache.

And even if you move e2fsck or reiserfsck into the kernel (you could
technically do that just now regardless of where the block_dev cache
lives) there will still be partd that wants to mmap the blockdevice to
get rid of part of the fat32 partition (right now it uses read/write of
course because buffer cache cannot be mapped in userspace), there will
still be mtools, not self caching dbms, od </dev/hda, dd of=/dev/hda
etc..etc..etc.. that makes block_dev still *very* useful.

> I'm not saying we don't need block devices, but I really don't see
> much of a use for them once everything in in the page cache... I
> assume this is why others have got rid of them completely.

I have no idea why/if other got rid of it completly, but the fact block_dev
is useful has nothing to do if it's in pagecache or in buffercache,
really. It's just that by doing it in pagecache you can mmap it as well
and it will provide overall better performance and it's probably cleaner
design. The only visible change is that you will be able to mmap a
blockdevice as well.

About a kernel based fsck Alexander told me he likes it, I personally
don't care about it that much because I believe there's not that much to
share at the source level, fsck and real fs are quite different
problems, and what can be shared can be copied and by not sharing we get
the flexibility of not breaking fsck every time we change the kernel and
more in general the flexibility of doing it in userspace, sharing such
bytecode at runtime definitely doesn't matter. It also partly depends
from the fs but current ext2 situation is really fine to me and I
wouldn't consier a wortwhile project to move e2fsck into the kernel.

Andrea

Alexander Viro

未读，

2001年5月5日 23:18:242001/5/5

收件人 Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、linux-...@vger.kernel.org

On Sun, 6 May 2001, Chris Wedgwood wrote:

> On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:

> About a kernel based fsck Alexander told me he likes it, I

> personally don't care about it that much because I believe...
>
> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

It's not exactly "kernel-based fsck". What I've been talking about is
secondary filesystem providing coherent access to primary fs metadata.
I.e. mount -t ext2meta -o master=/usr none /mnt and then access through
/mnt/super, /mnt/block_bitmap, etc.

Andreas Dilger

未读，

2001年5月5日 23:42:552001/5/5

收件人 Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、linux-...@vger.kernel.org

Chris Wedgewood writes:
> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

Actually, I was talking with Ted about this, and we agreed that:
a) kernel-based e2fsck is a pain in the a** (locking issues, etc)
b) you can do an LVM snapshot of your live filesystem and do a read-only
fsck on that to check if the filesystem is still OK. For journaled
filesystems like reiserfs and ext3, they need to use the super method
write_super_lockfs() to block I/O and flush everything to disk at the
time of the snapshot, to ensure that they don't need recovery on a
read-only device. This makes the LVM snapshot equivalent to unmount
the filesystem, copy contents to a new device and remount the filesystem.

While (b) doesn't let you fix a filesystem online, unless there is a kernel
bug or hardware problem, you should not have a problem. If you have either
of those, then fixing the filesystem online is just asking for more problems
in the future.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

Andrea Arcangeli

未读，

2001年5月5日 23:54:062001/5/5

收件人 Chris Wedgwood、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、Alexander Viro、linux-...@vger.kernel.org

On Sun, May 06, 2001 at 03:00:58PM +1200, Chris Wedgwood wrote:
> On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:
>

> Moving e2fsck into the kernel is a completly different matter
> than caching the blockdevice accesses with pagecache instead of
> buffercache.
>

> No, I was takling about user space fsck using character devices.

I misread your previous email sorry, I think you meant to fsck using
rawio (not to move fsck into the kernel). You can do that just now but
to get decent performance then fsck should do self caching, changing
fsck to do self caching doesn't sound worthwhile either. Note also that
rawio has nothing to do with the pagecache. Infact both rawio and
O_DIRECT bypasses all the pagecache and its smp locks for example.

> I'm not claiming it is... what I'm asking is _why_ do we need block
> devices once 'everything' lives in the page cache?

Where the cache of the blockdevice lives is a completly orthogonal
problem with "why cached blockdevices are useful" which I addressed in
the previous email.

> It's just that by doing it in pagecache you can mmap it as well
> and it will provide overall better performance and it's probably
> cleaner design. The only visible change is that you will be able
> to mmap a blockdevice as well.
>

> Why? What needs to mmap a block device? Since these are typically
> larger than that you can mmap into a 32-bit address space (yes, I'm
> ignoring the 5% or so of cases where this isn't true) I'm not aware
> on many applications that do it.

Last time I talked with the parted maintainer he was asking for that
feature so that parted won't need to do its own anti-oom management in
userspace, so he can simple mmap(MAP_SHARED) a quite large region of
metadata of the blockdevice, read/write to the mmaped region and the
kernel will take care of doing paging when it runs low on memory. right
now it allocates the metadata in anonymous memory and loads it via
read(). This memory will need to be swapped out if the working set
doesn't fit in ram (and swap may not be available ;).

> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

Being able to fsck a live filesystem is yet another exotic feature and
yes for that you would certainly need some additional kernel support.

Andrea

Alexander Viro

未读，

2001年5月6日 00:01:362001/5/6

收件人 Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、linux-...@vger.kernel.org

On Sun, 6 May 2001, Chris Wedgwood wrote:

> It's not exactly "kernel-based fsck". What I've been talking about
> is secondary filesystem providing coherent access to primary fs
> metadata. I.e. mount -t ext2meta -o master=/usr none /mnt and
> then access through /mnt/super, /mnt/block_bitmap, etc.
>

> Call me stupid --- but what exactly does the above actually achieve?
> Why would you do this?

Coherent access to metadata? Well, for one thing, it allows stuff like
tunefs and friends on mounted fs. What's more useful, it allows to
do things like access to boot code, which is _not_ safe to do through
device access - usually you have superblock in vicinity and no warranties
about the things that will be overwritten on umount. Same for debugging
stuff, IO stats, etc. - access through secondary tree is much saner
than inventing tons of ioctls for dealing with that. Moreover, it allows
fsck and friends to get rid of code duplication - while the repair
logics, etc. stays in userland (where it belongs) layout information
is already handled in the kernel. No need to duplicate it in userland...

Besides, with moving bitmaps, etc. into pagecache it becomes trivial
to implement.

BTW, we have another ugly chunk of code - duplicated between kernel
and userland and nasty in both incarnations. I mean handling of the
partition tables. Kernel should be able to read and parse them -
otherwise they are useless, right? Now, we have a bunch of userland
utilities that do the same. Various fdisks, that is. If you look
how they work you'll see that on the read side they duplicate
kernel code and on the write side... To put it quite mildly, they are
not doing it in graceful way. They write relevant sectors to disk and
use BLKRRPART to tell the kernel that ti should forget about all partitions
on that disk and reread the partition tables. _Not_ a nice thing to do,
since creation of new partition out of unused space on /dev/sda becomes
an interesting task when your root lives on /dev/sda1. Ditto for destroying
a single partition (not mounted/used by swap/etc.) while you have some
other partition in use. IWBNI we had a decent API for handling partition
tables...

Alan Cox

未读，

2001年5月6日 08:48:112001/5/6

收件人 Alexander Viro、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、linux-...@vger.kernel.org

> an interesting task when your root lives on /dev/sda1. Ditto for destroying
> a single partition (not mounted/used by swap/etc.) while you have some
> other partition in use. IWBNI we had a decent API for handling partition
> tables...

Partitions are just very crude logical volumes, and ultimiately I believe
should be handled exactly that way

Andreas Dilger

未读，

2001年5月6日 15:51:012001/5/6

收件人 Alan Cox、Alexander Viro、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、vol...@mindspring.com、linux-...@vger.kernel.org

Alan writes:
> > an interesting task when your root lives on /dev/sda1. Ditto for destroying
> > a single partition (not mounted/used by swap/etc.) while you have some
> > other partition in use. IWBNI we had a decent API for handling partition
> > tables...
>
> Partitions are just very crude logical volumes, and ultimiately I believe
> should be handled exactly that way

Actually, the EVMS project does exactly this. All I/O is done on a full
disk basis, and essentially does block remapping for each partition. This
also solves the problem of cache inconsistency if accessing the parent
device vs. accessing the partition.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

Alan Cox

未读，

2001年5月6日 16:59:502001/5/6

收件人 Andreas Dilger、Alan Cox、Alexander Viro、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、vol...@mindspring.com、linux-...@vger.kernel.org

> Actually, the EVMS project does exactly this. All I/O is done on a full
> disk basis, and essentially does block remapping for each partition. This
> also solves the problem of cache inconsistency if accessing the parent
> device vs. accessing the partition.

Interesting. Can EVMS handle the partition labels used by the LVM layer - ie
could it replace it as well ?

Andreas Dilger

未读，

2001年5月7日 00:16:572001/5/7

收件人 Alan Cox、Andreas Dilger、Alexander Viro、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、vol...@mindspring.com、linux-...@vger.kernel.org

Alan writes:
> > Actually, the EVMS project does exactly this. All I/O is done on a full
> > disk basis, and essentially does block remapping for each partition. This
> > also solves the problem of cache inconsistency if accessing the parent
> > device vs. accessing the partition.
>
> Interesting. Can EVMS handle the partition labels used by the LVM layer - ie
> could it replace it as well ?

Yes, they already support all current LVM volumes (including snapshots).
However, the user-space tools to set up new LVM volumes and manage existing
ones is not ready yet. The last I talked with the IBM folks (a week ago),
they said they were starting to work on the user-space tools.

Because the whole partition/volume code is modular in EVMS, they will be able
to handle AIX LVM, HP/UX LVM, etc. volumes in addition to the normal DOS or
other partitions.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

vol...@mindspring.com

未读，

2001年5月8日 09:48:532001/5/8

收件人 Richard Gooch、Linus Torvalds、Rogier Wolff、Alan Cox、Alexander Viro、Andrea Arcangeli、linux-...@vger.kernel.org

I have tried this approach too a couple of years ago. I came to the idea
that I want some kind of "event reporting" mechanism to know when
application faults and when other events (like I/O) occurs. Booting is
just the tip of the iceberg. MOST big apps are seeking on startup because
a) their code is spread out all over executable
b) don't forget shared libraries..
c) the practice of keeping configuration files in ~/.filename
implies - read a little, seek a little.
d) GUI apps tend to have a ton of icons.

I wonder - is it possible to get this via ptrace ? - could not find this
in the manpage.

Vladimir Dergachev

On Fri, 4 May 2001, Richard Gooch wrote:

> Linus Torvalds writes:
> > Now, if you want to speed up accesses, there are things you can
> > do. You can lay out the filesystem in the access order - trace the
> > IO accesses at bootup ("which file, which offset, which metadata
> > block?") and lay out the blocks of the files in exactly the right
> > order. Then you will get linear reads _without_ doing any "dd" at
> > all.
>

> A year ago I came up with an alternative approach for cache warming,
> but I see that it wouldn't work with our current infrastructure.
> However, maybe there is still a way to use the basic technique. If so,
> please make suggestions.

>
> The idea I had (motivated by the desire to eliminate random disc
> seeks, which is the limiting factor in how fast my boxes boot) was:
>
> - init(8) issues an ioctl(2) on the root FS block device which turns
> on recording of block reads (it records block numbers)
>
> - at the end of the bootup process, init(8) issues another ioctl(2) to
> grab the buffered block numbers, and turn off recording
>
> - init(8) then sorts this list in ascending order and saves the result
> in a file
>

> - next boot, init(8) checks the file, and if it exists, opens the root
> FS block device and reads in each block listed in the file. The
> effect is to warm the buffer cache extremely quickly. The head will
> move in one direction, grabbing data as it flys by. I expect this
> will take around 1 second
>

> - init(8) now continues the boot process (starting the magic ioctl(2)
> again so as to get a fresh list of blocks, in case something has
> changed)
>
> - booting is now super fast, thanks to no disc activity.
>

> The advantage of this scheme over blindly reading the first 50 MB is
> that it only reads in what you *need*, and thus will work better on
> low memory systems. It's also useful for other applications, not just
> speeding up the boot process.

>
> However, doing an ioctl(2) on the block device won't help. So the
> question is, where to add the hook? One possibility is the FS, and
> record inum,bnum pairs. But of course we don't have a way of accessing
> via inum in user-space. So that's no good. Besides, we want to get
> block numbers on the block device, because that's the only meaningful
> number to resort.
>
> So, what, then? Some kind of hook on the page cache? Ideas?
>

> Regards,
>
> Richard....
> Permanent: rgo...@atnf.csiro.au
> Current: rgo...@ras.ucalgary.ca

Helge Hafting

未读，

2001年5月9日 06:32:442001/5/9

收件人 vol...@mindspring.com、linux-...@vger.kernel.org

vol...@mindspring.com wrote:
>
> I have tried this approach too a couple of years ago. I came to the idea
> that I want some kind of "event reporting" mechanism to know when
> application faults and when other events (like I/O) occurs. Booting is
> just the tip of the iceberg. MOST big apps are seeking on startup because
> a) their code is spread out all over executable

Page tuning can fix that. (Have the compiler & linker increase locality
by stuffing related code in the same page. You want fast paths
stuffed into as few pages as possible, regardless of which functions
the instructions belong to.) This also cut down on swapping and TLB
misses.
Os/2 gained some nice speedups by doing this.

> b) don't forget shared libraries..

They can be page tuned too, and they're often partially in
memory aready when starting apps.

> c) the practice of keeping configuration files in ~/.filename
> implies - read a little, seek a little.
> d) GUI apps tend to have a ton of icons.

Putting several in a single file, or even the executable will
help here.

Helge Hafting

Pavel Machek

未读，

2001年5月11日 03:10:322001/5/11

收件人 Alexander Viro、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、linux-...@vger.kernel.org

gHi!

> > It's not exactly "kernel-based fsck". What I've been talking about
> > is secondary filesystem providing coherent access to primary fs
> > metadata. I.e. mount -t ext2meta -o master=/usr none /mnt and
> > then access through /mnt/super, /mnt/block_bitmap, etc.
> >
> > Call me stupid --- but what exactly does the above actually achieve?
> > Why would you do this?
>
> Coherent access to metadata? Well, for one thing, it allows stuff like
> tunefs and friends on mounted fs. What's more useful, it allows to
> do things like access to boot code, which is _not_ safe to do through
> device access - usually you have superblock in vicinity and no warranties
> about the things that will be overwritten on umount. Same for debugging
> stuff, IO stats, etc. - access through secondary tree is much saner
> than inventing tons of ioctls for dealing with that. Moreover, it allows
> fsck and friends to get rid of code duplication - while the repair
> logics, etc. stays in userland (where it belongs) layout information
> is already handled in the kernel. No need to duplicate it in userland...

OTOH with current way if you make mistake in kernel, fsck will not
automatically inherit it; therefore fsck is likely to work even if
kernel ext2 is b0rken [and that's fairly important]

> Besides, with moving bitmaps, etc. into pagecache it becomes trivial
> to implement.
>
> BTW, we have another ugly chunk of code - duplicated between kernel
> and userland and nasty in both incarnations. I mean handling of the
> partition tables. Kernel should be able to read and parse them -
> otherwise they are useless, right? Now, we have a bunch of userland

No. You might want to see (via fdisk) partition table, even through
*your* kernel can not read it.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

Alexander Viro

未读，

2001年5月11日 16:07:522001/5/11

收件人 Pavel Machek、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、linux-...@vger.kernel.org

On Mon, 7 May 2001, Pavel Machek wrote:
> OTOH with current way if you make mistake in kernel, fsck will not
> automatically inherit it; therefore fsck is likely to work even if
> kernel ext2 is b0rken [and that's fairly important]

... and by the same logics you should make fsck implement its own
drivers - after all, right now b0rken driver affects both the kernel
ext2 and fsck ;-)

I'm not sure that fsck of fs mounted read/write is worth doing in the
first place, but I'd rather do that via fs/ext2 exporting its metadata
explicitly than by playing silly buggers with device/fs coherency.

Stephen C. Tweedie

未读，

2001年5月18日 10:48:492001/5/18

收件人 Daniel Phillips、Pavel Machek、Alexander Viro、Chris Wedgwood、Andrea Arcangeli、Jens Axboe、Rogier Wolff、Linus Torvalds、Alan Cox、vol...@mindspring.com、linux-...@vger.kernel.org、Stephen Tweedie

Hi,

On Fri, May 11, 2001 at 04:54:44PM +0200, Daniel Phillips wrote:

> The only reasonable way I can think of getting a block-coherent view
> underneath a mounted fs is to have a reverse map, and update it each
> time we map block into the page cache or unmap it.

It's called the "buffer cache", and Ingo's early page-cache code in
2.3 actually did install page-cache backing buffers into the buffer
cache as aliases, mainly for debugging purposes.

Even without that, though, an application can achieve almost-coherency
via invalidation of the buffer cache before reading it. And yes, this
won't necessarily remain coherent over the lifetime of the application
process, but then unless the filesystem is 100% quiescent then you
don't get that on 2.2 either.

Which is rather the point. If the filesystem is active, then
coherency cannot be obtained at the block-device level in any case
without knowledge of the fs transaction activity. If the filesystem
is quiescent, then you can sync it and flush the buffer cache and you
already get the coherency that you need.

Cheers,
Stephen