Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

XFS crashed twice, once in 2.6.16.20, next in 2.6.17, reproducable

10 views
Skip to first unread message

Avuton Olrich

unread,
Jun 19, 2006, 3:45:29 AM6/19/06
to Linux Kernel Mailing List, linu...@oss.sgi.com
Didn't make it to the linux-xfs list, it's not working so I'll try by
sending it both to the LKML and linux-xfs again.

Hello, when trying to recursively delete a directory (same directory
twice) from my 500gb hard drive I get a problem. It crashed first in
2.6.16.20, then I upgraded to try to get rid of the issue. This one is
from 2.6.17:

xfs_da_do_buf: bno 16777216
dir: inode 1507133580
Filesystem "sda1": XFS internal error xfs_da_do_buf(1) at line 2119 of
file /usr/src/linux-stable-cold/fs/xfs/xfs_da_btree.c. Caller 0xb01
d9b63
<b01d9720> xfs_da_do_buf+0x40e/0x7c7 <b01d9b63> xfs_da_read_buf+0x30/0x35
<b01e43d5> xfs_dir2_leafn_lookup_int+0x2f3/0x453 <b01d9b63>
xfs_da_read_buf+0x30/0x35
<b01e2ba5> xfs_dir2_node_removename+0x288/0x47f <b01e2ba5>
xfs_dir2_node_removename+0x288/0x47f
<b01ddbd3> xfs_dir2_removename+0xce/0xd5 <b020ff5d> kmem_zone_alloc+0x4d/0x98
<b020d0ef> xfs_remove+0x2ac/0x444 <b0215e7f> xfs_vn_unlink+0x17/0x3b
<b020a32b> xfs_lookup+0x6e/0x78 <b011e734> __capable+0xc/0x1f
<b0155827> generic_permission+0x93/0xcc <b01558f8> permission+0x98/0xa4
<b0155da0> may_delete+0x32/0xe9 <b0156243> vfs_unlink+0x6d/0xa3
<b0157c7a> do_unlinkat+0x92/0x125 <b0159a0d> sys_getdents64+0x9c/0xa6
<b0102b67> sysenter_past_esp+0x54/0x75
Filesystem "sda1": XFS internal error xfs_trans_cancel at line 1150 of
file /usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Caller 0xb020d2
5e
<b0204b48> xfs_trans_cancel+0x59/0xe5 <b020d25e> xfs_remove+0x41b/0x444
<b020d25e> xfs_remove+0x41b/0x444 <b0215e7f> xfs_vn_unlink+0x17/0x3b
<b020a32b> xfs_lookup+0x6e/0x78 <b011e734> __capable+0xc/0x1f
<b0155827> generic_permission+0x93/0xcc <b01558f8> permission+0x98/0xa4
<b0155da0> may_delete+0x32/0xe9 <b0156243> vfs_unlink+0x6d/0xa3
<b0157c7a> do_unlinkat+0x92/0x125 <b0159a0d> sys_getdents64+0x9c/0xa6
<b0102b67> sysenter_past_esp+0x54/0x75
xfs_force_shutdown(sda1,0x8) called from line 1151 of file
/usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Return address =
0xb0218b68
Filesystem "sda1": Corruption of in-memory data detected. Shutting
down filesystem: sda1

While trying to xfs_repair I get the following:
fatal error -- can't read block 16777216 for directory inode 1507133580

Badblocks has been run on this machine and it was sucessful.

I did find an old thread with this, but no solution:
http://oss.sgi.com/archives/xfs/2005-02/msg00067.html

config:
http://olricha.homelinux.net:8080/config.gz

Thanks for any help. If I can help at all please let me know.
--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

daniel+deve...@flexserv.de

unread,
Jun 19, 2006, 6:36:05 AM6/19/06
to Avuton Olrich, Linux Kernel Mailing List, linu...@oss.sgi.com
"Avuton Olrich" <avu...@gmail.com> writes:

The same here.
after a complete mkfs.xfs under 2.6.17-rc6 it was solved.

Same if i boot 2.6.8 mk.xfs, boot into 2.6.16 the xfs get "shreddered"
a directly boot from .8 to .17-rc6 works. so i think there was a bug in .16
in the transition of the xfs wich got solved somewhere in the .17.rc? time.

> Filesystem "sda1": Corruption of in-memory data detected. Shutting
> down filesystem: sda1

Daniel

Nathan Scott

unread,
Jun 20, 2006, 2:11:02 AM6/20/06
to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com
On Mon, Jun 19, 2006 at 12:44:58AM -0700, Avuton Olrich wrote:
> ..

> Hello, when trying to recursively delete a directory (same directory
> twice) from my 500gb hard drive I get a problem. It crashed first in
> 2.6.16.20, then I upgraded to try to get rid of the issue. This one is
> from 2.6.17:

How reproducible is it? Is it reproducible even after xfs_repair?

If so, can you try Mandy's patch below, to see if it is addressing
the root cause of your problem? If problems persist, a reproducible
test case would be wonderful, if one can be found..

cheers.

--
Nathan


Fix nused counter. It's currently getting set to -1 rather than getting
decremented by 1. Since nused never reaches 0, the "if (!free->hdr.nused)"
check in xfs_dir2_leafn_remove() fails every time and xfs_dir2_shrink_inode()
doesn't get called when it should. This causes extra blocks to be left on
an empty directory and the directory in unable to be converted back to
inline extent mode.

Signed-off-by: Mandy Kirkconnell <alki...@sgi.com>
Signed-off-by: Nathan Scott <nat...@sgi.com>

--- a/fs/xfs/xfs_dir2_node.c 2006-06-20 16:00:45.000000000 +1000
+++ b/fs/xfs/xfs_dir2_node.c 2006-06-20 16:00:45.000000000 +1000
@@ -972,7 +972,7 @@ xfs_dir2_leafn_remove(
/*
* One less used entry in the free table.
*/
- free->hdr.nused = cpu_to_be32(-1);
+ be32_add(&free->hdr.nused, -1);
xfs_dir2_free_log_header(tp, fbp);
/*
* If this was the last entry in the table, we can

Avuton Olrich

unread,
Jun 20, 2006, 2:40:24 AM6/20/06
to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> On Mon, Jun 19, 2006 at 12:44:58AM -0700, Avuton Olrich wrote:
> > ..
> > Hello, when trying to recursively delete a directory (same directory
> > twice) from my 500gb hard drive I get a problem. It crashed first in
> > 2.6.16.20, then I upgraded to try to get rid of the issue. This one is
> > from 2.6.17:
>
> How reproducible is it? Is it reproducible even after xfs_repair?

Happens every time I try to remove that inode (directory). xfs_repair
ends with a fatal error:

Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...
rebuilding directory inode 128

fatal error -- can't read block 16777216 for directory inode 1507133580

> If so, can you try Mandy's patch below, to see if it is addressing


> the root cause of your problem? If problems persist, a reproducible
> test case would be wonderful, if one can be found..

I'm sorry, the patch doesn't change anything. It never makes it though
the xfs_repair due to the above error. If there's any information I
can get for you please let me know.

I'm not sure if it changes anything, but here's the message after the patch:


xfs_da_do_buf: bno 16777216
dir: inode 1507133580
Filesystem "sda1": XFS internal error xfs_da_do_buf(1) at line 2119 of
file /usr/src/linux-stable-cold/fs/xfs/xfs_da_btree.c. Caller
0xb01d9b63
<b01d9720> xfs_da_do_buf+0x40e/0x7c7 <b01d9b63> xfs_da_read_buf+0x30/0x35

<b01e43d9> xfs_dir2_leafn_lookup_int+0x2f3/0x453 <b01d9b63>
xfs_da_read_buf+0x30/0x35
<b01e2ba5> xfs_dir2_node_removename+0x288/0x483 <b01e2ba5>
xfs_dir2_node_removename+0x288/0x483
<b01ddbd3> xfs_dir2_removename+0xce/0xd5 <b020ff61> kmem_zone_alloc+0x4d/0x98
<b020d0f3> xfs_remove+0x2ac/0x444 <b0215e83> xfs_vn_unlink+0x17/0x3b
<b016190c> mntput_no_expire+0x11/0x7e <b01575f1> link_path_walk+0xaf/0xb9


<b011e734> __capable+0xc/0x1f <b0155827> generic_permission+0x93/0xcc
<b01558f8> permission+0x98/0xa4 <b0155da0> may_delete+0x32/0xe9
<b0156243> vfs_unlink+0x6d/0xa3 <b0157c7a> do_unlinkat+0x92/0x125
<b0159a0d> sys_getdents64+0x9c/0xa6 <b0102b67> sysenter_past_esp+0x54/0x75
Filesystem "sda1": XFS internal error xfs_trans_cancel at line 1150 of

file /usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Caller 0xb020d262
<b0204b4c> xfs_trans_cancel+0x59/0xe5 <b020d262> xfs_remove+0x41b/0x444
<b020d262> xfs_remove+0x41b/0x444 <b0215e83> xfs_vn_unlink+0x17/0x3b
<b016190c> mntput_no_expire+0x11/0x7e <b01575f1> link_path_walk+0xaf/0xb9


<b011e734> __capable+0xc/0x1f <b0155827> generic_permission+0x93/0xcc
<b01558f8> permission+0x98/0xa4 <b0155da0> may_delete+0x32/0xe9
<b0156243> vfs_unlink+0x6d/0xa3 <b0157c7a> do_unlinkat+0x92/0x125
<b0159a0d> sys_getdents64+0x9c/0xa6 <b0102b67> sysenter_past_esp+0x54/0x75
xfs_force_shutdown(sda1,0x8) called from line 1151 of file
/usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Return address =

0xb0218b6c


Filesystem "sda1": Corruption of in-memory data detected. Shutting
down filesystem: sda1

Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(sda1,0x1) called from line 338 of file
/usr/src/linux-stable-cold/fs/xfs/xfs_rw.c. Return address =
0xb0218b6c


--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Avuton Olrich

unread,
Jun 20, 2006, 2:40:56 AM6/20/06
to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> How reproducible is it? Is it reproducible even after xfs_repair?
It happens everytime I try to delete the directory.

Also, forgot to mention I ran xfs_check on it and it gave me more
information than I had before:
More information, ran xfs_check and got the following:
missing free index for data block 0 in dir ino 1507133580
missing free index for data block 2 in dir ino 1507133580
missing free index for data block 3 in dir ino 1507133580
missing free index for data block 4 in dir ino 1507133580
missing free index for data block 5 in dir ino 1507133580
missing free index for data block 6 in dir ino 1507133580
missing free index for data block 7 in dir ino 1507133580
missing free index for data block 8 in dir ino 1507133580
missing free index for data block 9 in dir ino 1507133580

--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Nathan Scott

unread,
Jun 20, 2006, 2:44:19 AM6/20/06
to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com
On Mon, Jun 19, 2006 at 11:38:58PM -0700, Avuton Olrich wrote:
> On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > If so, can you try Mandy's patch below, to see if it is addressing
> > the root cause of your problem? If problems persist, a reproducible
> > test case would be wonderful, if one can be found..
>
> I'm sorry, the patch doesn't change anything. It never makes it though
> the xfs_repair due to the above error. If there's any information I
> can get for you please let me know.

Oh - thats a kernel patch, not a repair patch, I was more interested
in whether the initial corruption could be reproduced. Which version
of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
resolve your problem, I suspect.

cheers.

--
Nathan

Avuton Olrich

unread,
Jun 20, 2006, 2:50:58 AM6/20/06
to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> Oh - thats a kernel patch, not a repair patch, I was more interested
> in whether the initial corruption could be reproduced. Which version
> of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
> resolve your problem, I suspect.

OK, I'm running Gentoo's latest: 2.7.11, I can't find 2.7.18
_anywhere_ although 2.7.13 is in the pre directory on the ftp, is that
the one you're referring to?


--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Nathan Scott

unread,
Jun 20, 2006, 2:52:46 AM6/20/06
to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com
On Mon, Jun 19, 2006 at 11:50:37PM -0700, Avuton Olrich wrote:
> On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > Oh - thats a kernel patch, not a repair patch, I was more interested
> > in whether the initial corruption could be reproduced. Which version
> > of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
> > resolve your problem, I suspect.
>
> OK, I'm running Gentoo's latest: 2.7.11, I can't find 2.7.18
> _anywhere_ although 2.7.13 is in the pre directory on the ftp, is that
> the one you're referring to?

No - its in CVS (for a long time); I'll go get the ftp area updated,
looks like thats been forgotten about again.

cheers.

--
Nathan

Avuton Olrich

unread,
Jun 20, 2006, 4:21:15 AM6/20/06
to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> On Mon, Jun 19, 2006 at 11:50:37PM -0700, Avuton Olrich wrote:
> > On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > > Oh - thats a kernel patch, not a repair patch, I was more interested
> > > in whether the initial corruption could be reproduced. Which version
> > > of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
> > > resolve your problem, I suspect.
> >
> > OK, I'm running Gentoo's latest: 2.7.11, I can't find 2.7.18
> > _anywhere_ although 2.7.13 is in the pre directory on the ftp, is that
> > the one you're referring to?
>
> No - its in CVS (for a long time); I'll go get the ftp area updated,
> looks like thats been forgotten about again.

OK, just compiled from CVS HEAD (xfs_repair 2.8.2) and it still fails:

If this fix is not yet in the 2.8.x I will wait for 2.7.18 to get on the ftp.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
entry "/ost+found" at block 0 offset 448 in directory inode 128
references invalid inode 18374686479671623679
clearing inode number in entry at offset 448...
entry at block 0 offset 448 in directory inode 128 has illegal name
"/ost+found": imap claims a free inode 859505 is in use, correcting
imap and clearing inode
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- clear lost+found (if it exists) ...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
Phase 5 - rebuild AG headers and trees...
- reset superblock...


Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...
rebuilding directory inode 128

fatal error -- can't read block 16777216 for directory inode
1507133580

--

avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Duncan Sands

unread,
Jun 20, 2006, 4:39:42 AM6/20/06
to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
> fatal error -- can't read block 16777216 for directory inode 1507133580

This looks to be the same problem as http://oss.sgi.com/bugzilla/show_bug.cgi?id=631
Note that the block numbers are identical in both reports: 16777216 = 0x1000000.
A very suspicious block number, wouldn't you say?

Best wishes,

Duncan.

Justin Piszcz

unread,
Jun 20, 2006, 4:57:42 AM6/20/06
to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
Have you checked to make sure you don't have a bad disk?

Avuton Olrich

unread,
Jun 20, 2006, 1:02:27 PM6/20/06
to Justin Piszcz, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On 6/20/06, Justin Piszcz <jpi...@lucidpixels.com> wrote:
> Have you checked to make sure you don't have a bad disk?

In the initial email I do state that I have run badblocks on this disk
sucessfully.

Justin Piszcz

unread,
Jun 20, 2006, 1:16:07 PM6/20/06
to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
WHat options did you pass to bad blocks?

Avuton Olrich

unread,
Jun 20, 2006, 1:21:58 PM6/20/06
to Justin Piszcz, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On 6/20/06, Justin Piszcz <jpi...@lucidpixels.com> wrote:
> WHat options did you pass to bad blocks?

Just the defaults, but it doesn't matter, someone else is having the
same exact issue I am, from the bugzilla entry earlier in this thread.

Nathan Scott

unread,
Jun 21, 2006, 10:57:16 PM6/21/06
to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com
On Tue, Jun 20, 2006 at 01:20:39AM -0700, Avuton Olrich wrote:
> On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > No - its in CVS (for a long time); I'll go get the ftp area updated,
> > looks like thats been forgotten about again.

FWIW, I've updated the ftp area now.

> OK, just compiled from CVS HEAD (xfs_repair 2.8.2) and it still fails:

Is this a large filesystem? Any chance we can get access to
it somehow (e.g. xfs_copy to a sparse file, then send me a
pointer to it) to reproduce the problem locally?

> fatal error -- can't read block 16777216 for directory inode
> 1507133580

Once you save a copy of it for further analysis of xfs_repair,
if you can, you can clear out this problem by directly poking at
the device using xfs_db in expert mode. "xfs_db -x /dev/xxx";
then "inode 1507133580"; then "write core.mode 0"; and then try
another xfs_repair run. Please try capture the fs for us first
though (if possible) else we're going to struggle to improve on
this aspect of xfs_repair. Send me some private mail if you do
manage to grab the fs and put it someplace for me.

thanks.

--
Nathan

Duncan Sands

unread,
Jun 25, 2006, 6:09:58 AM6/25/06
to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
I just got a new XFS crash running 2.6.17, again with problems at block
16777216 - I'll try to make a copy of the corrupted filesystem available.
Interestingly enough, I'm also seeing ext3 corruption. The usual
manifestation is that a program fails to run, with a message about it
not being in executable format (if it happens again I will take a note of
the exact message). I've had no problems at all with 2.6.17. It seems
to be happening randomly, which makes me suspect a race condition
(uniprocessor machine, but preemptable kernel), or memory corruption.
I will rebuild the kernel with all kernel debugging options turned
on, once I recover the filesystem.

Ciao,

D.

Duncan Sands

unread,
Jun 25, 2006, 9:55:52 AM6/25/06
to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com
On Sunday 25 June 2006 12:09, Duncan Sands wrote:
> I just got a new XFS crash running 2.6.17, again with problems at block
> 16777216 - I'll try to make a copy of the corrupted filesystem available.
> Interestingly enough, I'm also seeing ext3 corruption. The usual
> manifestation is that a program fails to run, with a message about it
> not being in executable format (if it happens again I will take a note of
> the exact message). I've had no problems at all with 2.6.17. It seems
> to be happening randomly, which makes me suspect a race condition
> (uniprocessor machine, but preemptable kernel), or memory corruption.
> I will rebuild the kernel with all kernel debugging options turned
> on, once I recover the filesystem.

Sorry, that should say: "I've had no problems at all with 2.6.15".
Also, xfs_repair successfully repaired the filesystem this time.
I've kept a copy of the filesystem in case anyone is interested.

Duncan.

0 new messages