XFS crashed twice, once in 2.6.16.20, next in 2.6.17, reproducable

Avuton Olrich

unread,

Jun 19, 2006, 3:45:29 AM6/19/06

to Linux Kernel Mailing List, linu...@oss.sgi.com

Didn't make it to the linux-xfs list, it's not working so I'll try by
sending it both to the LKML and linux-xfs again.

Hello, when trying to recursively delete a directory (same directory
twice) from my 500gb hard drive I get a problem. It crashed first in
2.6.16.20, then I upgraded to try to get rid of the issue. This one is
from 2.6.17:

xfs_da_do_buf: bno 16777216
dir: inode 1507133580
Filesystem "sda1": XFS internal error xfs_da_do_buf(1) at line 2119 of
file /usr/src/linux-stable-cold/fs/xfs/xfs_da_btree.c. Caller 0xb01
d9b63
<b01d9720> xfs_da_do_buf+0x40e/0x7c7 <b01d9b63> xfs_da_read_buf+0x30/0x35
<b01e43d5> xfs_dir2_leafn_lookup_int+0x2f3/0x453 <b01d9b63>
xfs_da_read_buf+0x30/0x35
<b01e2ba5> xfs_dir2_node_removename+0x288/0x47f <b01e2ba5>
xfs_dir2_node_removename+0x288/0x47f
<b01ddbd3> xfs_dir2_removename+0xce/0xd5 <b020ff5d> kmem_zone_alloc+0x4d/0x98
<b020d0ef> xfs_remove+0x2ac/0x444 <b0215e7f> xfs_vn_unlink+0x17/0x3b
<b020a32b> xfs_lookup+0x6e/0x78 <b011e734> __capable+0xc/0x1f
<b0155827> generic_permission+0x93/0xcc <b01558f8> permission+0x98/0xa4
<b0155da0> may_delete+0x32/0xe9 <b0156243> vfs_unlink+0x6d/0xa3
<b0157c7a> do_unlinkat+0x92/0x125 <b0159a0d> sys_getdents64+0x9c/0xa6
<b0102b67> sysenter_past_esp+0x54/0x75
Filesystem "sda1": XFS internal error xfs_trans_cancel at line 1150 of
file /usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Caller 0xb020d2
5e
<b0204b48> xfs_trans_cancel+0x59/0xe5 <b020d25e> xfs_remove+0x41b/0x444
<b020d25e> xfs_remove+0x41b/0x444 <b0215e7f> xfs_vn_unlink+0x17/0x3b
<b020a32b> xfs_lookup+0x6e/0x78 <b011e734> __capable+0xc/0x1f
<b0155827> generic_permission+0x93/0xcc <b01558f8> permission+0x98/0xa4
<b0155da0> may_delete+0x32/0xe9 <b0156243> vfs_unlink+0x6d/0xa3
<b0157c7a> do_unlinkat+0x92/0x125 <b0159a0d> sys_getdents64+0x9c/0xa6
<b0102b67> sysenter_past_esp+0x54/0x75
xfs_force_shutdown(sda1,0x8) called from line 1151 of file
/usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Return address =
0xb0218b68
Filesystem "sda1": Corruption of in-memory data detected. Shutting
down filesystem: sda1

While trying to xfs_repair I get the following:
fatal error -- can't read block 16777216 for directory inode 1507133580

Badblocks has been run on this machine and it was sucessful.

I did find an old thread with this, but no solution:
http://oss.sgi.com/archives/xfs/2005-02/msg00067.html

config:
http://olricha.homelinux.net:8080/config.gz

Thanks for any help. If I can help at all please let me know.
--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

daniel+deve...@flexserv.de

unread,

Jun 19, 2006, 6:36:05 AM6/19/06

to Avuton Olrich, Linux Kernel Mailing List, linu...@oss.sgi.com

"Avuton Olrich" <avu...@gmail.com> writes:

The same here.
after a complete mkfs.xfs under 2.6.17-rc6 it was solved.

Same if i boot 2.6.8 mk.xfs, boot into 2.6.16 the xfs get "shreddered"
a directly boot from .8 to .17-rc6 works. so i think there was a bug in .16
in the transition of the xfs wich got solved somewhere in the .17.rc? time.

> Filesystem "sda1": Corruption of in-memory data detected. Shutting
> down filesystem: sda1

Daniel

Nathan Scott

unread,

Jun 20, 2006, 2:11:02 AM6/20/06

to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com

On Mon, Jun 19, 2006 at 12:44:58AM -0700, Avuton Olrich wrote:
> ..

> Hello, when trying to recursively delete a directory (same directory
> twice) from my 500gb hard drive I get a problem. It crashed first in
> 2.6.16.20, then I upgraded to try to get rid of the issue. This one is
> from 2.6.17:

How reproducible is it? Is it reproducible even after xfs_repair?

If so, can you try Mandy's patch below, to see if it is addressing
the root cause of your problem? If problems persist, a reproducible
test case would be wonderful, if one can be found..

cheers.

--
Nathan

Fix nused counter. It's currently getting set to -1 rather than getting
decremented by 1. Since nused never reaches 0, the "if (!free->hdr.nused)"
check in xfs_dir2_leafn_remove() fails every time and xfs_dir2_shrink_inode()
doesn't get called when it should. This causes extra blocks to be left on
an empty directory and the directory in unable to be converted back to
inline extent mode.

Signed-off-by: Mandy Kirkconnell <alki...@sgi.com>
Signed-off-by: Nathan Scott <nat...@sgi.com>

--- a/fs/xfs/xfs_dir2_node.c 2006-06-20 16:00:45.000000000 +1000
+++ b/fs/xfs/xfs_dir2_node.c 2006-06-20 16:00:45.000000000 +1000
@@ -972,7 +972,7 @@ xfs_dir2_leafn_remove(
/*
* One less used entry in the free table.
*/
- free->hdr.nused = cpu_to_be32(-1);
+ be32_add(&free->hdr.nused, -1);
xfs_dir2_free_log_header(tp, fbp);
/*
* If this was the last entry in the table, we can

Avuton Olrich

unread,

Jun 20, 2006, 2:40:24 AM6/20/06

to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> On Mon, Jun 19, 2006 at 12:44:58AM -0700, Avuton Olrich wrote:
> > ..
> > Hello, when trying to recursively delete a directory (same directory
> > twice) from my 500gb hard drive I get a problem. It crashed first in
> > 2.6.16.20, then I upgraded to try to get rid of the issue. This one is
> > from 2.6.17:
>
> How reproducible is it? Is it reproducible even after xfs_repair?

Happens every time I try to remove that inode (directory). xfs_repair
ends with a fatal error:

Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...
rebuilding directory inode 128

fatal error -- can't read block 16777216 for directory inode 1507133580

> If so, can you try Mandy's patch below, to see if it is addressing

> the root cause of your problem? If problems persist, a reproducible
> test case would be wonderful, if one can be found..

I'm sorry, the patch doesn't change anything. It never makes it though
the xfs_repair due to the above error. If there's any information I
can get for you please let me know.

I'm not sure if it changes anything, but here's the message after the patch:

xfs_da_do_buf: bno 16777216
dir: inode 1507133580
Filesystem "sda1": XFS internal error xfs_da_do_buf(1) at line 2119 of
file /usr/src/linux-stable-cold/fs/xfs/xfs_da_btree.c. Caller
0xb01d9b63
<b01d9720> xfs_da_do_buf+0x40e/0x7c7 <b01d9b63> xfs_da_read_buf+0x30/0x35

<b01e43d9> xfs_dir2_leafn_lookup_int+0x2f3/0x453 <b01d9b63>
xfs_da_read_buf+0x30/0x35
<b01e2ba5> xfs_dir2_node_removename+0x288/0x483 <b01e2ba5>
xfs_dir2_node_removename+0x288/0x483
<b01ddbd3> xfs_dir2_removename+0xce/0xd5 <b020ff61> kmem_zone_alloc+0x4d/0x98
<b020d0f3> xfs_remove+0x2ac/0x444 <b0215e83> xfs_vn_unlink+0x17/0x3b
<b016190c> mntput_no_expire+0x11/0x7e <b01575f1> link_path_walk+0xaf/0xb9

<b011e734> __capable+0xc/0x1f <b0155827> generic_permission+0x93/0xcc
<b01558f8> permission+0x98/0xa4 <b0155da0> may_delete+0x32/0xe9
<b0156243> vfs_unlink+0x6d/0xa3 <b0157c7a> do_unlinkat+0x92/0x125
<b0159a0d> sys_getdents64+0x9c/0xa6 <b0102b67> sysenter_past_esp+0x54/0x75
Filesystem "sda1": XFS internal error xfs_trans_cancel at line 1150 of

file /usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Caller 0xb020d262
<b0204b4c> xfs_trans_cancel+0x59/0xe5 <b020d262> xfs_remove+0x41b/0x444
<b020d262> xfs_remove+0x41b/0x444 <b0215e83> xfs_vn_unlink+0x17/0x3b
<b016190c> mntput_no_expire+0x11/0x7e <b01575f1> link_path_walk+0xaf/0xb9

<b011e734> __capable+0xc/0x1f <b0155827> generic_permission+0x93/0xcc
<b01558f8> permission+0x98/0xa4 <b0155da0> may_delete+0x32/0xe9
<b0156243> vfs_unlink+0x6d/0xa3 <b0157c7a> do_unlinkat+0x92/0x125
<b0159a0d> sys_getdents64+0x9c/0xa6 <b0102b67> sysenter_past_esp+0x54/0x75
xfs_force_shutdown(sda1,0x8) called from line 1151 of file
/usr/src/linux-stable-cold/fs/xfs/xfs_trans.c. Return address =

0xb0218b6c

Filesystem "sda1": Corruption of in-memory data detected. Shutting
down filesystem: sda1

Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(sda1,0x1) called from line 338 of file
/usr/src/linux-stable-cold/fs/xfs/xfs_rw.c. Return address =
0xb0218b6c

--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Avuton Olrich

unread,

Jun 20, 2006, 2:40:56 AM6/20/06

to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:

> How reproducible is it? Is it reproducible even after xfs_repair?

It happens everytime I try to delete the directory.

Also, forgot to mention I ran xfs_check on it and it gave me more
information than I had before:
More information, ran xfs_check and got the following:
missing free index for data block 0 in dir ino 1507133580
missing free index for data block 2 in dir ino 1507133580
missing free index for data block 3 in dir ino 1507133580
missing free index for data block 4 in dir ino 1507133580
missing free index for data block 5 in dir ino 1507133580
missing free index for data block 6 in dir ino 1507133580
missing free index for data block 7 in dir ino 1507133580
missing free index for data block 8 in dir ino 1507133580
missing free index for data block 9 in dir ino 1507133580

--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Nathan Scott

unread,

Jun 20, 2006, 2:44:19 AM6/20/06

to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com

On Mon, Jun 19, 2006 at 11:38:58PM -0700, Avuton Olrich wrote:
> On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > If so, can you try Mandy's patch below, to see if it is addressing
> > the root cause of your problem? If problems persist, a reproducible
> > test case would be wonderful, if one can be found..
>
> I'm sorry, the patch doesn't change anything. It never makes it though
> the xfs_repair due to the above error. If there's any information I
> can get for you please let me know.

Oh - thats a kernel patch, not a repair patch, I was more interested
in whether the initial corruption could be reproduced. Which version
of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
resolve your problem, I suspect.

cheers.

--
Nathan

Avuton Olrich

unread,

Jun 20, 2006, 2:50:58 AM6/20/06

to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> Oh - thats a kernel patch, not a repair patch, I was more interested
> in whether the initial corruption could be reproduced. Which version
> of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
> resolve your problem, I suspect.

OK, I'm running Gentoo's latest: 2.7.11, I can't find 2.7.18
_anywhere_ although 2.7.13 is in the pre directory on the ftp, is that
the one you're referring to?

--
avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Nathan Scott

unread,

Jun 20, 2006, 2:52:46 AM6/20/06

to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com

On Mon, Jun 19, 2006 at 11:50:37PM -0700, Avuton Olrich wrote:
> On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > Oh - thats a kernel patch, not a repair patch, I was more interested
> > in whether the initial corruption could be reproduced. Which version
> > of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
> > resolve your problem, I suspect.
>
> OK, I'm running Gentoo's latest: 2.7.11, I can't find 2.7.18
> _anywhere_ although 2.7.13 is in the pre directory on the ftp, is that
> the one you're referring to?

No - its in CVS (for a long time); I'll go get the ftp area updated,
looks like thats been forgotten about again.

cheers.

--
Nathan

Avuton Olrich

unread,

Jun 20, 2006, 4:21:15 AM6/20/06

to Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> On Mon, Jun 19, 2006 at 11:50:37PM -0700, Avuton Olrich wrote:
> > On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > > Oh - thats a kernel patch, not a repair patch, I was more interested
> > > in whether the initial corruption could be reproduced. Which version
> > > of xfs_repair are you running? (xfs_repair -V) xfsprogs-2.7.18 will
> > > resolve your problem, I suspect.
> >
> > OK, I'm running Gentoo's latest: 2.7.11, I can't find 2.7.18
> > _anywhere_ although 2.7.13 is in the pre directory on the ftp, is that
> > the one you're referring to?
>
> No - its in CVS (for a long time); I'll go get the ftp area updated,
> looks like thats been forgotten about again.

OK, just compiled from CVS HEAD (xfs_repair 2.8.2) and it still fails:

If this fix is not yet in the 2.8.x I will wait for 2.7.18 to get on the ftp.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
entry "/ost+found" at block 0 offset 448 in directory inode 128
references invalid inode 18374686479671623679
clearing inode number in entry at offset 448...
entry at block 0 offset 448 in directory inode 128 has illegal name
"/ost+found": imap claims a free inode 859505 is in use, correcting
imap and clearing inode
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- clear lost+found (if it exists) ...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
Phase 5 - rebuild AG headers and trees...
- reset superblock...

Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...
rebuilding directory inode 128

fatal error -- can't read block 16777216 for directory inode
1507133580

--

avuton
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Duncan Sands

unread,

Jun 20, 2006, 4:39:42 AM6/20/06

to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

> fatal error -- can't read block 16777216 for directory inode 1507133580

This looks to be the same problem as http://oss.sgi.com/bugzilla/show_bug.cgi?id=631
Note that the block numbers are identical in both reports: 16777216 = 0x1000000.
A very suspicious block number, wouldn't you say?

Best wishes,

Duncan.

Justin Piszcz

unread,

Jun 20, 2006, 4:57:42 AM6/20/06

to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

Have you checked to make sure you don't have a bad disk?

Avuton Olrich

unread,

Jun 20, 2006, 1:02:27 PM6/20/06

to Justin Piszcz, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On 6/20/06, Justin Piszcz <jpi...@lucidpixels.com> wrote:
> Have you checked to make sure you don't have a bad disk?

In the initial email I do state that I have run badblocks on this disk
sucessfully.

Justin Piszcz

unread,

Jun 20, 2006, 1:16:07 PM6/20/06

to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

WHat options did you pass to bad blocks?

Avuton Olrich

unread,

Jun 20, 2006, 1:21:58 PM6/20/06

to Justin Piszcz, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On 6/20/06, Justin Piszcz <jpi...@lucidpixels.com> wrote:
> WHat options did you pass to bad blocks?

Just the defaults, but it doesn't matter, someone else is having the
same exact issue I am, from the bugzilla entry earlier in this thread.

Nathan Scott

unread,

Jun 21, 2006, 10:57:16 PM6/21/06

to Avuton Olrich, linux-...@vger.kernel.org, x...@oss.sgi.com

On Tue, Jun 20, 2006 at 01:20:39AM -0700, Avuton Olrich wrote:
> On 6/19/06, Nathan Scott <nat...@sgi.com> wrote:
> > No - its in CVS (for a long time); I'll go get the ftp area updated,
> > looks like thats been forgotten about again.

FWIW, I've updated the ftp area now.

> OK, just compiled from CVS HEAD (xfs_repair 2.8.2) and it still fails:

Is this a large filesystem? Any chance we can get access to
it somehow (e.g. xfs_copy to a sparse file, then send me a
pointer to it) to reproduce the problem locally?

> fatal error -- can't read block 16777216 for directory inode
> 1507133580

Once you save a copy of it for further analysis of xfs_repair,
if you can, you can clear out this problem by directly poking at
the device using xfs_db in expert mode. "xfs_db -x /dev/xxx";
then "inode 1507133580"; then "write core.mode 0"; and then try
another xfs_repair run. Please try capture the fs for us first
though (if possible) else we're going to struggle to improve on
this aspect of xfs_repair. Send me some private mail if you do
manage to grab the fs and put it someplace for me.

thanks.

--
Nathan

Duncan Sands

unread,

Jun 25, 2006, 6:09:58 AM6/25/06

to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

I just got a new XFS crash running 2.6.17, again with problems at block
16777216 - I'll try to make a copy of the corrupted filesystem available.
Interestingly enough, I'm also seeing ext3 corruption. The usual
manifestation is that a program fails to run, with a message about it
not being in executable format (if it happens again I will take a note of
the exact message). I've had no problems at all with 2.6.17. It seems
to be happening randomly, which makes me suspect a race condition
(uniprocessor machine, but preemptable kernel), or memory corruption.
I will rebuild the kernel with all kernel debugging options turned
on, once I recover the filesystem.

Ciao,

D.

Duncan Sands

unread,

Jun 25, 2006, 9:55:52 AM6/25/06

to Avuton Olrich, Nathan Scott, linux-...@vger.kernel.org, x...@oss.sgi.com

On Sunday 25 June 2006 12:09, Duncan Sands wrote:
> I just got a new XFS crash running 2.6.17, again with problems at block
> 16777216 - I'll try to make a copy of the corrupted filesystem available.
> Interestingly enough, I'm also seeing ext3 corruption. The usual
> manifestation is that a program fails to run, with a message about it
> not being in executable format (if it happens again I will take a note of
> the exact message). I've had no problems at all with 2.6.17. It seems
> to be happening randomly, which makes me suspect a race condition
> (uniprocessor machine, but preemptable kernel), or memory corruption.
> I will rebuild the kernel with all kernel debugging options turned
> on, once I recover the filesystem.

Sorry, that should say: "I've had no problems at all with 2.6.15".
Also, xfs_repair successfully repaired the filesystem this time.
I've kept a copy of the filesystem in case anyone is interested.

Duncan.