Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64

Sarah Newman

unread,

Jul 7, 2020, 8:40:02 PM7/7/20

to

Package: linux-signed-amd64
Version: 4.19.0-9-amd64

We've had two separate reports now of debian buster users running 4.19.0-9-amd64 who experienced serious file system corruption.

- Both were using ext3
- Both are running Xen HVM, but I do not have reason to believe this to be related
- Both are on distinct physical hosts
- Both had upgraded from an older non 4.19 kernel within the last two or three weeks

One user had the error:

ext4-fs error (device xvda1): ext4_validate_block_bitmap:393: comm cat: bg 812: block 26607617: invalid block bitmap
aborting journal on device xvda1-8
ext4-fs error (device xvda1): ext4_journal_check_start:61: Detected abnormal journal
ext4-fs (xvda1): Remounting filesystem read-only
ext4-fs (xvda1): Remounting filesystem read-only
ext4-fs error (device xvda1) in ext4_orphan_add:2863: Journal has aborted

The other gave us the output of tune2fs -l:

tune2fs 1.44.5 (15-Dec-2018)
Last mounted on: /
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 4700160
Block count: 9437183
Reserved block count: 471048
Free blocks: 6164372
Free inodes: 4479367
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 730
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16320
Inode blocks per group: 510
Filesystem created: Thu Apr 26 19:55:21 2012
Last mount time: Tue Jul 7 15:11:46 2020
Last write time: Tue Jul 7 15:11:45 2020
Mount count: 1
Maximum mount count: 26
Last checked: Tue Jul 7 15:10:50 2020
Check interval: 15552000 (6 months)
Next check after: Sun Jan 3 14:10:50 2021
Lifetime writes: 10 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
First orphan inode: 1224109
Default directory hash: tea
Directory Hash Seed: 77ef7ea3-5e01-4e55-b3da-3036769fb64b
Journal backup: inode blocks

--Sarah

Ben Hutchings

unread,

Jul 7, 2020, 11:20:03 PM7/7/20

to

Control: reassign -1 src:linux
Control: tag -1 moreinfo

On Tue, 2020-07-07 at 17:30 -0700, Sarah Newman wrote:
> Package: linux-signed-amd64
> Version: 4.19.0-9-amd64
>
> We've had two separate reports now of debian buster users running
> 4.19.0-9-amd64 who experienced serious file system corruption.

Which version? (I.e. what does "uname -v" or
"dpkg -s linux-image-4.19.0-9-amd64" say?)

> - Both were using ext3
> - Both are running Xen HVM, but I do not have reason to believe this to be related

I have no reason to assume that this is unrelated to the hypervisor, so
please report the version of Xen and whatever provides the back-end
block driver.

> - Both are on distinct physical hosts
> - Both had upgraded from an older non 4.19 kernel within the last two or three weeks

From which older versions?

> One user had the error:
>
> ext4-fs error (device xvda1): ext4_validate_block_bitmap:393: comm cat: bg 812: block 26607617: invalid block bitmap
> aborting journal on device xvda1-8
> ext4-fs error (device xvda1): ext4_journal_check_start:61: Detected abnormal journal
> ext4-fs (xvda1): Remounting filesystem read-only
> ext4-fs (xvda1): Remounting filesystem read-only
> ext4-fs error (device xvda1) in ext4_orphan_add:2863: Journal has aborted

And were there any other error messages, e.g. relating to I/O errors,
around the same time? How about in the back-end domain?

> The other gave us the output of tune2fs -l:

[...]

Looks like a fairly ordinary ext3 filesystem. It doesn't tell us
anything about what went wrong.

In general I would advise against continued use of the ext3 format. It
should continue to be supported by the ext4 code, but it is inevitably
going to be less well-tested than the ext4 format. So far as I can
remember, it is easy to upgrade in-place.

Ben.

--
Ben Hutchings
The most exhausting thing in life is being insincere.
- Anne Morrow Lindberg

signature.asc

Debian Bug Tracking System

unread,

Jul 7, 2020, 11:20:03 PM7/7/20

to

Processing control commands:

> reassign -1 src:linux
Bug #964494 [linux-signed-amd64] File system corruption with ext3 + kernel-4.19.0-9-amd64
Bug reassigned from package 'linux-signed-amd64' to 'src:linux'.
No longer marked as found in versions 4.19.0-9-amd64.
Ignoring request to alter fixed versions of bug #964494 to the same values previously set
> tag -1 moreinfo
Bug #964494 [src:linux] File system corruption with ext3 + kernel-4.19.0-9-amd64
Added tag(s) moreinfo.

--
964494: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964494
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems

Sarah Newman

unread,

Jul 8, 2020, 6:50:03 PM7/8/20

to

On 7/7/20 8:13 PM, Ben Hutchings wrote:
> Control: reassign -1 src:linux
> Control: tag -1 moreinfo
>
> On Tue, 2020-07-07 at 17:30 -0700, Sarah Newman wrote:
>> Package: linux-signed-amd64
>> Version: 4.19.0-9-amd64
>>
>> We've had two separate reports now of debian buster users running
>> 4.19.0-9-amd64 who experienced serious file system corruption.
>
> Which version? (I.e. what does "uname -v" or
> "dpkg -s linux-image-4.19.0-9-amd64" say?)

One is version: 4.19.118-2+deb10u1

>> - Both were using ext3
>> - Both are running Xen HVM, but I do not have reason to believe this to be related
>
> I have no reason to assume that this is unrelated to the hypervisor, so
> please report the version of Xen and whatever provides the back-end
> block driver.

For the failures there are two different Xen hypervisor versions involved, the most recent being 4.9.4.45.g8d2a6880, with various patches for security
issues applied.

For Linux, the base version is 4.9.197. That's missing the xen blockback patches "xen/blkback: Avoid unmapping unmapped grant pages" and "xen-blkback:
prevent premature module unload" but I don't think either of those are relevant here based on the descriptions for those patches.

Nothing in the backend has been updated within the last few weeks.

We believe that we have positively identified around 90 VMs running Debian Buster under the same backend versions, though we can't say for certain
what kernel version or file system. I would guess at least 15 of them to be running ext4 + linux-image-4.19.0-8-amd64/4.19.98-1 or later.

Some of our own test systems on the exact same kernel and hypervisor as the ones with failures are running:

4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/Linux
4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux

They are running on top of ext4, with file system options:

rw,relatime,nobarrier,errors=remount-ro,stripe=XX

They are not heavily loaded, so if load is related then they would not exhibit issues.

What we don't have personal knowledge of is 4.19.0-9-amd64, either with or without ext3.

Normally we would gather more data before making an upstream report, but given the severity I thought best to do this sooner rather than later.

>
>> - Both are on distinct physical hosts
>> - Both had upgraded from an older non 4.19 kernel within the last two or three weeks
>
> From which older versions?

In one case:

"the upgrade was from Debian 9 Stretch and the system was up to date before running the upgrade."

For the other, linux-image-4.9.0-11-amd64.

>
>> One user had the error:
>>
>> ext4-fs error (device xvda1): ext4_validate_block_bitmap:393: comm cat: bg 812: block 26607617: invalid block bitmap
>> aborting journal on device xvda1-8
>> ext4-fs error (device xvda1): ext4_journal_check_start:61: Detected abnormal journal
>> ext4-fs (xvda1): Remounting filesystem read-only
>> ext4-fs (xvda1): Remounting filesystem read-only
>> ext4-fs error (device xvda1) in ext4_orphan_add:2863: Journal has aborted
>
> And were there any other error messages, e.g. relating to I/O errors,
> around the same time? How about in the back-end domain?

For the backend, I do not see errors around that time or for several weeks previous on either physical host.

One user, the one who gave us that report, reports no other errors. They say:

After the live recovery fsck completed, I was able to use the partition and it reported clean, but it was clearly still pretty damaged. Grub2 for
example wouldn't install, insisting unknown filesystem. I copied all the data to a new ext4 filesystem and was able to boot into that, but later saw
there was pretty significant file corruption, including files that had not been modified in weeks or months. PHP files had random strings inserted in
them. Debsums reported probably 10% of packages having invalid sumchecks in some of the installed files. And a few mysql database tables had
corruption. I was able to restore the database and replace pretty much everything else from backups that had been made about 10 minutes prior to the
filesystem corruption, and then re-installed every package. So far things seem to be working fine, since I've more or less replaced every file.

The other user I am not sure about.

>
>> The other gave us the output of tune2fs -l:
> [...]
>
> Looks like a fairly ordinary ext3 filesystem. It doesn't tell us
> anything about what went wrong.
>
> In general I would advise against continued use of the ext3 format. It
> should continue to be supported by the ext4 code, but it is inevitably
> going to be less well-tested than the ext4 format. So far as I can
> remember, it is easy to upgrade in-place.

Thank you. One user has already converted to ext4 and the other plans to.

--Sarah

Sarah Newman

unread,

Jul 27, 2020, 6:20:04 PM7/27/20

to

On 7/20/20 11:24 AM, Hans van Kranenburg wrote:
> Hi,

>
> On Wed, 15 Jul 2020 20:52:40 -0700 Sarah Newman <s...@prgmr.com> wrote:
>> On 7/7/20 8:13 PM, Ben Hutchings wrote:

>>> Control: reassign -1 src:linux
>>> Control: tag -1 moreinfo
>>>
>>> On Tue, 2020-07-07 at 17:30 -0700, Sarah Newman wrote:

>>>> Package: linux-signed-amd64
>>>> Version: 4.19.0-9-amd64
>>>>
>>>> We've had two separate reports now of debian buster users running
>>>> 4.19.0-9-amd64 who experienced serious file system corruption.
>>>

>>> Which version? (I.e. what does "uname -v" or
>>> "dpkg -s linux-image-4.19.0-9-amd64" say?)
>>>

>>>> - Both were using ext3
>>>> - Both are running Xen HVM, but I do not have reason to believe this to be related
>>

>> [...]
>
> I have servers which run 4.19.118-2 as dom0 kernel and a Xen 4.11.4-1
> rebuild for Buster.
>
> One example is a smallish 6-server cluster that got a reboot cycle 48
> days ago.
>
> It contains a few heavily loaded domUs with 4.19.118 or 4.19.131 based
> kernels.
>
> No problems or disk corruption or anything is seen yet. dom0 filesystem
> is ext4, domUs use a mix of ext4 and btrfs (over iscsi). So, no ext3
> anywhere.
>
> We haven't got bug reports against Debian Xen packages in the BTS about
> this.
>
> I have not yet tried to make an ext3 fs on a block device in a test domU
> and then have it do things with the fs and reboot it now and then. If
> wanted, I can do that and see if there's any problem after a week or
> two. Just to add chaos to help correlating.
>
> FWIW,
> Hans
>

We've had 9 reports of users running 4.19.0-9-amd64 total, only 2 with ext3, no uptimes longer than maybe 5 weeks for the ext3 + 4.19.0-9-amd64 users.
No more reported failures.

Our attempts to reproduce so far have not been successful, but that might be because we were running a debug kernel build and I think there may have
been a deadlock in between depot_lock and console_owner which killed the tests.

I'm not sure what the next steps are here, but it sounds like ext4 has not shown any problems.

--Sarah

Sarah Newman

unread,

Aug 19, 2020, 1:10:03 AM8/19/20

to

We haven't had any further reports of file system corruption. I would guess that converting to EXT4 is sufficient to avoid the issue.

Salvatore Bonaccorso

unread,

Apr 18, 2021, 11:40:04 AM4/18/21

to

Should this bug be closed or is there anything we still can/should do
about it?

Regards,
Salvatore

Debian Bug Tracking System

unread,

Apr 18, 2021, 1:10:03 PM4/18/21

to

Your message dated Sun, 18 Apr 2021 19:01:41 +0200
with message-id <YHxl9a5n...@eldamar.lan>
and subject line Re: Bug#964494: Info received (Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64)
has caused the Debian Bug report #964494,
regarding File system corruption with ext3 + kernel-4.19.0-9-amd64
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact ow...@bugs.debian.org
immediately.)