[syzbot] [ext4?] BUG: sleeping function called from invalid context in ext4_update_super

11 views
Skip to first unread message

syzbot

unread,
Jun 10, 2023, 9:52:57 AM6/10/23
to adilger...@dilger.ca, linux...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, ty...@mit.edu
Hello,

syzbot found the following issue on:

HEAD commit: f8dba31b0a82 Merge tag 'asym-keys-fix-for-linus-v6.4-rc5' ..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11b5d0dd280000
kernel config: https://syzkaller.appspot.com/x/.config?x=7474de833c217bf4
dashboard link: https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
compiler: Debian clang version 15.0.7, GNU ld (GNU Binutils for Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/92008b448c84/disk-f8dba31b.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/25e27132216c/vmlinux-f8dba31b.xz
kernel image: https://storage.googleapis.com/syzbot-assets/72466b6c1237/bzImage-f8dba31b.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+4acc7d...@syzkaller.appspotmail.com

EXT4-fs error (device loop4): ext4_get_group_info:331: comm syz-executor.4: invalid group 4294819419
BUG: sleeping function called from invalid context at include/linux/buffer_head.h:404
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 21305, name: syz-executor.4
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
5 locks held by syz-executor.4/21305:
#0: ffff8880292c8460 (sb_writers#4){.+.+}-{0:0}, at: do_sendfile+0x5fb/0xff0 fs/read_write.c:1253
#1: ffff8880391da200 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: inode_lock include/linux/fs.h:775 [inline]
#1: ffff8880391da200 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: ext4_buffered_write_iter+0xaf/0x3a0 fs/ext4/file.c:283
#2: ffff8880391d9ec8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_write_lock_xattr fs/ext4/xattr.h:155 [inline]
#2: ffff8880391d9ec8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_convert_inline_data_to_extent fs/ext4/inline.c:584 [inline]
#2: ffff8880391d9ec8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_try_to_write_inline_data+0x51d/0x1360 fs/ext4/inline.c:740
#3: ffff8880391da088 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x980/0x1cf0 fs/ext4/inode.c:616
#4: ffff88803944f018 (&bgl->locks[i].lock){+.+.}-{2:2}, at: spin_trylock include/linux/spinlock.h:360 [inline]
#4: ffff88803944f018 (&bgl->locks[i].lock){+.+.}-{2:2}, at: ext4_lock_group fs/ext4/ext4.h:3407 [inline]
#4: ffff88803944f018 (&bgl->locks[i].lock){+.+.}-{2:2}, at: ext4_mb_try_best_found+0x1ca/0x5a0 fs/ext4/mballoc.c:2166
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 0 PID: 21305 Comm: syz-executor.4 Not tainted 6.4.0-rc5-syzkaller-00002-gf8dba31b0a82 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
__might_resched+0x5cf/0x780 kernel/sched/core.c:10153
lock_buffer include/linux/buffer_head.h:404 [inline]
ext4_update_super+0x93/0x1230 fs/ext4/super.c:6039
ext4_commit_super+0xd0/0x4c0 fs/ext4/super.c:6117
ext4_handle_error+0x5ee/0x8b0 fs/ext4/super.c:676
__ext4_error+0x277/0x3b0 fs/ext4/super.c:776
ext4_get_group_info+0x382/0x3e0 fs/ext4/balloc.c:331
ext4_mb_new_inode_pa+0x89c/0x1300 fs/ext4/mballoc.c:4915
ext4_mb_try_best_found+0x3a1/0x5a0 fs/ext4/mballoc.c:2171
ext4_mb_regular_allocator+0x3511/0x3c20 fs/ext4/mballoc.c:2784
ext4_mb_new_blocks+0xe5f/0x44a0 fs/ext4/mballoc.c:5843
ext4_alloc_branch fs/ext4/indirect.c:340 [inline]
ext4_ind_map_blocks+0x10d7/0x29e0 fs/ext4/indirect.c:635
ext4_map_blocks+0x9e7/0x1cf0 fs/ext4/inode.c:625
_ext4_get_block+0x238/0x6a0 fs/ext4/inode.c:779
__block_write_begin_int+0x548/0x1a50 fs/buffer.c:2064
ext4_try_to_write_inline_data+0x7ed/0x1360 fs/ext4/inline.c:740
ext4_write_begin+0x290/0x10b0 fs/ext4/inode.c:1147
ext4_da_write_begin+0x300/0xa40 fs/ext4/inode.c:2893
generic_perform_write+0x300/0x5e0 mm/filemap.c:3923
ext4_buffered_write_iter+0x122/0x3a0 fs/ext4/file.c:289
ext4_file_write_iter+0x1d6/0x1930
do_iter_write+0x7b1/0xcb0 fs/read_write.c:860
iter_file_splice_write+0x843/0xfe0 fs/splice.c:795
do_splice_from fs/splice.c:873 [inline]
direct_splice_actor+0xe7/0x1c0 fs/splice.c:1039
splice_direct_to_actor+0x4c4/0xbd0 fs/splice.c:994
do_splice_direct+0x283/0x3d0 fs/splice.c:1082
do_sendfile+0x620/0xff0 fs/read_write.c:1254
__do_sys_sendfile64 fs/read_write.c:1322 [inline]
__se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1308
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0ff0c8c169
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f0ff1944168 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f0ff0dabf80 RCX: 00007f0ff0c8c169
RDX: 0000000000000000 RSI: 0000000000000007 RDI: 0000000000000006
RBP: 00007f0ff0ce7ca1 R08: 0000000000000000 R09: 0000000000000000
R10: 0001000000201005 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe35f5084f R14: 00007f0ff1944300 R15: 0000000000022000
</TASK>
BUG: scheduling while atomic: syz-executor.4/21305/0x00000002
5 locks held by syz-executor.4/21305:
#0: ffff8880292c8460 (sb_writers#4){.+.+}-{0:0}, at: do_sendfile+0x5fb/0xff0 fs/read_write.c:1253
#1: ffff8880391da200 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: inode_lock include/linux/fs.h:775 [inline]
#1: ffff8880391da200 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: ext4_buffered_write_iter+0xaf/0x3a0 fs/ext4/file.c:283
#2: ffff8880391d9ec8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_write_lock_xattr fs/ext4/xattr.h:155 [inline]
#2: ffff8880391d9ec8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_convert_inline_data_to_extent fs/ext4/inline.c:584 [inline]
#2: ffff8880391d9ec8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_try_to_write_inline_data+0x51d/0x1360 fs/ext4/inline.c:740
#3: ffff8880391da088 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x980/0x1cf0 fs/ext4/inode.c:616
#4: ffff88803944f018 (&bgl->locks[i].lock){+.+.}-{2:2}, at: spin_trylock include/linux/spinlock.h:360 [inline]
#4: ffff88803944f018 (&bgl->locks[i].lock){+.+.}-{2:2}, at: ext4_lock_group fs/ext4/ext4.h:3407 [inline]
#4: ffff88803944f018 (&bgl->locks[i].lock){+.+.}-{2:2}, at: ext4_mb_try_best_found+0x1ca/0x5a0 fs/ext4/mballoc.c:2166
Modules linked in:
Preemption disabled at:
[<0000000000000000>] 0x0


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the bug is already fixed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to change bug's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the bug is a duplicate of another bug, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Fabio M. De Francesco

unread,
Jun 10, 2023, 4:41:23 PM6/10/23
to adilger...@dilger.ca, linux...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, ty...@mit.edu, syzbot
On sabato 10 giugno 2023 15:52:55 CEST syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: f8dba31b0a82 Merge tag 'asym-keys-fix-for-linus-v6.4-rc5' ..
> git tree: upstream
>
> [...]
>
> Unfortunately, I don't have any reproducer for this issue yet.

Unfortunately :-(

> Downloadable assets:
>
> [...]
Well, I'm a new to filesystems. However, I'd like to test a change in
ext4_handle_error().

Currently I see that errors are handled according to the next snippet of code
from the above-mentioned function (please note that we are in atomic context):

if (continue_fs)
if (continue_fs && journal)
schedule_work(&EXT4_SB(sb)->s_error_work);
else
ext4_commit_super(sb);

If evaluates false, we directly call ext4_commit_super(), forgetting that,
AFAICS we are in atomic context.

Obviously, we know that ext4_update_super() calls lock_buffer(), which
might_sleep().

As I said I have only little experience with filesystems, so my question is:
despite the overhead, can we delete the check and do the following?

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 05fcecc36244..574b096de059 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -662,19 +662,8 @@ static void ext4_handle_error(struct super_block *sb,
bool force_ro, int error,
jbd2_journal_abort(journal, -EIO);
}

- if (!bdev_read_only(sb->s_bdev)) {
- save_error_info(sb, error, ino, block, func, line);
- /*
- * In case the fs should keep running, we need to writeout
- * superblock through the journal. Due to lock ordering
- * constraints, it may not be safe to do it right here so we
- * defer superblock flushing to a workqueue.
- */
- if (continue_fs && journal)
- schedule_work(&EXT4_SB(sb)->s_error_work);
- else
- ext4_commit_super(sb);
- }
+ if (!bdev_read_only(sb->s_bdev))
+ schedule_work(&EXT4_SB(sb)->s_error_work);

/*
* We force ERRORS_RO behavior when system is rebooting. Otherwise we

Am I missing something I'm not able to see here?
If not, I'll try this diff if and when Syzkaller provides a reproducer.

Thanks,

Fabio
[...]


Fabio M. De Francesco

unread,
Jun 10, 2023, 4:49:35 PM6/10/23
to adilger...@dilger.ca, linux...@vger.kernel.org, linux-...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, ty...@mit.edu, syzbot
O, sorry. I forgot something...

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 05fcecc36244..ac3a734b7d4d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -664,16 +664,7 @@ static void ext4_handle_error(struct super_block *sb,
bool force_ro, int error,

if (!bdev_read_only(sb->s_bdev)) {
save_error_info(sb, error, ino, block, func, line);
- /*
- * In case the fs should keep running, we need to writeout
- * superblock through the journal. Due to lock ordering
- * constraints, it may not be safe to do it right here so we
- * defer superblock flushing to a workqueue.
- */
- if (continue_fs && journal)
- schedule_work(&EXT4_SB(sb)->s_error_work);
- else
- ext4_commit_super(sb);
+ schedule_work(&EXT4_SB(sb)->s_error_work);
}

/*

Theodore Ts'o

unread,
Jun 10, 2023, 11:20:38 PM6/10/23
to Fabio M. De Francesco, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
(Dropping linux-fsdevel and linux-kernel from the cc list.)

On Sat, Jun 10, 2023 at 10:41:18PM +0200, Fabio M. De Francesco wrote:
> Well, I'm a new to filesystems. However, I'd like to test a change in
> ext4_handle_error().
>
> Currently I see that errors are handled according to the next snippet of code
> from the above-mentioned function (please note that we are in atomic context):
>
> if (continue_fs && journal)
> schedule_work(&EXT4_SB(sb)->s_error_work);
> else
> ext4_commit_super(sb);
>
> If evaluates false, we directly call ext4_commit_super(), forgetting that,
> AFAICS we are in atomic context.
>
> As I said I have only little experience with filesystems, so my question is:
> despite the overhead, can we delete the check and do the following?
>
> [ Unconditionally call schedule_work(&EXT4_SB(sb)->s_error_work) ]

That doesn't work, for the simple reason that it's possible that file
system might be configured to immediately panic on an error. (See
later in the ext4_handle_error() function after the check for
test_opt(sb, ERRORS_PANIC). If that happens, the workqueue will never
have a chance to run. In that case, we have to call
ext4_commit_super().

The real answer here is that ext4_error() must never be called from an
atomic context, and a recent commit 5354b2af3406 ("ext4: allow
ext4_get_group_info() to fail") added a call to ext4_error() which is
problematic since some callers of the ext4_get_group_info() function
may be holding a spinlock. And so the best solution is to just simply
to drop the call to ext4_error(), since it's not strictly necessary.
If there is an antagonist process which is actively corrupting the
superblock, some other code path will report the fact that the file
system is corrupted soon enough.

- Ted

P.S. There is an exception to what I've described above, and that's
special ext4_grp_locked_error() which is used in fs/ext4/mballoc.c.
But that's a special case which requires very careful handling, In
general, you simply must not be in atomic context when you want to
report a problem.

Fabio M. De Francesco

unread,
Jun 11, 2023, 3:05:35 AM6/11/23
to Theodore Ts'o, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
On domenica 11 giugno 2023 05:20:32 CEST Theodore Ts'o wrote:
> (Dropping linux-fsdevel and linux-kernel from the cc list.)
>
> On Sat, Jun 10, 2023 at 10:41:18PM +0200, Fabio M. De Francesco wrote:
> > Well, I'm a new to filesystems. However, I'd like to test a change in
> > ext4_handle_error().
> >
> > Currently I see that errors are handled according to the next snippet of
> > code
> >
> > from the above-mentioned function (please note that we are in atomic
context):
> > if (continue_fs && journal)
> >
> > schedule_work(&EXT4_SB(sb)->s_error_work);
> >
> > else
> >
> > ext4_commit_super(sb);
> >
> > If evaluates false, we directly call ext4_commit_super(), forgetting that,
> > AFAICS we are in atomic context.
> >
> > As I said I have only little experience with filesystems, so my question
is:
> > despite the overhead, can we delete the check and do the following?
> >
> > [ Unconditionally call schedule_work(&EXT4_SB(sb)->s_error_work) ]
>
> That doesn't work, for the simple reason that it's possible that file
> system might be configured to immediately panic on an error. (See
> later in the ext4_handle_error() function after the check for
> test_opt(sb, ERRORS_PANIC).

Theodore,

Thanks for pointing out this "detail". I had completely overlooked it due to
lack of experience and because I just spent few minutes on this. I should have
read the entire function. The end result was that I didn't look at the code in
the final part of ext4_handle_error() :-(

Are you okay with me submitting a patch with your "Suggested by:" tag? Or
maybe you prefer to take care of it yourself? For now I await your kind reply.

> If that happens, the workqueue will never
> have a chance to run. In that case, we have to call
> ext4_commit_super().
>
> The real answer here is that ext4_error() must never be called from an
> atomic context, and a recent commit 5354b2af3406 ("ext4: allow
> ext4_get_group_info() to fail") added a call to ext4_error() which is
> problematic since some callers of the ext4_get_group_info() function
> may be holding a spinlock. And so the best solution is to just simply
> to drop the call to ext4_error(), since it's not strictly necessary.
> If there is an antagonist process which is actively corrupting the
> superblock, some other code path will report the fact that the file
> system is corrupted soon enough.
>
> - Ted
>
> P.S. There is an exception to what I've described above, and that's
> special ext4_grp_locked_error() which is used in fs/ext4/mballoc.c.
> But that's a special case which requires very careful handling, In
> general, you simply must not be in atomic context when you want to
> report a problem.

Yes, I can understand that we must not be in atomic context to report a
problem.

Can we "reliably" test !in_atomic() and act accordingly? I remember that the
in_atomic() helper cannot always detect atomic contexts.

Anyway, I suppose that this "exception" can be addressed later. Am I somewhat
wrong about looking at these problems like unrelated, so that we are not
forced to fix both them at the same time? If you have any suggestions you want
to share, I'd be happy to help with implementation.

Again thanks,

Fabio



Fabio M. De Francesco

unread,
Jun 11, 2023, 5:38:11 AM6/11/23
to Theodore Ts'o, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
On domenica 11 giugno 2023 05:20:32 CEST Theodore Ts'o wrote:
Ted,

I just responded with another message, so I'm waiting to know from you whether
or not you are okay with me submitting a patch with your "Suggested by:" tag.

In the meantime, I have had time to think of a different solution that allows
the workqueue the chance to run even if the file system is configured to
immediately panic on error (sorry, no code - I'm in a hurry)...

This can let you leave that call to ext4_error() that commit 5354b2af3406
("ext4: allow ext4_get_group_info()") had introduced (if it works - please
keep on reading).

1) Init a global variable ("glob") and set it to 0.
2) Modify the code of the error handling workqueue to set "glob" to 1, soon
before the task is done.
3) Change the fragment that panics the system to call mdelay() in a loop (it
doesn't sleep - correct?) for an adequate amount of time and periodically
check READ_ONCE(global) == 1. If true, break and then panic, otherwise
reiterate the loop.

Could it be useful?

> If there is an antagonist process which is actively corrupting the
> superblock, some other code path will report the fact that the file
> system is corrupted soon enough.
>
> - Ted
>
> P.S. There is an exception to what I've described above, and that's
> special ext4_grp_locked_error() which is used in fs/ext4/mballoc.c.
> But that's a special case which requires very careful handling, In
> general, you simply must not be in atomic context when you want to
> report a problem.

Haven't yet had time to look at this. So last answer is the only one I have at
the moment.

Thanks for your time,

Fabio


Theodore Ts'o

unread,
Jun 11, 2023, 9:18:34 AM6/11/23
to Fabio M. De Francesco, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
On Sun, Jun 11, 2023 at 09:05:31AM +0200, Fabio M. De Francesco wrote:
>
> Are you okay with me submitting a patch with your "Suggested by:" tag? Or
> maybe you prefer to take care of it yourself? For now I await your kind reply.

Sure, feel free to create such a patch.

I would strongly recommend that you use gce-xfstests or kvm-xfstests
before submitting ext4 patches. In this particular case, it's a
relatively simple patch, but it's a good habit to get into. See [1]
for more details.

[1] https://thunk.org/gce-xfstests

At the bare minimum, it would be useful for you to run
"{kvm,gce}-xfstests smoke" or more preferably, "{kvm,gce}-xfstests -c
ext4/4k -g auto". If it's a particular complex patch series, running
"gce-xfstests -c ext4/all -g auto" is nice, since it can save me a lot
of time when y ou've introduced a subtle corner case bug that doesn't
show up with lighter testing, and I have to track it down myself. The
latter only costs about $2 --- you could do "kvm-xfstests -c ext4/all
-g auto", but it will take a day or so, and it's not much fun unless
you have dedicated test hardware. So if you can afford the $2, I
strongly recommend using gce-xfstests for the full set of tests. (The
smoke test using gce-xfstests costs a penny or two, last time I priced
it out. But it's not too bad to run it using kvm-xfstests.)


> Can we "reliably" test !in_atomic() and act accordingly? I remember that the
> in_atomic() helper cannot always detect atomic contexts.

No; we can do something like BUG_ON(in_atomic(), but it will only work
if LOCKDEP is enabled, and that's generally is not enabled on
production systems.


On Sun, Jun 11, 2023 at 11:38:07AM +0200, Fabio M. De Francesco wrote:
> In the meantime, I have had time to think of a different solution that allows
> the workqueue the chance to run even if the file system is configured to
> immediately panic on error (sorry, no code - I'm in a hurry)...
>
> This can let you leave that call to ext4_error() that commit 5354b2af3406
> ("ext4: allow ext4_get_group_info()") had introduced (if it works - please
> keep on reading).
>
> 1) Init a global variable ("glob") and set it to 0.
> 2) Modify the code of the error handling workqueue to set "glob" to 1, soon
> before the task is done.
> 3) Change the fragment that panics the system to call mdelay() in a loop (it
> doesn't sleep - correct?) for an adequate amount of time and periodically
> check READ_ONCE(global) == 1. If true, break and then panic, otherwise
> reiterate the loop.

Well, it's more than a bit ugly, and it's not necessasrily guaranteed
to work. After all we're doing this to allow ext4_error() to be
called in critical sections. But that means that while we're doing
this mdelay loop, we're holding on to a spinlock. While lockdep isn't
smart enough to catch this particular deadlock, it's still a real
potential problem, which means such a solution is also fragile.

It's better off simply prohibiting ext4_error() to be called while
holding a lock, and in most places this isn't all that hard. Most of
the time, you don't want to hold spinlocks across a long period of
time, because this can become a scalability bottleneck. This means
very often the pattern is something like this:

spin_lock(..);
...
ret = do_stuff();
spin_unlock(..);

So it's enough to check the error return after you've unlocked the
spinlock. And you can also just _not_ call ext4_error() inside
do_stuff(), but have do_stuff return an error code, possibly with a
call to ext4_warning() if you want to get some context for the problem
into the kernel log, and then have the top-level function call
ext4_error().

Cheers,

- Ted

Fabio M. De Francesco

unread,
Jun 11, 2023, 3:16:00 PM6/11/23
to Theodore Ts'o, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
On domenica 11 giugno 2023 15:18:29 CEST Theodore Ts'o wrote:
> On Sun, Jun 11, 2023 at 09:05:31AM +0200, Fabio M. De Francesco wrote:
> > Are you okay with me submitting a patch with your "Suggested by:" tag? Or
> > maybe you prefer to take care of it yourself? For now I await your kind
> > reply.
> Sure, feel free to create such a patch.

Thanks!

Let me summarize, just to be sure we don't misunderstand each other...

To start off, I'll send out _only_ the patch for the bug reported by Syzbot,
the one about dropping the call to ext_error() in ext4_get_group_info().

I'll do this by Tuesday. (Sorry, I cannot do it by Monday because I must pass
an exam and an interview for a job).

However, on the other problems with ext4_grp_locked_error() that you noticed
in the final part of your first message in this thread I'll need some days
more to better understand the context I'm working in.

> I would strongly recommend that you use gce-xfstests or kvm-xfstests
> before submitting ext4 patches. In this particular case, it's a
> relatively simple patch, but it's a good habit to get into. See [1]
> for more details.
>
> [1] https://thunk.org/gce-xfstests

Thanks also for these information.

In the last year I have been sending several patches patches for filesystems,
several little things that comprise conversions (sometimes with additional
related work, like a series of 5 patches to fs/sysv).

The only tools I've ever needed were running (x)fstests on a QEMU/KVM x86_32
VM, 6GB RAM, booting a Kernel with HIGHMEM64G enabled (for PAE).

Never heard about {kvm,gce}-xfstests. I'll have a look at the link and try
these tests. I can afford $2, but at the moment I don't want to assemble a
dedicated hardware for tests.

What's the problem with running those tests in QEMU/KVM VMs???

Ah, I just recalled that in December 2022 I sent you three conversion from the
old kmap{,_atomic} that went reviewed by Jan K. and Ira W.. It was tested with
fstests as I explained above. That patch got lost, in fact I still see at
least a kmap_atomic() call site in ext4. One call site is no more in ext4. The
third today uses kmap_local_folio().

Well, everybody dislike to see their patches completely ignored (https://
lore.kernel.org/lkml/20221231174439.85...@gmail.com/). If you
wanted you could at least apply the part regarding the kmap_atomic()
conversion in ext4_journalled_write_inline_data().

Or I can convert the call to kmap_atomic() to kmap_local_folio(), if you
wanted. It's up to you, please let me know.

> At the bare minimum, it would be useful for you to run
> "{kvm,gce}-xfstests smoke" or more preferably, "{kvm,gce}-xfstests -c
> ext4/4k -g auto". If it's a particular complex patch series, running
> "gce-xfstests -c ext4/all -g auto" is nice, since it can save me a lot
> of time when y ou've introduced a subtle corner case bug that doesn't
> show up with lighter testing, and I have to track it down myself. The
> latter only costs about $2 --- you could do "kvm-xfstests -c ext4/all
> -g auto", but it will take a day or so, and it's not much fun unless
> you have dedicated test hardware. So if you can afford the $2, I
> strongly recommend using gce-xfstests for the full set of tests. (The
> smoke test using gce-xfstests costs a penny or two, last time I priced
> it out. But it's not too bad to run it using kvm-xfstests.)

OK, again thanks for these additional information. I'll surely use all this.

>
> > Can we "reliably" test !in_atomic() and act accordingly? I remember that
the
> > in_atomic() helper cannot always detect atomic contexts.
>
> No; we can do something like BUG_ON(in_atomic(),

OK, it's the same from the other side of the ways to look at it.

> but it will only work
> if LOCKDEP is enabled,

I don't know what others do. I always enable LOCKDEP, also in production
systems.

> and that's generally is not enabled on
> production systems.

Ah, OK.

> On Sun, Jun 11, 2023 at 11:38:07AM +0200, Fabio M. De Francesco wrote:
> > In the meantime, I have had time to think of a different solution that
> > allows
> > the workqueue the chance to run even if the file system is configured to
> > immediately panic on error (sorry, no code - I'm in a hurry)...
> >
> > This can let you leave that call to ext4_error() that commit 5354b2af3406
> > ("ext4: allow ext4_get_group_info()") had introduced (if it works - please
> > keep on reading).
> >
> > 1) Init a global variable ("glob") and set it to 0.
> > 2) Modify the code of the error handling workqueue to set "glob" to 1,
soon
> > before the task is done.
> > 3) Change the fragment that panics the system to call mdelay() in a loop
> > (it
> > doesn't sleep - correct?) for an adequate amount of time and periodically
> > check READ_ONCE(global) == 1. If true, break and then panic, otherwise
> > reiterate the loop.
>
> Well, it's more than a bit ugly,

I fully agree. I'm still in search of a reliable way to let atomic context run
idle waiting for a status change. I am talking about atomic code in paths that
we don't mind if they loop for few seconds (like the case at hand - we don't
care to panic immediately or within a sec or two. Do we care???

> and it's not necessarily guaranteed
> to work.

Why not? AFAIK, mdelay() doesn't trigger SAC bugs because it doesn't sleep,
very differently than msleep(). What's the problem to wait for panicking a
little later without causing bugs? Any other means to signal in atomic context
from a workqueue that is done with the assigned work?

> After all we're doing this to allow ext4_error() to be
> called in critical sections. But that means that while we're doing
> this mdelay loop, we're holding on to a spinlock. While lockdep isn't
> smart enough to catch this particular deadlock, it's still a real
> potential problem, which means such a solution is also fragile.
>
> It's better off simply prohibiting ext4_error() to be called while
> holding a lock, and in most places this isn't all that hard.

I'll do as you asked, but (out of curiosity and especially for the purpose of
stealing knowledge from very experienced Kernel hackers like you, can you
please set aside some more minutes to answer all my questions above?

Thanks in advance,

Fabio

> Most of
> the time, you don't want to hold spinlocks across a long period of
> time, because this can become a scalability bottleneck. This means
> very often the pattern is something like this:
>
> spin_lock(..);
> ...
> ret = do_stuff();
> spin_unlock(..);
>
> So it's enough to check the error return after you've unlocked the
> spinlock. And you can also just _not_ call ext4_error() inside
> do_stuff(), but have do_stuff return an error code, possibly with a
> call to ext4_warning() if you want to get some context for the problem
> into the kernel log, and then have the top-level function call
> ext4_error().

OK, yours is the right solution. Furthermore I didn't know about the
importance to show the errors exactly where they currently are. As said, I'll
follow your suggestion, but I'd like to know the answers at my questions,
please :-)

>
> Cheers,
>
> - Ted

Again, thanks for all the time you are devoting to a newcomer like I am.

Fabio


Theodore Ts'o

unread,
Jun 11, 2023, 8:19:26 PM6/11/23
to Fabio M. De Francesco, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
On Sun, Jun 11, 2023 at 09:15:56PM +0200, Fabio M. De Francesco wrote:
>
> Thanks!
>
> Let me summarize, just to be sure we don't misunderstand each other...
>
> To start off, I'll send out _only_ the patch for the bug reported by Syzbot,
> the one about dropping the call to ext_error() in ext4_get_group_info().
>
> I'll do this by Tuesday. (Sorry, I cannot do it by Monday because I must pass
> an exam and an interview for a job).

Sure, that'll be fine.

> However, on the other problems with ext4_grp_locked_error() that you noticed
> in the final part of your first message in this thread I'll need some days
> more to better understand the context I'm working in.

Um, I'm not sure what problems you're referring to. What I said is
that it works, but you just have to be careful in how you use it (and
the current callers in mballoc.c are careful).

And similarly, I don't think it's a problem that you need to be
careful not to call ext4_error() from an atomic context. You need to
be careful, and sometimes we screw up. But in this particular case,
it's pretty obvious how to fix it, and we don't even need a syzkaller
reproduccer. :-)

>
> > I would strongly recommend that you use gce-xfstests or kvm-xfstests
> > before submitting ext4 patches. In this particular case, it's a
> > relatively simple patch, but it's a good habit to get into. See [1]
> > for more details.
> >
> > [1] https://thunk.org/gce-xfstests
>
> Thanks also for these information.
>
> In the last year I have been sending several patches patches for filesystems,
> several little things that comprise conversions (sometimes with additional
> related work, like a series of 5 patches to fs/sysv).
>
> The only tools I've ever needed were running (x)fstests on a QEMU/KVM x86_32
> VM, 6GB RAM, booting a Kernel with HIGHMEM64G enabled (for PAE).
>
> Never heard about {kvm,gce}-xfstests. I'll have a look at the link and try
> these tests. I can afford $2, but at the moment I don't want to assemble a
> dedicated hardware for tests.
>
> What's the problem with running those tests in QEMU/KVM VMs???

kvm-xfstests is essentially running the tests using qemu. It's
advantage is that it's a pre-packaged test appliance, so people don't
need to go through extra effort to set it up. Also, there are pre-set
configurations which are useful for thoroughly testing a particular
file system, with multiple configurations (e.g., 4k block size, 1k
block size, with fsencrypt enabled, with bigalloc enabled, etc., etc.)
There are dozen configs that I run by default, which is what "-c
ext4/all" does. From test-appliance/files/root/fs/ext4/cfg/all.list,
you'll see that the following configs are run:

4k, 1k, ext3, encrypt, nojournal, ext3conv, adv, dioread_nolock,
data_journal, bigalloc_4k, bigalloc_1k, dax

"kvm-xfstests smoke" is syntactic sugar for "kvm-xfstests -c ext4/4k -g quick".

The advantage of using gce-xfstests is that when you give a list of
configs to run, such as "gce-xfstests -c ext4/all -g auto", it will
start up separate VM's for each configuration, and so the test runs
for these configs are run in parallel. This takes about an hour and
45 minutes (which is the long pole time for running the ext4/1k
config). If you use kvm-xfstests it will run the tests sequentially
and so "kvm-xfstests -c ext4/all -g auto" will take about 24 hours,
plus or minue.

The other advantage of gce-xfstests is that you can also do test runs
on arm64 platforms. Leah, one of the XFS stable backports
maintainers, uses gce-xfstests to test backports of various xfs bug
fixes to the LTS kernels, and it's handy to be able to test on both
x86_64 and arm64 to find potential bugs that might work on Intel/AMD
platforms, but which might be problematic on say arm64. You can
actually use kvm-xfstests to test an arm64 kernel, but it will have to
use qemu to emulate the arm64 platforms, and so this can be slow.

(You can also use kvm-xfstests to test i386 kernels, if you care about
testing compat API's on 32-bit kernels. I used to do xfstests runs
using 32-bit x86, but I've gotten too busy, so I've stopped doing
that.)

There are also some useful bits for automatically configuring a
kernel, using the install-kconfig script, and to build the kernel for
use with kvm- and gce-xfstests automatically, using the kbuild script.
There's nothing "magic" about the script, but the fact that you can do
things like "install-kconfig --lockdep", or "install-kconfig --kasan",
etc., and you can also build for arm64 automatically using "kbuild
--arch arm64" is handy.

Basically, these have all be done to optimize for developer velocity,
since I use them every day in my developer workflow.

> Ah, I just recalled that in December 2022 I sent you three conversion from the
> old kmap{,_atomic} that went reviewed by Jan K. and Ira W.. It was tested with
> fstests as I explained above. That patch got lost, in fact I still see at
> least a kmap_atomic() call site in ext4. One call site is no more in ext4. The
> third today uses kmap_local_folio().
>
> Well, everybody dislike to see their patches completely ignored (https://
> lore.kernel.org/lkml/20221231174439.85...@gmail.com/). If you
> wanted you could at least apply the part regarding the kmap_atomic()
> conversion in ext4_journalled_write_inline_data().

Sorry, I can get pretty overloaded. (For example, I've spent most of
my free time this weekend hunting down bugs in other people's patch
set because they hadn't run "gce-xfstests -c ext4/all -g auto" :-).
In this particular case, you sent it while I was on a Panama Canal
cruise, and while I did see the commit, it was lower priority because
kmap_atomic, kmap_local_page(), etc., are all no-op's on 64-bit
systems, and quite frankly, issues which are i386-specific are the
first to get load-shed when I have high priority tasks on my plate.

As it turns out, there is a patch in the dev branch (the dev branch
contains those changtes that will be sent to Linux at the next merge
window, assuming I can flush out all of the bugs that were missed
during code review), that completely gets rid of
ext4_journalled_write_inline_data(), and a quick check of the dev
branch at the moment shows that we don't have any kmap_atomic() calls
in the tree.

> > but it will only work
> > if LOCKDEP is enabled,
>
> I don't know what others do. I always enable LOCKDEP, also in production
> systems.

Enabling LOCKDEP doubles the run time for running "gce-xfstests -c
ext4/all -g auto". It *TRIPLES* the run time for running
"gce-xfstests -c xfs/all -g auto". This is why production kernels
generally don't enable lockdep. Most people (especially the finance
folks who are trying to optimize costs for either bare metal hardware
or cloud VM's) don't like giving away 50% of their performance (or
potentially even more) on their production servers....

> I fully agree. I'm still in search of a reliable way to let atomic context run
> idle waiting for a status change. I am talking about atomic code in paths that
> we don't mind if they loop for few seconds (like the case at hand - we don't
> care to panic immediately or within a sec or two. Do we care???

The problem is that you might be holding a spinlock that is needed for
the workqueue function to complete. In that case, you could deadlock
**forever**, so it's not a matter of a second or two, but waiting
until the heat death of the universe. :-)

It's probably safe given that the workqueue function doesn't need to
take the group local, so for this particular case, it probably OK.
But as a general solution, if you happened to be holding the j_state
lock when you called ext4_error(), then when the workqueue function
tried to commit the superblock change, it would need the j_state lock,
and it would wait forever, while your mdelay() loop would waiting for
the workqueue function to complete... and deadlock.

Worse, lockdep won't actually detect this problem, which is why I
called it inherently fragile. Moreover, using mdelay() loops to
"solve" this kind of locking problem is considered "in poor taste".
For an experienced developer, this sort of thing will make them throw
up in their mouth a little, because it's not terribly clean, even when
it works. So when Linus says that he wants maintainers to have "good
taste", that's the sort of thing that he's talking about.

Remember, spinlocks are supposed to be held for a very short amount of
time. If you are waiting for something to complete while holding a
spinlock, even if it's a "legal" way (for example, by trying to get
another spinlock) this may case a lot of other CPU's running on behalf
of other threasd to also block waiting for the spinlock to be
released. Consider that modern cores can have 128 or 256 or more
cores. If you have a large number of threads just twiddling their
thumbs, spinning waiting on a spinlock, while an mdelay loop is
waiting hundreds of milliseconds for I/O to complete... your computer
will be more like a space heater than a device which allows users to
get their work done.

So the question is not how to find a "reliable way to let atomic
context run > idle waiting for a status change". That's the wrong
question. The better question is: "how do you restructure code
running in an attomic context so it doesn't need to wait for a status
change"?

Cheers,

- Ted

Fabio M. De Francesco

unread,
Jun 14, 2023, 3:38:04 PM6/14/23
to Theodore Ts'o, adilger...@dilger.ca, linux...@vger.kernel.org, syzkall...@googlegroups.com, syzbot
On lunedì 12 giugno 2023 02:19:21 CEST Theodore Ts'o wrote:
> On Sun, Jun 11, 2023 at 09:15:56PM +0200, Fabio M. De Francesco wrote:
> > Thanks!
> >
> > Let me summarize, just to be sure we don't misunderstand each other...
> >
> > To start off, I'll send out _only_ the patch for the bug reported by
Syzbot,
> > the one about dropping the call to ext_error() in ext4_get_group_info().
> >
> > I'll do this by Tuesday. (Sorry, I cannot do it by Monday because I must
> > pass
> > an exam and an interview for a job).

Ted,

Sorry, I sent the patch this morning (local time), that is one day later :-(

It's at https://lore.kernel.org/lkml/20230614100446.143...@gmail.com/

> Sure, that'll be fine.
>
> > However, on the other problems with ext4_grp_locked_error() that you
noticed
> > in the final part of your first message in this thread I'll need some days
> > more to better understand the context I'm working in.
>
> Um, I'm not sure what problems you're referring to. What I said is
> that it works, but you just have to be careful in how you use it (and
> the current callers in mballoc.c are careful).

My poor English made me misunderstanding what you wrote in the final part of
you first email. My fault, again sorry.

> And similarly, I don't think it's a problem that you need to be
> careful not to call ext4_error() from an atomic context. You need to
> be careful, and sometimes we screw up. But in this particular case,
> it's pretty obvious how to fix it, and we don't even need a syzkaller
> reproducer. :-)

Sure.

> > > I would strongly recommend that you use gce-xfstests or kvm-xfstests
> > > before submitting ext4 patches.

I'll surely do next time, but this was too trivial to necessitate any test. Do
you agree with me?

> > > In this particular case, it's a
> > > relatively simple patch, but it's a good habit to get into. See [1]
> > > for more details.
> > >
> > > [1] https://thunk.org/gce-xfstests
> >
> > Thanks also for these information.
> >

Well, I think that this means that you indeed agree for this particular case
:-)

> > I'm still in search of a reliable way to let atomic context
> > run idle waiting for a status change.

[...]

> So the question is not how to find a "reliable way to let atomic
> context run > idle waiting for a status change". That's the wrong
> question. The better question is: "how do you restructure code
> running in an atomic context so it doesn't need to wait for a status
> change"?
>
> Cheers,
>
> - Ted

Very interesting discussion.
I skipped the details only for shortening this email.

Again thanks for your precious help,

Fabio



Reply all
Reply to author
Forward
0 new messages