Re: btrfs bio linked list corruption.

Al Viro

unread,

Oct 11, 2016, 11:20:06 AM10/11/16

to

On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.

Those iovec fixups are in the current tree... TBH, I don't see anything
in splice-related stuff that could come anywhere near that (short of
some general memory corruption having random effects of that sort).

Could you try to bisect that sucker, or is it too hard to reproduce?

Dave Jones

unread,

Oct 11, 2016, 11:30:06 AM10/11/16

to

On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
> On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> > This is from Linus' current tree, with Al's iovec fixups on top.
>
> Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

> TBH, I don't see anything
> in splice-related stuff that could come anywhere near that (short of
> some general memory corruption having random effects of that sort).
>
> Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.

Dave

Dave Jones

unread,

Oct 11, 2016, 11:30:06 AM10/11/16

to

This is from Linus' current tree, with Al's iovec fixups on top.

------------[ cut here ]------------
WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
Call Trace:
[<ffffffff8d32007c>] dump_stack+0x4f/0x73
[<ffffffff8d07a6c1>] __warn+0xc1/0xe0
[<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
[<ffffffff8d33e689>] __list_add+0x89/0xb0
[<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350
[<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240
[<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240
[<ffffffff8d2fdb98>] submit_bio+0x78/0x150
[<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0
[<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs]
[<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs]
[<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs]
[<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs]
[<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs]
[<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs]
[<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs]
[<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs]
[<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50
[<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs]
[<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs]
[<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs]
[<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs]
[<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs]
[<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20
[<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50
[<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs]
[<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs]
[<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs]
[<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs]
[<ffffffff8d1f4092>] notify_change+0x252/0x440
[<ffffffff8d1d164e>] do_truncate+0x6e/0xc0
[<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170
[<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20
[<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10
[<ffffffff8d00259c>] do_syscall_64+0x5c/0x170
[<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25
--[ end trace 906673a2f703b373 ]---

Dave Jones

unread,

Oct 11, 2016, 12:00:06 PM10/11/16

to

On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote:

> This call trace is reading metadata so we can finish the truncate. I'd
> say adding more memory pressure would make it happen more often.

That story checks out. There were a bunch of oom's in the log before this.

Dave

Dave Jones

unread,

Oct 11, 2016, 12:30:05 PM10/11/16

to

On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:

>
>
> On 10/11/2016 10:45 AM, Dave Jones wrote:
> > This is from Linus' current tree, with Al's iovec fixups on top.
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
> > Call Trace:
> > [<ffffffff8d32007c>] dump_stack+0x4f/0x73
> > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0
> > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
> > [<ffffffff8d33e689>] __list_add+0x89/0xb0
> > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350
>

> /*
> * A task plug currently exists. Since this is completely lockless,
> * utilize that to temporarily store requests until the task is
> * either done or scheduled away.
> */
> plug = current->plug;
> if (plug) {
> blk_mq_bio_to_request(rq, bio);
> if (!request_count)
> trace_block_plug(q);
>
> blk_mq_put_ctx(data.ctx);
>
> if (request_count >= BLK_MAX_REQUEST_COUNT) {
> blk_flush_plug_list(plug, false);
> trace_block_plug(q);
> }
>
> list_add_tail(&rq->queuelist, &plug->mq_list);
> ^^^^^^^^^^^^^^^^^^^^^^
>
> Dave, is this where we're crashing? This seems strange.

According to objdump -S ..

ffffffff8130a1b7: 48 8b 70 50 mov 0x50(%rax),%rsi
list_add_tail(&rq->queuelist, &ctx->rq_list);
ffffffff8130a1bb: 48 8d 50 48 lea 0x48(%rax),%rdx
ffffffff8130a1bf: 48 89 45 a8 mov %rax,-0x58(%rbp)
ffffffff8130a1c3: e8 38 44 03 00 callq ffffffff8133e600 <__list_add>
blk_mq_hctx_mark_pending(hctx, ctx);
ffffffff8130a1c8: 48 8b 45 a8 mov -0x58(%rbp),%rax
ffffffff8130a1cc: 4c 89 ff mov %r15,%rdi

That looks like the list_add_tail from __blk_mq_insert_req_list

Dave

Dave Jones

unread,

Oct 12, 2016, 9:50:06 AM10/12/16

to

On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
>
>
> On 10/11/2016 10:45 AM, Dave Jones wrote:

> > This is from Linus' current tree, with Al's iovec fixups on top.
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880

I hit this again overnight, it's the same trace, the only difference
being slightly different addresses in the list pointers:

[42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00).

I'm actually a little surprised that ->next was the same across two
reboots on two different kernel builds. That might be a sign this is
more repeatable than I'd thought, even if it does take hours of runtime
right now to trigger it. I'll try and narrow the scope of what trinity
is doing to see if I can make it happen faster.

Dave

Dave Jones

unread,

Oct 12, 2016, 10:50:06 AM10/12/16

to

.. and of course the first thing that happens is a completely different
btrfs trace..

WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a

Call Trace:
[<ffffffffb731ff3c>] dump_stack+0x4f/0x73
[<ffffffffb707a6c1>] __warn+0xc1/0xe0
[<ffffffffb707a7e8>] warn_slowpath_null+0x18/0x20
[<ffffffffc01d312a>] start_transaction+0x40a/0x440 [btrfs]
[<ffffffffc01a6215>] ? btrfs_alloc_path+0x15/0x20 [btrfs]
[<ffffffffc01d31b2>] btrfs_join_transaction+0x12/0x20 [btrfs]
[<ffffffffc01d92cf>] cow_file_range_inline+0xef/0x830 [btrfs]
[<ffffffffc01d9d75>] cow_file_range.isra.64+0x365/0x480 [btrfs]
[<ffffffffb77c24cc>] ? _raw_spin_unlock+0x2c/0x50
[<ffffffffc01f229f>] ? release_extent_buffer+0x9f/0x110 [btrfs]
[<ffffffffc01da299>] run_delalloc_nocow+0x409/0xbd0 [btrfs]
[<ffffffffb70c6109>] ? get_lock_stats+0x19/0x50
[<ffffffffc01dadea>] run_delalloc_range+0x38a/0x3e0 [btrfs]
[<ffffffffc01f4aba>] writepage_delalloc.isra.47+0x10a/0x190 [btrfs]
[<ffffffffc01f7678>] __extent_writepage+0xd8/0x2c0 [btrfs]
[<ffffffffc01f7b2e>] extent_write_cache_pages.isra.44.constprop.63+0x2ce/0x430 [btrfs]
[<ffffffffb733e497>] ? debug_smp_processor_id+0x17/0x20
[<ffffffffb70c6109>] ? get_lock_stats+0x19/0x50
[<ffffffffc01f8278>] extent_writepages+0x58/0x80 [btrfs]
[<ffffffffc01d7a80>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffc01d4a63>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffffb7162e9c>] do_writepages+0x1c/0x30
[<ffffffffb71550f1>] __filemap_fdatawrite_range+0xc1/0x100
[<ffffffffb71551ee>] filemap_fdatawrite_range+0xe/0x10
[<ffffffffc01eb2fb>] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
[<ffffffffc01f0820>] btrfs_wait_ordered_range+0x40/0x100 [btrfs]
[<ffffffffc01eb5d5>] btrfs_sync_file+0x285/0x390 [btrfs]
[<ffffffffb7207626>] vfs_fsync_range+0x46/0xa0
[<ffffffffb72076d8>] do_fsync+0x38/0x60
[<ffffffffb720795b>] SyS_fsync+0xb/0x10
[<ffffffffb700259c>] do_syscall_64+0x5c/0x170
[<ffffffffb77c2e4b>] entry_SYSCALL64_slow_path+0x25/0x25

Dave Jones

unread,

Oct 13, 2016, 2:20:07 PM10/13/16

to

On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:

> This isn't even IO. Uuughhhh. We're going to need a fast enough test
> that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.

Dave

Dave Jones

unread,

Oct 13, 2016, 6:30:05 PM10/13/16

to

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
> > >
> > > This isn't even IO. Uuughhhh. We're going to need a fast enough test
> > > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
> > Mix them together though, and something goes awry.
> >
>

> Hasn't triggered here yet. I'll leave it running though.

With that combo of params I triggered it 3-4 times in a row within minutes.. Then
as soon as I posted, it stopped being so easy to repro.

There's some other variable I haven't figured out yet (maybe how the random way that files
get opened in fds/testfiles.c), but it does seem to point at the xattr changes.

I'll poke at it some more tomorrow.

Dave

Dave Jones

unread,

Oct 15, 2016, 8:50:05 PM10/15/16

to

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> > > > .. and of course the first thing that happens is a completely different
> > > > btrfs trace..
> > > >
> > > >
> > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
> > >
> > > This isn't even IO. Uuughhhh. We're going to need a fast enough test
> > > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
> > Mix them together though, and something goes awry.
> >

> Hasn't triggered here yet. I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12 pfn:4988fa
page:ffffea0012623e80 count:0 mapcount:0 mapping:ffff8804450456e0 index:0x9

flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 2 PID: 1388 Comm: kworker/u8:12 Not tainted 4.8.0-think+ #18
Workqueue: writeback wb_workfn
(flush-btrfs-1)

ffffc90000aef7e8
ffffffff81320e7c
ffffea0012623e80
ffffffff819fe6ec

ffffc90000aef810
ffffffff81159b3f
0000000000000000
ffffea0012623e80

400000000000000c
ffffc90000aef820
ffffffff81159bfa
ffffc90000aef868

Call Trace:
[<ffffffff81320e7c>] dump_stack+0x4f/0x73
[<ffffffff81159b3f>] bad_page+0xbf/0x120
[<ffffffff81159bfa>] free_pages_check_bad+0x5a/0x70
[<ffffffff8115c0fb>] free_hot_cold_page+0x20b/0x270
[<ffffffff8115c41b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff81165062>] release_pages+0x2d2/0x380
[<ffffffff811665d2>] __pagevec_release+0x22/0x30
[<ffffffffa009f810>] extent_write_cache_pages.isra.48.constprop.63+0x350/0x430 [btrfs]
[<ffffffff8133f487>] ? debug_smp_processor_id+0x17/0x20
[<ffffffff810c6999>] ? get_lock_stats+0x19/0x50
[<ffffffffa009fce8>] extent_writepages+0x58/0x80 [btrfs]
[<ffffffffa007f150>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007c0d3>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8116370c>] do_writepages+0x1c/0x30
[<ffffffff81202d63>] __writeback_single_inode+0x33/0x180
[<ffffffff8120357b>] writeback_sb_inodes+0x2cb/0x5d0
[<ffffffff8120390d>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff81203c03>] wb_writeback+0x203/0x210
[<ffffffff81204197>] wb_workfn+0xe7/0x2a0
[<ffffffff810c8b7f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[<ffffffff8109458a>] process_one_work+0x1da/0x4b0
[<ffffffff8109452a>] ? process_one_work+0x17a/0x4b0
[<ffffffff810948a9>] worker_thread+0x49/0x490
[<ffffffff81094860>] ? process_one_work+0x4b0/0x4b0
[<ffffffff81094860>] ? process_one_work+0x4b0/0x4b0

Dave Jones

unread,

Oct 18, 2016, 6:50:05 PM10/18/16

to

So Chris had me do a run on ext4 just for giggles. It took a while, but
eventually this fell out...

WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0
list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0).
CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1
ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000
ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740
ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548
Call Trace:
[<ffffffff81320e3c>] dump_stack+0x4f/0x73
[<ffffffff8107a711>] __warn+0xc1/0xe0
[<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80
[<ffffffff8133f499>] __list_add+0x89/0xb0
[<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350
[<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240
[<ffffffff812fe6e9>] generic_make_request+0xf9/0x240
[<ffffffff812fe8a8>] submit_bio+0x78/0x150
[<ffffffff8120bde6>] ? __find_get_block+0x126/0x130
[<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0
[<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20
[<ffffffff8120d958>] ll_rw_block+0xa8/0xb0
[<ffffffff8120da0f>] __breadahead+0x3f/0x70
[<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0
[<ffffffff8126806d>] ext4_iget+0x8d/0xb90
[<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700
[<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30
[<ffffffff81273cd6>] ext4_lookup+0x136/0x250
[<ffffffff811e118d>] lookup_slow+0x12d/0x220
[<ffffffff811e3897>] walk_component+0x1e7/0x310
[<ffffffff811e33f8>] ? path_init+0x4d8/0x520
[<ffffffff811e4022>] path_lookupat+0x62/0x120
[<ffffffff811e4f22>] ? getname_flags+0x32/0x180
[<ffffffff811e5278>] filename_lookup+0xa8/0x130
[<ffffffff81352526>] ? strncpy_from_user+0x46/0x170
[<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180
[<ffffffff811e53d1>] user_path_at_empty+0x31/0x40
[<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0
[<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[<ffffffff811da30e>] SYSC_newstat+0x2e/0x60
[<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20
[<ffffffff811da499>] SyS_newstat+0x9/0x10
[<ffffffff8100259c>] do_syscall_64+0x5c/0x170
[<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25

So this one isn't a btrfs specific problem as I first thought.

This sometimes reproduces within minutes, sometimes hours, which makes
it a pain to bisect. It only started showing up this merge window though.

Dave

Linus Torvalds

unread,

Oct 18, 2016, 7:40:05 PM10/18/16

to

On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <c...@fb.com> wrote:
>
> Jens, not sure if you saw the whole thread. This has triggered bad page
> state errors, and also corrupted a btrfs list. It hurts me to say, but it
> might not actually be your fault.

Where is that thread, and what is the "this" that triggers problems?

Looking at the "->mq_list" users, I'm not seeing any changes there in
the last year or so. So I don't think it's the list itself.

Linus

Linus Torvalds

unread,

Oct 18, 2016, 8:20:05 PM10/18/16

to

On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <c...@fb.com> wrote:
>
> Seems to be the whole thing:

Ahh. On lkml, so I do have it in my mailbox, but Dave changed the
subject line when he tested on ext4 rather than btrfs..

Anyway, the corrupted address is somewhat interesting. As Dave Jones
said, he saw

list_add corruption. prev->next should be next (ffffe8ffff806648),
but was ffffc9000067fcd8. (prev=ffff880503878b80).

list_add corruption. prev->next should be next (ffffe8ffffc05648),
but was ffffc9000028bcd8. (prev=ffff880503a145c0).

and Dave Chinner reports

list_add corruption. prev->next should be next (ffffe8ffffc02808),
but was ffffc90005f6bda8. (prev=ffff88013363bb80).

and it's worth noting that the "but was" is a remarkably consistent
vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's
identical across two boots for DaveJ in the low 14 bits, and fairly
high up in those low 14 bots (0x3cd8).

DaveC has a different address, but it's also in the vmalloc space, and
also looks like it is fairly high up in 14 bits (0x3da8). So in both
cases it's almost certainly a stack address with a fairly empty stack.
The differences are presumably due to different kernel configurations
and/or just different filesystems calling the same function that does
the same bad thing but now at different depths in the stack.

Adding Andy to the cc, because this *might* be triggered by the
vmalloc stack code itself. Maybe the re-use of stacks showing some
problem? Maybe Chris (who can't see the problem) doesn't have
CONFIG_VMAP_STACK enabled?

Andy - this is on lkml, under

Dave Chinner:
[regression, 4.9-rc1] blk-mq: list corruption in request queue

Dave Jones:
btrfs bio linked list corruption.
Re: bio linked list corruption.

and they are definitely the same thing across three different
filesystems (xfs, btrfs and ext4), and they are consistent enough that
there is almost certainly a single very specific memory corrupting
issue that overwrites something with a stack pointer.

Linus

Linus Torvalds

unread,

Oct 18, 2016, 8:30:05 PM10/18/16

to

On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Adding Andy to the cc, because this *might* be triggered by the
> vmalloc stack code itself. Maybe the re-use of stacks showing some
> problem? Maybe Chris (who can't see the problem) doesn't have
> CONFIG_VMAP_STACK enabled?

I bet it's the plug itself that is the stack address. In fact, it's
probably that mq_list head pointer

I think every single users of block plugging uses the pattern

struct blk_plug plug;

blk_start_plug(&plug);

and then we'll have

INIT_LIST_HEAD(&plug->mq_list);

which initializes that mq_list head with the stack addresses pointing to itself.

So when we see something like this:

list_add corruption. prev->next should be next (ffffe8ffff806648),
but was ffffc9000067fcd8. (prev=ffff880503878b80)

and it comes from

list_add_tail(&rq->queuelist, &plug->mq_list);

which will expand to

__list_add(new, head->prev, head)

which in this case *should* be:

__list_add(&rq->queuelist, plug->mq_list.prev, &plug->mq_list);

so in fact we *should* have "next" be a stack address.

So that debug message is really really odd. I would expect that "next"
is the stack address (because we're adding to the tail of the list, so
"next" is the list head itself), but the debug message corruption
printout says that "was" is the stack address, but next isn't.

Weird.The "but was" value actually looks like the right address should
look, but the actual address (which *should* be just "&plug->mq_list"
and really should be on the stack) looks bogus.

I'm now very confused.

Linus

Andy Lutomirski

unread,

Oct 18, 2016, 9:10:05 PM10/18/16

to

Wouldn't this cause the exact opposite problem? If the warning is to be
believed, then prev is *not* on the stack but somehow prev->next ended
up pointing to the stack. If stack reuse caused something to corrupt a
value on the stack, then how would this cause a stack address to be
written to a non-stack location? All I can think of is that "prev"
itself is corrupted somehow.

One possible debugging approach would be to change:

#define NR_CACHED_STACKS 2

to

#define NR_CACHED_STACKS 0

in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
force an immediate TLB flush after vfree.

Also, CONFIG_DEBUG_VIRTUAL=y can be quite helpful for debugging stack
issues. I'm tempted to do something equivalent to hardwiring that
option on for a while if CONFIG_VMAP_STACK=y.

Philipp Hahn

unread,

Oct 19, 2016, 1:10:06 PM10/19/16

to

Hello,

Am 19.10.2016 um 01:42 schrieb Chris Mason:

> Seems to be the whole thing:
>

> http://www.gossamer-threads.com/lists/linux/kernel/2545792
>
> My guess is xattr, but I don't have a good reason for that.

Nearly a month ago I reported also a "list_add corruption", but with 4.1.6:
<http://marc.info/?l=linux-kernel&m=147508265316854&w=2>

That server rungs Samba4, which also is a heavy user of xattr.

Might be that it is related.

Philipp Hahn

Linus Torvalds

unread,

Oct 19, 2016, 1:50:06 PM10/19/16

to

On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn <pmh...@pmhahn.de> wrote:
>
> Nearly a month ago I reported also a "list_add corruption", but with 4.1.6:
> <http://marc.info/?l=linux-kernel&m=147508265316854&w=2>
>
> That server rungs Samba4, which also is a heavy user of xattr.

That one looks very different. In fact, the list that got corrupted
for you has since been changed to a hlist (which is *similar* to our
doubly-linked list, but has a smaller head and does not allow adding
to the end of the list).

Also, the "should be" and "was" values are very close, and switched:

should be ffffffff81ab3ca8, but was ffffffff81ab3cc8
should be ffffffff81ab3cc8, but was ffffffff81ab3ca8

so it actually looks like it was the same data structure. In
particular, it looks like enqueue_timer() ended up racing on adding an
entry to one index in the "base->vectors[]" array, while hitting an
entry that was pointing to another index near-by.

So I don't think it's related. Yours looks like some subtle timer base
race. It smells like a locking problem with timers. I'm not seeing
what it might be, but it *might* have been fixed by doing the
TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018).

Adding some timer people just in case, but I don't think your 4.1
report is related.

Linus

Ingo Molnar

unread,

Oct 20, 2016, 3:00:05 AM10/20/16

to

Side note: in case timer callback related corruption is suspected, a very
efficient debugging method is to enable debugobjects tracking+checking:

CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_SELFTEST=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_WORK=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y
CONFIG_DEBUG_KOBJECT_RELEASE=y

( Appending these to the .config and running 'make oldconfig' should enable all of
these. )

If the problem is in any of these areas then a debug warning could trigger at a
more convenient place than some later 'some unrelated bits got corrupted' warning.

Thanks,

Ingo

Thomas Gleixner

unread,

Oct 20, 2016, 3:30:05 AM10/20/16

to

On Thu, 20 Oct 2016, Ingo Molnar wrote:
> * Linus Torvalds <torv...@linux-foundation.org> wrote:
> > So I don't think it's related. Yours looks like some subtle timer base
> > race. It smells like a locking problem with timers. I'm not seeing
> > what it might be, but it *might* have been fixed by doing the
> > TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018).

4.1 does not have the new fangled timer wheel. So 22b886dd1018 is
irrelevant.

> > Adding some timer people just in case, but I don't think your 4.1
> > report is related.

This looks like one of the classic issues where an enqueued timer gets
freed/reinitialized or otherwise manipulated. So yes, Ingos recommendation
for debugging is the right thing to do.

Thanks,

tglx

Dave Jones

unread,

Oct 20, 2016, 6:50:05 PM10/20/16

to

On Tue, Oct 18, 2016 at 05:28:44PM -0700, Linus Torvalds wrote:
> On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > Adding Andy to the cc, because this *might* be triggered by the
> > vmalloc stack code itself. Maybe the re-use of stacks showing some
> > problem? Maybe Chris (who can't see the problem) doesn't have
> > CONFIG_VMAP_STACK enabled?
>
> I bet it's the plug itself that is the stack address. In fact, it's
> probably that mq_list head pointer

So I've done a few experiments the last couple days.

1, I see some kind of disaster happen with every filesystem
ext4, btrfs, xfs. For some reason I can repro it faster on btrfs
(though xfs blew up pretty quickly too, but I don't know if it's
the same as this list corruption bug).

2, I ran for 24 hours with VMAP_STACK turned off. I saw some
_different_ btrfs problems, but I never hit that list debug corruption
once.

3, I turned vmap stacks back on, and got this pretty quickly.
Another new flavor of crash, but Chris recommended I post this one
because it looks interesting.

[ 3943.514961] BUG: Bad page state in process kworker/u8:14 pfn:482244
[ 3943.532400] page:ffffea0012089100 count:0 mapcount:0 mapping:ffff8804c40d6ae0 index:0x2f
[ 3943.551865] flags: 0x4000000000000008(uptodate)
[ 3943.561652] page dumped because: non-NULL mapping
[ 3943.587698] CPU: 2 PID: 26823 Comm: kworker/u8:14 Not tainted 4.9.0-rc1-think+ #9
[ 3943.607409] Workqueue: writeback wb_workfn
[ 3943.617194] (flush-btrfs-2)
[ 3943.617260] ffffc90001bf7870
[ 3943.627007] ffffffff8130c93c
[ 3943.627075] ffffea0012089100
[ 3943.627112] ffffffff819ff37c
[ 3943.627149] ffffc90001bf7898
[ 3943.636918] ffffffff81150fef
[ 3943.636985] 0000000000000000
[ 3943.637021] ffffea0012089100
[ 3943.637059] 4000000000000008
[ 3943.646965] ffffc90001bf78a8
[ 3943.647041] ffffffff811510aa
[ 3943.647081] ffffc90001bf78f0
[ 3943.647126] Call Trace:
[ 3943.657068] [<ffffffff8130c93c>] dump_stack+0x4f/0x73
[ 3943.666996] [<ffffffff81150fef>] bad_page+0xbf/0x120
[ 3943.676839] [<ffffffff811510aa>] free_pages_check_bad+0x5a/0x70
[ 3943.686646] [<ffffffff8115355b>] free_hot_cold_page+0x20b/0x270
[ 3943.696402] [<ffffffff8115387b>] free_hot_cold_page_list+0x2b/0x50
[ 3943.706092] [<ffffffff8115c1fd>] release_pages+0x2bd/0x350
[ 3943.715726] [<ffffffff8115d732>] __pagevec_release+0x22/0x30
[ 3943.725358] [<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[ 3943.735126] [<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs]
[ 3943.744808] [<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[ 3943.754457] [<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs]
[ 3943.764085] [<ffffffff8115a91c>] do_writepages+0x1c/0x30
[ 3943.773667] [<ffffffff811f65f3>] __writeback_single_inode+0x33/0x180
[ 3943.783233] [<ffffffff811f6de8>] writeback_sb_inodes+0x2a8/0x5b0
[ 3943.792870] [<ffffffff811f733b>] wb_writeback+0xeb/0x1f0
[ 3943.802326] [<ffffffff811f7972>] wb_workfn+0xd2/0x280
[ 3943.811673] [<ffffffff810906e5>] process_one_work+0x1d5/0x490
[ 3943.821044] [<ffffffff81090685>] ? process_one_work+0x175/0x490
[ 3943.830447] [<ffffffff810909e9>] worker_thread+0x49/0x490
[ 3943.839756] [<ffffffff810909a0>] ? process_one_work+0x490/0x490
[ 3943.849074] [<ffffffff810909a0>] ? process_one_work+0x490/0x490
[ 3943.858264] [<ffffffff81095b5e>] kthread+0xee/0x110
[ 3943.867451] [<ffffffff81095a70>] ? kthread_park+0x60/0x60
[ 3943.876616] [<ffffffff81095a70>] ? kthread_park+0x60/0x60
[ 3943.885624] [<ffffffff81095a70>] ? kthread_park+0x60/0x60
[ 3943.894580] [<ffffffff81790492>] ret_from_fork+0x22/0x30

This feels like chasing a moving target, because the crash keeps changing..
I'm going to spend some time trying to at least pin down a selection
of syscalls that trinity can reproduce this with quickly.

Early-on, it seemed like this was xattr related, but now I'm not so sure.
Once or twice, I was able to repro it within a few minutes using just
writev, fsync, lsetxattr and lremovexattr. Then a day later, I found I
could run for a day before seeing it. Position of the moon or something.
Or it could have been entirely unrelated to the actual syscalls being run,
and based just on how contended the cpu/memory was.

Dave

Dave Jones

unread,

Oct 20, 2016, 7:00:06 PM10/20/16

to

On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:

> One possible debugging approach would be to change:
>
> #define NR_CACHED_STACKS 2
>
> to
>
> #define NR_CACHED_STACKS 0
>
> in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> force an immediate TLB flush after vfree.

I can give that idea some runtime, but it sounds like this a case where
we're trying to prove a negative, and that'll just run and run ? In which case I
might do this when I'm travelling on Sunday.

> Also, CONFIG_DEBUG_VIRTUAL=y can be quite helpful for debugging stack
> issues. I'm tempted to do something equivalent to hardwiring that
> option on for a while if CONFIG_VMAP_STACK=y.

This one I had on. Nothing interesting jumped out.

Dave

Andy Lutomirski

unread,

Oct 20, 2016, 7:10:05 PM10/20/16

to

On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <da...@codemonkey.org.uk> wrote:
> On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:
>
> > One possible debugging approach would be to change:
> >
> > #define NR_CACHED_STACKS 2
> >
> > to
> >
> > #define NR_CACHED_STACKS 0
> >
> > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> > force an immediate TLB flush after vfree.
>
> I can give that idea some runtime, but it sounds like this a case where
> we're trying to prove a negative, and that'll just run and run ? In which case I
> might do this when I'm travelling on Sunday.

The idea is that the stack will be free and unmapped immediately upon
process exit if configured like this so that bogus stack accesses (by
the CPU, not DMA) would OOPS immediately.

Dave Jones

unread,

Oct 20, 2016, 7:10:05 PM10/20/16

to

oh, misparsed. ok, I can definitely get behind that idea then.
I'll do that next.

Dave

Andy Lutomirski

unread,

Oct 20, 2016, 7:30:05 PM10/20/16

to

It could be worth trying this, too:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8

It occurred to me that the current code is a little bit fragile.

--Andy

Dave Jones

unread,

Oct 21, 2016, 4:10:05 PM10/21/16

to

It's been nearly 24hrs with the above changes, and it's been pretty much
silent the whole time.

The only thing of note over that time period has been a btrfs lockdep
warning that's been around for a while, and occasional btrfs checksum
failures, which I've been seeing for a while, but seem to have gotten
worse since 4.8.

I'm pretty confident in the disk being ok in this machine, so I think
the checksum warnings are bogus. Chris suggested they may be the result
of memory corruption, but there's little else going on.

BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513
BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262
BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825
BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722
BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005

A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets
differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files,
but it struck me as surprising.

btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the
test files that trinity generates. If so, it might point to a race somewhere.

Dave

Dave Jones

unread,

Oct 21, 2016, 4:30:05 PM10/21/16

to

On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote:

> > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513
> > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262
> > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825
> > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722
> > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005
> >
> >
> > A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets
> > differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files,
> > but it struck me as surprising.
> >
> > btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the
> > test files that trinity generates. If so, it might point to a race somewhere.
>

> btrfs inspect inode 130654 mntpoint

Interesting, they all return

ERROR: ino paths ioctl: No such file or directory

So these files got deleted perhaps ?

Dave

Dave Jones

unread,

Oct 21, 2016, 5:20:05 PM10/21/16

to

On Fri, Oct 21, 2016 at 04:41:09PM -0400, Josef Bacik wrote:

> >> >
> >> > btrfs inspect inode 130654 mntpoint
> >>
> >> Interesting, they all return
> >>
> >> ERROR: ino paths ioctl: No such file or directory
> >>
> >> So these files got deleted perhaps ?
> >>

> > Yeah, they must have.
> >
>
> So one thing that will cause spurious csum errors is if you do things like
> change the memory while it is in flight during O_DIRECT. Does trinity do that?
> If so then that would explain it. If not we should probably dig into it. Thanks,

Yeah, that's definitely possible. And it wasn't that long ago I added
some code to always open testfiles multiple times with different modes,
so the likely of O_DIRECT went up. That would explain why I've started
seeing this more.

Dave

Dave Jones

unread,

Oct 22, 2016, 11:30:05 AM10/22/16

to

On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:

> > It could be worth trying this, too:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> >
> > It occurred to me that the current code is a little bit fragile.
>
> It's been nearly 24hrs with the above changes, and it's been pretty much
> silent the whole time.
>
> The only thing of note over that time period has been a btrfs lockdep
> warning that's been around for a while, and occasional btrfs checksum
> failures, which I've been seeing for a while, but seem to have gotten
> worse since 4.8.
>
> I'm pretty confident in the disk being ok in this machine, so I think
> the checksum warnings are bogus. Chris suggested they may be the result
> of memory corruption, but there's little else going on.

The only interesting thing last nights run was this..

BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
flags: 0x400000000000000c(referenced|uptodate)

page dumped because: non-NULL mapping

CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
Workqueue: writeback wb_workfn (flush-btrfs-2)
ffffc90001f97828
ffffffff8130d07c
ffffea00138adc00
ffffffff819ff524
ffffc90001f97850
ffffffff8115117f
0000000000000000
ffffea00138adc00
400000000000000c
ffffc90001f97860
ffffffff8115123a
ffffc90001f978a8
Call Trace:
[<ffffffff8130d07c>] dump_stack+0x4f/0x73
[<ffffffff8115117f>] bad_page+0xbf/0x120
[<ffffffff8115123a>] free_pages_check_bad+0x5a/0x70
[<ffffffff81153b38>] free_hot_cold_page+0x248/0x290
[<ffffffff81153e3b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff8115c84d>] release_pages+0x2bd/0x350
[<ffffffff8115dd82>] __pagevec_release+0x22/0x30
[<ffffffffa009cd4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[<ffffffffa009d199>] extent_writepages+0x49/0x60 [btrfs]
[<ffffffffa007d840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007a993>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8115af6c>] do_writepages+0x1c/0x30
[<ffffffff811f6d33>] __writeback_single_inode+0x33/0x180
[<ffffffff811f7528>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f78bd>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff811f7b73>] wb_writeback+0x1e3/0x1f0
[<ffffffff811f80b2>] wb_workfn+0xd2/0x280
[<ffffffff81090875>] process_one_work+0x1d5/0x490
[<ffffffff81090815>] ? process_one_work+0x175/0x490
[<ffffffff81090b79>] worker_thread+0x49/0x490
[<ffffffff81090b30>] ? process_one_work+0x490/0x490
[<ffffffff81095cee>] kthread+0xee/0x110
[<ffffffff81095c00>] ? kthread_park+0x60/0x60
[<ffffffff81790bd2>] ret_from_fork+0x22/0x30

Dave Jones

unread,

Oct 24, 2016, 12:50:06 AM10/24/16

to

On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote:

>
>
> On 10/22/2016 11:20 AM, Dave Jones wrote:
> > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
> >
> > > > It could be worth trying this, too:
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> > > >
> > > > It occurred to me that the current code is a little bit fragile.
> > >
> > > It's been nearly 24hrs with the above changes, and it's been pretty much
> > > silent the whole time.
> > >
> > > The only thing of note over that time period has been a btrfs lockdep
> > > warning that's been around for a while, and occasional btrfs checksum
> > > failures, which I've been seeing for a while, but seem to have gotten
> > > worse since 4.8.
> > >
> > > I'm pretty confident in the disk being ok in this machine, so I think
> > > the checksum warnings are bogus. Chris suggested they may be the result
> > > of memory corruption, but there's little else going on.
> >
> > The only interesting thing last nights run was this..
> >
> > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
> > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
> > flags: 0x400000000000000c(referenced|uptodate)
> > page dumped because: non-NULL mapping
> > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
> > Workqueue: writeback wb_workfn (flush-btrfs-2)
>

> Well crud, we're back to wondering if this is Btrfs or the stack
> corruption. Since the pagevecs are on the stack and this is a new
> crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
> should make sure.

Here's an interesting one from today, pointing the finger at xattrs again.

[69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
[69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
[69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
[69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
[69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086
[69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000
[69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000
[69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000
[69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740
[69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000
[69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0
[69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000
[69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[69943.534771] Stack:
[69943.540023] ffff880507bd74c0
[69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08
[69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c
[69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace:
[69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70
[69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
[69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80
[69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
[69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70
[69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0
[69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100
[69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0
[69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170
[69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170
[69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0
[69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70
[69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
[69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30
[69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170
[69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170
[69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170
[69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
[69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50
[69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0
[69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20
[69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0
[69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90
[69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0
[69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25
[69943.731521] Code:
[69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 [69943.745506] RIP
[69943.752432] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
RSP <ffffc9000b10b9e8>
[69943.766856] CR2: ffffffff81789d28

Andy Lutomirski

unread,

Oct 24, 2016, 4:10:05 PM10/24/16

to

This is an unhandled kernel page fault. The string "Oops" is so helpful :-/

That's lock incl 0x198(%rax). I think this is:

atomic_inc((atomic_t *)&class->ops);

I suppose this could be stack corruption at work, but after a fair
amount of staring, I still haven't found anything in the vmap_stack
code that would cause stack corruption.

Linus Torvalds

unread,

Oct 24, 2016, 4:50:06 PM10/24/16

to

On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> This is an unhandled kernel page fault. The string "Oops" is so helpful :-/

I think there was a line above it that DaveJ just didn't include.

Well, it is intriguing that what faults is this:

finish_wait(shmem_falloc_waitq, &shmem_fault_wait);

where 'shmem_fault_wait' is a on-stack wait queue. So it really looks
very much like stack corruption.

What strikes me is that "finish_wait()" does this optimistic "has my
entry been removed" without holding the waitqueue lock (and uses
list_empty_careful() to make sure it does that "safely").

It has that big comment too:

/*
* shmem_falloc_waitq points into the shmem_fallocate()
* stack of the hole-punching task: shmem_falloc_waitq
* is usually invalid by the time we reach here, but
* finish_wait() does not dereference it in that case;
* though i_lock needed lest racing with wake_up_all().
*/

the stack it comes from is the wait queue head from shmem_fallocate(),
which will do "wake_up_all()" under the inode lock.

On the face of it, the inode lock should make that safe and serialize
everything. And yes, finish_wait() does not touch the unsafe stuff if
the wait-queue (in the local stack) is empty, which wake_up_all()
*should* have guaranteed. It's just a regular wait-queue entry (that
DEFINE_WAIT() does that), so it uses the normal
autoremove_wake_function() that removes things on successful wakeup:

int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int
sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);

if (ret)
list_del_init(&wait->task_list);
return ret;
}

So the only issue is "did default_wake_function() return true"? That's
try_to_wake_up(TASK_NORMAL, 0), and I note that it can return zero
(and thus *not* remove the entry - leavign the invalid entry tghere)
if

if (!(p->state & state))
goto out;

but "prepare_to_wait()" (which also ran with the inode->i_lock held,
and also takes the wait-queue lock) did set p->state to
TASK_UNINTERRUPTIBLE.

So this is all some really subtle code, but I'm not seeing that it
would be wrong.

Linus

Linus Torvalds

unread,

Oct 24, 2016, 5:20:06 PM10/24/16

to

On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> So this is all some really subtle code, but I'm not seeing that it
> would be wrong.

Ahh... Except maybe..

The vmalloc/vfree code itself is a bit scary. In particular, we have a
rather insane model of TLB flushing. We leave the virtual area on a
lazy purge-list, and we delay flushing the TLB and actually freeing
the virtual memory for it so that we can batch things up.

But we've free'd the physical pages that are *mapped* by that area
when we do the vfree(). So there can be stale TLB entries that point
to pages that have gotten re-used. They shouldn't matter, because
nothing should be writing to those pages, but it strikes me that this
may also be hurting the DEBUG_PAGEALLOC thing. Maybe we're not getting
the page fautls that we *should* be getting, and there are hidden
reads and writes to those paghes that already got free'd.\

There was some nasty reason why we did that insane thing. I think it
was just that there are a few high-frequency vmalloc/vfree users and
the TLB flushing was killing some performance.

But it does strike me that we are playing very fast and loose with the
TLB on the vmalloc area.

So maybe all the new VMAP code is fine, and it's really vmalloc/vfree
that has been subtly broken but nobody has ever cared before?

Linus

Linus Torvalds

unread,

Oct 24, 2016, 6:00:05 PM10/24/16

to

On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> The vmalloc/vfree code itself is a bit scary. In particular, we have a
> rather insane model of TLB flushing. We leave the virtual area on a
> lazy purge-list, and we delay flushing the TLB and actually freeing
> the virtual memory for it so that we can batch things up.

Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in
vmap_debug_free_range() should have forced a synchronous TLB flush fro
vmalloc ranges too, so that doesn't explain it either.

Linus

Andy Lutomirski

unread,

Oct 24, 2016, 6:50:19 PM10/24/16

to

Here's my theory: I think you're looking at the right code but the
wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks
really dicey. Consider:

fallocate calls wake_up_all(), which calls autoremove_wait_function().
That wakes up the shmem_fault thread. Suppose that the shmem_fault
thread gets moving really quickly (presumably because it never went to
sleep in the first place -- suppose it hadn't made it to schedule(),
so schedule() turns into a no-op). It calls finish_wait() *before*
autoremove_wake_function() does the list_del_init(). finish_wait()
gets to the list_empty_careful() call and it returns true.

Now the fallocate thread catches up and *exits*. Dave's test makes a
new thread that reuses the stack (the vmap area or the backing store).

Now the shmem_fault thread continues on its merry way and takes
q->lock. But oh crap, q->lock is pointing at some random spot on some
other thread's stack. Kaboom!

If I'm right, actually blowing up due to this bug would have been
very, very unlikely on 4.8 and below, and it has nothing to do with
vmap stacks per se -- it was the RCU removal. We used to wait for an
RCU grace period before reusing a stack, and there couldn't have been
an RCU grace period in the middle of finish_wait(), so we were safer.
(Not totally safe, because the thread that called fallocate() could
have made it somewhere else that used the same stack depth, but I can
see that being less likely.)

IOW, I think that autoremove_wake_function() is buggy. It should
remove the wait entry, then try to wake the thread, then re-add the
wait entry if the wakeup fails. Or maybe it should use some atomic
flag in the wait entry or have some other locking to make sure that
finish_wait does not touch q->lock if autoremove_wake_function
succeeds.

--Andy

Linus Torvalds

unread,

Oct 24, 2016, 8:10:05 PM10/24/16

to

On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>
> Here's my theory: I think you're looking at the right code but the
> wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks
> really dicey.

Hmm.

> Consider:
>
> fallocate calls wake_up_all(), which calls autoremove_wait_function().
> That wakes up the shmem_fault thread. Suppose that the shmem_fault
> thread gets moving really quickly (presumably because it never went to
> sleep in the first place -- suppose it hadn't made it to schedule(),
> so schedule() turns into a no-op). It calls finish_wait() *before*
> autoremove_wake_function() does the list_del_init(). finish_wait()
> gets to the list_empty_careful() call and it returns true.

All of this happens under inode->i_lock, so the different parts are
serialized. So if this happens before the wakeup, then finish_wait()
will se that the wait-queue entry is not empty (it points to the wait
queue head in shmem_falloc_waitq.

But then it will just remove itself carefully with list_del_init()
under the waitqueue lock, and things are fine.

Yes, it uses the waitiqueue lock on the other stack, but that stack is
still ok since
(a) we're serialized by inode->i_lock
(b) this code ran before the fallocate thread catches up and exits.

In other words, your scenario seems to be dependent on those two
threads being able to race. But they both hold inode->i_lock in the
critical region you are talking about.

> Now the fallocate thread catches up and *exits*. Dave's test makes a
> new thread that reuses the stack (the vmap area or the backing store).
>
> Now the shmem_fault thread continues on its merry way and takes
> q->lock. But oh crap, q->lock is pointing at some random spot on some
> other thread's stack. Kaboom!

Note that q->lock should be entirely immaterial, since inode->i_lock
nests outside of it in all uses.

Now, if there is some code that runs *without* the inode->i_lock, then
that would be a big bug.

But I'm not seeing it.

I do agree that some race on some stack data structure could easily be
the cause of these issues. And yes, the vmap code obviously starts
reusing the stack much earlier, and would trigger problems that would
essentially be hidden by the fact that the kernel stack used to stay
around not just until exit(), but until the process was reaped.

I just think that in this case i_lock really looks like it should
serialize things correctly.

Or are you seeing something I'm not?

Linus

Andy Lutomirski

unread,

Oct 24, 2016, 9:20:04 PM10/24/16

to

No, I missed that.

--Andy

Dave Jones

unread,

Oct 25, 2016, 8:30:05 PM10/25/16

to

On Mon, Oct 24, 2016 at 09:42:39AM -0400, Chris Mason wrote:

> > > Well crud, we're back to wondering if this is Btrfs or the stack
> > > corruption. Since the pagevecs are on the stack and this is a new
> > > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
> > > should make sure.
> >
> > Here's an interesting one from today, pointing the finger at xattrs again.
> >
> >
> > [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
> > [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
> > [69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
>

> Was this btrfs?

I already told you elsewhere, but for benefit of everyone else, yes, it was.

At Chris' behest, I gave ext4 some more air-time with this workload.
It ran for 1 day 6 hrs without incident before I got bored and tried
something else. I threw XFS on the test partition, restarted the test,
and got the warnings below across two reboots.

DaveC: Do these look like real problems, or is this more "looks like
random memory corruption" ? It's been a while since I did some stress
testing on XFS, so these might not be new..

XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2938
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:113!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6
task: ffff8804f4658040 task.stack: ffff88050568c000
RIP: 0010:[<ffffffffa02d3e2b>]
[<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs]
RSP: 0000:ffff88050568f9e8 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 0000000000000046 RCX: 0000000000000001
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa02fe34d
RBP: ffff88050568f9e8 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: ffff88050568fb44
R13: 00000000000000f3 R14: ffff8804f292bf88 R15: 000ffffffffe0046
FS: 00007fe2ddfdfb40(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007fe2dbabd000 CR3: 00000004f461f000 CR4: 00000000001406e0
Stack:
ffff88050568fa88 ffffffffa027ccee fffffffffffffff9 ffff8804f16fd8b0
0000000000003ffa 0000000000000032 ffff8804f292bf40 0000000000004976
000ffffffffe0008 00000000000004fd ffff880400000000 0000000000005107
Call Trace:
[<ffffffffa027ccee>] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs]
[<ffffffffa027f2d4>] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs]
[<ffffffffa02cadd7>] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs]
[<ffffffffa02cb0d1>] xfs_file_iomap_begin+0x181/0x270 [xfs]
[<ffffffffa02ca13e>] ? xfs_file_iomap_end+0x9e/0xe0 [xfs]
[<ffffffff8122e573>] iomap_apply+0x53/0x100
[<ffffffff8122df10>] ? iomap_write_end+0x70/0x70
[<ffffffff8122e68b>] iomap_file_buffered_write+0x6b/0x90
[<ffffffff8122df10>] ? iomap_write_end+0x70/0x70
[<ffffffffa02c1dd8>] xfs_file_buffered_aio_write+0xe8/0x1d0 [xfs]
[<ffffffff810c3b7f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[<ffffffffa02c1f45>] xfs_file_write_iter+0x85/0x120 [xfs]
[<ffffffff811c8c98>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811c9622>] do_readv_writev+0x172/0x210
[<ffffffffa02c1ec0>] ? xfs_file_buffered_aio_write+0x1d0/0x1d0 [xfs]
[<ffffffff811e9794>] ? __fdget_pos+0x44/0x50
[<ffffffff8178b2f2>] ? mutex_lock_nested+0x272/0x3f0
[<ffffffff811e9794>] ? __fdget_pos+0x44/0x50
[<ffffffff811e9794>] ? __fdget_pos+0x44/0x50
[<ffffffff811c98ea>] vfs_writev+0x3a/0x50
[<ffffffff811c9950>] do_writev+0x50/0xd0
[<ffffffff811ca9fb>] SyS_writev+0xb/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff8178ff4b>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 48 c7 c7 65 e3 2f a0 e8 74 37 da e0 5d c3 66 90 55 48 89 f1 41 89 d0 48 c7 c6 18 93 30 a0 48 89 fa 48 89 e5 31 ff e8 65 fa ff ff <0f> 0b 0f 1f 00 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 48 89 e5
RIP
[<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs]
RSP <ffff88050568f9e8>

XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309
kernel BUG at fs/xfs/xfs_message.c:113!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 7309 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
Workqueue: writeback wb_workfn
(flush-8:0)
task: ffff88025eb98040 task.stack: ffffc9000a914000
RIP: 0010:[<ffffffffa0571e2b>]
[<ffffffffa0571e2b>] assfail+0x1b/0x20 [xfs]
RSP: 0018:ffffc9000a917410 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff8804538d22b8 RCX: 0000000000000001
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa059c34d
RBP: ffffc9000a917410 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: ffffffffffffffff
R13: ffff88047c765698 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880507800000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000008 CR3: 00000004c56e7000 CR4: 00000000001406f0
DR0: 00007fec5e3c9000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600

Stack:
ffffc9000a917438 ffffffffa057abe1 ffffc9000a917510 ffffc9000a917510
ffffc9000a917510 ffffc9000a917460 ffffffffa0548eff ffffc9000a917510
0000000000000001 ffffc9000a917510 ffffc9000a917480 ffffffffa050aa3d
Call Trace:
[<ffffffffa057abe1>] xfs_trans_mod_sb+0x241/0x280 [xfs]
[<ffffffffa0548eff>] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
[<ffffffffa050aa3d>] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
[<ffffffffa050bb1b>] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
[<ffffffffa051c1b4>] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
[<ffffffffa054648e>] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
[<ffffffffa051c8db>] xfs_bmap_alloc+0x2b/0x40 [xfs]
[<ffffffffa051dc30>] xfs_bmapi_write+0x640/0x1210 [xfs]
[<ffffffffa0569326>] xfs_iomap_write_allocate+0x166/0x350 [xfs]
[<ffffffffa05540b0>] xfs_map_blocks+0x1b0/0x260 [xfs]
[<ffffffffa0554beb>] xfs_do_writepage+0x23b/0x730 [xfs]
[<ffffffff81159ef8>] ? clear_page_dirty_for_io+0x128/0x210
[<ffffffff81159e71>] ? clear_page_dirty_for_io+0xa1/0x210
[<ffffffff8115a1b6>] write_cache_pages+0x1d6/0x4a0
[<ffffffffa05549b0>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[<ffffffffa0554419>] xfs_vm_writepages+0x59/0x80 [xfs]

[<ffffffff8115af6c>] do_writepages+0x1c/0x30
[<ffffffff811f6d33>] __writeback_single_inode+0x33/0x180
[<ffffffff811f7528>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f78bd>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff811f7b73>] wb_writeback+0x1e3/0x1f0
[<ffffffff811f80b2>] wb_workfn+0xd2/0x280
[<ffffffff81090875>] process_one_work+0x1d5/0x490
[<ffffffff81090815>] ? process_one_work+0x175/0x490
[<ffffffff81090b79>] worker_thread+0x49/0x490
[<ffffffff81090b30>] ? process_one_work+0x490/0x490
[<ffffffff81090b30>] ? process_one_work+0x490/0x490
[<ffffffff81095cee>] kthread+0xee/0x110
[<ffffffff81095c00>] ? kthread_park+0x60/0x60
[<ffffffff81790bd2>] ret_from_fork+0x22/0x30

Code: 48 c7 c7 65 c3 59 a0 e8 c4 5c b0 e0 5d c3 66 90 55 48 89 f1 41 89 d0 48 c7 c6 18 73 5a a0 48 89 fa 48 89 e5 31 ff e8 65 fa ff ff <0f> 0b 0f 1f 00 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 48 89 e5
RIP
[<ffffffffa0571e2b>] assfail+0x1b/0x20 [xfs]
RSP <ffffc9000a917410>

Linus Torvalds

unread,

Oct 25, 2016, 9:40:05 PM10/25/16

to

On Tue, Oct 25, 2016 at 5:27 PM, Dave Jones <da...@codemonkey.org.uk> wrote:
>
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ? It's been a while since I did some stress
> testing on XFS, so these might not be new..

Andy, do you think we could just do some poisoning of the stack as we
free it, to see if that catches anything?

Something truly stupid like just

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -218,6 +218,7 @@ static inline void free_thread_stack(struct
task_struct *tsk)
unsigned long flags;
int i;

+ memset(tsk->stack_vm_area->addr, 0xd0, THREAD_SIZE);
local_irq_save(flags);
for (i = 0; i < NR_CACHED_STACKS; i++) {
if (this_cpu_read(cached_stacks[i]))

or similar?

It seems like DaveJ had an easier time triggering these problems with
the stack cache, but they clearly didn't go away when the stack cache
was disabled. So maybe the stack cache just made the reuse more likely
and faster, making the problem show up faster too. But if we actively
poison things, we'll corrupt the free'd stack *immediately* if there
is some stale use..

Completely untested. Maybe there's some reason we can't write to the
whole thing like that?

Linus

Linus

Linus Torvalds

unread,

Oct 25, 2016, 9:40:05 PM10/25/16

to

On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Completely untested. Maybe there's some reason we can't write to the
> whole thing like that?

That hack boots and seems to work for me, but doesn't show anything.

Dave, mind just trying that oneliner?

Linus

Dave Jones

unread,

Oct 26, 2016, 12:40:06 PM10/26/16

to

I gave this a go last thing last night. It crashed within 5 minutes,
but it was one we've already seen (the bad page map trace) with nothing
additional that looked interesting. I rebooted, and tried again and
went to bed.

12 hours later, it's still doing it's thing.

Heisenbugs man, literally the worst.

Dave

Linus Torvalds

unread,

Oct 26, 2016, 12:50:06 PM10/26/16

to

On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
>
> I gave this a go last thing last night. It crashed within 5 minutes,
> but it was one we've already seen (the bad page map trace) with nothing
> additional that looked interesting.

Did the bad page map trace have any registers that looked like they
had 0xd0d0d0d0d0d0 in them?

I assume not, but worth checking.

> Heisenbugs man, literally the worst.

I know you already had this in some email, but I lost it. I think you
narrowed it down to a specific set of system calls that seems to
trigger this best. fallocate and xattrs or something?

Linus

Dave Jones

unread,

Oct 26, 2016, 2:20:07 PM10/26/16

to

On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
> >
> > I gave this a go last thing last night. It crashed within 5 minutes,
> > but it was one we've already seen (the bad page map trace) with nothing
> > additional that looked interesting.
>
> Did the bad page map trace have any registers that looked like they
> had 0xd0d0d0d0d0d0 in them?
>
> I assume not, but worth checking.

sadly not. In case I did overlook something, here's last nights..

BUG: Bad page state in process kworker/u8:13 pfn:4dae31
page:ffffea00136b8c40 count:0 mapcount:0 mapping:ffff8804f011d6e0 index:0xd1530

flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping

CPU: 3 PID: 1207 Comm: kworker/u8:13 Not tainted 4.9.0-rc2-think+ #3
Workqueue: writeback wb_workfn
(flush-btrfs-1)
ffffc900006fb870
ffffffff8130cf3c
ffffea00136b8c40
ffffffff819ff54c
ffffc900006fb898
ffffffff811511af
0000000000000000
ffffea00136b8c40
400000000000000c
ffffc900006fb8a8
ffffffff8115126a
ffffc900006fb8f0
Call Trace:
[<ffffffff8130cf3c>] dump_stack+0x4f/0x73
[<ffffffff811511af>] bad_page+0xbf/0x120
[<ffffffff8115126a>] free_pages_check_bad+0x5a/0x70
[<ffffffff81153b68>] free_hot_cold_page+0x248/0x290
[<ffffffff81153e6b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff8115c87d>] release_pages+0x2bd/0x350
[<ffffffff8115ddb2>] __pagevec_release+0x22/0x30
[<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs]
[<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8115af9c>] do_writepages+0x1c/0x30
[<ffffffff811f6d73>] __writeback_single_inode+0x33/0x180
[<ffffffff811f7568>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f7abb>] wb_writeback+0xeb/0x1f0
[<ffffffff811f80f2>] wb_workfn+0xd2/0x280
[<ffffffff810908a5>] process_one_work+0x1d5/0x490
[<ffffffff81090845>] ? process_one_work+0x175/0x490
[<ffffffff81090ba9>] worker_thread+0x49/0x490
[<ffffffff81090b60>] ? process_one_work+0x490/0x490
[<ffffffff81095d1e>] kthread+0xee/0x110
[<ffffffff81095c30>] ? kthread_park+0x60/0x60
[<ffffffff81790a52>] ret_from_fork+0x22/0x30

> > Heisenbugs man, literally the worst.
>
> I know you already had this in some email, but I lost it. I think you
> narrowed it down to a specific set of system calls that seems to
> trigger this best. fallocate and xattrs or something?

I did. Or so I thought. Then iirc, it ran for a really long time
with those. I'll revisit that just to be sure.

Something else I tried: Trinity is very heavily stressing
the fork path, with new children forking off and doing crazy shit
all the time, segfaulting, repeat. I wrote a small test app that
models all of that sans the "do crazy shit with syscalls", and that
didn't do anything useful in terms of reproducing this.
I hoped that had panned out, because I could totally see the relationship
between reusing vmap'ing stacks and heavy fork() use.
Perhaps I'm still missing something non-obvious.

Dave

Dave Jones

unread,

Oct 26, 2016, 2:50:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote:

> I know you already had this in some email, but I lost it. I think you
> narrowed it down to a specific set of system calls that seems to
> trigger this best. fallocate and xattrs or something?

So I was about to give that a shot again. That this has been running
doing for 24hrs was bugging me. I ctrl-c'd the current run, and trinity
just sat there, because all of its child processes are stuck in D state.

The stacks show nearly all of them are stuck in sync_inodes_sb

iotop & vmstat shows there is _zero_ io actually happening.
So it's spent most the night doing next to nothing useful afaict.

Chris ?

Here's the /proc/pid/stack's of those children..

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
[<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
[<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff811fc294>] utimes_common+0xd4/0x190
[<ffffffff811fc457>] do_utimes+0x107/0x120
[<ffffffff811fc641>] SyS_futimesat+0xa1/0xd0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa008ed32>] btrfs_fallocate+0xb2/0xfd0 [btrfs]
[<ffffffff811c6c3e>] vfs_fallocate+0x13e/0x220
[<ffffffff811c79f3>] SyS_fallocate+0x43/0x80
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
[<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
[<ffffffff811fbcdb>] sync_fs_one_sb+0x1b/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdd0>] sys_sync+0x50/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa008c0f4>] btrfs_file_llseek+0x34/0x290 [btrfs]
[<ffffffff811c9ab5>] SyS_lseek+0x85/0xa0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090a0e>] btrfs_sync_file+0x7e/0x360 [btrfs]
[<ffffffff811fbbe6>] vfs_fsync_range+0x46/0xa0
[<ffffffffa00910e0>] btrfs_file_write_iter+0x3f0/0x550 [btrfs]
[<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811ca162>] do_readv_writev+0x172/0x210
[<ffffffff811ca42a>] vfs_writev+0x3a/0x50
[<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
[<ffffffff811cb592>] SyS_pwritev2+0x12/0x20
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811ea2d4>] __fdget_pos+0x44/0x50
[<ffffffff811de8ec>] SyS_getdents+0x6c/0x110
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff811c752a>] do_truncate+0x4a/0x90
[<ffffffff811c790c>] do_sys_ftruncate.constprop.19+0x10c/0x170
[<ffffffff811c7999>] SyS_ftruncate+0x9/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa0095413>] btrfs_wait_ordered_extents+0x1e3/0x2e0 [btrfs]
[<ffffffffa0095647>] btrfs_wait_ordered_roots+0x137/0x200 [btrfs]
[<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
[<ffffffff811fbcdb>] sync_fs_one_sb+0x1b/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdd0>] sys_sync+0x50/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff81149fbf>] wait_on_page_bit+0xaf/0xc0
[<ffffffff8114a121>] __filemap_fdatawait_range+0x151/0x170
[<ffffffff8114d79c>] filemap_fdatawait_keep_errors+0x1c/0x20
[<ffffffff811f59b3>] sync_inodes_sb+0x273/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs]
[<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811ca162>] do_readv_writev+0x172/0x210
[<ffffffff811ca42a>] vfs_writev+0x3a/0x50
[<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
[<ffffffff811cb57c>] SyS_pwritev+0xc/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811ea2d4>] __fdget_pos+0x44/0x50
[<ffffffff811c9a48>] SyS_lseek+0x18/0xa0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs]
[<ffffffff811c8e64>] __vfs_write+0xc4/0x120
[<ffffffff811c9f03>] vfs_write+0xb3/0x1a0
[<ffffffff811cb3d4>] SyS_pwrite64+0x74/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa009576b>] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
[<ffffffffa008bf5d>] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
[<ffffffffa008d1e8>] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
[<ffffffffa0090e60>] btrfs_file_write_iter+0x170/0x550 [btrfs]
[<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811ca162>] do_readv_writev+0x172/0x210
[<ffffffff811ca42a>] vfs_writev+0x3a/0x50
[<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
[<ffffffff811cb57c>] SyS_pwritev+0xc/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs]
[<ffffffff811c8e64>] __vfs_write+0xc4/0x120
[<ffffffff811c998d>] __kernel_write+0x4d/0xf0
[<ffffffff811f995d>] write_pipe_buf+0x6d/0x80
[<ffffffff811f9cdf>] __splice_from_pipe+0x12f/0x1b0
[<ffffffff811fabdc>] splice_from_pipe+0x4c/0x70
[<ffffffff811fac34>] default_file_splice_write+0x14/0x20
[<ffffffff811f9081>] direct_splice_actor+0x31/0x40
[<ffffffff811f972c>] splice_direct_to_actor+0xcc/0x1e0
[<ffffffff811f98d0>] do_splice_direct+0x90/0xb0
[<ffffffff811cad30>] do_sendfile+0x1b0/0x390
[<ffffffff811cb7cf>] SyS_sendfile64+0x5f/0xd0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff81086eee>] do_sigtimedwait+0x16e/0x260
[<ffffffff81087084>] SYSC_rt_sigtimedwait+0xa4/0x100
[<ffffffff810870e9>] SyS_rt_sigtimedwait+0x9/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

Linus Torvalds

unread,

Oct 26, 2016, 3:10:06 PM10/26/16

to

On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
>
> The stacks show nearly all of them are stuck in sync_inodes_sb

That's just wb_wait_for_completion(), and it means that some IO isn't
completing.

There's also a lot of processes waiting for inode_lock(), and a few
waiting for mnt_want_write()

Ignoring those, we have

> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

Don't know this one. There's a couple of them. Could there be some
ABBA deadlock on the ordered roots waiting?

> [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
> [<ffffffffa008ed32>] btrfs_fallocate+0xb2/0xfd0 [btrfs]
> [<ffffffff811c6c3e>] vfs_fallocate+0x13e/0x220
> [<ffffffff811c79f3>] SyS_fallocate+0x43/0x80
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

This one is also inode_lock(), and is interesting only because it's
fallocate(), which has shown up so many times before..

But there are other threads blocked on do_truncate, or
btrfs_file_write_iter instead, or on lseek, so this is not different
for any other reason.

> [<ffffffff81149fbf>] wait_on_page_bit+0xaf/0xc0
> [<ffffffff8114a121>] __filemap_fdatawait_range+0x151/0x170
> [<ffffffff8114d79c>] filemap_fdatawait_keep_errors+0x1c/0x20
> [<ffffffff811f59b3>] sync_inodes_sb+0x273/0x300
> [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

This is actually waiting on the page. Possibly this is the IO that is
never completing, and keeps the inode lock.

> [<ffffffffa009576b>] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
> [<ffffffffa008bf5d>] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
> [<ffffffffa008d1e8>] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
> [<ffffffffa0090e60>] btrfs_file_write_iter+0x170/0x550 [btrfs]
> [<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
> [<ffffffff811ca162>] do_readv_writev+0x172/0x210
> [<ffffffff811ca42a>] vfs_writev+0x3a/0x50
> [<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
> [<ffffffff811cb57c>] SyS_pwritev+0xc/0x10
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25

Hmm. This is the one that *started* the ordered extents (as opposed to
the ones waiting for it)

I dunno. There might be a lost IO. More likely it's the same
corruption that causes it, it just didn't result in an oops this time.

Linus

Linus Torvalds

unread,

Oct 26, 2016, 6:10:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <c...@fb.com> wrote:
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>
> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
> [ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
> [ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
> [ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
> [ 2759.127444] Call Trace:
> [ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74
> [ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
> [ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120
> [ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
> [ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
> [ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

list_add_tail(&rq->queuelist, &ctx->rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like "&ctx->rq_list" might be corrupt.

And that actually seems much more likely than the "plug" list, because
while the plug is entirely thread-local (and thus shouldn't have any
races), the ctx->rq_list very much is not.

Jens?

For example, should we have a

BUG_ON(ctx != rq->mq_ctx);

in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then
__blk_mq_insert_request() will insert things onto the queues of
rq->mq_ctx.

blk_mq_insert_requests() has similar issues, but there has that BUG_ON().

The locking there really is *very* messy. All the lockers do

spin_lock(&ctx->lock);
...
spin_unlock(&ctx->lock);

but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't
act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those
wrong, you're completely dead.

Now, I'm not seeing why they'd be wrong, and why they'd be associated
with the VMAP_STACK thing, but it could just be an unlucky timing
thing.

Linus

Linus Torvalds

unread,

Oct 26, 2016, 6:30:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason <c...@fb.com> wrote:
>
> This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes.
> I've done 30 minutes each with XFS and Ext4 without luck.

Ok, see the email I wrote that crossed yours - if it's really some
list corruption on ctx->rq_list due to some locking problem, I really
would expect CONFIG_VMAP_STACK to be entirely irrelevant, except
perhaps from a timing standpoint.

> WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500).
> Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk
> CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319

> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014

> ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8
> ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf
> ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180
> Call Trace:
> [<ffffffff814fde0f>] dump_stack+0x53/0x74
> [<ffffffff8151c46e>] ? __list_add+0xbe/0xd0
> [<ffffffff810648cf>] __warn+0xff/0x120
> [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50
> [<ffffffff8151c46e>] __list_add+0xbe/0xd0
> [<ffffffff814dec38>] blk_sq_make_request+0x388/0x580
> [<ffffffff814d5444>] generic_make_request+0x104/0x200

Well, it's very consistent, I have to say. So I really don't think
this is random corruption.

Could you try the attached patch? It adds a couple of sanity tests:

- a number of tests to verify that 'rq->queuelist' isn't already on
some queue when it is added to a queue

- one test to verify that rq->mq_ctx is the same ctx that we have locked.

I may be completely full of shit, and this patch may be pure garbage
or "obviously will never trigger", but humor me.

Linus

patch.diff

Dave Jones

unread,

Oct 26, 2016, 6:50:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:

> Could you try the attached patch? It adds a couple of sanity tests:
>
> - a number of tests to verify that 'rq->queuelist' isn't already on
> some queue when it is added to a queue
>
> - one test to verify that rq->mq_ctx is the same ctx that we have locked.
>
> I may be completely full of shit, and this patch may be pure garbage
> or "obviously will never trigger", but humor me.

I gave it a shot too for shits & giggles.
This falls out during boot.

[ 9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
[ 9.271391] ------------[ cut here ]------------
[ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0
[ 9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4
[ 9.300106] ffffc90000013848
[ 9.307353] ffffffff8130d27c
[ 9.307420] 0000000000000000
[ 9.307456] 0000000000000000
[ 9.307494] ffffc90000013888
[ 9.314780] ffffffff81077a41
[ 9.314847] 0000049d00013898
[ 9.314884] 0000000000000001
[ 9.314922] 0000000000000002
[ 9.322213] ffffe8ffffc06600
[ 9.322280] ffffe8ffff606600
[ 9.322317] ffff8805015c2e80
[ 9.322355] Call Trace:
[ 9.329689] [<ffffffff8130d27c>] dump_stack+0x4f/0x73
[ 9.337208] [<ffffffff81077a41>] __warn+0xc1/0xe0
[ 9.344730] [<ffffffff81077b18>] warn_slowpath_null+0x18/0x20
[ 9.352282] [<ffffffff812f7975>] blk_sq_make_request+0x465/0x4a0
[ 9.359816] [<ffffffff812eb44a>] ? generic_make_request+0xca/0x210
[ 9.367367] [<ffffffff812eb457>] generic_make_request+0xd7/0x210
[ 9.374931] [<ffffffff812eb5f9>] submit_bio+0x69/0x120
[ 9.382488] [<ffffffff812004ef>] ? submit_bh_wbc+0x16f/0x1e0
[ 9.390083] [<ffffffff812004ef>] submit_bh_wbc+0x16f/0x1e0
[ 9.397673] [<ffffffff81200bfe>] submit_bh+0xe/0x10
[ 9.405205] [<ffffffff81255eac>] __ext4_get_inode_loc+0x1ac/0x3d0
[ 9.412795] [<ffffffff81258f4b>] ext4_iget+0x6b/0xb70
[ 9.420366] [<ffffffff811d6018>] ? lookup_slow+0xf8/0x1f0
[ 9.427947] [<ffffffff81259a7a>] ext4_iget_normal+0x2a/0x30
[ 9.435541] [<ffffffff81264734>] ext4_lookup+0x114/0x230

Linus Torvalds

unread,

Oct 26, 2016, 7:00:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two

I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.

Linus

Linus Torvalds

unread,

Oct 26, 2016, 7:00:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones <da...@codemonkey.org.uk> wrote:
>
> I gave it a shot too for shits & giggles.
> This falls out during boot.
>

> [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0

Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(&ctx->lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)

Dave: it might be a good idea to split that "WARN_ON_ONCE()" in

blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;

path. If you split it into two: one before that "insert_rq:" label,
and one before the "goto insert_rq" thing, then we could see if it is
just one of the blk_mq_merge_queue_io() cases (or both) that is
broken..

Linus

Dave Jones

unread,

Oct 26, 2016, 7:10:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:

> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue. Dave, are you testing on a sata drive or
> something similar with a shallower queue depth? If we end up sleeping
> for a request, I think we could trigger data->ctx being different.

yeah, just regular sata. I've been meaning to put an ssd in that box for
a while, but now it sounds like my procrastination may have paid off.

> Dave, can you hit the warnings with this? Totally untested...

Coming up..

Dave

Linus Torvalds

unread,

Oct 26, 2016, 7:10:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <ax...@fb.com> wrote:
>
> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue.

Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.

Linus

Dave Jones

unread,

Oct 26, 2016, 7:10:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two, since right now it can trigger both
> for the
>
> blk_mq_bio_to_request(rq, bio);
>
> path _and_ for the
>
> if (!blk_mq_attempt_merge(q, ctx, bio)) {
> blk_mq_bio_to_request(rq, bio);
> goto insert_rq;
>
> path. If you split it into two: one before that "insert_rq:" label,
> and one before the "goto insert_rq" thing, then we could see if it is
> just one of the blk_mq_merge_queue_io() cases (or both) that is
> broken..

It's the latter of the two.

[ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
blk_sq_make_request+0x320/0x4d0

Dave

Dave Jones

unread,

Oct 26, 2016, 7:50:05 PM10/26/16

to

On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:

> >- hctx->queued++;
> >- data->hctx = hctx;
> >- data->ctx = ctx;
> >+ data->hctx = alloc_data.hctx;
> >+ data->ctx = alloc_data.ctx;
> >+ data->hctx->queued++;
> > return rq;
> > }
>
> This made it through an entire dbench 2048 run on btrfs. My script has
> it running in a loop, but this is farther than I've gotten before.
> Looking great so far.

Fixed the splat during boot for me too.
Now the fun part, let's see if it fixed the 'weird shit' that Trinity
was stumbling on.

Dave

Dave Chinner

unread,

Oct 27, 2016, 1:50:05 AM10/27/16

to

> [<ffffffff8122e573>] iomap_apply+0x53/0x100
> [<ffffffff8122e68b>] iomap_file_buffered_write+0x6b/0x90

All this iomap code is new, so it's entirely possible that this is a
new bug. The assert failure is indicative that the delalloc extent's
metadata reservation grew when we expected it to stay the same or
shrink.

> XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309

...

> [<ffffffffa057abe1>] xfs_trans_mod_sb+0x241/0x280 [xfs]
> [<ffffffffa0548eff>] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
> [<ffffffffa050aa3d>] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
> [<ffffffffa050bb1b>] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
> [<ffffffffa051c1b4>] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
> [<ffffffffa054648e>] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
> [<ffffffffa051c8db>] xfs_bmap_alloc+0x2b/0x40 [xfs]
> [<ffffffffa051dc30>] xfs_bmapi_write+0x640/0x1210 [xfs]
> [<ffffffffa0569326>] xfs_iomap_write_allocate+0x166/0x350 [xfs]
> [<ffffffffa05540b0>] xfs_map_blocks+0x1b0/0x260 [xfs]
> [<ffffffffa0554beb>] xfs_do_writepage+0x23b/0x730 [xfs]

And that's indicative of a delalloc metadata reservation being
being too small and so we're allocating unreserved blocks.

Different symptoms, same underlying cause, I think.

I see the latter assert from time to time in my testing, but it's
not common (maybe once a month) and I've never been able to track it
down. However, it doesn't affect production systems unless they hit
ENOSPC hard enough that it causes the critical reserve pool to be
exhausted iand so the allocation fails. That's extremely rare -
usually takes a several hundred processes all trying to write as had
as they can concurrently and to all slip through the ENOSPC
detection without the correct metadata reservations and all require
multiple metadata blocks to be allocated durign writeback...

If you've got a way to trigger it quickly and reliably, that would
be helpful...

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Christoph Hellwig

unread,

Oct 27, 2016, 2:40:07 AM10/27/16

to

> Dave, can you hit the warnings with this? Totally untested...

Can we just kill off the unhelpful blk_map_ctx structure, e.g.:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed..d74a74a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
}
}

-struct blk_map_ctx {
- struct blk_mq_hw_ctx *hctx;
- struct blk_mq_ctx *ctx;
-};
-
static struct request *blk_mq_map_request(struct request_queue *q,
struct bio *bio,
- struct blk_map_ctx *data)
+ struct blk_mq_alloc_data *data)
{
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct request *rq;
int op = bio_data_dir(bio);
int op_flags = 0;
- struct blk_mq_alloc_data alloc_data;

blk_queue_enter_live(q);
ctx = blk_mq_get_ctx(q);
@@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct request_queue *q,
op_flags |= REQ_SYNC;

trace_block_getrq(q, bio, op);
- blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
- rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
+ blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+ rq = __blk_mq_alloc_request(data, op, op_flags);

- hctx->queued++;
- data->hctx = hctx;
- data->ctx = ctx;

+ data->hctx->queued++;
return rq;
}

@@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf);
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
- struct blk_map_ctx data;
+ struct blk_mq_alloc_data data;
struct request *rq;
unsigned int request_count = 0;
struct blk_plug *plug;
@@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
struct blk_plug *plug;
unsigned int request_count = 0;
- struct blk_map_ctx data;
+ struct blk_mq_alloc_data data;
struct request *rq;
blk_qc_t cookie;

Linus Torvalds

unread,

Oct 27, 2016, 12:40:05 PM10/27/16

to

On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig <h...@infradead.org> wrote:
>> Dave, can you hit the warnings with this? Totally untested...
>
> Can we just kill off the unhelpful blk_map_ctx structure, e.g.:

Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.

And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..

Linus

Dave Jones

unread,

Oct 27, 2016, 1:30:04 PM10/27/16

to

On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote:

> And that's indicative of a delalloc metadata reservation being
> being too small and so we're allocating unreserved blocks.
>
> Different symptoms, same underlying cause, I think.
>
> I see the latter assert from time to time in my testing, but it's
> not common (maybe once a month) and I've never been able to track it
> down. However, it doesn't affect production systems unless they hit
> ENOSPC hard enough that it causes the critical reserve pool to be
> exhausted iand so the allocation fails. That's extremely rare -
> usually takes a several hundred processes all trying to write as had
> as they can concurrently and to all slip through the ENOSPC
> detection without the correct metadata reservations and all require
> multiple metadata blocks to be allocated durign writeback...
>
> If you've got a way to trigger it quickly and reliably, that would
> be helpful...

Seems pretty quickly reproducable for me in some shape or form.
Run trinity with --enable-fds=testfile and create enough children
to create a fair bit of contention, (for me -C64 seems a good fit on
spinning rust, but if you're running on shiny nvme you might have to pump it up a bit).

Dave

Dave Jones

unread,

Oct 31, 2016, 3:00:05 PM10/31/16

to

It took a while, but.. bad news.

BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c

flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping

CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1
Workqueue: writeback wb_workfn
(flush-btrfs-2)
ffffc90000777828
ffffffff8130d04c
ffffea0013838e40
ffffffff819ff654
ffffc90000777850
ffffffff81150e5f
0000000000000000
ffffea0013838e40
400000000000000c
ffffc90000777860
ffffffff81150f1a
ffffc900007778a8
Call Trace:
[<ffffffff8130d04c>] dump_stack+0x4f/0x73
[<ffffffff81150e5f>] bad_page+0xbf/0x120
[<ffffffff81150f1a>] free_pages_check_bad+0x5a/0x70
[<ffffffff81153818>] free_hot_cold_page+0x248/0x290
[<ffffffff81153b1b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff8115c52d>] release_pages+0x2bd/0x350
[<ffffffff8115da62>] __pagevec_release+0x22/0x30

[<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs]
[<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs]

[<ffffffff8115ac4c>] do_writepages+0x1c/0x30
[<ffffffff811f69f3>] __writeback_single_inode+0x33/0x180
[<ffffffff811f71e8>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f757d>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff811f7833>] wb_writeback+0x1e3/0x1f0
[<ffffffff811f7d72>] wb_workfn+0xd2/0x280
[<ffffffff810908d5>] process_one_work+0x1d5/0x490
[<ffffffff81090875>] ? process_one_work+0x175/0x490
[<ffffffff81090bd9>] worker_thread+0x49/0x490
[<ffffffff81090b90>] ? process_one_work+0x490/0x490
[<ffffffff81095d4e>] kthread+0xee/0x110
[<ffffffff81095c60>] ? kthread_park+0x60/0x60
[<ffffffff81790b12>] ret_from_fork+0x22/0x30

Linus Torvalds

unread,

Oct 31, 2016, 3:40:05 PM10/31/16

to

On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
>
> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> flags: 0x400000000000000c(referenced|uptodate)
> page dumped because: non-NULL mapping

Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.

Linus

Dave Jones

unread,

Nov 6, 2016, 12:00:05 PM11/6/16

to

<subject changed, hopefully we're done with bio corruption for now>

On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:

> Agreed, I think this is a separate bug, almost certainly btrfs specific.
> I'll work with Dave on a better reproducer.

Still refining my 'capture ftrace when trinity detects taint' feature,
but in the meantime, here's a variant I don't think we've seen before:

general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 3 PID: 1913 Comm: trinity-c51 Not tainted 4.9.0-rc3-think+ #3
task: ffff880503350040 task.stack: ffffc90000240000
RIP: 0010:[<ffffffffa007c2f6>]
[<ffffffffa007c2f6>] write_ctree_super+0x96/0xb30 [btrfs]
RSP: 0018:ffffc90000243c90 EFLAGS: 00010286
RAX: dae05adadadad000 RBX: 0000000000000000 RCX: 0000000000000002
RDX: ffff8804fdfcc000 RSI: ffff8804edcee313 RDI: ffff8804edcee1c3
RBP: ffffc90000243d00 R08: 0000000000000003 R09: ffff880000000000
R10: 0000000000000001 R11: 0000000000000100 R12: ffff88045151c548
R13: 0000000000000000 R14: ffff8804ee5122a8 R15: ffff8804572267e8
FS: 00007f25c3e0eb40(0000) GS:ffff880507e00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f25c1560d44 CR3: 0000000454e20000 CR4: 00000000001406e0
DR0: 00007fee93506000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:

0000000000000001
ffff88050227b3f8
ffff8804fff01b28
00000001810b7f35
ffffffffa007c265
0000000000000001
ffffc90000243cb8
000000002c9a8645
ffff8804fff01b28
ffff8804fff01b28
ffff88045151c548
0000000000000000
Call Trace:
[<ffffffffa007c265>] ? write_ctree_super+0x5/0xb30 [btrfs]
[<ffffffffa00d2956>] btrfs_sync_log+0x886/0xa60 [btrfs]
[<ffffffffa009f4f9>] btrfs_sync_file+0x479/0x4d0 [btrfs]
[<ffffffff812789ab>] vfs_fsync_range+0x4b/0xb0
[<ffffffff81260755>] ? __fget_light+0x5/0x60
[<ffffffff81278a6d>] do_fsync+0x3d/0x70
[<ffffffff81278a35>] ? do_fsync+0x5/0x70
[<ffffffff81278d20>] SyS_fsync+0x10/0x20
[<ffffffff81002d81>] do_syscall_64+0x61/0x170
[<ffffffff81894a8b>] entry_SYSCALL64_slow_path+0x25/0x25
Code: c7 48 8b 42 30 4c 8b 08 48 b8 00 00 00 00 00 16 00 00 49 03 81 a0 01 00 00 49 b9 00 00 00 00 00 88 ff ff 48 c1 f8 06 48 c1 e0 0c <4a> 8b 44 08 50 48 39 46 08 0f 84 8d 08 00 00 49 63 c0 48 8d 0c
RIP
[<ffffffffa007c2f6>] write_ctree_super+0x96/0xb30 [btrfs]
RSP <ffffc90000243c90>

All code
========
0: c7 (bad)
1: 48 8b 42 30 mov 0x30(%rdx),%rax
5: 4c 8b 08 mov (%rax),%r9
8: 48 b8 00 00 00 00 00 movabs $0x160000000000,%rax
f: 16 00 00
12: 49 03 81 a0 01 00 00 add 0x1a0(%r9),%rax
19: 49 b9 00 00 00 00 00 movabs $0xffff880000000000,%r9
20: 88 ff ff
23: 48 c1 f8 06 sar $0x6,%rax
27: 48 c1 e0 0c shl $0xc,%rax
2b:* 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax <-- trapping instruction
30: 48 39 46 08 cmp %rax,0x8(%rsi)
34: 0f 84 8d 08 00 00 je 0x8c7
3a: 49 63 c0 movslq %r8d,%rax
3d: 48 rex.W
3e: 8d .byte 0x8d
3f: 0c .byte 0xc

Code starting with the faulting instruction
===========================================
0: 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax
5: 48 39 46 08 cmp %rax,0x8(%rsi)
9: 0f 84 8d 08 00 00 je 0x89c
f: 49 63 c0 movslq %r8d,%rax
12: 48 rex.W
13: 8d .byte 0x8d
14: 0c .byte 0xc

According to objdump -S, it looks like this is an inlined copy of backup_super_roots

root_backup = info->super_for_commit->super_roots + last_backup;
2706: 48 8d b8 2b 0b 00 00 lea 0xb2b(%rax),%rdi
270d: 48 63 c1 movslq %ecx,%rax
2710: 48 8d 34 80 lea (%rax,%rax,4),%rsi
2714: 48 8d 04 b0 lea (%rax,%rsi,4),%rax
2718: 48 8d 34 c7 lea (%rdi,%rax,8),%rsi
btrfs_header_generation(info->tree_root->node))
271c: 48 8b 42 30 mov 0x30(%rdx),%rax
2720: 4c 8b 08 mov (%rax),%r9
2723: 48 b8 00 00 00 00 00 movabs $0x160000000000,%rax
272a: 16 00 00
272d: 49 03 81 a0 01 00 00 add 0x1a0(%r9),%rax
if (btrfs_backup_tree_root_gen(root_backup) ==
2734: 49 b9 00 00 00 00 00 movabs $0xffff880000000000,%r9
273b: 88 ff ff
273e: 48 c1 f8 06 sar $0x6,%rax
2742: 48 c1 e0 0c shl $0xc,%rax
2746: 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax <-- trapping instruction
274b: 48 39 46 08 cmp %rax,0x8(%rsi)
274f: 0f 84 8d 08 00 00 je 2fe2 <write_ctree_super+0x932>

Dave Jones

unread,

Nov 8, 2016, 10:00:07 AM11/8/16

to

On Sun, Nov 06, 2016 at 11:55:39AM -0500, Dave Jones wrote:
> <subject changed, hopefully we're done with bio corruption for now>
>
> On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
> > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
> > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
> > >>
> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> > >> flags: 0x400000000000000c(referenced|uptodate)
> > >> page dumped because: non-NULL mapping
> > >
> > >Hmm. So this seems to be btrfs-specific, right?
> > >
> > >I searched for all your "non-NULL mapping" cases, and they all seem to
> > >have basically the same call trace, with some work thread doing
> > >writeback and going through btrfs_writepages().
> > >
> > >Sounds like it's a race with either fallocate hole-punching or
> > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
> > >clearly ran other filesystems too but I am not seeing this backtrace
> > >for anything else.
> >
> > Agreed, I think this is a separate bug, almost certainly btrfs specific.
> > I'll work with Dave on a better reproducer.
>
> Still refining my 'capture ftrace when trinity detects taint' feature,
> but in the meantime, here's a variant I don't think we've seen before:

And another new one:

kernel BUG at fs/btrfs/ctree.c:3172!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1
task: ffff8804ffde37c0 task.stack: ffffc90002188000
RIP: 0010:[<ffffffffa00576b9>]
[<ffffffffa00576b9>] btrfs_set_item_key_safe+0x179/0x190 [btrfs]
RSP: 0000:ffffc9000218b8a8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8804fddcf348 RCX: 0000000000001000
RDX: 0000000000000000 RSI: ffffc9000218b9ce RDI: ffffc9000218b8c7
RBP: ffffc9000218b908 R08: 0000000000004000 R09: ffffc9000218b8c8
R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000218b8b6
R13: ffffc9000218b9ce R14: 0000000000000001 R15: ffff880480684a88
FS: 00007f7c7f998b40(0000) GS:ffff880507800000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 000000044f15f000 CR4: 00000000001406f0
DR0: 00007f4ce439d000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:

ffff880501430000 d305ffffa00a2245 006c000000000002 0500000000000010
6c000000000002d3 0000000000001000 000000006427eebb ffff880480684a88
0000000000000000 ffff8804fddcf348 0000000000002000 0000000000000000
Call Trace:
[<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs]
[<ffffffff8116c80c>] ? function_trace_call+0x13c/0x190
[<ffffffffa009c4f5>] ? __btrfs_drop_extents+0x5/0xe30 [btrfs]
[<ffffffff810e2b00>] ? do_raw_write_lock+0xb0/0xc0
[<ffffffffa00cd43d>] btrfs_log_changed_extents+0x35d/0x630 [btrfs]
[<ffffffffa00a6a74>] ? release_extent_buffer+0xa4/0x110 [btrfs]
[<ffffffffa00cd0e5>] ? btrfs_log_changed_extents+0x5/0x630 [btrfs]
[<ffffffffa00d1085>] btrfs_log_inode+0xb05/0x11d0 [btrfs]
[<ffffffff8116536c>] ? trace_function+0x6c/0x80
[<ffffffffa00d0580>] ? log_directory_changes+0xc0/0xc0 [btrfs]
[<ffffffffa00d1a20>] ? btrfs_log_inode_parent+0x240/0x940 [btrfs]
[<ffffffff8116c80c>] ? function_trace_call+0x13c/0x190
[<ffffffffa00d1a20>] btrfs_log_inode_parent+0x240/0x940 [btrfs]
[<ffffffffa00d17e5>] ? btrfs_log_inode_parent+0x5/0x940 [btrfs]
[<ffffffff81259131>] ? dget_parent+0x71/0x150
[<ffffffffa00d3102>] btrfs_log_dentry_safe+0x62/0x80 [btrfs]
[<ffffffffa009f404>] btrfs_sync_file+0x344/0x4d0 [btrfs]
[<ffffffff81278a1b>] vfs_fsync_range+0x4b/0xb0
[<ffffffff812607c5>] ? __fget_light+0x5/0x60
[<ffffffff81278add>] do_fsync+0x3d/0x70
[<ffffffff81278aa5>] ? do_fsync+0x5/0x70
[<ffffffff81278db3>] SyS_fdatasync+0x13/0x20
[<ffffffff81002d81>] do_syscall_64+0x61/0x170
[<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 48 8b 45 b7 48 8d 7d bf 4c 89 ee 48 89 45 c8 0f b6 45 b6 88 45 c7 48 8b 45 ae 48 89 45 bf e8 af f2 ff ff 85 c0 0f 8f 43 ff ff ff <0f> 0b 0f 0b e8 ee f3 02 e1 0f 1f 40 00 66 2e 0f 1f 84 00 00 00

Unfortunatly, because this was a BUG_ON, it locked up the box so it
didn't save any additional debug info. Tempted to see if making BUG_ON
a no-op will at least let it live long enough to save the ftrace buffer.

Given this seems to be mutating every time I see something go wrong,
I'm wondering if this is fallout from memory corruption again.

Dave

Dave Jones

unread,

Nov 10, 2016, 9:40:05 AM11/10/16

to

On Tue, Nov 08, 2016 at 10:08:04AM -0500, Chris Mason wrote:

> > And another new one:
> >
> > kernel BUG at fs/btrfs/ctree.c:3172!
> >

> > Call Trace:
> > [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs]
>

> We've been hunting this one for at least two years. It's the white
> whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I
> think it's not the same as the pagevec based problems you reported earlier.

Great, now for whatever reason, I'm hitting this over and over.

Even better, after the last time I hit it, it reboot and this happened during boot..

BTRFS info (device sda6): disk space caching is enabled
BTRFS info (device sda6): has skinny extents
BTRFS info (device sda3): disk space caching is enabled
------------[ cut here ]------------
WARNING: CPU: 1 PID: 443 at fs/btrfs/file.c:546 btrfs_drop_extent_cache+0x411/0x420 [btrfs]
CPU: 1 PID: 443 Comm: mount Not tainted 4.9.0-rc4-think+ #1
ffffc90000c4b468 ffffffff813b66bc 0000000000000000 0000000000000000
ffffc90000c4b4a8 ffffffff81086d2b 0000022200c4b488 000000000002f265
40c8dded1afd6000 ffff8804ff5cddc8 ffff8804ef26f2b8 40c8dded1afd5000
Call Trace:
[<ffffffff813b66bc>] dump_stack+0x4f/0x73
[<ffffffff81086d2b>] __warn+0xcb/0xf0
[<ffffffff81086e5d>] warn_slowpath_null+0x1d/0x20
[<ffffffffa009c0f1>] btrfs_drop_extent_cache+0x411/0x420 [btrfs]
[<ffffffff81215923>] ? alloc_debug_processing+0x73/0x1b0
[<ffffffffa009c93f>] __btrfs_drop_extents+0x44f/0xe30 [btrfs]
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffff8121842a>] ? kmem_cache_alloc+0x2aa/0x330
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffffa009e399>] btrfs_drop_extents+0x79/0xa0 [btrfs]
[<ffffffffa00ce4d1>] replay_one_extent+0x1e1/0x710 [btrfs]
[<ffffffffa00cec6d>] replay_one_buffer+0x26d/0x7e0 [btrfs]
[<ffffffff8121732c>] ? ___slab_alloc.constprop.83+0x27c/0x5c0
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffff813d5b87>] ? debug_smp_processor_id+0x17/0x20
[<ffffffffa00ca3db>] walk_up_log_tree+0xeb/0x240 [btrfs]
[<ffffffffa00ca5d6>] walk_log_tree+0xa6/0x1d0 [btrfs]
[<ffffffffa00d32fc>] btrfs_recover_log_trees+0x1dc/0x460 [btrfs]
[<ffffffffa00cea00>] ? replay_one_extent+0x710/0x710 [btrfs]
[<ffffffffa0081f65>] open_ctree+0x2575/0x2670 [btrfs]
[<ffffffffa005144b>] btrfs_mount+0xd0b/0xe10 [btrfs]
[<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660
[<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
[<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50
[<ffffffff81243794>] mount_fs+0x14/0xa0
[<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150
[<ffffffffa0050a08>] btrfs_mount+0x2c8/0xe10 [btrfs]
[<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660
[<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
[<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
[<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50
[<ffffffff81243794>] mount_fs+0x14/0xa0
[<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150
[<ffffffff81265bd2>] do_mount+0x1c2/0xda0
[<ffffffff811d41c0>] ? memdup_user+0x60/0x90
[<ffffffff81266ac3>] SyS_mount+0x83/0xd0
[<ffffffff81002d81>] do_syscall_64+0x61/0x170
[<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace d3fa03bb9c115bbe ]---
BTRFS: error (device sda3) in btrfs_replay_log:2491: errno=-17 Object already exists (Failed to recover log tree)
BTRFS error (device sda3): cleaner transaction attach returned -30
BTRFS error (device sda3): open_ctree failed

Guess I'll hit it with btrfsck and hope for the best..

Dave

Dave Jones

unread,

Nov 23, 2016, 2:40:05 PM11/23/16

to

On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
> On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:

> Agreed, I think this is a separate bug, almost certainly btrfs specific.
> I'll work with Dave on a better reproducer.

Ok, let's try this..
Here's the usual dump, from the current tree:

[ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
[ 317.702235] page:ffffea001363f500 count:0 mapcount:0 mapping:ffff8804eef8e0e0 index:0x29
[ 317.718068] flags: 0x400000000000000c(referenced|uptodate)
[ 317.731086] page dumped because: non-NULL mapping
[ 317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ #1
[ 317.770404] Workqueue: writeback wb_workfn
[ 317.783490] (flush-btrfs-3)
[ 317.783581] ffffc90000ff3798
[ 317.796651] ffffffff813b69bc
[ 317.796742] ffffea001363f500
[ 317.796798] ffffffff81c474ee
[ 317.796854] ffffc90000ff37c0
[ 317.809910] ffffffff811b30c4
[ 317.810001] 0000000000000000
[ 317.810055] ffffea001363f500
[ 317.810112] 0000000000000003
[ 317.823230] ffffc90000ff37d0
[ 317.823320] ffffffff811b318f
[ 317.823376] ffffc90000ff3818
[ 317.823431] Call Trace:
[ 317.836645] [<ffffffff813b69bc>] dump_stack+0x4f/0x73
[ 317.850063] [<ffffffff811b30c4>] bad_page+0xc4/0x130
[ 317.863426] [<ffffffff811b318f>] free_pages_check_bad+0x5f/0x70
[ 317.876805] [<ffffffff811b61f5>] free_hot_cold_page+0x305/0x3b0
[ 317.890132] [<ffffffff811b660e>] free_hot_cold_page_list+0x7e/0x170
[ 317.903410] [<ffffffff811c08b7>] release_pages+0x2e7/0x3a0
[ 317.916600] [<ffffffff811c2067>] __pagevec_release+0x27/0x40
[ 317.929817] [<ffffffffa00b0a55>] extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs]
[ 317.943073] [<ffffffffa00b0fcd>] extent_writepages+0x5d/0x90 [btrfs]
[ 317.956033] [<ffffffffa008ea60>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[ 317.968902] [<ffffffffa008b6a8>] btrfs_writepages+0x28/0x30 [btrfs]
[ 317.981744] [<ffffffff811be6d1>] do_writepages+0x21/0x30
[ 317.994524] [<ffffffff812727cf>] __writeback_single_inode+0x7f/0x6e0
[ 318.007333] [<ffffffff81894cc1>] ? _raw_spin_unlock+0x31/0x50
[ 318.020090] [<ffffffff8127356d>] writeback_sb_inodes+0x31d/0x6c0
[ 318.032844] [<ffffffff812739a2>] __writeback_inodes_wb+0x92/0xc0
[ 318.045623] [<ffffffff81273e97>] wb_writeback+0x3f7/0x530
[ 318.058304] [<ffffffff81274a38>] wb_workfn+0x148/0x620
[ 318.070949] [<ffffffff810a4b54>] ? process_one_work+0x194/0x690
[ 318.083549] [<ffffffff812748f5>] ? wb_workfn+0x5/0x620
[ 318.096160] [<ffffffff810a4bf2>] process_one_work+0x232/0x690
[ 318.108722] [<ffffffff810a4b54>] ? process_one_work+0x194/0x690
[ 318.121225] [<ffffffff810a509e>] worker_thread+0x4e/0x490
[ 318.133680] [<ffffffff810a5050>] ? process_one_work+0x690/0x690
[ 318.146126] [<ffffffff810a5050>] ? process_one_work+0x690/0x690
[ 318.158507] [<ffffffff810aa7e2>] kthread+0x102/0x120
[ 318.170880] [<ffffffff810aa6e0>] ? kthread_park+0x60/0x60
[ 318.183254] [<ffffffff81895822>] ret_from_fork+0x22/0x30

This time, I managed to get my ftrace magic working, so we have this
trace from just before this happened. Does this shed any light ?

https://codemonkey.org.uk/junk/trace.txt

(note that some of the spinlock/rcu calls will be missing here. I chose
to add them to ftrace's exclude list because they dominate the trace
buffer so much, especially in debug builds)

Dave

Dave Jones

unread,

Nov 23, 2016, 3:00:05 PM11/23/16

to

On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:

> [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4

> trace from just before this happened. Does this shed any light ?
>
> https://codemonkey.org.uk/junk/trace.txt

crap, I just noticed the timestamps in the trace come from quite a bit
later. I'll tweak the code to do the taint checking/ftrace stop after
every syscall, that should narrow the window some more.

Getting closer..

Dave

Dave Jones

unread,

Dec 1, 2016, 10:40:07 AM12/1/16

to

Ok, this is getting more like it.
http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt

Also same bug, different run, but a different traceview http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt

(function-graph screws up the RIP for some reason, 'return_to_handler'
should actually be btrfs_destroy_inode)

Anyways, I've got some code that works pretty well for dumping the
ftrace buffer now when things go awry. I just need to run it enough
times that I hit that bad page state instead of this, or a lockdep bug first.

Dave

Dave Jones

unread,

Dec 3, 2016, 12:00:05 PM12/3/16

to

On Thu, Dec 01, 2016 at 10:32:09AM -0500, Dave Jones wrote:
> http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt
>
> Also same bug, different run, but a different traceview http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt
>
> (function-graph screws up the RIP for some reason, 'return_to_handler'
> should actually be btrfs_destroy_inode)

Chris pointed me at a pending patch that took care of this warning.

> Anyways, I've got some code that works pretty well for dumping the
> ftrace buffer now when things go awry. I just need to run it enough
> times that I hit that bad page state instead of this, or a lockdep bug first.

Which allowed me to run long enough to get this trace..
http://codemonkey.org.uk/junk/bad-page-state.txt

Does this shed any light ?

The interesting process here seems to be kworker/u8:17, and the trace
captures some of what that was doing before that bad page was hit.

I've got another run going now, so I'll compare that trace when it
happens to see if it matches my current theory that it's something to
do with that btrfs_scrubparity_helper. I've seen that show up in stack
traces a few times while chasing this.

Dave

Vegard Nossum

unread,

Dec 4, 2016, 6:10:05 PM12/4/16

to

FWIW I hit this as well:

BUG: unable to handle kernel paging request at ffffffff81ff08b7
IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
PGD 461e067 PUD 461f063
PMD 1e001e1
Oops: 0003 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217

Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS

rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff8801ee924080 task.stack: ffff8801bab88000
RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>]
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082
RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffffff85ae7d00
RBP: ffff8801bab8f7b0 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8801e727fc40 R11: fffffbfff0b54ced R12: ffffffff85ae7d00
R13: ffffffff84912920 R14: ffff8801ee924080 R15: 0000000000000000
FS: 00007f37ee653700(0000) GS:ffff8801f6a00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: ffffffff81ff08b7 CR3: 00000001daa70000 CR4: 00000000000006f0
DR0: 00007f37ee465000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:

ffff8801ee9247d0 0000000000000000 0000000100000000 ffff8801ee924080
ffff8801f6a201c0 ffff8801f6a201c0 0000000000000000 0000000000000001
ffff880100000000 ffff880100000000 ffff8801e727fc40 ffff8801ee924080
Call Trace:
[<ffffffff81361751>] lock_acquire+0x141/0x2b0
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff83c95b29>] _raw_spin_lock_irqsave+0x49/0x60
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff813530c0>] finish_wait+0xb0/0x180
[<ffffffff81576227>] shmem_fault+0x4c7/0x6b0
[<ffffffff83a9b7cd>] ? p9_client_rpc+0x13d/0xf40
[<ffffffff81575d60>] ? shmem_getpage_gfp+0x1c90/0x1c90
[<ffffffff81fbe777>] ? radix_tree_next_chunk+0x4f7/0x840
[<ffffffff81352150>] ? wake_atomic_t_function+0x210/0x210
[<ffffffff815ad316>] __do_fault+0x206/0x410
[<ffffffff815ad110>] ? do_page_mkwrite+0x320/0x320
[<ffffffff815b9bcf>] handle_mm_fault+0x1cef/0x2a60
[<ffffffff815b8012>] ? handle_mm_fault+0x132/0x2a60
[<ffffffff815b7ee0>] ? __pmd_alloc+0x370/0x370
[<ffffffff81692e7e>] ? inode_add_bytes+0x10e/0x160
[<ffffffff8162de11>] ? memset+0x31/0x40
[<ffffffff815cba10>] ? find_vma+0x30/0x150
[<ffffffff812373a2>] __do_page_fault+0x452/0x9f0
[<ffffffff81ff071f>] ? iov_iter_init+0xaf/0x1d0
[<ffffffff81237bf5>] trace_do_page_fault+0x1e5/0x3a0
[<ffffffff8122a007>] do_async_page_fault+0x27/0xa0
[<ffffffff83c97618>] async_page_fault+0x28/0x30
[<ffffffff82059341>] ? strnlen_user+0x91/0x1a0
[<ffffffff8205931e>] ? strnlen_user+0x6e/0x1a0
[<ffffffff8157e038>] strndup_user+0x28/0xb0
[<ffffffff81d83c17>] SyS_add_key+0xc7/0x370
[<ffffffff81d83b50>] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[<ffffffff815143ea>] ? __context_tracking_exit.part.4+0x3a/0x1e0
[<ffffffff81d83b50>] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c96534>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30

I didn't read all the emails in this thread, the crash site looks
identical to one of the earlier traces although the caller may be
different.

I think you can rule out btrfs in any case, probably block layer as
well, since it looks like this comes from shmem.

Vegard

Vegard Nossum

unread,

Dec 5, 2016, 6:20:06 AM12/5/16

to

On 5 December 2016 at 00:04, Vegard Nossum <vegard...@gmail.com> wrote:
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at ffffffff81ff08b7
> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30

> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217

[...]

> I think you can rule out btrfs in any case, probably block layer as
> well, since it looks like this comes from shmem.

I should rather say that the VM runs on a 9p root filesystem and it
doesn't use/mount any block devices or disk-based filesystems.

I have all the trinity logs for the crash if that's useful. I tried a
couple of runs with just the (at the time) in-progress syscalls but it
didn't turn up anything interesting. Otherwise it seems like a lot of
data to go through by hand.

The crashing child seems to have just started, though, if that's relevant:

[child56:21744] [0]
open(filename="/sys/block/loop2/power/runtime_active_time",
flags=0x777b01, mode=666) = -1 (Not a directory)
[child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000,
size=3415) = -1 (Invalid argument)
[child56:21744] [2] [32BIT]
access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1
(Invalid argument)
[child56:21744] [3] getegid() = 0xfffe
[child56:21744] [4]
swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation
not permitted)
[child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439
[child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0,
prot=0x1000008, key=0x600000000000000) = 0
[child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0
[child56:21744] [8] flock(fd=168, cmd=0xffffffc191f30b0c) = -1
(Invalid argument)
[child56:21744] [9] add_key(_type=0x437a15,
_description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0,
ringid=0xfffffffffffffff8)

The other logfiles end thusly:

==> trinity-child0.log <==
[child0:21593] [311] faccessat(dfd=246,
filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid
argument)
[child0:21593] [312] renameat(olddfd=246,
oldname="/proc/13/task/13/attr/sockcreate", newdfd=377,
newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied)
[child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0

==> trinity-child100.log <==
[child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted)
[child100:21536] [440] waitid(which=175, upid=21587, infop=0x4,
options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument)
[child100:21536] [441]
getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000,
value=0x7f37ec26c000, size=49) = -1 (Operation not supported)

==> trinity-child101.log <==
[child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39
[child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0
[child101:21537] [57] semctl(semid=0xbd851e2b40e7df,
semnum=0x1b1b1b1b1b, cmd=0x2000000000, arg=0xcacacacaca) = -1 (Invalid
argument)

==> trinity-child102.log <==
[child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1
(Invalid argument)
[child102:21542] [12] add_key(_type=0x43793f,
_description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32,
ringid=0xfffffffffffffffa) = -1 (Invalid argument)
[child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0

==> trinity-child103.log <==
[child103:21543] [45] dup(fildes=183) = 512
[child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32)
= -1 (Invalid argument)
[child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map",
statbuf=0x7f37ee466000) = 0

==> trinity-child104.log <==
[child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1
(Not a directory)
[child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) =
-1 (Bad address)
[child104:21546] [51] setsid() = 0x542a

==> trinity-child105.log <==
[child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246,
timeout=-1) = -1 (Invalid argument)
[child105:21547] [524] dup2(oldfd=244, newfd=244) = 244
[child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted)

==> trinity-child106.log <==
[child106:19910] [136] getegid() = 0xfffe
[child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0
[child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1,
rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000)
==> trinity-child107.log <==
[child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210,
fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor)
[child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5,
idx1=0x787878787878, idx2=0xffffff6060606060) = -1 (Operation not
permitted)
[child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0

==> trinity-child108.log <==
[child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8,
flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor)
[child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000,
shmflg=-195) = -1 (Invalid argument)
[child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child109.log <==
[child109:21552] [241] fremovexattr(fd=133, name=0x4) = -1 (Bad address)
[child109:21552] [242] sched_setattr(pid=0, uattr=0x1) = -1 (Invalid argument)
[child109:21552] [243] msgget(key=0x71717171717171,
msgflg=0xec00000000) = 0x2d805a

==> trinity-child10.log <==
[child10:21619] [120] creat(pathname="/proc/6/attr/fscreate", mode=0) = 521
[child10:21619] [121] [32BIT] sethostname(name=0x7f37ec06c000,
len=4096) = -1 (Operation not permitted)
[child10:21619] [122] add_key(_type=0x43793f,
_description=0x7f37ec66c000, _payload=0x7f37ec06c000, plen=8,
ringid=0xfffffffffffffffe) = -1 (Invalid argument)

==> trinity-child110.log <==
[child110:21554] [289] accept(fd=366, upeer_sockaddr=0x2e09f10,
upeer_addrlen=12) = -1 (Socket operation on non-socket)
[child110:21554] [290] unlinkat(dfd=147,
pathname="/proc/16/task/16/net/atm/br2684", flag=-62) = -1 (Invalid
argument)
[child110:21554] [291] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child111.log <==
[child111:21556] [102] getsockname(fd=292, usockaddr=0x2e10190,
usockaddr_len=110) = -1 (Socket operation on non-socket)
[child111:21556] [103] munmap(addr=0x7f37ee65a000, len=4096) = 0
[child111:21556] [104] personality(personality=1) = 0

==> trinity-child112.log <==
[child112:21241] [848] setpgid(pid=21456, pgid=9) = -1 (No such process)
[child112:21241] [849] splice(fd_in=182, off_in=0x7f37ec76c000,
fd_out=182, off_out=0x7f37ec76c008, len=4, flags=0x7) = -1 (Bad file
descriptor)
[child112:21241] [850] setgid(gid=0) = -1 (Operation not permitted)

==> trinity-child113.log <==
[child113:21560] [453] memfd_create(uname=0x7f37ee657000, flag=0x1) = 530
[child113:21560] [454] pwritev2(fd=342, vec=0x2e1e000, vlen=133,
pos_l=0xffffff1d, pos_h=-4, flags=0x0) = -1 (Illegal seek)
[child113:21560] [455] readahead(fd=338, offset=0xe000, count=4096) =
-1 (Invalid argument)

==> trinity-child114.log <==
[child114:21562] [27] [32BIT] getgroups16(gidsetsize=0xf12ec06dec,
grouplist=0x7f37ee465000) = -1 (Bad address)
[child114:21562] [28] keyctl(cmd=0x14, arg2=0xcecececece, arg3=8192,
arg4=0x7cc57bd0d8cf577, arg5=0x898989898989) = -1 (Invalid argument)
[child114:21562] [29] readlink(path="/sys/block/loop1/mq/0/pending",
buf=0x7f37ee467000, bufsiz=0x1565a3aa) = -1 (Invalid argument)

==> trinity-child115.log <==
[child115:21563] [392] eventfd(count=8) = 523
[child115:21563] [393] getegid() = 0xfffe
[child115:21563] [394] open(filename="/proc/587/stat", flags=0x795ac0,
mode=666) = -1 (Invalid argument)

==> trinity-child116.log <==
[child116:21567] [80] mbind(start=0x7f37ee65a000, len=0, mode=0x3,
nmask=0x0, maxnode=0x8000, flags=0x5) = -1 (Operation not permitted)
[child116:21567] [81] flock(fd=357, cmd=0xffffff0fe2220862) = 0
[child116:21567] [82] clock_adjtime(which_clock=15, utx=0x8) = -1
(Invalid argument)

==> trinity-child117.log <==
[child117:21570] [333] pipe(fildes=0x7f37ee466000) = 0
[child117:21570] [334] get_robust_list(pid=21439, head_ptr=0x1,
len_ptr=0x7f37ebc18000) = -1 (Operation not permitted)
[child117:21570] [335] execve(name="/proc/550/net/rt_cache",
argv=0x2e230d0, envp=0x2ba92a0)
==> trinity-child118.log <==
[child118:21572] [455] fstatat64(dfd=244,
filename="/proc/1297/task/1297/fdinfo/110", statbuf=0x8, flag=97) = -1
(Invalid argument)
[child118:21572] [456] flistxattr(fd=244, list=0x7f37ec26c000, size=8) = 0
[child118:21572] [457] ioprio_set(which=0x5d5d5d5d5d5d,
who=0x222734221582, ioprio=0xffffff5b5b5b5b5b) = -1 (Invalid argument)

==> trinity-child119.log <==
[child119:21574] [400] timerfd_create(clockid=0x0, flags=0x0) = 522
[child119:21574] [401] execveat(fd=338, name="/proc/1024/net/can-bcm",
argv=0x2bda9f0, envp=0x2e21130, flags=0x100) = 522
[child119:21574] [402] execveat(fd=244,
name="/sys/devices/platform/serial8250/tty/ttyS3/power/autosuspend_delay_ms",
argv=0x2e09b30, envp=0x2e1d0f0, flags=0x1000)
==> trinity-child11.log <==
[child11:21621] [352]
symlink(oldname="/sys/devices/virtual/sound/timer/uevent",
newname="/proc/549/task/549/net/rpc/auth.unix.gid") = -1 (File exists)
[child11:21621] [353] [32BIT] setgroups16(gidsetsize=0xffffff18,
grouplist=0x8) = -1 (Operation not permitted)
[child11:21621] [354] semtimedop(semid=0x100000de63c0d6e9, tsops=0x4,
nsops=0, timeout=0x8)
==> trinity-child120.log <==
[child120:21578] [141] execveat(fd=147,
name="/proc/2/task/2/net/llc/socket", argv=0x2e0a250, envp=0x2e10190,
flags=0x1000) = -1 (Permission denied)
[child120:21578] [142] readlinkat(dfd=330,
pathname="/sys/devices/virtual/misc/ocfs2_control/power/runtime_active_kids",
buf=0x7f37ee656000, bufsiz=4) = -1 (Invalid argument)
[child120:21578] [143] fremovexattr(fd=330, name=0x7f37ec26c000) = -1
(Permission denied)

==> trinity-child121.log <==
[child121:21263] [586] pread64(fd=237, buf=0x7f37ec76c000, count=3746,
pos=75) = 0
[child121:21263] [587] inotify_add_watch(fd=237,
pathname="/proc/974/task/974/net/atm/pvc", mask=0x5) = -1 (Invalid
argument)
[child121:21263] [588]
removexattr(pathname="/proc/974/task/974/net/sco",
name=0x7f37ec26c000) = -1 (Permission denied)

==> trinity-child122.log <==
[child122:21580] [57] getrusage(who=0x3a2a97385da52b25, ru=0x0) = -1
(Invalid argument)
[child122:21580] [58] getdents(fd=373, dirent=0x1, count=0x6cfbd4f4) =
-1 (Bad file descriptor)
[child122:21580] [59] msgsnd(msqid=-9, msgp=0x8, msgsz=59, msgflg=0x0)
= -1 (Bad address)

==> trinity-child123.log <==
[child123:21582] [289] recvmsg(fd=244, msg=0x7f37ec26c000,
flags=0x5f058) = -1 (Socket operation on non-socket)
[child123:21582] [290] poll(ufds=0x2d25880, nfds=7, timeout_msecs=0) = 5
[child123:21582] [291] getppid() = 1297

==> trinity-child124.log <==
[child124:20960] [493] uname(name=0x7f37ee466000) = 0
[child124:20960] [494] sysfs(option=0x1, arg1=-3,
arg2=0x5982b49806138008) = -1 (Bad address)
[child124:20960] [495]
execve(name="/proc/sys/net/ipv4/neigh/sit0/base_reachable_time_ms",
argv=0x2e122d0, envp=0x2bc95a0)
==> trinity-child125.log <==
[child125:21586] [15] setgroups(gidsetsize=12, grouplist=0x0) = -1
(Operation not permitted)
[child125:21586] [16] [32BIT] oldumount(name=0x7f37ec46c000) = -1
(Operation not permitted)
[child125:21586] [17] utimes(filename="/proc/5/net/anycast6",
utimes=0x8) = -1 (Bad address)

==> trinity-child126.log <==
[child126:21587] [29] shmdt(shmaddr=0x7f37ec66c000) = -1 (Invalid argument)
[child126:21587] [30]
setxattr(pathname="/sys/devices/pci0000:00/0000:00:03.0/net/eth0/power/runtime_active_kids",
name=0x4, value=0x7f37ec46c000, size=0xfffd, flags=0x1) = -1 (Bad
address)
[child126:21587] [31] chroot(filename="/proc/1240/net/hci") = -1 (Not
a directory)

==> trinity-child127.log <==
[child127:21588] [290] clock_gettime(which_clock=6156, tp=0x0) = -1
(Invalid argument)
[child127:21588] [291] statfs(pathname="/proc/1252/net/ipv6_route",
buf=0x7f37ee659000) = 0
[child127:21588] [292] msync(start=0x7f37ee658000, len=0, flags=0x6) = 0

==> trinity-child12.log <==
[child12:21625] [54] getsid(pid=21655) = 0
[child12:21625] [55] mlock2(start=0x7f37ec06c000, len=0x136000) = -1
(Cannot allocate memory)
[child12:21625] [56] dup3(oldfd=180, newfd=180, flags=0x80000) = -1
(Invalid argument)

==> trinity-child13.log <==
[child13:21627] [32] add_key(_type=0x437a15,
_description=0x7f37ee466001, _payload=0x7f37ee466000, plen=0x40000,
ringid=0xfffffffffffffffe) = -1 (Invalid argument)
[child13:21627] [33] mknodat(dfd=243,
filename="/proc/553/task/553/attr/fscreate", mode=0xf63d64273f4d42,
dev=-10) = -1 (Operation not permitted)
[child13:21627] [34] mq_timedreceive(mqdes=328,
u_msg_ptr=0x7f37ee466000, msg_len=0xd1d1d1, u_msg_prio=0x7f37ec06c000,
u_abs_timeout=0x7f37ee466000) = -1 (Invalid argument)

==> trinity-child14.log <==
[child14:21633] [194] msgsnd(msqid=0x4e251795f0, msgp=0x0, msgsz=8,
msgflg=0x0) = -1 (Bad address)
[child14:21633] [195] ppoll(ufds=0x2c4b530, nfds=7, tsp=0x2e09fc0,
sigmask=0x7f37ee465000, sigsetsize=128) = -1 (Invalid argument)
[child14:21633] [196] faccessat(dfd=251,
filename="/proc/15/task/15/net/stat/nf_conntrack", mode=40) = -1
(Invalid argument)

==> trinity-child15.log <==
[child15:21634] [210] timer_gettime(timer_id=1, setting=0x4) = -1
(Invalid argument)
[child15:21634] [211] lgetxattr(pathname="/proc/551/setgroups",
name=0x4, value=0xc, size=1) = -1 (Bad address)
[child15:21634] [212] splice(fd_in=265, off_in=0x7f37ec26c000,
fd_out=265, off_out=0x7f37ec26c008, len=1, flags=0x4) = -1 (Bad file
descriptor)

==> trinity-child16.log <==
[child16:21640] [307]
sched_get_priority_max(policy=0x7fffffff08000000) = -1 (Invalid
argument)
[child16:21640] [308] getcpu(cpup=0x7f37ee465000,
nodep=0x7f37ee465004, unused=0x7f37ee465008) = 0
[child16:21640] [309] sendfile(out_fd=367, in_fd=367, offset=0x0,
count=0) = -1 (Bad file descriptor)

==> trinity-child17.log <==
[child17:21642] [227]
llistxattr(pathname="/proc/16/task/16/net/route", list=0x8, size=4) =

0
[child17:21642] [228] msync(start=0x7f37ec66c000, len=16384, flags=0x6) = 0
[child17:21642] [229] [32BIT]
execve(name="/proc/1024/net/atm/devices", argv=0x2c32e20,
envp=0x2dfb5a0) = 0

==> trinity-child18.log <==
[child18:21644] [55] execve(name="/proc/971/task/971/net/icmp",
argv=0x2d25880, envp=0x2b996a0) = -1 (Permission denied)
[child18:21644] [56] utime(filename="/proc/974/net/raw",
times=0x7f37ee465000) = -1 (Operation not permitted)
[child18:21644] [57] umask(mask=0xfffffff7) = 18

==> trinity-child19.log <==
[child19:21651] [208] [32BIT] move_pages(pid=21604, nr_pages=145,
pages=0x2e1a3d0, nodes=0x2e1b3e0, status=0x2e1b630, flags=0x2) = -1
(Bad address)
[child19:21651] [209] timer_getoverrun(timer_id=-2) = -1 (Invalid argument)
[child19:21651] [210] rename(oldname="/proc/14/cmdline",
newname="/proc/1041/task/1041/net/can/rcvlist_all") = -1 (No such file
or directory)

==> trinity-child1.log <==
[child1:19992] [1361]
lsetxattr(pathname="/proc/1297/task/1297/fdinfo/96", name=0x1,
value=0x7f37ee466000, size=0, flags=0x0) = -1 (Bad address)
[child1:19992] [1362] setgid(gid=0xf0f0f0f0f) = -1 (Operation not permitted)
[child1:19992] [1363] clock_nanosleep(which_clock=0x1, flags=0x0,
rqtp=0x7f37ee659000, rmtp=0x7f37ee465000)
==> trinity-child20.log <==
[child20:21027] [1852] sendto(fd=244, buff=0x2df7410, len=1,
flags=0x800436c8, addr=0x2df7460, addr_len=16) = -1 (Socket operation
on non-socket)
[child20:21027] [1853] setresgid(rgid=0, egid=0x4848, sgid=0xdddddddd)
= -1 (Operation not permitted)
[child20:21027] [1854] poll(ufds=0x2e1fb80, nfds=8, timeout_msecs=0) = 6

==> trinity-child21.log <==
[child21:21330] [663]
statfs(pathname="/sys/block/md0/queue/scheduler", buf=0x7f37ec66c000)
= 0
[child21:21330] [664] sched_setaffinity(pid=1, len=1028,
user_mask_ptr=0x7f37ec46c000) = -1 (Operation not permitted)
[child21:21330] [665] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child22.log <==
[child22:21652] [61] pkey_alloc(flags=0, init_val=0x0) = -1 (No space
left on device)
[child22:21652] [62] fsync(fd=330) = 0
[child22:21652] [63] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child23.log <==
[child23:21655] [237] sync_file_range(fd=244, offset=191, nbytes=1,
flags=0xffffffffffffff9) = -1 (Invalid argument)
[child23:21655] [238] io_submit(ctx_id=4088, nr=1,
iocbpp=0x7f37ee658000) = -1 (Invalid argument)
[child23:21655] [239] flistxattr(fd=367, list=0x7f37ee656000, size=8)
==> trinity-child24.log <==
[child24:21657] [86] [32BIT] personality(personality=0x3fde3e4912ec) =
0x82604841
[child24:21657] [87]
mkdir(pathname="/proc/924/task/924/net/atalk/interface", mode=0) = -1
(File exists)
[child24:21657] [88] [32BIT] fchownat(dfd=381,
filename="/sys/module/nf_conntrack/parameters/nf_conntrack_helper",
user=0x1804802345, group=0xffffffff, flag=-3) = -1 (Invalid argument)

==> trinity-child25.log <==
[child25:21658] [157] timerfd_settime(ufd=201, flags=0x80800,
utmr=0x1, otmr=0x9) = -1 (Bad address)
[child25:21658] [158] rt_tgsigqueueinfo(tgid=0xc0da20044837,
pid=21330, sig=7099, uinfo=0x7f37ee466000) = -1 (Bad address)
[child25:21658] [159] pkey_mprotect(start=0x7f37ec76c000, len=0x79000,
prot=0x0, key=0x5120a54c676961) = -1 (Invalid argument)

==> trinity-child26.log <==
[child26:21665] [20] fdatasync(fd=217) = -1 (Invalid argument)
[child26:21665] [21] fsetxattr(fd=137, name=0x4, value=0x7f37ec26c000,
size=-9, flags=0x2) = -1 (Bad address)
[child26:21665] [22] open_by_handle_at(mountdirfd=137,
handle=0x7f37ee465000, flags=0x5b33c0) = -1 (Operation not permitted)

==> trinity-child27.log <==
[child27:21668] [190] preadv(fd=381, vec=0x2e1b3e0, vlen=233,
pos_l=0x710702af1dfa, pos_h=0xc976c19d4f) = 0
[child27:21668] [191] [32BIT] mlock(addr=0x7f37ec76c000, len=0xce000)
= -1 (Cannot allocate memory)
[child27:21668] [192] [32BIT] getsockopt(fd=381,
level=0x54100210202640, optname=0x9a9a9a9a9a9a, optval=0x7f37ee466000,
optlen=8) = -1 (Socket operation on non-socket)

==> trinity-child28.log <==
[child28:1813] [202] sched_setscheduler(pid=0, policy=0x5,
param=0x7f37ee465000) = 0
[child28:1813] [203] setresuid(ruid=0x80020293a4aa482, euid=247,
suid=0xffffffff81000002) = -1 (Operation not permitted)
[child28:1813] [204] epoll_ctl(epfd=332, op=0x3, fd=332,
event=0x2e0a150) [child28:1813] [204]
newstat(filename="/proc/9/task/9/net/atalk", statbuf=0x8)
[child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk",
statbuf=0x8) [child28:1813] [204]
statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000)
[child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk",
statbuf=0x8) [child28:1813] [204]
statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000)
[child28:1813] [204] timerfd_create(clockid=0x1, flags=0x800)
[child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk",
statbuf=0x8) [child28:1813] [204]
statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000)
[child28:1813] [204] timerfd_create(clockid=0x1, flags=0x800)
[child28:1813] [204] process_vm_readv(pid=1, lvec=0x2e21d30,
liovcnt=48, rvec=0x2e22040, riovcnt=23, flags=0x0)
==> trinity-child29.log <==
= 0
[child29:1329] [68] [32BIT] getitimer(which=0xfffffffb,
value=0x7f37ee465000) [child29:1329] [68]
setpriority(which=0x8000000000, who=4, niceval=-2) [child29:1329] [68]
rt_sigprocmask(how=-9, set=0x0, oset=0x7f37ee466000, sigsetsize=128) =
-1 (Invalid argument)
[child29:1329] [69] shmat(shmid=0xfffffff9, shmaddr=0x1, shmflg=0x55d10db3)
==> trinity-child2.log <==
[child2:21596] [87] accept4(fd=189, upeer_sockaddr=0x2bda970,
upeer_addrlen=110, flags=0x80800) = -1 (Bad file descriptor)
[child2:21596] [88]
access(filename="/proc/1134/task/1134/net/llc/socket", mode=2060) = -1
(Invalid argument)
[child2:21596] [89] sendfile(out_fd=189, in_fd=172,
offset=0x7f37ee465000, count=4096) = -1 (Bad file descriptor)

==> trinity-child30.log <==
[child30:9344] [334] sendmsg(fd=0, msg=0x2c4b530, flags=0x5ba83) = -1
(Socket operation on non-socket)
[child30:9344] [335] signalfd(ufd=382, user_mask=0x7f37ee65a000,
sizemask=4) = -1 (Invalid argument)
[child30:9344] [336] listen(fd=206, backlog=0x4649119809)
[child30:9344] [336] io_getevents(ctx_id=0x372b076d67, min_nr=0xfff8,
nr=4096, events=0x7f37ec86c000, timeout=0x7f37ee65a000)
==> trinity-child31.log <==
[child31:2333] [416] munlockall() = 0
[child31:2333] [417] fgetxattr(fd=189, name=0x0, value=0x7f37ec76c000,
size=252) = -1 (Bad file descriptor)
[child31:2333] [418] fgetxattr(fd=308, name=0x4, value=0x7f37ec26c000,
size=8) [child31:2333] [418] clock_adjtime(which_clock=15, utx=0x1)
[child31:2333] [418] clock_adjtime(which_clock=15, utx=0x1)
[child31:2333] [418] timer_gettime(timer_id=0xf9f9f9f9f9f9,
setting=0x1) [child31:2333] [418] socket(family=32, type=0x80800,
protocol=10) [child31:2333] [418] socket(family=32, type=0x80800,
protocol=10) [child31:2333] [418] dup2(oldfd=371, newfd=363)
[child31:2333] [418] socket(family=32, type=0x80800, protocol=10)
[child31:2333] [418] dup2(oldfd=371, newfd=363) [child31:2333] [418]
unlink(pathname="/proc/1209/task/1209/net/atm/pvc") [child31:2333]
[418] openat(dfd=213, filename="/proc/971/task/971/net/atm",
flags=0x141a00, mode=712) [child31:2333] [418]
timer_gettime(timer_id=0x900040400186, setting=0x4) [child31:2333]
[418] getxattr(pathname="/proc/1023/net/stat/ndisc_cache",
name=0x7f37ec86c000, value=0x7f37ec86c001, size=259) [child31:2333]
[418] getxattr(pathname="/proc/1023/net/stat/ndisc_cache",
name=0x7f37ec86c000, value=0x7f37ec86c001, size=259) [child31:2333]
[418] open(filename="/sys/module/l2tp_ip", flags=0x80400, mode=211)
[child31:2333] [418] umask(mask=0xa10000004d483090) [child31:2333]
[418] umask(mask=0xa10000004d483090) [child31:2333] [418]
setgroups(gidsetsize=0, grouplist=0x4) [child31:2333] [418]
getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333]
[418] getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8)
[child31:2333] [418] io_cancel(ctx_id=0xffffffff00000004,
iocb=0x7f37ee465000, result=0x4fc0000000) [child31:2333] [418]
getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333]
[418] io_cancel(ctx_id=0xffffffff00000004, iocb=0x7f37ee465000,
result=0x4fc0000000) [child31:2333] [418] ioperm(from=221,
num=0x1786c7a03485c3c0, turn_on=176) [child31:2333] [418]
fadvise64(fd=271, offset=0x2afb1db, len=449, advice=0x5)
[child31:2333] [418] getpriority(which=0x1, who=2622) [child31:2333]
[418] signalfd(ufd=285, user_mask=0x7f37ee466000, sizemask=0)
[child31:2333] [418] setresuid(ruid=40, euid=0xffff, suid=-7)
==> trinity-child32.log <==
[child32:21672] [199] getresuid(ruid=0x7f37ec26c000,
euid=0x7f37ee65a000, suid=0x7f37ee465000) = 0
[child32:21672] [200] getrlimit(resource=0xa, rlim=0x4) = -1 (Bad address)
[child32:21672] [201] [32BIT] signalfd(ufd=367,
user_mask=0x7f37ec46c000, sizemask=32)
==> trinity-child33.log <==
[child33:21677] [231] mq_getsetattr(mqdes=244,
u_mqstat=0x7f37ee658000, u_omqstat=0x7f37ee658008) = -1 (Invalid
argument)
[child33:21677] [232] gettid() = 0x54ad
[child33:21677] [233] timer_create(which_clock=161,
timer_event_spec=0x7f37ec86c000, create_timer_id=0x7f37ec86c008)
==> trinity-child34.log <==
[child34:21679] [186] sched_rr_get_interval(pid=0, interval=0x7f37ee466000) = 0
[child34:21679] [187] [32BIT] epoll_pwait(epfd=244,
events=0x7f37ee657000, maxevents=0x40300000a, timeout=562) = -1
(Invalid argument)
[child34:21679] [188] getpriority(which=0x0, who=1)
==> trinity-child35.log <==
[child35:14725] [92] getrandom(buf=0x7f37ee656000, count=0, flags=0x3)
= -1 (Invalid argument)
[child35:14725] [93] setgid(gid=-4) = -1 (Operation not permitted)
[child35:14725] [94] execveat(fd=210, name="/proc/942/net/hidp",
argv=0x2bda970, envp=0x2e10190, flags=0x0) = -1 (Operation not
permitted)

==> trinity-child36.log <==
[child36:21681] [167] sethostname(name=0x7f37ec76c000, len=227) = -1
(Operation not permitted)
[child36:21681] [168] munlock(addr=0x7f37ec86c000, len=0x48000) = 0
[child36:21681] [169] execve(name="/proc/980/task/980/net/udplite",
argv=0x2e0a1a0, envp=0x2e10190)
==> trinity-child37.log <==
[child37:21682] [33] setgid(gid=0x6c6c6c6c) = -1 (Operation not permitted)
[child37:21682] [34] syslog(type=0x3, buf=0x7f37ee656000, len=0) = 0
[child37:21682] [35] pwrite64(fd=335, buf=0x2e204b0, count=1220,
pos=56) = -1 (Bad file descriptor)

==> trinity-child38.log <==
[child38:21689] [107] setpriority(which=0xafafafafafafafaf,
who=0xffffffffffff0000, niceval=0x400000000) = -1 (Invalid argument)
[child38:21689] [108] newstat(filename="/proc/983/cmdline",
statbuf=0x7f37ec66c000) = 0
[child38:21689] [109] mq_getsetattr(mqdes=367, u_mqstat=0x0, u_omqstat=0x0)
==> trinity-child39.log <==
[child39:20422] [556] adjtimex(txc_p=0x7f37ee466000) = -1 (Operation
not permitted)
[child39:20422] [557] ioperm(from=0, num=0xd000, turn_on=55) = -1
(Operation not permitted)
[child39:20422] [558] clock_nanosleep(which_clock=0x2, flags=0x0,
rqtp=0x7f37ec66c000, rmtp=0x7f37ec66c000)
==> trinity-child3.log <==
[child3:21597] [146] prlimit64(pid=1, resource=12, new_rlim=0x4,
old_rlim=0xc) = -1 (Bad address)
[child3:21597] [147] mq_unlink(u_name=0x7f37ec46c000) = -1 (No such
file or directory)
[child3:21597] [148] [32BIT] rt_sigprocmask(how=0x1616161616161617,
set=0x7f37ec76c000, oset=0x7f37ee465000, sigsetsize=128) = -1 (Invalid
argument)

==> trinity-child40.log <==
[child40:21692] [101] getgid() = 0xfffe
[child40:21692] [102]
lsetxattr(pathname="/proc/14/task/14/net/udplite6",
name=0x7f37ee657000, value=0x7f37ee465000, size=4096, flags=0x0) = -1
(Permission denied)
[child40:21692] [103] pwritev2(fd=367, vec=0x2e1a3d0, vlen=92,
pos_l=15, pos_h=0xffa4cdcdd0ee0601, flags=0x5)
==> trinity-child41.log <==
[child41:3589] [16] sched_get_priority_max(policy=0x34000000000000c3)
[child41:3589] [16] timerfd_settime(ufd=308, flags=0x80800,
utmr=0x7f37ec66c000, otmr=0x7f37ee466000) [child41:3589] [16]
timerfd_settime(ufd=326, flags=0x0, utmr=0x7f37ee465000,
otmr=0x7f37ec86c000) = -1 (File exists)
[child41:3589] [16] sched_get_priority_max(policy=0x34000000000000c3)
[child41:3589] [16] timerfd_settime(ufd=308, flags=0x80800,
utmr=0x7f37ec66c000, otmr=0x7f37ee466000) [child41:3589] [16]
timerfd_settime(ufd=326, flags=0x0, utmr=0x7f37ee465000,
otmr=0x7f37ec86c000) [child41:3589] [16]
getresuid(ruid=0x7f37ee467000, euid=0x7f37ee467000,
suid=0x7f37ec06c000) [child41:3589] [16] getrusage(who=0x703a9e50fc,
ru=0x8) [child41:3589] [16] munlockall() = 0
[child41:3589] [17]
chmod(filename="/proc/sys/net/ipv6/route/gc_min_interval", mode=4)
==> trinity-child42.log <==
[child42:21696] [165] getrusage(who=-5, ru=0x0) = -1 (Invalid argument)
[child42:21696] [166] fremovexattr(fd=168, name=0x7f37ee465000) = -1
(Numerical result out of range)
[child42:21696] [167] madvise(start=0x7f37ec06c000, len_in=0xdb000,
advice=0x9) = 0

==> trinity-child43.log <==
[child43:21700] [106] readv(fd=244, vec=0x2e1a3d0, vlen=246) = 0
[child43:21700] [107]
swapoff(path="/sys/devices/virtual/tty/tty26/power/runtime_usage") =
-1 (Operation not permitted)
[child43:21700] [108] linkat(olddfd=244,
oldname="/sys/devices/virtual/workqueue/writeback/nice", newdfd=244,
newname="/proc/1014/net/nfsfs/volumes", flags=0x0) = -1 (File exists)

==> trinity-child44.log <==
[child44:21703] [94] mkdir(pathname="/sys/module/pstore/parameters",
mode=0) = -1 (File exists)
[child44:21703] [95] sendfile(out_fd=367, in_fd=137, offset=0x1,
count=8) = -1 (Bad address)
[child44:21703] [96] [32BIT] sync_file_range(fd=367,
offset=0x707070707, nbytes=0x8000, flags=0xffffefff)
==> trinity-child45.log <==
[child45:21704] [1] fdatasync(fd=381) = -1 (Invalid argument)
[child45:21704] [2] munmap(addr=0x7f37ee656000, len=4096) = 0
[child45:21704] [3] pipe2(fildes=0x7f37ec46c000, flags=0x4800) = 0

==> trinity-child46.log <==
[child46:21705] [82]
rmdir(pathname="/sys/devices/tracepoint/power/runtime_active_kids") =
-1 (Permission denied)
[child46:21705] [83] chroot(filename="/proc/15/task/15/net/tcp") = -1
(Not a directory)
[child46:21705] [84] [32BIT] ioprio_set(which=-2, who=2048,
ioprio=0x646464646464) = -1 (Invalid argument)

==> trinity-child47.log <==
[child47:21710] [127] setrlimit(resource=0x7f37ee465000, rlim=16384) =
-1 (Bad address)
[child47:21710] [128] setns(fd=244, nstype=0x0) = -1 (Invalid argument)
[child47:21710] [129] pkey_alloc(flags=0, init_val=0x2)
==> trinity-child48.log <==
[child48:21717] [39]
access(filename="/proc/10/task/10/net/ip_vs_stats", mode=700) = -1
(Invalid argument)
[child48:21717] [40] [32BIT] sigprocmask(how=233, set=0x7f37ec46c000,
oset=-9) = -1 (Bad address)
[child48:21717] [41] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child49.log <==
[child49:21718] [81] sendto(fd=168, buff=0x2e0a540, len=1,
flags=0x41110, addr=0x2ba92a0, addr_len=28) = -1 (Socket operation on
non-socket)
[child49:21718] [82] clock_gettime(which_clock=0x430d4f678edd, tp=0x8)
= -1 (Invalid argument)
[child49:21718] [83] removexattr(pathname="/proc/13/attr/prev",
name=0x7f37ec06c000)
==> trinity-child4.log <==
[child4:21602] [155] getgroups(gidsetsize=0, grouplist=0x1) = 1
[child4:21602] [156] sendmsg(fd=0, msg=0x2c22c50, flags=0x40010883) =
-1 (Socket operation on non-socket)
[child4:21602] [157] fchown(fd=330, user=0x20240098000041,
group=0x20000) = -1 (Operation not permitted)

==> trinity-child50.log <==
[child50:21719] [41] shmctl(shmid=0x4c4c4c4c4c4c4c, cmd=0x2, buf=0x0)
= -1 (Invalid argument)
[child50:21719] [42] timer_delete(timer_id=0x80000000) = -1 (Invalid argument)
[child50:21719] [43] madvise(start=0x7f37ec86c000, len_in=0x20000, advice=0x2)
==> trinity-child51.log <==
[child51:21723] [91] msgctl(msqid=0x9f054e8b21, cmd=0x3,
buf=0x7f37ee466000) = 1763
[child51:21723] [92] time(tloc=0x7f37ee466000) = 0x584474e5
[child51:21723] [93] timer_gettime(timer_id=1, setting=0x7f37ee466000)
==> trinity-child52.log <==
[child52:21728] [70] epoll_ctl(epfd=244, op=0x3, fd=244,
event=0x2e0a2f0) = -1 (Operation not permitted)
[child52:21728] [71] [32BIT] statfs(pathname="/proc/548/net/atm/vc",
buf=0x7f37ec46c000) = -1 (Bad address)
[child52:21728] [72] rt_sigaction(sig=52, act=0x7f37ee465000,
oact=0x7f37ec06c000, sigsetsize=128) = -1 (Invalid argument)

==> trinity-child53.log <==
[child53:21734] [64] lseek(fd=244, offset=0, whence=0x1) = 33
[child53:21734] [65] preadv(fd=244, vec=0x2e10190, vlen=17,
pos_l=0x2714319808dd0822, pos_h=128) = 0
[child53:21734] [66] epoll_create1(flags=0x0) = 516

==> trinity-child54.log <==
[child54:21738] [50] ioprio_get(which=0, who=10023) = -1 (Invalid argument)
[child54:21738] [51] [32BIT] setregid(rgid=0x1000000000000,
egid=0xffff) = -1 (Operation not permitted)
[child54:21738] [52]
execve(name="/proc/1240/net/rpc/auth.rpcsec.context", argv=0x2e10190,
envp=0x2c6bd50)
==> trinity-child55.log <==
[child55:21741] [44] recvmsg(fd=381, msg=0x1, flags=0x20000000) = -1
(Socket operation on non-socket)
[child55:21741] [45] [32BIT] setgroups(gidsetsize=0xf9f9f9f9f9f9f9,
grouplist=0x0) = -1 (Operation not permitted)
[child55:21741] [46] execveat(fd=168, name="/proc/irq/4/smp_affinity",
argv=0x2e10190, envp=0x2e00900, flags=0x100)
==> trinity-child56.log <==
[child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0
[child56:21744] [8] flock(fd=168, cmd=0xffffffc191f30b0c) = -1
(Invalid argument)
[child56:21744] [9] add_key(_type=0x437a15,
_description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0,
ringid=0xfffffffffffffff8)
==> trinity-child57.log <==
[child57:21746] [12] [32BIT] mq_open(u_name=0x7f37ec86c000,
oflag=8192, mode=2000, u_attr=0x7f37ec86c004) = -1 (Bad address)
[child57:21746] [13] getpgid(pid=21226) = 0x52ea
[child57:21746] [14] execveat(fd=168, name="/proc/1297/task/1297",
argv=0x2bda970, envp=0x2ba92a0, flags=0x0)
==> trinity-child58.log <==
[child58:21749] [33] eventfd2(count=4, flags=0x0) = 441
[child58:21749] [34] mknodat(dfd=244,
filename="/sys/devices/pci0000:00/0000:00:03.0/net/eth0/queues",
mode=0x17104124, dev=0xffffffff0000567f) = -1 (Operation not
permitted)
[child58:21749] [35] clock_getres(which_clock=158, tp=0x4) = -1
(Invalid argument)

==> trinity-child59.log <==
[child59:21754] [21] setfsgid(gid=0x40000000000000) = 0xfffe
[child59:21754] [22] getpgrp() = 0
[child59:21754] [23] ppoll(ufds=0x2e0a5d0, nfds=0, tsp=0x2e0a590,
sigmask=0x0, sigsetsize=128)
==> trinity-child5.log <==
[child5:21604] [287] seccomp(op=0x1, flags=0x0, uargs=0x1) = 0
[child5:21604] [288] rt_sigpending(set=0x7f37ee466000,
sigsetsize=0x2323232323232323) = -1 (Invalid argument)
[child5:21604] [289] getuid() = 0xfffe

==> trinity-child60.log <==
[child60:21758] [0] getsid(pid=1) = 0
[child60:21758] [1] io_cancel(ctx_id=255, iocb=0x7f37ec46c000,
result=0x7f37ec46c000) = -1 (Invalid argument)
[child60:21758] [2] [32BIT] recvmmsg(fd=367, mmsg=0x1, vlen=4,
flags=0x0, timeout=0x7f37ec46c000)
==> trinity-child61.log <==
[child61:21760] [5] open(filename="/proc/1240/net/nfsfs",
flags=0x331702, mode=666) = 444
[child61:21760] [6] recvmmsg(fd=244, mmsg=0x7f37ec66c000, vlen=1,
flags=0x0, timeout=0x7f37ee466000) = -1 (Socket operation on
non-socket)
[child61:21760] [7] swapon(path=".//proc/3/task/3/net/ip_vs_conn/",
swap_flags=0x0)
==> trinity-child62.log <==
[child62:21428] [668] memfd_create(uname=0x7f37ee465000, flag=0x1) = 546
[child62:21428] [669]
newstat(filename="/sys/devices/virtual/net/sit0/statistics/rx_dropped",
statbuf=0x8) = -1 (Bad address)
[child62:21428] [670] [32BIT] sigpending(set=0x7f37ee467000) = -1 (Bad address)

==> trinity-child63.log <==
[child63:21431] [151] getgid() = 0xfffe
[child63:21431] [152] set_mempolicy(mode=0xffffffff, nmask=0x0,
maxnode=0xbbbbbbbbbbbbbbbb) = -1 (Invalid argument)
[child63:21762] [0] ioperm(from=0xb5b5b5b5b5b5b5b6, num=514, turn_on=135)
==> trinity-child64.log <==
[child64:21432] [754] utime(filename="/proc/552/net/can/rcvlist_eff",
times=0x4) = -1 (Bad address)
[child64:21432] [755] fcntl(fd=244, cmd=0x406, arg=244) = 540
[child64:21432] [756] mlock2(start=0x7f37ec06c000, len=0x9d000) = -1
(Cannot allocate memory)

==> trinity-child65.log <==
[child65:21435] [140] timer_create(which_clock=0x10000000,
timer_event_spec=0x1, create_timer_id=0x7f37ec46c000) = -1 (Invalid
argument)
[child65:21435] [141] getpeername(fd=152, usockaddr=0x2e0a1a0,
usockaddr_len=16) = -1 (Socket operation on non-socket)
[child65:21435] [142] ioperm(from=0x54ee6fb85625, num=1, turn_on=128)
= -1 (Invalid argument)

==> trinity-child66.log <==
[child66:21438] [310] msgctl(msqid=0x12121212121212, cmd=0x1,
buf=0x7f37ee465000) = -1 (Invalid argument)
[child66:21438] [311] set_robust_list(head=0x7f37ee658000, len=24) = 0
[child66:21438] [312] msgget(key=0x10000000,
msgflg=0x80000000000000d1) = -1 (Permission denied)

==> trinity-child67.log <==
[child67:21439] [184] pkey_alloc(flags=0, init_val=0x0) = -1 (Invalid argument)
[child67:21439] [185] getpgrp() = 0
[child67:21439] [186] [32BIT] getpgrp() = 0

==> trinity-child68.log <==
[child68:21443] [90] shmdt(shmaddr=0x7f37ee465000) = 0
[child68:21443] [91] remap_file_pages(start=0x7f37ec86c000, size=4096,
prot=0, pgoff=0, flags=0x10000) = 0
[child68:21443] [92] [32BIT] get_mempolicy(policy=0x7f37ee467000,
nmask=0x7f37ee467001, maxnode=147, addr=0x7f37ee466000, flags=0x0) =
-1 (Invalid argument)

==> trinity-child69.log <==
[child69:21444] [283] signalfd(ufd=139, user_mask=0x1, sizemask=8) =
-1 (Invalid argument)
[child69:21444] [284]
utimes(filename="/sys/devices/virtual/tty/tty50/power/runtime_status",
utimes=0x7f37ee465000) = -1 (Invalid argument)
[child69:21444] [285] fremovexattr(fd=139, name=0x7f37ee465000) = -1
(Permission denied)

==> trinity-child6.log <==
[child6:21605] [7] msync(start=0x7f37ee656000, len=0, flags=0x1) = 0
[child6:21605] [8] mq_timedreceive(mqdes=324,
u_msg_ptr=0x7f37ec06c000, msg_len=0x41da, u_msg_prio=0x7f37ee656000,
u_abs_timeout=0x7f37ee657000) = -1 (Bad file descriptor)
[child6:21605] [9] [32BIT] sethostname(name=0x7f37ec76c000,
len=0x85b19) = -1 (Operation not permitted)

==> trinity-child70.log <==
[child70:21447] [654] inotify_init1(flags=0x80000) = 547
[child70:21447] [655] setns(fd=244, nstype=0x0) = -1 (Invalid argument)
[child70:21447] [656] getdents(fd=367, dirent=0x4, count=0)
==> trinity-child71.log <==
[child71:21449] [86] fremovexattr(fd=344, name=0x1) = -1 (Bad address)
[child71:21449] [87] fadvise64(fd=344, offset=0, len=1, advice=0x2) = 0
[child71:21449] [88] [32BIT] fstat64(fd=285, statbuf=0x7f37ee467000) =
-1 (Bad address)

==> trinity-child72.log <==
[child72:21138] [88] [32BIT] pread64(fd=210, buf=0x7f37ec06c000,
count=1361, pos=0) = -1 (Invalid argument)
[child72:21138] [89] clock_settime(which_clock=512, tp=0x7f37ec06c000)
= -1 (Invalid argument)
[child72:21138] [90] mbind(start=0x7f37ec86c000, len=0xf5000,
mode=0x2, nmask=0x1, maxnode=0x8000, flags=0x6) = -1 (Bad address)

==> trinity-child73.log <==
[child73:21141] [706] setrlimit(resource=0x7f37ec26c000,
rlim=0xfffffffd) = -1 (Bad address)
[child73:21141] [707] pwrite64(fd=351, buf=0x2e091a0, count=1,
pos=0xfffd) = -1 (Bad file descriptor)
[child73:21141] [708] time(tloc=0x7f37ee65a000) = 0x584474dd

==> trinity-child74.log <==
[child74:21456] [429] setfsgid(gid=-5) = 0xfffe
[child74:21456] [430] sethostname(name=0x7f37ec26c000, len=8) = -1
(Operation not permitted)
[child74:21456] [431] open(filename="/sys/block/loop2/mq/0/cpu_list",
flags=0x333401, mode=666) = -1 (Not a directory)

==> trinity-child75.log <==
[child75:21458] [578] accept(fd=316, upeer_sockaddr=0x2e240e0,
upeer_addrlen=3729) = -1 (Socket operation on non-socket)
[child75:21458] [579] setxattr(pathname="/sys/block/loop5",
name=0x7f37ec06c000, value=0x7f37ec86c000, size=8, flags=0x1) = -1
(Permission denied)
[child75:21458] [580] sched_setscheduler(pid=0, policy=0x3, param=0x0)
= -1 (Invalid argument)

==> trinity-child76.log <==
[child76:21461] [546] gettimeofday(tv=0x7f37ee465000,
tz=0x7f37ee465004) = -1 (Bad address)
[child76:21461] [547] [32BIT] rt_sigaction(sig=32, act=0x7f37ec46c000,
oact=0x0, sigsetsize=128) = -1 (Invalid argument)
[child76:21461] [548] clock_nanosleep(which_clock=0x2, flags=0x1,
rqtp=0x7f37ee656000, rmtp=0x7f37ec06c000)
==> trinity-child77.log <==
[child77:21465] [73]
swapoff(path="/proc/865/task/865/net/rpc/auth.rpcsec.context") = -1
(Operation not permitted)
[child77:21465] [74] sched_getscheduler(pid=21253) = 0
[child77:21465] [75] [32BIT]
rt_sigtimedwait(uthese=0xba501c1fc77f1f57, uinfo=0x0,
uts=0x7f37ee466000, sigsetsize=0x7f37ee466008) = -1 (Invalid argument)

==> trinity-child78.log <==
[child78:21469] [216] mknod(filename="/proc/1023/net/ip_vs_conn",
mode=1007, dev=0x1b9008e90080) = -1 (No such file or directory)
[child78:21469] [217] geteuid() = 0xfffe
[child78:21469] [218] setsockopt(fd=169, level=0xa49a302c26ea,
optname=51, optval=0x2e1afd0, optlen=0x5bf09058529076) = -1 (Bad file
descriptor)

==> trinity-child79.log <==
[child79:21475] [696] setresuid(ruid=-2, euid=43,
suid=0x7272727272727272) = -1 (Operation not permitted)
[child79:21475] [697] msync(start=0x7f37ee659000, len=0, flags=0x6) = 0
[child79:21475] [698] io_submit(ctx_id=118, nr=8, iocbpp=0x8) = -1
(Invalid argument)

==> trinity-child7.log <==
[child7:21609] [33] prctl(option=22, arg2=2, arg3=0, arg4=0, arg5=0) =
-1 (Bad address)
[child7:21609] [34] userfaultfd(flags=0x800) = 511
[child7:21609] [35] getuid() = 0xfffe

==> trinity-child80.log <==
[child80:21478] [101] settimeofday(tv=0x7f37ec26c000,
tz=0x7f37ec46c000) = -1 (Operation not permitted)
[child80:21478] [102] getdents(fd=330, dirent=0x0, count=56) = -1 (Not
a directory)
[child80:21478] [103] epoll_ctl(epfd=330, op=0x2, fd=330,
event=0x2e0a390) = -1 (Invalid argument)

==> trinity-child81.log <==
[child81:21482] [33] [32BIT] move_pages(pid=0, nr_pages=304,
pages=0x2e19fc0, nodes=0x2e1afd0, status=0x2e1b4a0, flags=0x0) = -1
(Bad address)
[child81:21482] [34] getpriority(which=0x1, who=21008) = 1
[child81:21482] [35] sethostname(name=0x7f37ee659000, len=0x6aeaf512)
= -1 (Operation not permitted)

==> trinity-child82.log <==
[child82:21163] [963] keyctl(cmd=0x2, arg2=0xff00000, arg3=0xc928d000,
arg4=0x400000, arg5=1) = -1 (Invalid argument)
[child82:21163] [964] [32BIT] uselib(library=0x7f37ec46c000) = -1 (Bad address)
[child82:21163] [965] mq_getsetattr(mqdes=226, u_mqstat=0x1,
u_omqstat=0x2) = -1 (Bad address)

==> trinity-child83.log <==
[child83:21486] [531] getrlimit(resource=0xb, rlim=0x7f37ee465000) =
-1 (Bad address)
[child83:21486] [532] gettid() = 0x53ee
[child83:21486] [533] renameat(olddfd=244,
oldname="/proc/1252/net/atalk/socket", newdfd=267,
newname=".//proc/721/net/irda/ircomm") = -1 (Not a directory)

==> trinity-child84.log <==
[child84:21488] [238] mkdir(pathname="/proc/1161/task/1161/mounts",
mode=222) = -1 (File exists)
[child84:21488] [239] capget(header=0x67002320,
dataptr=0xfffaffb7effdfffe) = -1 (Bad address)
[child84:21488] [240]
rmdir(pathname="/sys/devices/virtual/misc/ocfs2_control/power/runtime_active_time")
= -1 (Permission denied)

==> trinity-child85.log <==
[child85:21489] [151] settimeofday(tv=0x7f37ec76c000,
tz=0x7f37ec76c008) = -1 (Invalid argument)
[child85:21489] [152] fstatat64(dfd=365,
filename=".//proc/551/net/ip_vs_conn", statbuf=0x7f37ec26c000,
flag=-2) = -1 (Invalid argument)
[child85:21489] [153] read(fd=365, buf=0x7f37ee466000, count=4096) = 0

==> trinity-child86.log <==
[child86:21496] [39] sched_getaffinity(pid=21349, len=217,
user_mask_ptr=0x4) = -1 (Invalid argument)
[child86:21496] [40] semtimedop(semid=149, tsops=0x7f37ec86c000,
nsops=-1, timeout=0x7f37ec86c000) = -1 (Argument list too long)
[child86:21496] [41] get_robust_list(pid=0, head_ptr=0x7f37ee465000,
len_ptr=0x7f37ec66c000) = 0

==> trinity-child87.log <==
[child87:21500] [517] select(n=0, inp=0x2e12fa0, outp=0x2e13030,
exp=0x2e130c0, tvp=0x2e09740) = 0
[child87:21500] [518] readlink(path="/proc/14/comm", buf=0x1,
bufsiz=223) = -1 (Invalid argument)
[child87:21500] [519] setgid(gid=0) = -1 (Operation not permitted)

==> trinity-child88.log <==
[child88:21502] [523] [32BIT] socketcall(call=1, args=0x2e1b390) = -1
(Address family not supported by protocol)
[child88:21502] [524]
newlstat(filename="/proc/722/task/722/net/rpc/nfs",
statbuf=0x7f37ec76c000) = 0
[child88:21502] [525] mq_unlink(u_name=0x7f37ec06c000)
==> trinity-child89.log <==
[child89:21507] [132] [32BIT] getresuid(ruid=0x7f37ec06c000,
euid=0x7f37ec06c001, suid=0x7f37ec06c005) = -1 (Bad address)
[child89:21507] [133] uname(name=0x7f37ec26c000) = 0
[child89:21507] [134] shutdown(fd=213, how=0x0) = -1 (Bad file descriptor)

==> trinity-child8.log <==
[child8:21610] [211] mlock2(start=0x7f37ec26c000, len=0x95000) = -1
(Cannot allocate memory)
[child8:21610] [212]
removexattr(pathname="/sys/module/zswap/parameters/compressor",
name=0x7f37ec46c000) = -1 (Permission denied)
[child8:21610] [213] socket(family=25, type=2048, protocol=5) = -1
(Address family not supported by protocol)

==> trinity-child90.log <==
[child90:21508] [36] [32BIT]
llistxattr(pathname="/proc/1045/net/protocols", list=0x0, size=-6) = 0
[child90:21508] [37] pipe(fildes=0x7f37ee467000) = 0
[child90:21508] [38] ftruncate(fd=183, length=0xffffffffca1f2000) = -1
(Invalid argument)

==> trinity-child91.log <==
[child91:21514] [28] unlink(pathname="/proc/924/net/netstat") = -1
(Permission denied)
[child91:21514] [29] pivot_root(new_root=0x7f37ee465000,
put_old=0x7f37ec06c000) = -1 (Operation not permitted)
[child91:21514] [30] sched_setscheduler(pid=21410, policy=0x3,
param=0x4) = -1 (Bad address)

==> trinity-child92.log <==
[child92:21517] [444] [32BIT]
chroot(filename="/proc/980/task/980/net/dev_snmp6/sit0") = -1 (Not a
directory)
[child92:21517] [445] syslog(type=0x1, buf=0x7f37ec26c000,
len=0x48000) = -1 (Operation not permitted)
[child92:21517] [446] getrandom(buf=0x7f37ec76c000, count=0x2a000,
flags=0x1) = -1 (Invalid argument)

==> trinity-child93.log <==
[child93:21518] [477] waitid(which=0xffd9, upid=1,
infop=0x7f37ec06c000, options=0x100000, ru=0x7f37ec06c000) = -1
(Invalid argument)
[child93:21518] [478] fchmod(fd=168, mode=0) = -1 (Operation not permitted)
[child93:21518] [479] clock_nanosleep(which_clock=0x2, flags=0x1,
rqtp=0x7f37ec76c000, rmtp=0x7f37ee466000)
==> trinity-child94.log <==
[child94:20875] [1515] access(filename="/proc/567/net/atm/pvc",
mode=5000) = -1 (Invalid argument)
[child94:20875] [1516] seccomp(op=0x1, flags=0x1, uargs=0x4) = -1
(Invalid argument)
[child94:20875] [1517] [32BIT] getuid() = 0xfffe

==> trinity-child95.log <==
[child95:21520] [94] userfaultfd(flags=0x0) = 516
[child95:21520] [95] eventfd(count=153) = 517
[child95:21520] [96] shmat(shmid=0x4000000, shmaddr=0x7f37ee466000,
shmflg=0x3000000000400000) = -1 (Invalid argument)

==> trinity-child96.log <==
[child96:21522] [225] pwritev2(fd=324, vec=0x2e19fc0, vlen=70,
pos_l=0x2727272727, pos_h=255, flags=0x2) = -1 (Illegal seek)
[child96:21522] [226] setfsuid(uid=-8) = 0xfffe
[child96:21522] [227] getdents(fd=324, dirent=0x7f37ec06c000,
count=4096) = -1 (Not a directory)

==> trinity-child97.log <==
[child97:21526] [119] [32BIT]
statfs64(pathname="/proc/545/task/545/net/can/rcvlist_all",
sz=0x8500000000000088) = -1 (Invalid argument)
[child97:21526] [120] pipe2(fildes=0x7f37ee465000, flags=0x0) = 0
[child97:21526] [121]
utimes(filename="/sys/devices/virtual/tty/ircomm4", utimes=0x0) = -1
(Permission denied)

==> trinity-child98.log <==
[child98:21529] [163] setresgid(rgid=20, egid=256,
sgid=0xffffffffffff5252) = -1 (Operation not permitted)
[child98:21529] [164]
llistxattr(pathname="/proc/986/net/dev_snmp6/ip_vti0", list=0x0,
size=4) = 0
[child98:21529] [165] getcpu(cpup=0x7f37ee659000,
nodep=0x7f37ee659008, unused=0x7f37ee659009) = 0

==> trinity-child99.log <==
[child99:21532] [223] sigaltstack(uss=0x7f37ec86c000,
uoss=0x7f37ec86c008, regs=0x7f37ee465000) = -1 (Invalid argument)
[child99:21532] [224] vmsplice(fd=320, iov=0x2e1afd0, nr_segs=1016,
flags=0xb) = -1 (Bad file descriptor)
[child99:21532] [225] [32BIT] setfsgid16(gid=0x2209024001252024) = 0xfffe

==> trinity-child9.log <==
[child9:21616] [294] getuid() = 0xfffe
[child9:21616] [295] fstatfs(fd=244, buf=0x8) = -1 (Bad address)
[child9:21616] [296] mlock2(start=0x7f37ec76c000, len=0xe000) = 0

Vegard

Vegard Nossum

unread,

Dec 5, 2016, 12:20:05 PM12/5/16

to

On 5 December 2016 at 12:10, Vegard Nossum <vegard...@gmail.com> wrote:
> On 5 December 2016 at 00:04, Vegard Nossum <vegard...@gmail.com> wrote:
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at ffffffff81ff08b7
>> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
>> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217
> [...]
>
>> I think you can rule out btrfs in any case, probably block layer as
>> well, since it looks like this comes from shmem.
>
> I should rather say that the VM runs on a 9p root filesystem and it
> doesn't use/mount any block devices or disk-based filesystems.
>
> I have all the trinity logs for the crash if that's useful. I tried a
> couple of runs with just the (at the time) in-progress syscalls but it
> didn't turn up anything interesting. Otherwise it seems like a lot of
> data to go through by hand.

I've hit this another 7 times in the past ~3 hours.

Three times the address being dereferenced has pointed to
iov_iter_init+0xaf (even across a kernel rebuild), three times it has
pointed to put_prev_entity+0x55, once to 0x800000008, and twice to
0x292. The fact that it would hit even one of those more than once
across runs is pretty suspicious to me, although the ones that point
to iov_iter_init and put_prev_entity point to "random" instructions in
the sense that they are neither entry points nor return addresses.

shmem_fault() was always on the stack, but it came from different
syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
clock_settime(), mount(), etc.

I also got this warning which is related:

------------[ cut here ]------------
WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0
list_del corruption. prev->next should be ffff88014bdc79e8, but was
ffff88014bfbfc60
Kernel panic - not syncing: panic_on_warn set ...

CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219

Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS

Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffff88014bdc7700 ffffffff81fb0861 ffffffff83e74b60 ffff88014bdc77d8
ffffffff84006c00 ffffffff847103e0 ffff88014bdc77c8 ffffffff81515244
0000000041b58ab3 ffffffff844e21c2 ffffffff81515061 ffffffff00000054
Call Trace:
[<ffffffff81fb0861>] dump_stack+0x83/0xb2
[<ffffffff81515244>] panic+0x1e3/0x3ad
[<ffffffff81515061>] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb
[<ffffffff82016f7f>] ? __list_del_entry+0x14f/0x1d0
[<ffffffff812708bf>] __warn+0x1bf/0x1e0
[<ffffffff8135f2d2>] ? __lock_acquire.isra.32+0xc2/0x1a30
[<ffffffff8127098c>] warn_slowpath_fmt+0xac/0xd0
[<ffffffff812708e0>] ? __warn+0x1e0/0x1e0
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff82016f7f>] __list_del_entry+0x14f/0x1d0
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff813530cb>] finish_wait+0xbb/0x180
[<ffffffff81576227>] shmem_fault+0x4c7/0x6b0
[<ffffffff81574743>] ? shmem_getpage_gfp+0x673/0x1c90
[<ffffffff81575d60>] ? shmem_getpage_gfp+0x1c90/0x1c90

[<ffffffff81352150>] ? wake_atomic_t_function+0x210/0x210
[<ffffffff815ad316>] __do_fault+0x206/0x410
[<ffffffff815ad110>] ? do_page_mkwrite+0x320/0x320

[<ffffffff815b80ac>] ? handle_mm_fault+0x1cc/0x2a60
[<ffffffff815b8fd7>] handle_mm_fault+0x10f7/0x2a60
[<ffffffff815b8012>] ? handle_mm_fault+0x132/0x2a60
[<ffffffff81310a7f>] ? thread_group_cputime+0x49f/0x6e0
[<ffffffff815b7ee0>] ? __pmd_alloc+0x370/0x370
[<ffffffff81310a9c>] ? thread_group_cputime+0x4bc/0x6e0
[<ffffffff81310d2d>] ? thread_group_cputime_adjusted+0x6d/0xe0
[<ffffffff81237170>] ? __do_page_fault+0x220/0x9f0

[<ffffffff815cba10>] ? find_vma+0x30/0x150
[<ffffffff812373a2>] __do_page_fault+0x452/0x9f0

[<ffffffff81237bf5>] trace_do_page_fault+0x1e5/0x3a0
[<ffffffff8122a007>] do_async_page_fault+0x27/0xa0
[<ffffffff83c97618>] async_page_fault+0x28/0x30

[<ffffffff81fdec7c>] ? copy_user_generic_string+0x2c/0x40
[<ffffffff812b0303>] ? SyS_times+0x93/0x110
[<ffffffff812b0270>] ? do_sys_times+0x2b0/0x2b0
[<ffffffff812b0270>] ? do_sys_times+0x2b0/0x2b0
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c96534>] entry_SYSCALL64_slow_path+0x25/0x25
------------[ cut here ]------------

The warning shows that it made it past the list_empty_careful() check
in finish_wait() but then bugs out on the &wait->task_list
dereference.

Anything stick out?

Vegard

Dave Jones

unread,

Dec 5, 2016, 12:30:14 PM12/5/16

to

> ------------[ cut here ]------------
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the &wait->task_list
> dereference.

I just pushed out the ftrace changes I made to Trinity that might help
you gather more clues.

Right now it's hardcoded to dump a trace to /boot/trace.txt when it
detects the kernel has become tainted.

Before a trinity run, I run this as root..

#!/bin/sh

cd /sys/kernel/debug/tracing/

echo 10000 > buffer_size_kb
echo function >> current_tracer

for i in $(cat /home/davej/blacklist-symbols)
do
echo $i >> set_ftrace_notrace
done

echo 1 >> tracing_on

blacklist-symbols is the more noisy stuff that pollutes traces.
Right now I use these: https://paste.fedoraproject.org/499794/14809582/
You may need to add some more.

(I'll get around to making all this scripting go away, and have trinity
just set this stuff up itself eventually)

Oh, and if you're not running as root, you might need a diff like below
so that trinity can stop the trace when it detects tainting.

Dave

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8696ce6bf2f6..2d6c97e871e0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("trace_clock", 0644, d_tracer, tr,
&trace_clock_fops);

- trace_create_file("tracing_on", 0644, d_tracer,
+ trace_create_file("tracing_on", 0666, d_tracer,
tr, &rb_simple_fops);

create_trace_options_dir(tr);

Linus Torvalds

unread,

Dec 5, 2016, 1:00:06 PM12/5/16

to

On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard...@gmail.com> wrote:
>
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the &wait->task_list
> dereference.
>
> Anything stick out?

I hate that shmem waitqueue garbage. It's really subtle.

I think the problem is that "wake_up_all()" in shmem_fallocate()
doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
"everything on the list".

I think that what happens is that the waiters somehow move from
TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
wake_up_all() will ignore them, leave them on the list, and now that
list on stack is no longer empty at the end.

And the way *THAT* can happen is that the task is on some *other*
waitqueue as well, and that other waiqueue wakes it up. That's not
impossible, you can certainly have people on wait-queues that still
take faults.

Or somebody just uses a directed wake_up_process() or something.

Since you apparently can recreate this fairly easily, how about trying
this stupid patch?

NOTE! This is entirely untested. I may have screwed this up entirely.
You get the idea, though - just remove the wait queue head from the
list - the list entries stay around, but nothing points to the stack
entry (that we're going to free) any more.

And add the warning to see if this actually ever triggers (and because
I'd like to see the callchain when it does, to see if it's another
waitqueue somewhere or what..)

Linus

patch.diff

Andy Lutomirski

unread,

Dec 5, 2016, 1:20:07 PM12/5/16

to

On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum <vegard...@gmail.com> wrote:
> On 23 November 2016 at 20:58, Dave Jones <da...@codemonkey.org.uk> wrote:
>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>
>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
>> > trace from just before this happened. Does this shed any light ?
>> >
>> > https://codemonkey.org.uk/junk/trace.txt
>>
>> crap, I just noticed the timestamps in the trace come from quite a bit
>> later. I'll tweak the code to do the taint checking/ftrace stop after
>> every syscall, that should narrow the window some more.
>
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at ffffffff81ff08b7

We really ought to improve this message. If nothing else, it should
say whether it was a read, a write, or an instruction fetch.

> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
> PGD 461e067 PUD 461f063
> PMD 1e001e1

Too lazy to manually decode this right now, but I don't think it matters.

> Oops: 0003 [#1] PREEMPT SMP KASAN

Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many
CPUs does your kernel think you have?

> RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>]
> __lock_acquire.isra.32+0xda/0x1a30
> RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082
> RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000

RAX points to kernel text.

> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00

2b: 3e ff 80 98 01 00 00 incl %ds:*0x198(%rax) <--
trapping instruction

That's very strange. I think this is:

atomic_inc((atomic_t *)&class->ops);

but my kernel contains:

3cb4: f0 ff 80 98 01 00 00 lock incl 0x198(%rax)

So your kernel has been smp-alternatived. That 3e comes from
alternatives_smp_unlock. If you're running on SMP with UP
alternatives, things will break.

What's your kernel command line? Can we have your entire kernel log from boot?

Adding Borislav, since he's the guru for this code.

Linus Torvalds

unread,

Dec 5, 2016, 1:30:11 PM12/5/16

to

On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski <lu...@kernel.org> wrote:
>
> So your kernel has been smp-alternatived. That 3e comes from
> alternatives_smp_unlock. If you're running on SMP with UP
> alternatives, things will break.

I'm assuming he's just running in a VM with a single CPU.

The problem that I pointed out with assuming wake_up_all() actually
removes all wait queue entries does not depend on SMP. The race is
much more fundamental and long-lived.

Linus

Vegard Nossum

unread,

Dec 5, 2016, 3:20:06 PM12/5/16

to

On 5 December 2016 at 20:11, Vegard Nossum <vegard...@gmail.com> wrote:
> On 5 December 2016 at 18:55, Linus Torvalds

> ------------[ cut here ]------------
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0

> Kernel panic - not syncing: panic_on_warn set ...

So I noticed that panic_on_warn just after sending the email and I've
been waiting for it it to trigger again.

The warning has triggered twice more without panic_on_warn set and I
haven't seen any crash yet.

Vegard

Linus Torvalds

unread,

Dec 5, 2016, 3:20:06 PM12/5/16

to

On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum <vegard...@gmail.com> wrote:
>
> ------------[ cut here ]------------

> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0

Ok, good. So that's confirmed as the cause of this problem.

And the call chain that I wanted is obviously completely
uninteresting, because it's call cghain on the other side (the page
fault side) that would show the nested wake queue behavior. I was just
being stupid about it.

I wonder if we have any other places where we just blithely assume
that "wake_up_all()" will actually empty the whole wait queue. It's
_usually_ true, but as noted, nested waiting does happen.

Anyway, can you try this patch instead? It should actually cause the
wake_up_all() to always remove all entries, and thus the WARN_ON()
should no longer happen (and I removed the "list_del()" hackery).

Linus

patch.diff

Linus Torvalds

unread,

Dec 5, 2016, 3:40:05 PM12/5/16

to

Adding the scheduler people to the participants list, and re-attaching
the patch, because while this patch is internal to the VM code, the
issue itself is not.

There might well be other cases where somebody goes "wake_up_all()"
will wake everybody up, so I can put the wait queue head on the stack,
and then after I've woken people up I can return".

Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
does make sure that everybody is woken up, but the usual autoremove
wake function only removes the wakeup entry if the process was woken
up by that particular wakeup. If something else had woken it up, the
entry remains on the list, and the waker in this case returned with
the wait head still containing entries.

Which is deadly when the wait queue head is on the stack.

So I'm wondering if we should make that "synchronous_wake_function()"
available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
that uses it.

Of course, I'm really hoping that this shmem.c use is the _only_ such
case. But I doubt it.

Comments?

Note for Ingo and Peter: this patch has not been tested at all. But
Vegard did test an earlier patch of mine that just verified that yes,
the issue really was that wait queue entries remained on the wait
queue head just as we were about to return and free it.

Linus

patch.diff

Vegard Nossum

unread,

Dec 5, 2016, 4:40:10 PM12/5/16

to

On 5 December 2016 at 21:35, Linus Torvalds

<torv...@linux-foundation.org> wrote:
> Note for Ingo and Peter: this patch has not been tested at all. But
> Vegard did test an earlier patch of mine that just verified that yes,
> the issue really was that wait queue entries remained on the wait
> queue head just as we were about to return and free it.

The second patch has been running for 1h+ without any problems of any
kind. I should typically have seen 2 crashes by now. I'll let it run
overnight to be sure.

Vegard

Peter Zijlstra

unread,

Dec 6, 2016, 3:20:04 AM12/6/16

to

On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote:
> Adding the scheduler people to the participants list, and re-attaching
> the patch, because while this patch is internal to the VM code, the
> issue itself is not.
>
> There might well be other cases where somebody goes "wake_up_all()"
> will wake everybody up, so I can put the wait queue head on the stack,
> and then after I've woken people up I can return".
>
> Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
> does make sure that everybody is woken up, but the usual autoremove
> wake function only removes the wakeup entry if the process was woken
> up by that particular wakeup. If something else had woken it up, the
> entry remains on the list, and the waker in this case returned with
> the wait head still containing entries.
>
> Which is deadly when the wait queue head is on the stack.

Yes, very much so.

> So I'm wondering if we should make that "synchronous_wake_function()"
> available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
> that uses it.

We could also do some debug code that tracks the ONSTACK ness and warns
on autoremove. Something like the below, equally untested.

>
> Of course, I'm really hoping that this shmem.c use is the _only_ such
> case. But I doubt it.

$ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
28

---

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2408e8d5c05c..199faaa89847 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
struct __wait_queue_head {
spinlock_t lock;
struct list_head task_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+ int onstack;
+#endif
};
typedef struct __wait_queue_head wait_queue_head_t;

@@ -56,9 +59,18 @@ struct task_struct;
#define DECLARE_WAITQUEUE(name, tsk) \
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)

-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
+#else
+#define ___WAIT_QUEUE_ONSTACK(onstack)
+#endif
+
+#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack) { \
.lock = __SPIN_LOCK_UNLOCKED(name.lock), \
- .task_list = { &(name).task_list, &(name).task_list } }
+ .task_list = { &(name).task_list, &(name).task_list }, \
+ ___WAIT_QUEUE_ONSTACK(onstack) }
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) ___WAIT_QUEUE_HEAD_INITIALIZER(name, 0)

#define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct

#ifdef CONFIG_LOCKDEP
# define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
- ({ init_waitqueue_head(&name); name; })
+ ({ init_waitqueue_head(&name); (name).onstack = 1; name; })
# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
#else
-# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
+# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
+ wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1)
#endif

static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9453efe9b25a..746d00117d08 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
}
EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */

+static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+ WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
+#endif
+}
+
/*
* Note: we use "set_current_state()" _after_ the wait-queue add,
* because we need a memory barrier there on SMP, so that any
@@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
set_current_state(state);
+ prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
if (list_empty(&wait->task_list))
__add_wait_queue_tail(q, wait);
set_current_state(state);
+ prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
}
set_current_state(state);
}
+ prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);

return ret;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9bb7d825ba14..af2ef22a5b2b 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1235,6 +1235,14 @@ config DEBUG_PI_LIST

If unsure, say N.

+config DEBUG_WAITQUEUE
+ bool "Debug waitqueue"
+ depends on DEBUG_KERNEL
+ help
+ Enable this to do sanity checking on waitqueue usage
+
+ If unsure, say N.
+
config DEBUG_SG
bool "Debug SG table operations"
depends on DEBUG_KERNEL

Ingo Molnar

unread,

Dec 6, 2016, 3:40:05 AM12/6/16

to

* Peter Zijlstra <pet...@infradead.org> wrote:

> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

This debug facility looks sensible. A couple of minor suggestions:

> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -39,6 +39,9 @@ struct wait_bit_queue {
> struct __wait_queue_head {
> spinlock_t lock;
> struct list_head task_list;
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + int onstack;
> +#endif

The structure will pack better in the debug-enabled case if 'onstack' is next to
'lock', as spinlock_t is 4 bytes on many architectures.

> -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
> +#else
> +#define ___WAIT_QUEUE_ONSTACK(onstack)
> +#endif

Please help readers by showing the internal structure of the definition:

#ifdef CONFIG_DEBUG_WAITQUEUE
# define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
#else
# define ___WAIT_QUEUE_ONSTACK(onstack)
#endif

> +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
> +#endif
> +}

I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit
misleadig (we don't prepare debugging, we do the debug check here).

> +config DEBUG_WAITQUEUE
> + bool "Debug waitqueue"
> + depends on DEBUG_KERNEL

I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on
CONFIG_DEBUG_SCHED.

LGTM otherwise!

Thanks,

Ingo

Vegard Nossum

unread,

Dec 6, 2016, 3:50:04 AM12/6/16

to

Alright, so nearly 12 hours later I don't see either the new warning
or the original crash at all, so feel free to add:

Tested-by: Vegard Nossum <vegard...@oracle.com>.

That said, my 8 VMs had all panicked in some way due to OOMs (which is
new since v4.8), although some got page allocation stalls for >20s and
died because "khugepaged blocked for more than 120 seconds", others
got "Out of memory and no killable processes".

Vegard

Linus Torvalds

unread,

Dec 6, 2016, 12:00:06 PM12/6/16

to

On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra <pet...@infradead.org> wrote:
>>
>> Of course, I'm really hoping that this shmem.c use is the _only_ such
>> case. But I doubt it.
>
> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

Hmm. Most of them seem to be ok, because they use "wait_event()",
which will always remove itself from the wait-queue. And they do it
from the place that allocates the wait-queue.

IOW, the mm/shmem.c case really was fairly special, because it just
did "prepare_to_wait()", and then did a finish_wait() - and not in the
thread that allocated it on the stack.

So it's really that "some _other_ thread allocated the waitqueue on
the stack, and now we're doing a wait on it" that is bad.

So the normal pattern seems to be:

- allocate wq on the stack
- pass it on to a waker
- wait for it

and that's ok, because as part of "wait for it" we will also be
cleaning things up.

The reason mm/shmem.c was buggy was that it did

- allocate wq on the stack
- pass it on to somebody else to wait for
- wake it up

and *that* is buggy, because it's the waiter, not the waker, that
normally cleans things up.

Is there some good way to find this kind of pattern automatically, I wonder....

Linus

Dave Jones

unread,

Dec 7, 2016, 12:00:06 PM12/7/16

to

On Sat, Dec 03, 2016 at 11:48:33AM -0500, Dave Jones wrote:

> The interesting process here seems to be kworker/u8:17, and the trace
> captures some of what that was doing before that bad page was hit.

I'm travelling next week, so I'm trying to braindump the stuff I've
found so far and summarise so I can pick it back up later if no-one else
figures it out first.

I've hit the bad page map spew with enough regularity that I've now got a handful of
good traces.

http://codemonkey.org.uk/junk/btrfs/bad-page-state1.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state2.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state3.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state4.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state5.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state6.txt

It smells to me like a race between truncate and the writeback
workqueue. The variety of traces here seem to show both sides
of the race, sometimes it's kworker, sometimes a trinity child process.

bad-page-state3.txt onwards have some bonus trace_printk's from
btrfs_setsize as I was curious what sizes we were passing down to
truncate. The only patterns I see are going from very large to very
small sizes. Perhaps that causes truncate to generate so much
writeback that it makes the race apparent ?

Other stuff I keep hitting:

Start transaction spew:
http://codemonkey.org.uk/junk/btrfs/start_transaction.txt
That's the WARN_ON(h->use_count > 2);
I hit this with enough regularity that I had to comment it out.
It's not clear to me whether this is related at all.

Lockdep spew:
http://codemonkey.org.uk/junk/btrfs/register_lock_class1.txt
http://codemonkey.org.uk/junk/btrfs/register_lock_class2.txt
This stuff has been around for a while (4.6ish iirc)

Sometimes the fs got into a screwed up state that needed btrfscking.
http://codemonkey.org.uk/junk/btrfs/replay-log-fail.txt

Dave

Steven Rostedt

unread,

Dec 9, 2016, 4:20:05 PM12/9/16

to

On Thu, Dec 01, 2016 at 10:32:09AM -0500, Dave Jones wrote:
>

> (function-graph screws up the RIP for some reason, 'return_to_handler'
> should actually be btrfs_destroy_inode)

That's because function_graph hijacks the return address and replaces it with
return_to_handler. The back trace has code to show what that handler is
suppose to be. But I should see why the WARNING shows it instead, and fix
that too.

Thanks,

-- Steve