kernel BUG at mm/page-writeback.c:LINE!

syzbot

unread,

Jan 3, 2021, 9:19:16 AM1/3/21

to ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com

Hello,

syzbot found the following issue on:

HEAD commit: 139711f0 Merge branch 'akpm' (patches from Andrew)
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11648e93500000
kernel config: https://syzkaller.appspot.com/x/.config?x=f3e74a3fb99ae2c2
dashboard link: https://syzkaller.appspot.com/bug?extid=2fc0712f8f8b8b8fa0ef
compiler: gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2fc071...@syzkaller.appspotmail.com

kernel BUG at mm/page-writeback.c:2241!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 26311 Comm: syz-executor.2 Not tainted 5.11.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:write_cache_pages+0xd06/0x1190 mm/page-writeback.c:2241
Code: 02 00 00 e8 dc 6e da ff 48 c7 c6 80 64 54 89 4c 89 ef e8 0d 04 0b 00 0f 0b 49 c7 c4 c0 77 f7 8e e9 ce f9 ff ff e8 ba 6e da ff <0f> 0b e8 b3 6e da ff 49 8d 5c 24 ff e9 c5 f8 ff ff e8 a4 6e da ff
RSP: 0018:ffffc9000931f850 EFLAGS: 00010212
RAX: 000000000000ac05 RBX: 0000000000008000 RCX: ffffc9000d7c1000
RDX: 0000000000040000 RSI: ffffffff819805d6 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffea0000c0ce07
R10: ffffffff8197fee3 R11: 0000000000000000 R12: ffffea0002762cc8
R13: ffffea0000c0ce00 R14: dffffc0000000000 R15: 0000000000000000
FS: 00007f653a526700(0000) GS:ffff8880b9f00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000052e94c CR3: 0000000015251000 CR4: 0000000000350ee0
Call Trace:
mpage_writepages+0xd8/0x230 fs/mpage.c:714
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__filemap_fdatawrite_range+0x2a1/0x380 mm/filemap.c:422
fat_cont_expand+0x169/0x230 fs/fat/file.c:235
fat_setattr+0xac2/0xf40 fs/fat/file.c:501
notify_change+0xb60/0x10a0 fs/attr.c:336
do_truncate+0x134/0x1f0 fs/open.c:64
do_sys_ftruncate+0x703/0x860 fs/open.c:195
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e299
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f653a525c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 000000000045e299
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000000000000003
RBP: 000000000119c060 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000119c034
R13: 00007fff76bf2e8f R14: 00007f653a5269c0 R15: 000000000119c034
Modules linked in:
---[ end trace 23c7a881902c6de9 ]---
RIP: 0010:write_cache_pages+0xd06/0x1190 mm/page-writeback.c:2241
Code: 02 00 00 e8 dc 6e da ff 48 c7 c6 80 64 54 89 4c 89 ef e8 0d 04 0b 00 0f 0b 49 c7 c4 c0 77 f7 8e e9 ce f9 ff ff e8 ba 6e da ff <0f> 0b e8 b3 6e da ff 49 8d 5c 24 ff e9 c5 f8 ff ff e8 a4 6e da ff
RSP: 0018:ffffc9000931f850 EFLAGS: 00010212
RAX: 000000000000ac05 RBX: 0000000000008000 RCX: ffffc9000d7c1000
RDX: 0000000000040000 RSI: ffffffff819805d6 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffea0000c0ce07
R10: ffffffff8197fee3 R11: 0000000000000000 R12: ffffea0002762cc8
R13: ffffea0000c0ce00 R14: dffffc0000000000 R15: 0000000000000000
FS: 00007f653a526700(0000) GS:ffff8880b9f00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2f228000 CR3: 0000000015251000 CR4: 0000000000350ee0

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Andrew Morton

unread,

Jan 4, 2021, 3:41:56 PM1/4/21

to syzbot, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, Linus Torvalds

Well that's exciting. write_cache_pages() does:

if (PageWriteback(page)) {
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
else
goto continue_unlock;
}

bang -->> BUG_ON(PageWriteback(page));

and filemap_fdatawrite_range() passed WB_SYNC_ALL to
__filemap_fdatawrite_range().

So either wait_on_page_writeback() simply failed to work (returned
without waiting) or someone came in and unexpectedly set PG_writeback a
second time.

And because it isn't reproducible, we don't know if it's a -mm patch or
it it's a mainline bug or whatever. There isn't much material in -mm
at present and I can't see any which would affect these things, so is
it reasonable to suspect that this is a mainline bug?

Linus, how confident are you in those wait_on_page_bit_common()
changes?

Linus Torvalds

unread,

Jan 4, 2021, 4:52:49 PM1/4/21

to Andrew Morton, Hugh Dickins, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton <ak...@linux-foundation.org> wrote:
>
> >
> > kernel BUG at mm/page-writeback.c:2241!

> > Call Trace:
> > mpage_writepages+0xd8/0x230 fs/mpage.c:714
> > do_writepages+0xec/0x290 mm/page-writeback.c:2352
> > __filemap_fdatawrite_range+0x2a1/0x380 mm/filemap.c:422
> > fat_cont_expand+0x169/0x230 fs/fat/file.c:235
> > fat_setattr+0xac2/0xf40 fs/fat/file.c:501
> > notify_change+0xb60/0x10a0 fs/attr.c:336
> > do_truncate+0x134/0x1f0 fs/open.c:64
> > do_sys_ftruncate+0x703/0x860 fs/open.c:195
> > do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>

> Well that's exciting. write_cache_pages() does:
>
> if (PageWriteback(page)) {
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
> else
> goto continue_unlock;
> }
>
> bang -->> BUG_ON(PageWriteback(page));

This is the same situation that was discussed earlier, and that we
thought Hugh commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
BUG_ON(PageWriteback)") should fix.

Basically, the theory was that the sequence

if (PageWriteback(page))
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));

could have the wake-up of the _previous_ owner of the page happen, and
wake up the wait_on_page_writeback() waiter, but then a new owner of
the page would re-allocate it and mark it for writeback.

> So either wait_on_page_writeback() simply failed to work (returned
> without waiting) or someone came in and unexpectedly set PG_writeback a
> second time.

So Hugh found at least _one_ way that that "someone came in and
unexpectedly set PG_writeback a second time" could happen.

But that fix by Hugh is already in that HEAD commit that syzbot is
now reporting, so there's something else going on.

> Linus, how confident are you in those wait_on_page_bit_common()
> changes?

Pretty confident. The atomicity of the bitops themselves is fairly simple.

But in the writeback bit? No. The old code would basically _loop_ if
it was woken up and the writeback bit was set again, and would hide
any problems with it.

The new code basically goes "ok, the writeback bit was clear at one
point, so I've waited enough".

We could easily turn the "if ()" in wait_on_page_writeback() into a "while()".

But honestly, it does smell to me like the bug is always in the caller
not having serialized with whatever actually starts writeback. High
figured out one such case.

This code holds the page lock, but I don't see where
set_page_writeback() would always be done with the page lock held. So
what really protects against PG_writeback simply being set again?

The whole BUG_ON() seems entirely buggy to me.

In fact, even if you hold the page lock while doing
set_page_writeback(), since the actual IO does *NOT* hold the page
lock, the unlock happens without it. So even if every single case of
setting the page writeback were to hold the page lock, what keeps this
from happening:

CPU1 = end previous writeback
CPU2 = start new writeback under page lock
CPU3 = write_cache_pages()

CPU1 CPU2 CPU3
---- ---- ----

end_page_writeback()
test_clear_page_writeback(page)
... delayed...

lock_page();
set_page_writeback()
unlock_page()

lock_page()
wait_on_page_writeback();

wake_up_page(page, PG_writeback);
.. wakes up CPU3 ..

BUG_ON(PageWriteback(page));

IOW, that BUG_ON() really feels entirely bogus to me. Notice how it
wasn't actually serialized with the waking up of the _previous_ bit.

Could we make the wait_on_page_writeback() just loop if it sees the
page under writeback again? Sure.

Could we make the wait_on_page_bit_common() code say "if this is
PG_writeback, I won't wake it up after all, because the bit is set
again?" Sure.

But I feel it's really that end_page_writeback() itself is
fundamentally buggy, because the "wakeup" is not atomic with the bit
clearing _and_ it doesn't actually hold the page lock that is
allegedly serializing this all.

That raciness was what caused the "stale wakeup from previous owner"
thing too. And I think that Hugh fixed the page re-use case, but the
fundamental problem of end_page_writeback() kind of remained.

And yes, I think this was all hidden by wait_on_page_writeback()
effectively looping over the "PageWriteback(page)" test because of how
wait_on_page_bit() worked.

So the one-liner of changing the "if" to "while" in
wait_on_page_writeback() should get us back to what we used to do.

Except I still get the feeling that the bug really is not in
wait_on_page_writeback(), but in the end_page_writeback() side.

Comments? I'm perfectly happy doing the one-liner. I would just be
_happier_ with end_page_writeback() having the serialization..

The real problem is that "wake_up_page(page, bit)" is not the thing
that actually clears the bit. So there's a fundamental race between
clearing the bit and waking something up.

Which makes me think that the best option would actually be to move
the bit clearing _into_ wake_up_page(). But that looks like a very big
change.

Linus

Hugh Dickins

unread,

Jan 4, 2021, 10:29:00 PM1/4/21

to Linus Torvalds, Andrew Morton, Hugh Dickins, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Mon, 4 Jan 2021, Linus Torvalds wrote:
> On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton <ak...@linux-foundation.org> wrote:
>
> > Linus, how confident are you in those wait_on_page_bit_common()
> > changes?
>
> Pretty confident. The atomicity of the bitops themselves is fairly simple.
>
> But in the writeback bit? No. The old code would basically _loop_ if
> it was woken up and the writeback bit was set again, and would hide
> any problems with it.
>
> The new code basically goes "ok, the writeback bit was clear at one
> point, so I've waited enough".
>
> We could easily turn the "if ()" in wait_on_page_writeback() into a "while()".
>
> But honestly, it does smell to me like the bug is always in the caller
> not having serialized with whatever actually starts writeback. High
> figured out one such case.
>
> This code holds the page lock, but I don't see where
> set_page_writeback() would always be done with the page lock held. So
> what really protects against PG_writeback simply being set again?
>
> The whole BUG_ON() seems entirely buggy to me.
>
> In fact, even if you hold the page lock while doing
> set_page_writeback(), since the actual IO does *NOT* hold the page
> lock, the unlock happens without it. So even if every single case of
> setting the page writeback were to hold the page lock,

I did an audit when this came up before, and though not 100% confident
in my diligence, it certainly looked that way; and others looked too
(IIRC Matthew had a patch to add a WARN_ON_ONCE or whatever, but that
didn't go upstream).

> what keeps this from happening:
>
> CPU1 = end previous writeback
> CPU2 = start new writeback under page lock
> CPU3 = write_cache_pages()
>
> CPU1 CPU2 CPU3
> ---- ---- ----
>
> end_page_writeback()
> test_clear_page_writeback(page)
> ... delayed...
>
>
> lock_page();
> set_page_writeback()
> unlock_page()
>
>
> lock_page()
> wait_on_page_writeback();
>
> wake_up_page(page, PG_writeback);
> .. wakes up CPU3 ..
>
> BUG_ON(PageWriteback(page));
>
> IOW, that BUG_ON() really feels entirely bogus to me. Notice how it
> wasn't actually serialized with the waking up of the _previous_ bit.

Well. That looks so obvious now you suggest it, that I feel very
stupid for not seeing it before, so have tried hard to disprove you.
But I think you're right.

>
> Could we make the wait_on_page_writeback() just loop if it sees the
> page under writeback again? Sure.
>
> Could we make the wait_on_page_bit_common() code say "if this is
> PG_writeback, I won't wake it up after all, because the bit is set
> again?" Sure.
>
> But I feel it's really that end_page_writeback() itself is
> fundamentally buggy, because the "wakeup" is not atomic with the bit
> clearing _and_ it doesn't actually hold the page lock that is
> allegedly serializing this all.

And we won't be adding a lock_page() into end_page_writeback()!

>
> That raciness was what caused the "stale wakeup from previous owner"
> thing too. And I think that Hugh fixed the page re-use case, but the
> fundamental problem of end_page_writeback() kind of remained.
>
> And yes, I think this was all hidden by wait_on_page_writeback()
> effectively looping over the "PageWriteback(page)" test because of how
> wait_on_page_bit() worked.
>
> So the one-liner of changing the "if" to "while" in
> wait_on_page_writeback() should get us back to what we used to do.

I think that is the realistic way to go.

>
> Except I still get the feeling that the bug really is not in
> wait_on_page_writeback(), but in the end_page_writeback() side.
>
> Comments? I'm perfectly happy doing the one-liner. I would just be
> _happier_ with end_page_writeback() having the serialization..
>
> The real problem is that "wake_up_page(page, bit)" is not the thing
> that actually clears the bit. So there's a fundamental race between
> clearing the bit and waking something up.
>
> Which makes me think that the best option would actually be to move
> the bit clearing _into_ wake_up_page(). But that looks like a very big
> change.

I'll be surprised if that direction is even possible, without unpleasant
extra locking. If there were only one wakeup to be done, perhaps, but
potentially there are many. When I looked before, it seemed that the
clear bit needs to come before the wakeup, and the wakeup needs to come
before the clear bit. And the BOOKMARK case drops q->lock.

It should be possible to rely on the XArray's i_pages lock rather than
the page lock for serialization, much as I did in one variant of the
patch I sent originally. Updated version appended below for show
(most of it rearrangement+cleanup rather than the functional change);
but I think it's slightly incomplete (__test_set_page_writeback()
should take i_pages lock even in the !mapping_use_writeback_tags case);
and last time around you were worried by the NULL mapping case - which
I believe just comes from an abundance of caution (who writes back
without any backing? and truncation/eviction wait_on_page_writeback()).

But I expect you to take one look and quickly opt instead for the "while"
in wait_on_page_writeback(): which is not so bad now that you've shown a
scenario which justifies it.

--- 5.11-rc2/include/linux/page-flags.h 2020-12-27 20:39:36.819923814 -0800
+++ linux/include/linux/page-flags.h 2021-01-04 17:27:32.771920607 -0800
@@ -549,7 +549,6 @@ static __always_inline void SetPageUptod

CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)

-int test_clear_page_writeback(struct page *page);
int __test_set_page_writeback(struct page *page, bool keep_write);

#define test_set_page_writeback(page) \
--- 5.11-rc2/include/linux/pagemap.h 2020-12-13 14:41:30.000000000 -0800
+++ linux/include/linux/pagemap.h 2021-01-04 17:27:32.775920636 -0800
@@ -660,6 +660,7 @@ static inline int lock_page_or_retry(str
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern void wake_up_page_bit(struct page *page, int bit_nr);

/*
* Wait for a page to be unlocked.
--- 5.11-rc2/kernel/sched/wait_bit.c 2020-03-29 15:25:41.000000000 -0700
+++ linux/kernel/sched/wait_bit.c 2021-01-04 17:27:32.779920665 -0800
@@ -90,9 +90,8 @@ __wait_on_bit_lock(struct wait_queue_hea
ret = action(&wbq_entry->key, mode);
/*
* See the comment in prepare_to_wait_event().
- * finish_wait() does not necessarily takes wwq_head->lock,
- * but test_and_set_bit() implies mb() which pairs with
- * smp_mb__after_atomic() before wake_up_page().
+ * finish_wait() does not necessarily take
+ * wwq_head->lock, but test_and_set_bit() implies mb().
*/
if (ret)
finish_wait(wq_head, &wbq_entry->wq_entry);
--- 5.11-rc2/mm/filemap.c 2020-12-27 20:39:37.659932212 -0800
+++ linux/mm/filemap.c 2021-01-04 17:28:59.464560914 -0800
@@ -1093,7 +1093,7 @@ static int wake_page_function(wait_queue
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}

-static void wake_up_page_bit(struct page *page, int bit_nr)
+void wake_up_page_bit(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
struct wait_page_key key;
@@ -1147,13 +1147,6 @@ static void wake_up_page_bit(struct page
spin_unlock_irqrestore(&q->lock, flags);
}

-static void wake_up_page(struct page *page, int bit)
-{
- if (!PageWaiters(page))
- return;
- wake_up_page_bit(page, bit);
-}
-
/*
* A choice of three behaviors for wait_on_page_bit_common():
*/
@@ -1466,40 +1459,6 @@ void unlock_page(struct page *page)
}
EXPORT_SYMBOL(unlock_page);

-/**
- * end_page_writeback - end writeback against a page
- * @page: the page
- */
-void end_page_writeback(struct page *page)
-{
- /*
- * TestClearPageReclaim could be used here but it is an atomic
- * operation and overkill in this particular case. Failing to
- * shuffle a page marked for immediate reclaim is too mild to
- * justify taking an atomic operation penalty at the end of
- * ever page writeback.
- */
- if (PageReclaim(page)) {
- ClearPageReclaim(page);
- rotate_reclaimable_page(page);
- }
-
- /*
- * Writeback does not hold a page reference of its own, relying
- * on truncation to wait for the clearing of PG_writeback.
- * But here we must make sure that the page is not freed and
- * reused before the wake_up_page().
- */
- get_page(page);
- if (!test_clear_page_writeback(page))
- BUG();
-
- smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
- put_page(page);
-}
-EXPORT_SYMBOL(end_page_writeback);
-
/*
* After completing I/O on a page, call this routine to update the page
* flags appropriately
--- 5.11-rc2/mm/page-writeback.c 2020-12-13 14:41:30.000000000 -0800
+++ linux/mm/page-writeback.c 2021-01-04 17:28:59.464560914 -0800
@@ -589,7 +589,7 @@ static void wb_domain_writeout_inc(struc

/*
* Increment @wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
+ * completion count. Called from end_page_writeback().
*/
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
@@ -2719,49 +2719,69 @@ int clear_page_dirty_for_io(struct page
}
EXPORT_SYMBOL(clear_page_dirty_for_io);

-int test_clear_page_writeback(struct page *page)
+/**
+ * end_page_writeback - end writeback against a page
+ * @page: the page
+ */
+void end_page_writeback(struct page *page)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping;
struct mem_cgroup *memcg;
struct lruvec *lruvec;
- int ret;
+ unsigned long flags;
+ int writeback;
+
+ /*
+ * TestClearPageReclaim could be used here but it is an atomic
+ * operation and overkill in this particular case. Failing to
+ * shuffle a page marked for immediate reclaim is too mild to
+ * justify taking an atomic operation penalty at the end of
+ * every page writeback.
+ */
+ if (PageReclaim(page)) {
+ ClearPageReclaim(page);
+ rotate_reclaimable_page(page);
+ }

+ mapping = page_mapping(page);
+ WARN_ON_ONCE(!mapping);
memcg = lock_page_memcg(page);
lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+ dec_lruvec_state(lruvec, NR_WRITEBACK);
+ dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+ inc_node_page_state(page, NR_WRITTEN);
+
+ if (mapping)
+ xa_lock_irqsave(&mapping->i_pages, flags);
+
+ writeback = TestClearPageWriteback(page);
+ /* No need for smp_mb__after_atomic() after TestClear */
+ if (PageWaiters(page))
+ wake_up_page_bit(page, PG_writeback);
+
if (mapping && mapping_use_writeback_tags(mapping)) {
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
- unsigned long flags;

- xa_lock_irqsave(&mapping->i_pages, flags);
- ret = TestClearPageWriteback(page);
- if (ret) {
- __xa_clear_mark(&mapping->i_pages, page_index(page),
+ __xa_clear_mark(&mapping->i_pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
- struct bdi_writeback *wb = inode_to_wb(inode);
+ if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
+ struct bdi_writeback *wb = inode_to_wb(inode);

- dec_wb_stat(wb, WB_WRITEBACK);
- __wb_writeout_inc(wb);
- }
+ dec_wb_stat(wb, WB_WRITEBACK);
+ __wb_writeout_inc(wb);
}
+ if (inode && !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+ sb_clear_inode_writeback(inode);
+ }

- if (mapping->host && !mapping_tagged(mapping,
- PAGECACHE_TAG_WRITEBACK))
- sb_clear_inode_writeback(mapping->host);
-
+ if (mapping)
xa_unlock_irqrestore(&mapping->i_pages, flags);
- } else {
- ret = TestClearPageWriteback(page);
- }
- if (ret) {
- dec_lruvec_state(lruvec, NR_WRITEBACK);
- dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- inc_node_page_state(page, NR_WRITTEN);
- }
__unlock_page_memcg(memcg);
- return ret;
+ BUG_ON(!writeback);
}
+EXPORT_SYMBOL(end_page_writeback);

int __test_set_page_writeback(struct page *page, bool keep_write)
{

Linus Torvalds

unread,

Jan 5, 2021, 2:31:57 PM1/5/21

to Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Mon, Jan 4, 2021 at 7:29 PM Hugh Dickins <hu...@google.com> wrote:
>
> > But I feel it's really that end_page_writeback() itself is
> > fundamentally buggy, because the "wakeup" is not atomic with the bit
> > clearing _and_ it doesn't actually hold the page lock that is
> > allegedly serializing this all.
>
> And we won't be adding a lock_page() into end_page_writeback()!

Right.

However, the more I think about this, the more I feel
end_page_writeback() is kind of ok, and the real problem is that we
should have had a "lock/unlock" pattern for the PG_writeback bit
instead.

Instead, what we have is basically a special "wait for bit to clear",
and then external sychronization - even if lacking - for set/clear
bit.

And the "wait for bit to clear" and "set bit" aren't even adjacent -
the waiting happens in write_cache_pages(), but then the actual
setting happens later in the actual "(*writepage)()" function.

What _would_ be nicer, I feel, is if write_cache_pages() simply did
the equivalent of "lock_page()" except for setting the PG_writeback
bit. The whole "wait for bit to clear" is fundamentally a much more
ambiguous thing for that whole "what if somebody else got it" reason,
but it's also nasty because it can be very unfair (the same way our
"lock_page()" itself used to be very unfair).

IOW, you can have that situation where one actor continually waits for
the bit to clear, but then somebody else comes in and gets the bit
instead.

Of course, writeback is much less likely than lock_page(), so it
probably doesn't happen much in practice.

And honestly, while I think it would be good to make the rule be
"writepage function was entered with the writeback bit already held",
that kind of change would be a major pain. We have at leastr 49
different 'writepage' implementations, and while I think a few of them
end up just being block_write_full_page(), it means that we have a lot
of that

BUG_ON(PageWriteback(page));
set_page_writeback(page);

pattern in various random filesystem code that all would have to be changed.

So I think I know what I'd _like_ the code to be like, but I don't
think it's worth it.

> > So the one-liner of changing the "if" to "while" in
> > wait_on_page_writeback() should get us back to what we used to do.
>
> I think that is the realistic way to go.

Yeah, that's what I'll do.

> > Which makes me think that the best option would actually be to move
> > the bit clearing _into_ wake_up_page(). But that looks like a very big
> > change.

See above - having thought about this overnight, I don't think this is
even the best change.

> But I expect you to take one look and quickly opt instead for the "while"
> in wait_on_page_writeback(): which is not so bad now that you've shown a
> scenario which justifies it.

[ patch removed - I agree, this pain isn't worth it ]

Linus

Linus Torvalds

unread,

Jan 5, 2021, 2:54:08 PM1/5/21

to Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Tue, Jan 5, 2021 at 11:31 AM Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> On Mon, Jan 4, 2021 at 7:29 PM Hugh Dickins <hu...@google.com> wrote:
> >
> > > So the one-liner of changing the "if" to "while" in
> > > wait_on_page_writeback() should get us back to what we used to do.
> >
> > I think that is the realistic way to go.
>
> Yeah, that's what I'll do.

I took your "way to go" statement as an ack, and made it all be commit
c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple
pending writebacks").

Linus

Hugh Dickins

unread,

Jan 5, 2021, 4:13:53 PM1/5/21

to Linus Torvalds, Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

Great, thanks, I see it now.

I was going to raise a question, whether you should now revert
073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"):
which would not have gone in like that if c2407cf7d22d were already in.

But if it were reverted, we'd need some other fix for the PageTail part
of it; and I still cannot think of anywhere else where we knowingly
operated on a struct page without holding a reference; and there have
been no adverse reports on its extra get_page+put_page.

So I think it's safest to leave it in.

Hugh

Linus Torvalds

unread,

Jan 5, 2021, 4:23:08 PM1/5/21

to Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Tue, Jan 5, 2021 at 1:13 PM Hugh Dickins <hu...@google.com> wrote:
>
> I was going to raise a question, whether you should now revert
> 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"):
> which would not have gone in like that if c2407cf7d22d were already in.

Honestly, even if it wasn't for that PageTail issue, I think
073861ed77b6 is just the right thing to do anyway. It just feels _so_
much safer to not have the possibility of that page wait thing
following while the page is possibly then being free'd and re-used at
the same time.

So I think the only reason to revert that commit would be if we were
to find that it's a huge performance problem to raise the page
refcount temporarily. Which I think is very unlikely (since we already
dirty the page structure due to the page flags modification - although
they are far enough apart that it might be a different cacheline).

Linus

Matthew Wilcox

unread,

Jan 5, 2021, 4:35:08 PM1/5/21

to Linus Torvalds, Hugh Dickins, Andrew Morton, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

struct pages _tend_ to be 64 bytes on 64-bit platforms (and i suspect
you're long past caring about performance on 32-bit platforms).

Linus Torvalds

unread,

Jan 8, 2021, 9:04:41 PM1/8/21

to Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

Oh, and Michael Larabel (of phoronix) reports that that one-liner does
something bad to a few PostgreSQL tests, on the order of 5-10%
regression on some machines (but apparently not others).

I suspect that's a sign of instability in the benchmark numbers, but
it probably also means that we have some silly condition where
multiple threads want to clean the same page.

I sent him a patch to try if it ends up being better to just not wake
things up early at all (instead of the "if" -> "while") conversion.
That trivial patch appended here in case anybody has comments.

Just the fact that that one-liner made a performance impact makes me
go "hmm", though. Michael didn't see the BUG_ON(), so it's presumably
some _other_ user of wait_on_page_writeback() than the
write_cache_pages() one that causes issues.

Anybody got any suspicions? Honestly, when working on the page wait
queues, I was working under the assumption that it's really just the
page lock that truly matters.

I'm thinking things like __filemap_fdatawait_range(), which doesn't
hold the page lock at all, so it's all kinds of non-serialized, and
could now be waiting for any number of IO's ro complete..

Oh well. This email doesn't really have a point, it's more of a
heads-up that that "wait to see one or multiple writebacks" thing
seems to matter more than I would have expected for some loads..

Linus

patch

Jan Kara

unread,

Jan 12, 2021, 5:44:27 AM1/12/21

to Linus Torvalds, Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Fri 08-01-21 18:04:21, Linus Torvalds wrote:
> On Tue, Jan 5, 2021 at 11:53 AM Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > I took your "way to go" statement as an ack, and made it all be commit
> > c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple
> > pending writebacks").
>
> Oh, and Michael Larabel (of phoronix) reports that that one-liner does
> something bad to a few PostgreSQL tests, on the order of 5-10%
> regression on some machines (but apparently not others).

Do you have more details? From my experience (we do regular pgbench runs
for various kernels in various configs in SUSE) PostgreSQL numbers tend to
be somewhat noisy and more dependent on CPU scheduling and NUMA locality
than anything else. But it very much depends on the exact config passed to
pgbench so that's why I'm asking...

> I suspect that's a sign of instability in the benchmark numbers, but
> it probably also means that we have some silly condition where
> multiple threads want to clean the same page.
>
> I sent him a patch to try if it ends up being better to just not wake
> things up early at all (instead of the "if" -> "while") conversion.
> That trivial patch appended here in case anybody has comments.
>
> Just the fact that that one-liner made a performance impact makes me
> go "hmm", though. Michael didn't see the BUG_ON(), so it's presumably
> some _other_ user of wait_on_page_writeback() than the
> write_cache_pages() one that causes issues.
>
> Anybody got any suspicions? Honestly, when working on the page wait
> queues, I was working under the assumption that it's really just the
> page lock that truly matters.
>
> I'm thinking things like __filemap_fdatawait_range(), which doesn't
> hold the page lock at all, so it's all kinds of non-serialized, and
> could now be waiting for any number of IO's ro complete..
>
> Oh well. This email doesn't really have a point, it's more of a
> heads-up that that "wait to see one or multiple writebacks" thing
> seems to matter more than I would have expected for some loads..

Honestly I'm surprised your patch made a difference as well. It is pretty
common a page gets redirtied while it's being written back but usually it
takes time before next writeback of the page is started. But I guess with
the DB load it is possible e.g. if we frequently flush out some page for
data consistency reasons (I know PostgreSQL is using sync_file_range(2)
interface to start flushing pages early and then uses fsync(2) when it
really needs the pages written which could create a situation with unfair
treatment of PageWriteback bit).

Honza
--
Jan Kara <ja...@suse.com>
SUSE Labs, CR

Linus Torvalds

unread,

Jan 12, 2021, 12:35:55 PM1/12/21

to Jan Kara, Hugh Dickins, Andrew Morton, Matthew Wilcox, syzbot, Linux Kernel Mailing List, Linux-MM, syzkaller-bugs

On Tue, Jan 12, 2021 at 2:44 AM Jan Kara <ja...@suse.cz> wrote:
>
> On Fri 08-01-21 18:04:21, Linus Torvalds wrote:
> >
> > Oh, and Michael Larabel (of phoronix) reports that that one-liner does
> > something bad to a few PostgreSQL tests, on the order of 5-10%
> > regression on some machines (but apparently not others).
>
> Do you have more details? From my experience (we do regular pgbench runs
> for various kernels in various configs in SUSE) PostgreSQL numbers tend to
> be somewhat noisy and more dependent on CPU scheduling and NUMA locality
> than anything else. But it very much depends on the exact config passed to
> pgbench so that's why I'm asking...

No, I don't really have many more details. I don't have things like
raw numbers or exact configurations, but Michael has been very
responsive if you ask, so if you are interested, just email him at
Michael Larabel <Mic...@michaellarabel.com>.

It wasn't NUMA - apparently the machines he saw this on were just
plain consumer setups, and his larger machines didn't actually show
the effect. But yes, I suspect it was some scheduling artifact, and
probably just fairly random noise from just changing the scheduling
pattern a bit.

Linus

Reply all

Reply to author

Forward