mm: WARNING in __delete_from_page_cache

瀏覽次數:38 次
跳到第一則未讀訊息

Dmitry Vyukov

未讀,
2016年1月24日 清晨5:48:412016/1/24
收件者:Matthew Wilcox、Alexander Viro、linux-...@vger.kernel.org、LKML、Andrew Morton、Michal Hocko、Jan Kara、Vlastimil Babka、Kirill A. Shutemov、Matthew Wilcox、Junichi Nomura、Greg Thelen、Dave Hansen、linu...@kvack.org、syzkaller、Kostya Serebryany、Alexander Potapenko、Sasha Levin
Hello,

The following program triggers WARNING in __delete_from_page_cache:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 7676 at mm/filemap.c:217
__delete_from_page_cache+0x9f6/0xb60()
Modules linked in:
CPU: 0 PID: 7676 Comm: a.out Not tainted 4.4.0+ #276
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
00000000ffffffff ffff88006d3f7738 ffffffff82999e2d 0000000000000000
ffff8800620a0000 ffffffff86473d20 ffff88006d3f7778 ffffffff81352089
ffffffff81658d36 ffffffff86473d20 00000000000000d9 ffffea0000009d60
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
[<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
[<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
[<ffffffff81658d36>] __delete_from_page_cache+0x9f6/0xb60 mm/filemap.c:217
[<ffffffff81658fb2>] delete_from_page_cache+0x112/0x200 mm/filemap.c:244
[<ffffffff818af369>] __dax_fault+0x859/0x1800 fs/dax.c:487
[<ffffffff8186f4f6>] blkdev_dax_fault+0x26/0x30 fs/block_dev.c:1730
[< inline >] wp_pfn_shared mm/memory.c:2208
[<ffffffff816e9145>] do_wp_page+0xc85/0x14f0 mm/memory.c:2307
[< inline >] handle_pte_fault mm/memory.c:3323
[< inline >] __handle_mm_fault mm/memory.c:3417
[<ffffffff816ecec3>] handle_mm_fault+0x2483/0x4640 mm/memory.c:3446
[<ffffffff8127eff6>] __do_page_fault+0x376/0x960 arch/x86/mm/fault.c:1238
[<ffffffff8127f738>] trace_do_page_fault+0xe8/0x420 arch/x86/mm/fault.c:1331
[<ffffffff812705c4>] do_async_page_fault+0x14/0xd0 arch/x86/kernel/kvm.c:264
[<ffffffff86338f78>] async_page_fault+0x28/0x30 arch/x86/entry/entry_64.S:986
[<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
---[ end trace dae21e0f85f1f98c ]---


// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <pthread.h>
#include <stdint.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <fcntl.h>

int main()
{
syscall(SYS_mmap, 0x20000000ul, 0x10000ul, 0x3ul, 0x32ul, -1, 0x0ul);
int fd = syscall(SYS_open, "/dev/ram1", O_RDWR);
syscall(SYS_mmap, 0x20a31000ul, 0x3000ul, 0x3ul, 0xb011ul, fd, 0x0ul);
*(uint64_t*)0x20003000 = 1;
syscall(SYS_write, fd, 0x20003000ul, 0x78ul, 0, 0, 0);
syscall(SYS_getresuid, 0x20000688ul, 0x200008f2ul, 0x20a31000ul, 0, 0, 0);
return 0;
}

On commit 30f05309bde49295e02e45c7e615f73aa4e0ccc2.

Kirill A. Shutemov

未讀,
2016年1月24日 下午6:04:242016/1/24
收件者:Dmitry Vyukov、Matthew Wilcox、Alexander Viro、linux-...@vger.kernel.org、LKML、Andrew Morton、Michal Hocko、Jan Kara、Vlastimil Babka、Kirill A. Shutemov、Matthew Wilcox、Junichi Nomura、Greg Thelen、Dave Hansen、linu...@kvack.org、syzkaller、Kostya Serebryany、Alexander Potapenko、Sasha Levin
Reduced and human readable test case:

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

int main()
{
int fd;
char *p;

fd = open("/dev/ram0", O_RDWR);
p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
write(fd, "1", 1);
*p = 1;
return 0;
}

Looks like DAX doesn't expect to see something except hole-page in the radix
tree. This expectation is [probably] true for files on DAX-enabled
filesystems, but it seems broken for ramdisks.

Matthew?

--
Kirill A. Shutemov

Jan Kara

未讀,
2016年1月25日 清晨7:21:572016/1/25
收件者:Kirill A. Shutemov、Dmitry Vyukov、Matthew Wilcox、Alexander Viro、linux-...@vger.kernel.org、LKML、Andrew Morton、Michal Hocko、Jan Kara、Vlastimil Babka、Kirill A. Shutemov、Matthew Wilcox、Junichi Nomura、Greg Thelen、Dave Hansen、linu...@kvack.org、syzkaller、Kostya Serebryany、Alexander Potapenko、Sasha Levin、Dan Williams
Thanks. Despite the huge list of recipients the author of the changes
hasn't been CCed :) I've added Dan to CC since he wrote DAX support for
block devices. It seems somehow the write didn't go through the DAX path
but through the standard page cache write path. Ah, I see, only
file->f_mapping->host has S_DAX set but io_is_direct() which decides
whether DAX or pagecache path should be used for writes uses file->f_inode
which is something different for block devices...

Honza
--
Jan Kara <ja...@suse.com>
SUSE Labs, CR

Williams, Dan J

未讀,
2016年1月25日 晚上10:42:372016/1/25
收件者:kir...@shutemov.name、ja...@suse.cz、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、dave....@linux.intel.com、k...@google.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、wi...@linux.intel.com、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
On Mon, 2016-01-25 at 13:22 +0100, Jan Kara wrote:
[..]
> Thanks. Despite the huge list of recipients the author of the changes
> hasn't been CCed :) I've added Dan to CC since he wrote DAX support
> for
> block devices. It seems somehow the write didn't go through the DAX
> path
> but through the standard page cache write path. Ah, I see, only
> file->f_mapping->host has S_DAX set but io_is_direct() which decides
> whether DAX or pagecache path should be used for writes uses file-
> >f_inode
> which is something different for block devices... 

Thanks, yes, the following silences the warning for me:

8<----- (git am --scissors)
Subject: fs, block: force direct-I/O for dax-enabled block devices

From: Dan Williams <dan.j.w...@intel.com>

Similar to the file I/O path, re-direct all I/O to the DAX path for I/O
to a block-device special file.

Otherwise, we confuse the DAX code that does not expect to find live
data in the page cache:
Cc: Matthew Wilcox <wi...@linux.intel.com>
Cc: Ross Zwisler <ross.z...@linux.intel.com>
Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
Reported-by: Dmitry Vyukov <dvy...@google.com>
Reported-by: Kirill A. Shutemov <kir...@shutemov.name>
Suggested-by: Jan Kara <ja...@suse.cz>
Signed-off-by: Dan Williams <dan.j.w...@intel.com>
---
 fs/block_dev.c     |    5 -----
 include/linux/fs.h |   12 +++++++++++-
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7b9cd49622b1..277008617b2d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -156,11 +156,6 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
  return 0;
 }
 
-static struct inode *bdev_file_inode(struct file *file)
-{
- return file->f_mapping->host;
-}
-
 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a2046275cdf..a4c4314eed48 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1237,6 +1237,11 @@ static inline struct inode *file_inode(const struct file *f)
  return f->f_inode;
 }
 
+static inline struct inode *bdev_file_inode(struct file *file)
+{
+ return file->f_mapping->host;
+}
+
 static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
 {
  return locks_lock_inode_wait(file_inode(filp), fl);
@@ -2907,7 +2912,12 @@ extern void replace_mount_options(struct super_block *sb, char *options);
 
 static inline bool io_is_direct(struct file *filp)
 {
- return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+ struct inode *inode = file_inode(filp);
+
+ if (S_ISBLK(inode->i_mode))
+ inode = bdev_file_inode(filp);
+
+ return (filp->f_flags & O_DIRECT) || IS_DAX(inode);
 }
 
 static inline int iocb_flags(struct file *file)

Jan Kara

未讀,
2016年1月26日 凌晨3:28:552016/1/26
收件者:Williams, Dan J、kir...@shutemov.name、ja...@suse.cz、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、dave....@linux.intel.com、k...@google.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、wi...@linux.intel.com、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
The patch looks good to me. You can add:

Reviewed-by: Jan Kara <ja...@suse.cz>

Honza

Matthew Wilcox

未讀,
2016年1月26日 清晨7:55:002016/1/26
收件者:Williams, Dan J、kir...@shutemov.name、ja...@suse.cz、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、dave....@linux.intel.com、k...@google.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
On Tue, Jan 26, 2016 at 03:42:34AM +0000, Williams, Dan J wrote:
> @@ -2907,7 +2912,12 @@ extern void replace_mount_options(struct super_block *sb, char *options);
>  
>  static inline bool io_is_direct(struct file *filp)
>  {
> - return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));

I think this should just be a one-liner:

- return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+ return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);

This does the right thing for block device inodes and filesystem inodes.
(see the opening stanzas of __dax_fault for an example).

Jan Kara

未讀,
2016年1月26日 上午8:36:272016/1/26
收件者:Matthew Wilcox、Williams, Dan J、kir...@shutemov.name、ja...@suse.cz、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、dave....@linux.intel.com、k...@google.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
Ah, right. This looks indeed better.

Williams, Dan J

未讀,
2016年1月26日 上午11:59:312016/1/26
收件者:ja...@suse.cz、wi...@linux.intel.com、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、k...@google.com、dave....@linux.intel.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、kir...@shutemov.name、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
Oh, yeah, looks good.

8<---- (git am --scissors)
Subject: fs, block: force direct-I/O for dax-enabled block devices

From: Dan Williams <dan.j.w...@intel.com>

Similar to the file I/O path, re-direct all I/O to the DAX path for I/O
to a block-device special file.  Both regular files and device special
files can use the common filp->f_mapping->host lookup to determing is
DAX is enabled.
Cc: Ross Zwisler <ross.z...@linux.intel.com>
Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
Reported-by: Dmitry Vyukov <dvy...@google.com>
Reported-by: Kirill A. Shutemov <kir...@shutemov.name>
Suggested-by: Jan Kara <ja...@suse.cz>
Reviewed-by: Jan Kara <ja...@suse.cz>
Suggested-by: Matthew Wilcox <wi...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.w...@intel.com>
---
 include/linux/fs.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a2046275cdf..b10002d4a5f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2907,7 +2907,7 @@ extern void replace_mount_options(struct super_block *sb, char *options);
 
 static inline bool io_is_direct(struct file *filp)
 {
- return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+ return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
 }
 

Ross Zwisler

未讀,
2016年1月27日 下午1:02:582016/1/27
收件者:Dmitry Vyukov、Matthew Wilcox、Alexander Viro、linux-...@vger.kernel.org、LKML、Andrew Morton、Michal Hocko、Jan Kara、Vlastimil Babka、Kirill A. Shutemov、Matthew Wilcox、Junichi Nomura、Greg Thelen、Dave Hansen、linu...@kvack.org、syzkaller、Kostya Serebryany、Alexander Potapenko、Sasha Levin
Having inline functions represented in the stack trace and having file
names with line numbers seems really useful - how did you get this
output? Is this a feature of some kernel patch applied for syzkaller?

Dmitry Vyukov

未讀,
2016年1月27日 下午1:07:322016/1/27
收件者:syzkaller、Matthew Wilcox、Alexander Viro、linux-...@vger.kernel.org、LKML、Andrew Morton、Michal Hocko、Jan Kara、Vlastimil Babka、Kirill A. Shutemov、Matthew Wilcox、Junichi Nomura、Greg Thelen、Dave Hansen、linu...@kvack.org、Kostya Serebryany、Alexander Potapenko、Sasha Levin
I pipe normal kernel output through this script:
https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py

If you are in linux source dir with vmlinux and modules, then you just do:
$ cat crash | kasan_symbolize.py

Ross Zwisler

未讀,
2016年1月27日 下午5:04:462016/1/27
收件者:Williams, Dan J、ja...@suse.cz、wi...@linux.intel.com、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、k...@google.com、dave....@linux.intel.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、kir...@shutemov.name、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
I had a test case where I was hitting a warning while inserting into the page
cache when the inode was supposed to be DAX, and this clears up my issue as
well.

Tested-by: Ross Zwisler <ross.z...@linux.intel.com>

> ---
>  include/linux/fs.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1a2046275cdf..b10002d4a5f5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2907,7 +2907,7 @@ extern void replace_mount_options(struct super_block *sb, char *options);
>  
>  static inline bool io_is_direct(struct file *filp)
>  {
> - return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
> + return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
>  }
>  
>  static inline int iocb_flags(struct file *file)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Ross Zwisler

未讀,
2016年1月28日 下午2:51:202016/1/28
收件者:Ross Zwisler、Williams, Dan J、ja...@suse.cz、wi...@linux.intel.com、syzk...@googlegroups.com、linux-...@vger.kernel.org、kirill....@linux.intel.com、linu...@kvack.org、k...@google.com、dave....@linux.intel.com、vba...@suse.cz、vi...@zeniv.linux.org.uk、dvy...@google.com、ak...@linux-foundation.org、gth...@google.com、kir...@shutemov.name、linux-...@vger.kernel.org、mho...@suse.com、ja...@suse.com、gli...@google.com、sasha...@oracle.com、Wilcox, Matthew R、j-no...@ce.jp.nec.com
My testing has turned up another case where we can end up doing both page
cache I/O and DAX I/O to the same raw block device. Here is the stack trace
of the error, passed through kasan_symbolize.py:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 598 at mm/filemap.c:590 __add_to_page_cache_locked+0x30d/0x3c0()
Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
CPU: 1 PID: 598 Comm: systemd-udevd Tainted: G W 4.5.0-rc1+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
0000000000000000 0000000001e02977 ffff88040ff9fa88 ffffffff81576db2
0000000000000000 ffff88040ff9fac0 ffffffff810a65a2 ffff8804103d9020
0000000000000000 ffff8804103d9008 00000000ffffffea ffffea0002ebf080
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff81576db2>] dump_stack+0x44/0x62 lib/dump_stack.c:50
[<ffffffff810a65a2>] warn_slowpath_common+0x82/0xc0 kernel/panic.c:482
[<ffffffff810a66ea>] warn_slowpath_null+0x1a/0x20 kernel/panic.c:515
[< inline >] page_cache_tree_insert mm/filemap.c:590
[<ffffffff811ca32d>] __add_to_page_cache_locked+0x30d/0x3c0 mm/filemap.c:649
[<ffffffff811ca449>] add_to_page_cache_lru+0x49/0xd0 mm/filemap.c:697
[< inline >] __read_cache_page mm/filemap.c:2300
[<ffffffff811ca87b>] do_read_cache_page+0x6b/0x300 mm/filemap.c:2330
[<ffffffff811cab2c>] read_cache_page+0x1c/0x20 mm/filemap.c:2377
[< inline >] read_mapping_page include/linux/pagemap.h:391
[<ffffffff81558d74>] read_dev_sector+0x34/0xf0 block/partition-generic.c:558
[< inline >] read_part_sector block/partitions/check.h:37
[<ffffffff81560c4e>] read_lba+0x18e/0x290 block/partitions/efi.c:264
[< inline >] find_valid_gpt block/partitions/efi.c:610
[<ffffffff81561542>] efi_partition+0xf2/0x7d0 block/partitions/efi.c:692
[<ffffffff8155b47e>] check_partition+0x13e/0x220 block/partitions/check.c:166
[<ffffffff81559670>] rescan_partitions+0xc0/0x2b0 block/partition-generic.c:434
[<ffffffff81553aeb>] __blkdev_reread_part+0x6b/0xa0 block/ioctl.c:171
[<ffffffff81553b45>] blkdev_reread_part+0x25/0x40 block/ioctl.c:191
[<ffffffff815547b9>] blkdev_ioctl+0x5a9/0xab0 block/ioctl.c:624
[<ffffffff812a2303>] block_ioctl+0x43/0x50 fs/block_dev.c:1624
[< inline >] vfs_ioctl fs/ioctl.c:43
[<ffffffff81274452>] do_vfs_ioctl+0xa2/0x6a0 fs/ioctl.c:674
[< inline >] SYSC_ioctl fs/ioctl.c:689
[<ffffffff81274ac9>] SyS_ioctl+0x79/0x90 fs/ioctl.c:680
[<ffffffff81a6a732>] entry_SYSCALL_64_fastpath+0x12/0x76 arch/x86/entry/entry_64.S:185
---[ end trace 96171b39eee8b31b ]---

Here is my minimal reproducer:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>

#define PAGE(a) ((a)*0x1000)

int main(int argc, char *argv[])
{
int i, fd;
char *data_array = (char*) 0x10200000; /* request a 2MiB aligned address with mmap() */
int a;

fd = open("/dev/pmem0", O_RDWR);
if (fd < 0) {
perror("fd");
return 1;
}

data_array = mmap(data_array, PAGE(0x300), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, PAGE(0));

data_array[PAGE(0x0)] = 1;
close(fd);
return 0;
}

I believe what is happening is that I am doing a DAX PMD fault, and that fault
is happening at the same time that an IOCTL is doing a scan for partition
tables. The above stack trace is for a scan of an efi_partition(), but I get a
similar stack trace going through sgi_partition(), ldm_partition(),
msdos_partition(), etc. This looping is happening in check_partition(), as it
loops through the various partition types.

I think the issue essentially is that the partition scanning code doesn't have
a direct I/O code path, at least from what I have found. This means that it's
not calling into DAX, and instead is just using the page cache.
回覆所有人
回覆作者
轉寄
0 則新訊息