scsi: use-after-free in bio_copy_from_iter

90 views
Skip to first unread message

Dmitry Vyukov

unread,
Nov 25, 2016, 2:08:38 PM11/25/16
to Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, syzkaller
Hello,

The following program triggers use-after-free in bio_copy_from_iter:
https://gist.githubusercontent.com/dvyukov/80cd94b4e4c288f16ee4c787d404118b/raw/10536069562444da51b758bb39655b514ff93b45/gistfile1.txt


==================================================================
BUG: KASAN: use-after-free in copy_from_iter+0xf30/0x15e0 at addr
ffff880062c6e02a
Read of size 4096 by task a.out/8529
page:ffffea00018b1b80 count:2 mapcount:0 mapping:ffff88006c80e9d0 index:0x1695
flags: 0x5fffc0000000864(referenced|lru|active|private)
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff88003ebcea00
CPU: 1 PID: 8529 Comm: a.out Not tainted 4.9.0-rc6+ #55
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff880039ca6770 ffffffff834c3bb9 ffffffff00000001 1ffff10007394c81
ffffed0007394c79 0000000041b58ab3 ffffffff89575c30 ffffffff834c38cb
0000000000000000 0000000000000000 0000000000000001 0000000000000000
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff834c3bb9>] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51
[< inline >] print_address_description mm/kasan/report.c:211
[< inline >] kasan_report_error mm/kasan/report.c:285
[<ffffffff819f18b8>] kasan_report+0x418/0x440 mm/kasan/report.c:305
[< inline >] check_memory_region_inline mm/kasan/kasan.c:308
[<ffffffff819f0509>] check_memory_region+0x139/0x190 mm/kasan/kasan.c:315
[<ffffffff819f0a43>] memcpy+0x23/0x50 mm/kasan/kasan.c:350
[<ffffffff83525d10>] copy_from_iter+0xf30/0x15e0 lib/iov_iter.c:559
[<ffffffff83526834>] copy_page_from_iter+0x474/0x860 lib/iov_iter.c:614
[< inline >] bio_copy_from_iter block/bio.c:1025
[<ffffffff833dfc10>] bio_copy_user_iov+0xb50/0xf10 block/bio.c:1226
[< inline >] __blk_rq_map_user_iov block/blk-map.c:56
[<ffffffff83421c15>] blk_rq_map_user_iov+0x295/0x930 block/blk-map.c:130
[<ffffffff834223e9>] blk_rq_map_user+0x139/0x1e0 block/blk-map.c:159
[< inline >] sg_start_req drivers/scsi/sg.c:1757
[<ffffffff8470269a>] sg_common_write.isra.20+0x12da/0x1b20
drivers/scsi/sg.c:772
[<ffffffff8470774a>] sg_write+0x78a/0xda0 drivers/scsi/sg.c:675
[<ffffffff81a6fabe>] __vfs_write+0x65e/0x830 fs/read_write.c:510
[<ffffffff81a6fd7c>] __kernel_write+0xec/0x340 fs/read_write.c:532
[<ffffffff81b41e7c>] write_pipe_buf+0x19c/0x260 fs/splice.c:814
[< inline >] splice_from_pipe_feed fs/splice.c:519
[<ffffffff81b426af>] __splice_from_pipe+0x31f/0x750 fs/splice.c:643
[<ffffffff81b45dcc>] splice_from_pipe+0x1dc/0x300 fs/splice.c:678
[<ffffffff81b45f95>] default_file_splice_write+0x45/0x90 fs/splice.c:826
[< inline >] do_splice_from fs/splice.c:868
[<ffffffff81b3e2ba>] direct_splice_actor+0x12a/0x190 fs/splice.c:1035
[<ffffffff81b40706>] splice_direct_to_actor+0x2c6/0x840 fs/splice.c:990
[<ffffffff81b40f4e>] do_splice_direct+0x2ce/0x470 fs/splice.c:1078
[<ffffffff81a7459f>] do_sendfile+0x73f/0x1060 fs/read_write.c:1401
[< inline >] SYSC_sendfile64 fs/read_write.c:1456
[<ffffffff81a7677e>] SyS_sendfile64+0xee/0x1a0 fs/read_write.c:1448
[<ffffffff8814cf85>] entry_SYSCALL_64_fastpath+0x23/0xc6
arch/x86/entry/entry_64.S:209
Memory state around the buggy address:
ffff880062c6ef00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880062c6ef80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff880062c6f000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff880062c6f080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff880062c6f100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
BUG: unable to handle kernel paging request at ffff880062c6f000
IP: [<ffffffff835051c2>] __memcpy+0x12/0x20 arch/x86/lib/memcpy_64.S:37
PGD c53d067 [ 494.351750] PUD c540067
PMD 7fdea067 [ 494.351750] PTE 8000000062c6f060

Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 1 PID: 8529 Comm: a.out Tainted: G B 4.9.0-rc6+ #55
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88003c54e3c0 task.stack: ffff880039ca0000
RIP: 0010:[<ffffffff835051c2>] [<ffffffff835051c2>]
__memcpy+0x12/0x20 arch/x86/lib/memcpy_64.S:37
RSP: 0018:ffff880039ca6838 EFLAGS: 00010246
RAX: ffff880031b99000 RBX: 0000000000001000 RCX: 0000000000000006
RDX: 0000000000000000 RSI: ffff880062c6effa RDI: ffff880031b99fd0
RBP: ffff880039ca6858 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000200 R11: ffffed00063733ff R12: ffff880031b99000
R13: ffff880062c6e02a R14: ffff880039ca6f70 R15: ffff880039ca6d18
FS: 00007f7a78013700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff880062c6f000 CR3: 000000006411f000 CR4: 00000000000006e0
Stack:
ffffffff819f0a65 ffff880039ca71a0 0000000000001000 0000000000002000
ffff880039ca6d40 ffffffff83525d10 0000000041b58ab3 ffffffff89576240
ffffffff00000001 0000000000000000 0000000041b58ab3 ffffffff894d07b0
Call Trace:
[<ffffffff83525d10>] copy_from_iter+0xf30/0x15e0 lib/iov_iter.c:559
[<ffffffff83526834>] copy_page_from_iter+0x474/0x860 lib/iov_iter.c:614
[< inline >] bio_copy_from_iter block/bio.c:1025
[<ffffffff833dfc10>] bio_copy_user_iov+0xb50/0xf10 block/bio.c:1226
[< inline >] __blk_rq_map_user_iov block/blk-map.c:56
[<ffffffff83421c15>] blk_rq_map_user_iov+0x295/0x930 block/blk-map.c:130
[<ffffffff834223e9>] blk_rq_map_user+0x139/0x1e0 block/blk-map.c:159
[< inline >] sg_start_req drivers/scsi/sg.c:1757
[<ffffffff8470269a>] sg_common_write.isra.20+0x12da/0x1b20
drivers/scsi/sg.c:772
[<ffffffff8470774a>] sg_write+0x78a/0xda0 drivers/scsi/sg.c:675
[<ffffffff81a6fabe>] __vfs_write+0x65e/0x830 fs/read_write.c:510
[<ffffffff81a6fd7c>] __kernel_write+0xec/0x340 fs/read_write.c:532
[<ffffffff81b41e7c>] write_pipe_buf+0x19c/0x260 fs/splice.c:814
[< inline >] splice_from_pipe_feed fs/splice.c:519
[<ffffffff81b426af>] __splice_from_pipe+0x31f/0x750 fs/splice.c:643
[<ffffffff81b45dcc>] splice_from_pipe+0x1dc/0x300 fs/splice.c:678
[<ffffffff81b45f95>] default_file_splice_write+0x45/0x90 fs/splice.c:826
[< inline >] do_splice_from fs/splice.c:868
[<ffffffff81b3e2ba>] direct_splice_actor+0x12a/0x190 fs/splice.c:1035
[<ffffffff81b40706>] splice_direct_to_actor+0x2c6/0x840 fs/splice.c:990
[<ffffffff81b40f4e>] do_splice_direct+0x2ce/0x470 fs/splice.c:1078
[<ffffffff81a7459f>] do_sendfile+0x73f/0x1060 fs/read_write.c:1401
[< inline >] SYSC_sendfile64 fs/read_write.c:1456
[<ffffffff81a7677e>] SyS_sendfile64+0xee/0x1a0 fs/read_write.c:1448
[<ffffffff8814cf85>] entry_SYSCALL_64_fastpath+0x23/0xc6
arch/x86/entry/entry_64.S:209
Code: 4e fe e9 4d ff ff ff e8 9d c7 4e fe eb 8f e8 96 c7 4e fe e9 66
ff ff ff 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3>
48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3
RIP [<ffffffff835051c2>] __memcpy+0x12/0x20 arch/x86/lib/memcpy_64.S:37
RSP <ffff880039ca6838>
CR2: ffff880062c6f000
---[ end trace 13b61a6864c2c008 ]---
==================================================================


There are no alloc/free stacks in the report, so maybe it is not
use-after-free but rather an out-of-bounds.

On commit 16ae16c6e5616c084168740990fc508bda6655d4 (Nov 24).

Dmitry Vyukov

unread,
Dec 2, 2016, 11:51:01 AM12/2/16
to Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, syzkaller
+David did some debugging of a similar case. His 0x400 at location
0x2000efdc refers to 0xffff at 0x20012fdc in the provided reproducer:
NONFAILING(*(uint32_t*)0x20012fdc = (uint32_t)0xffff);
Here is his explanation:
=====
That's not nice, it's passing a reply_len mismatch of 0x400 which is going
to cause an sg_write warning (in 988 bytes, out 38 bytes). input_size
will be 74 since count == 80, the mxsize is 0x400 (!) so after subtracting
SZ_SG_HEADER we get the 988 bytes in. This forces SG_DXFER_TO_FROM_DEV
when we really want SG_DXFER_TO_DEV. If you set the value at 0x2000efdc
to 0x24 rather than 0x400 so there's a legitimate reply_len equal to
SG_DXFER_TO_FROM_DEV, I'd assume this passes just fine, assuming your
container can get enough memory.
=====

I've tried to replace 0xffff in the provided reproducer with 0x24 and
it indeed does not crash.

Johannes Thumshirn

unread,
Dec 3, 2016, 5:38:09 AM12/3/16
to Dmitry Vyukov, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, syzkaller, Hannes Reinecke
On Fri, Dec 02, 2016 at 05:50:39PM +0100, Dmitry Vyukov wrote:
> On Fri, Nov 25, 2016 at 8:08 PM, Dmitry Vyukov <dvy...@google.com> wrote:

[...]

>
> +David did some debugging of a similar case. His 0x400 at location
> 0x2000efdc refers to 0xffff at 0x20012fdc in the provided reproducer:
> NONFAILING(*(uint32_t*)0x20012fdc = (uint32_t)0xffff);
> Here is his explanation:
> =====
> That's not nice, it's passing a reply_len mismatch of 0x400 which is going
> to cause an sg_write warning (in 988 bytes, out 38 bytes). input_size
> will be 74 since count == 80, the mxsize is 0x400 (!) so after subtracting
> SZ_SG_HEADER we get the 988 bytes in. This forces SG_DXFER_TO_FROM_DEV
> when we really want SG_DXFER_TO_DEV. If you set the value at 0x2000efdc
> to 0x24 rather than 0x400 so there's a legitimate reply_len equal to
> SG_DXFER_TO_FROM_DEV, I'd assume this passes just fine, assuming your
> container can get enough memory.
> =====
>
> I've tried to replace 0xffff in the provided reproducer with 0x24 and
> it indeed does not crash.

This is somewhat expected, AFAICS the value at 0x20012fdc ends up as
hp->dxfer_len in the SG driver. I did a lot of debugging (actually I spend the
whole last work week soley on this) but I can't find where it does the actual
use-after-free, so if you have anything to share I'd be glad.

Byte,
Johannes

--
Johannes Thumshirn Storage
jthum...@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Dmitry Vyukov

unread,
Dec 3, 2016, 10:23:01 AM12/3/16
to Johannes Thumshirn, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, syzkaller, Hannes Reinecke
On Sat, Dec 3, 2016 at 11:38 AM, Johannes Thumshirn <jthum...@suse.de> wrote:
> On Fri, Dec 02, 2016 at 05:50:39PM +0100, Dmitry Vyukov wrote:
>> On Fri, Nov 25, 2016 at 8:08 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>
> [...]
>
>>
>> +David did some debugging of a similar case. His 0x400 at location
>> 0x2000efdc refers to 0xffff at 0x20012fdc in the provided reproducer:
>> NONFAILING(*(uint32_t*)0x20012fdc = (uint32_t)0xffff);
>> Here is his explanation:
>> =====
>> That's not nice, it's passing a reply_len mismatch of 0x400 which is going
>> to cause an sg_write warning (in 988 bytes, out 38 bytes). input_size
>> will be 74 since count == 80, the mxsize is 0x400 (!) so after subtracting
>> SZ_SG_HEADER we get the 988 bytes in. This forces SG_DXFER_TO_FROM_DEV
>> when we really want SG_DXFER_TO_DEV. If you set the value at 0x2000efdc
>> to 0x24 rather than 0x400 so there's a legitimate reply_len equal to
>> SG_DXFER_TO_FROM_DEV, I'd assume this passes just fine, assuming your
>> container can get enough memory.
>> =====
>>
>> I've tried to replace 0xffff in the provided reproducer with 0x24 and
>> it indeed does not crash.
>
> This is somewhat expected, AFAICS the value at 0x20012fdc ends up as
> hp->dxfer_len in the SG driver. I did a lot of debugging (actually I spend the
> whole last work week soley on this) but I can't find where it does the actual
> use-after-free, so if you have anything to share I'd be glad.

Hi Johannes,

Thanks for looking into this!

As I noted I don't think this is use-after-free, more likely it is an
out-of-bounds access against non-slab range.

Report says that we are copying 0x1000 bytes starting at 0xffff880062c6e02a.
The first bad address is 0xffff880062c6f000, this address was freed
previously and that's why KASAN reports UAF.
But this is already next page, and KASAN does not insert redzones
around pages (only around slab allocations).
So most likely the code should have not touch 0xffff880062c6f000 as it
is not his memory.
Also I noticed that the report happens after few minutes of repeatedly
running this program, so I would expect that this is some kind of race
-- either between kernel threads, or maybe between user space threads
and kernel. Or maybe it's just that the next page is not always marked
as free, so we just don't detect the bad access.

Does it all make any sense to you?
Can you think of any additional sanity checks that will ensure that
this code copies only memory it owns?

Johannes Thumshirn

unread,
Dec 3, 2016, 1:20:02 PM12/3/16
to Dmitry Vyukov, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, syzkaller, Hannes Reinecke
On Sat, Dec 03, 2016 at 04:22:39PM +0100, Dmitry Vyukov wrote:
> On Sat, Dec 3, 2016 at 11:38 AM, Johannes Thumshirn <jthum...@suse.de> wrote:
> > On Fri, Dec 02, 2016 at 05:50:39PM +0100, Dmitry Vyukov wrote:
> >> On Fri, Nov 25, 2016 at 8:08 PM, Dmitry Vyukov <dvy...@google.com> wrote:

[...]

Hi Dmitry,

>
> Thanks for looking into this!
>
> As I noted I don't think this is use-after-free, more likely it is an
> out-of-bounds access against non-slab range.
>
> Report says that we are copying 0x1000 bytes starting at 0xffff880062c6e02a.
> The first bad address is 0xffff880062c6f000, this address was freed
> previously and that's why KASAN reports UAF.

We're copying 65499 bytes (65535 - sizeof(sg_header)) and we've got 2 order 3
page allocations to do this. It fails somewhere in there. I have seen fails at
0x2000, 0xe000 and all (0x1000 aligned) offsets inbetween.

> But this is already next page, and KASAN does not insert redzones
> around pages (only around slab allocations).
> So most likely the code should have not touch 0xffff880062c6f000 as it
> is not his memory.
> Also I noticed that the report happens after few minutes of repeatedly
> running this program, so I would expect that this is some kind of race
> -- either between kernel threads, or maybe between user space threads
> and kernel.

I somehow think it's a race as well, especially as I have to run the
reproducer in an endless loop and break out of it once I have the 1st
stacktrace in dmesg. This takes between some minutes up to one hour on my
setup.

But the race against a userspace thread... Could it be that the reproducer has
already exited it's threads while the copy_from_iter() is still running?
Normally I'd say no, as user-space shouldn't run while the kernel is doing
things in it's address space, but this is highly suspicious.

> Or maybe it's just that the next page is not always marked
> as free, so we just don't detect the bad access.

Could be, but I lack the memory management knowledge to say more than a 'could
be'.

>
> Does it all make any sense to you?
> Can you think of any additional sanity checks that will ensure that
> this code copies only memory it owns?

Given that we pass the 0xffff as dxfer_len it thinks it owns all memory, so
this is OK, kinda. All that could be would be that user-space has already
exited and thus it's memory is already freed.

Dmitry Vyukov

unread,
Dec 5, 2016, 9:32:05 AM12/5/16
to syzkaller, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, Johannes Thumshirn
The crash happens in the context of sendfile syscall, we the address
space should be alive. But the crash happens on address
0xffff880062c6f000 which is not a user-space address, right? This is
something that kernel has allocated previously.
Do you have CONFIG_DEBUG_PAGEALLOC enabled? I have it enabled. Maybe
it increases changes of triggering the bug.

Do we know where that memory that we are copying was allocated? Is it
slab or page alloc? We could extend KASAN output with more details.
E.g. print allocation stack for the _first_ byte of memcpy, or
memorize page alloc/free stacks in page struct using lib/stackdepot.c.

Johannes Thumshirn

unread,
Dec 5, 2016, 10:17:59 AM12/5/16
to Dmitry Vyukov, syzkaller, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, mho...@suse.cz
It comes in this way:

drivers/scsi/sg.c:
sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos)
580 struct sg_header old_hdr;
581 sg_io_hdr_t *hp;
582 unsigned char cmnd[SG_MAX_CDB_SIZE];
[...]
598 if (__copy_from_user(&old_hdr, buf, SZ_SG_HEADER))
599 return -EFAULT;
[...]
612 buf += SZ_SG_HEADER;
613 __get_user(opcode, buf);
[...]
614 if (sfp->next_cmd_len > 0) {
615 cmd_size = sfp->next_cmd_len;
616 sfp->next_cmd_len = 0; /* reset so only this write() effected */
617 } else {
618 cmd_size = COMMAND_SIZE(opcode); /* based on SCSI command group */
619 if ((opcode >= 0xc0) && old_hdr.twelve_byte)
620 cmd_size = 12;
621 }
[...]
625 input_size = count - cmd_size;
626 mxsize = (input_size > old_hdr.reply_len) ? input_size : old_hdr.reply_len;
627 mxsize -= SZ_SG_HEADER;
633 hp = &srp->header;
[...]
646 hp->dxferp = (char __user *)buf + cmd_size;
[...]
654 if (__copy_from_user(cmnd, buf, cmd_size))
655 return -EFAULT;
[...]
675 k = sg_common_write(sfp, srp, cmnd, sfp->timeout, blocking);
[...]
752 static int
753 sg_common_write(Sg_fd * sfp, Sg_request * srp,
754 unsigned char *cmnd, int timeout, int blocking)
755 {
[...]
772 k = sg_start_req(srp, cmnd);
[...]
1653 static int
1654 sg_start_req(Sg_request *srp, unsigned char *cmd)
1655 {
[...]
1757 res = blk_rq_map_user(q, rq, md, hp->dxferp,
1758 hp->dxfer_len, GFP_ATOMIC);
[...]

block/blk-map.c:
148 int blk_rq_map_user(struct request_queue *q, struct request *rq,
149 struct rq_map_data *map_data, void __user *ubuf,
150 unsigned long len, gfp_t gfp_mask)
151 {
[...]
154 int ret = import_single_range(rq_data_dir(rq), ubuf, len, &iov, &i);

lib/iov_iter.c:
1209 int import_single_range(int rw, void __user *buf, size_t len,
1210 struct iovec *iov, struct iov_iter *i)
1211 {
1217 iov->iov_base = buf;

block/blk-map.c:
148 int blk_rq_map_user(struct request_queue *q, struct request *rq,
149 struct rq_map_data *map_data, void __user *ubuf,
150 unsigned long len, gfp_t gfp_mask)
151 {
[...]
159 return blk_rq_map_user_iov(q, rq, map_data, &i, gfp_mask);

and so on....

So the memory for hp->dxferp comes from:
633 hp = &srp->header;
a.k.a.
sg_add_sfp()'s:
2151 sfp = kzalloc(sizeof(*sfp), GFP_ATOMIC | __GFP_NOWARN);
And then taken out of the pool by sg_add_request() methinks. So yes it is a
kernel pointer (and the address proves it as well).

From my debug instrumentation I see that the dxferp ends up in the
iovec_iter's kvec->iov_base and the faulting address is always dxferp + n *
4k with n in [1, 16] (and we're copying 16 4k pages from the iovec into the
bio).

HTH,

Al Viro

unread,
Dec 5, 2016, 2:03:45 PM12/5/16
to Johannes Thumshirn, Dmitry Vyukov, syzkaller, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, mho...@suse.cz
On Mon, Dec 05, 2016 at 04:17:53PM +0100, Johannes Thumshirn wrote:
> 633 hp = &srp->header;
> [...]
> 646 hp->dxferp = (char __user *)buf + cmd_size;

> So the memory for hp->dxferp comes from:
> 633 hp = &srp->header;

????

> >From my debug instrumentation I see that the dxferp ends up in the
> iovec_iter's kvec->iov_base and the faulting address is always dxferp + n *
> 4k with n in [1, 16] (and we're copying 16 4k pages from the iovec into the
> bio).

_Address_ of hp->dxferp comes from that assignment; the value is 'buf'
argument of sg_write() + small offset. In this case, it should point
inside a pipe buffer, which is, indeed, at a kernel address. Who'd
allocated srp is irrelevant.

And if you end up dereferencing more than one page worth there, you do have
a problem - pipe buffers are not going to be that large. Could you slap
WARN_ON((size_t)input_size > count);
right after the calculation of input_size in sg_write() and see if it triggers
on your reproducer?

Johannes Thumshirn

unread,
Dec 6, 2016, 4:33:01 AM12/6/16
to Al Viro, Dmitry Vyukov, syzkaller, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, mho...@suse.cz
On Mon, Dec 05, 2016 at 07:03:39PM +0000, Al Viro wrote:
> On Mon, Dec 05, 2016 at 04:17:53PM +0100, Johannes Thumshirn wrote:
> > 633 hp = &srp->header;
> > [...]
> > 646 hp->dxferp = (char __user *)buf + cmd_size;
>
> > So the memory for hp->dxferp comes from:
> > 633 hp = &srp->header;
>
> ????
>
> > >From my debug instrumentation I see that the dxferp ends up in the
> > iovec_iter's kvec->iov_base and the faulting address is always dxferp + n *
> > 4k with n in [1, 16] (and we're copying 16 4k pages from the iovec into the
> > bio).
>
> _Address_ of hp->dxferp comes from that assignment; the value is 'buf'
> argument of sg_write() + small offset. In this case, it should point
> inside a pipe buffer, which is, indeed, at a kernel address. Who'd
> allocated srp is irrelevant.

Yes I realized that as well when I had enough distance between me and the
code...

>
> And if you end up dereferencing more than one page worth there, you do have
> a problem - pipe buffers are not going to be that large. Could you slap
> WARN_ON((size_t)input_size > count);
> right after the calculation of input_size in sg_write() and see if it triggers
> on your reproducer?

I did and it didn't trigger. What triggers is (as expected) a
WARN_ON((size_t)mxsize > count);
We have count at 80 and mxsize (which ends in hp->dxfer_len) at 65499. But the
65499 bytes are the len of the data we're suppost to be copying in via the
iov. I'm still rather confused what's happening here, sorry.

Dmitry Vyukov

unread,
Dec 6, 2016, 4:44:19 AM12/6/16
to Johannes Thumshirn, Al Viro, syzkaller, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, Michal Hocko
I think the critical piece here is some kind of race or timing
condition. Note that the test program executes all of
memfd_create/write/open/sendfile twice. Second time the calls race
with each other, but they also can race with the first execution of
the calls.

Johannes Thumshirn

unread,
Dec 6, 2016, 10:38:45 AM12/6/16
to Dmitry Vyukov, Al Viro, syzkaller, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, Michal Hocko
FWIW I've just run the reproducer once instead of looping it to check how it
would normally behave and it bailes out at:

604 if (count < (SZ_SG_HEADER + 6))
605 return -EIO; /* The minimum scsi command length is 6 bytes. */

That means, weren't going down the copy_form_iter() road at all. Usually, but
sometimes we do. And then we try to copy 16 pages from the pipe buffer (is
this correct?).
The reproducer does: sendfile("/dev/sg0", memfd, offset_in_memfd, 0x10000);

I don't see how we get there? Could it be random data from the mmap() we point
the memfd to?

This bug is confusing to be honest.

Dmitry Vyukov

unread,
Dec 6, 2016, 10:46:29 AM12/6/16
to syzkaller, Al Viro, Doug Gilbert, je...@linux.vnet.ibm.com, Martin K. Petersen, linux-scsi, LKML, ax...@kernel.dk, linux...@vger.kernel.org, David Rientjes, Hannes Reinecke, Michal Hocko
Where does this count come from? What address in the user program? Is
it 0x20012fxx?
One possibility for non-deterministically changing inputs is that this part:

case 2:
NONFAILING(*(uint32_t*)0x20012fd8 = (uint32_t)0x28);
NONFAILING(*(uint32_t*)0x20012fdc = (uint32_t)0xffff);
NONFAILING(*(uint64_t*)0x20012fe0 = (uint64_t)0x0);
NONFAILING(*(uint64_t*)0x20012fe8 = (uint64_t)0xffffffffffff993f);
NONFAILING(*(uint64_t*)0x20012ff0 = (uint64_t)0xa8b);
NONFAILING(*(uint32_t*)0x20012ff8 = (uint32_t)0xff);
r[9] = syscall(__NR_write, r[2], 0x20012fd8ul, 0x28ul, 0, 0,
0, 0, 0, 0);

runs concurrently with this part:

case 0:
r[0] =
syscall(__NR_mmap, 0x20000000ul, 0x13000ul, 0x3ul,
0x32ul, 0xfffffffffffffffful, 0x0ul, 0, 0, 0);

So all of the input data to the write, or a subset of the input data,
can be zeros.
Reply all
Reply to author
Forward
0 new messages