general protection fault in oom_unkillable_task

26 views
Skip to first unread message

syzbot

unread,
Jun 14, 2019, 9:08:06 PM6/14/19
to ak...@linux-foundation.org, ebie...@xmission.com, gu...@fb.com, han...@cmpxchg.org, jgl...@redhat.com, linux-...@vger.kernel.org, linu...@kvack.org, mho...@kernel.org, penguin...@i-love.sakura.ne.jp, shak...@google.com, syzkall...@googlegroups.com, yuzho...@didichuxing.com
Hello,

syzbot found the following crash on:

HEAD commit: 3f310e51 Add linux-next specific files for 20190607
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15ab8771a00000
kernel config: https://syzkaller.appspot.com/x/.config?x=5d176e1849bbc45
dashboard link: https://syzkaller.appspot.com/bug?extid=d0fc9d3c166bc5e4a94b
compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d0fc9d...@syzkaller.appspotmail.com

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
#11
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
RSP: 0018:ffff888000127490 EFLAGS: 00010a03
RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Call Trace:
oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
select_bad_process mm/oom_kill.c:374 [inline]
out_of_memory mm/oom_kill.c:1088 [inline]
out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
mem_cgroup_oom mm/memcontrol.c:1905 [inline]
try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
wp_huge_pmd mm/memory.c:3793 [inline]
__handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
__do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
RIP: 0033:0x400590
Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01
00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
Modules linked in:
---[ end trace a65689219582ffff ]---
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
RSP: 0018:ffff888000127490 EFLAGS: 00010a03
RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Tetsuo Handa

unread,
Jun 14, 2019, 9:10:48 PM6/14/19
to syzbot, ak...@linux-foundation.org, mho...@kernel.org, ebie...@xmission.com, gu...@fb.com, han...@cmpxchg.org, jgl...@redhat.com, linux-...@vger.kernel.org, linu...@kvack.org, shak...@google.com, syzkall...@googlegroups.com, yuzho...@didichuxing.com
I'm not sure this patch is correct/safe. Can you try memcg OOM torture
test (including memcg group OOM killing enabled) with this patch applied?
----------------------------------------
From a436624c73d106fad9b880a6cef5abd83b2329a2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Date: Sat, 15 Jun 2019 10:06:00 +0900
Subject: [PATCH] memcg: Protect whole mem_cgroup_scan_tasks() using RCU.

syzbot is reporting a GFP at for_each_thread() from a memcg OOM event [1].
While select_bad_process() in a global OOM event traverses whole threads
under RCU, select_bad_process() in a memcg OOM event is traversing threads
without RCU, and I guess that this can result in traversing bogus pointer.

Suppose a process containing three threads T1, T2, T3 is in a memcg.
T3 invokes memcg OOM killer, and starts traversing from T1. T3 elevates
refcount on T1, but T3 is preempted before oom_unkillable_task(T1) check.
Then, T1 reaches do_exit() and T1 does list_del_rcu(&T1->thread_node).

do_exit() {
cgroup_exit() {
css_set_move_task(tsk, cset, NULL, false);
}
exit_notify() {
release_task() {
__exit_signal() {
__unhash_process() {
list_del_rcu(&p->thread_node);
}
}
}
}
}

Then, T2 also reaches do_exit() and does list_del_rcu(&T2->thread_node).
Since the refcount of T1 was kept elevated by T3, T1 cannot be freed. But
since the refcount of T2 was not elevated by T3, T2 can complete do_exit()
and T2 is freed as soon as RCU grace period elapsed. At this point, since
T1 was removed from thread group before T2 was removed, T1's next thread
remains already freed T2. If memory used for T2 was reallocated before T3
resumes execution, accessing T1's next thread will not be reported as
use-after-free but memory referenced as T1's next thread contains bogus
values.

Thus, I think that the rule is: when traversing threads inside a section
between css_task_iter_start() and css_task_iter_end(), each thread must
not involve e.g. for_each_thread() unless whole section is protected by
RCU.

[1] https://syzkaller.appspot.com/bug?id=4559bc383e7c73a35bc6c8336557635459fb7a62

Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+d0fc9d...@syzkaller.appspotmail.com>
---
mm/memcontrol.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a..8e01f01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1159,6 +1159,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,

BUG_ON(memcg == root_mem_cgroup);

+ rcu_read_lock();
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -1172,6 +1173,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
break;
}
}
+ rcu_read_unlock();
return ret;
}

--
1.8.3.1

Shakeel Butt

unread,
Jun 14, 2019, 11:15:43 PM6/14/19
to syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, Michal Hocko, Tetsuo Handa, syzkaller-bugs, yuzho...@didichuxing.com
On Fri, Jun 14, 2019 at 6:08 PM syzbot
<syzbot+d0fc9d...@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: 3f310e51 Add linux-next specific files for 20190607
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15ab8771a00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=5d176e1849bbc45
> dashboard link: https://syzkaller.appspot.com/bug?extid=d0fc9d3c166bc5e4a94b
> compiler: gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d0fc9d...@syzkaller.appspotmail.com
>
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
> #11
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
> RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]

It seems like oom_unkillable_task() is broken for memcg OOMs. It
should not be calling has_intersects_mems_allowed() for memcg OOMs.

Michal Hocko

unread,
Jun 15, 2019, 9:49:59 AM6/15/19
to Shakeel Butt, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, Tetsuo Handa, syzkaller-bugs, yuzho...@didichuxing.com
You are right. It doesn't really make much sense to check for the NUMA
policy/cpusets when the memcg oom is NUMA agnostic. Now that I am
looking at the code then I am really wondering why do we even call
oom_unkillable_task from oom_badness. proc_oom_score shouldn't care
about NUMA either.

In other words the following should fix this unless I am missing
something (task_in_mem_cgroup seems to be a relict from before the group
oom handling). But please note that I am still not fully operation and
laying in the bed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5a58778c91d4..43eb479a5dc7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -161,8 +161,8 @@ static bool oom_unkillable_task(struct task_struct *p,
return true;

/* When mem_cgroup_out_of_memory() and p is not member of the group */
- if (memcg && !task_in_mem_cgroup(p, memcg))
- return true;
+ if (memcg)
+ return false;

/* p may not have freeable memory in nodemask */
if (!has_intersects_mems_allowed(p, nodemask))
@@ -318,7 +318,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
struct oom_control *oc = arg;
unsigned long points;

- if (oom_unkillable_task(task, NULL, oc->nodemask))
+ if (oom_unkillable_task(task, oc->memcg, oc->nodemask))
goto next;

--
Michal Hocko
SUSE Labs

Tetsuo Handa

unread,
Jun 15, 2019, 12:00:15 PM6/15/19
to syzbot, ak...@linux-foundation.org, mho...@kernel.org, ebie...@xmission.com, gu...@fb.com, han...@cmpxchg.org, jgl...@redhat.com, linux-...@vger.kernel.org, linu...@kvack.org, shak...@google.com, syzkall...@googlegroups.com, yuzho...@didichuxing.com
On 2019/06/15 10:10, Tetsuo Handa wrote:
> I'm not sure this patch is correct/safe. Can you try memcg OOM torture
> test (including memcg group OOM killing enabled) with this patch applied?

Well, I guess this patch was wrong. The ordering of removing threads
does not matter as long as we start traversing via signal_struct.
The reason why crashed at for_each_thread() is unknown...

Shakeel Butt

unread,
Jun 15, 2019, 12:11:50 PM6/15/19
to Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, Tetsuo Handa, syzkaller-bugs, yuzho...@didichuxing.com
Yes, we need something like this but not exactly.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5a58778c91d4..43eb479a5dc7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -161,8 +161,8 @@ static bool oom_unkillable_task(struct task_struct *p,
> return true;
>
> /* When mem_cgroup_out_of_memory() and p is not member of the group */
> - if (memcg && !task_in_mem_cgroup(p, memcg))
> - return true;
> + if (memcg)
> + return false;

This will break the dump_tasks() usage of oom_unkillable_task(). We
can change dump_tasks() to traverse processes like
mem_cgroup_scan_tasks() for memcg OOMs.

Tetsuo Handa

unread,
Jun 15, 2019, 12:49:05 PM6/15/19
to Shakeel Butt, Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On 2019/06/16 1:11, Shakeel Butt wrote:
> On Sat, Jun 15, 2019 at 6:50 AM Michal Hocko <mho...@kernel.org> wrote:
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 5a58778c91d4..43eb479a5dc7 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -161,8 +161,8 @@ static bool oom_unkillable_task(struct task_struct *p,
>> return true;
>>
>> /* When mem_cgroup_out_of_memory() and p is not member of the group */
>> - if (memcg && !task_in_mem_cgroup(p, memcg))
>> - return true;
>> + if (memcg)
>> + return false;
>
> This will break the dump_tasks() usage of oom_unkillable_task(). We
> can change dump_tasks() to traverse processes like
> mem_cgroup_scan_tasks() for memcg OOMs.

While dump_tasks() traverses only each thread group, mem_cgroup_scan_tasks()
traverses each thread. To avoid printk()ing all threads in a thread group,
moving that check to

if (memcg && !task_in_mem_cgroup(p, memcg))
continue;

in dump_tasks() is better?

Shakeel Butt

unread,
Jun 15, 2019, 2:50:44 PM6/15/19
to Tetsuo Handa, Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On Sat, Jun 15, 2019 at 9:49 AM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> On 2019/06/16 1:11, Shakeel Butt wrote:
> > On Sat, Jun 15, 2019 at 6:50 AM Michal Hocko <mho...@kernel.org> wrote:
> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >> index 5a58778c91d4..43eb479a5dc7 100644
> >> --- a/mm/oom_kill.c
> >> +++ b/mm/oom_kill.c
> >> @@ -161,8 +161,8 @@ static bool oom_unkillable_task(struct task_struct *p,
> >> return true;
> >>
> >> /* When mem_cgroup_out_of_memory() and p is not member of the group */
> >> - if (memcg && !task_in_mem_cgroup(p, memcg))
> >> - return true;
> >> + if (memcg)
> >> + return false;
> >
> > This will break the dump_tasks() usage of oom_unkillable_task(). We
> > can change dump_tasks() to traverse processes like
> > mem_cgroup_scan_tasks() for memcg OOMs.
>
> While dump_tasks() traverses only each thread group, mem_cgroup_scan_tasks()
> traverses each thread.

I think mem_cgroup_scan_tasks() traversing threads is not intentional
and css_task_iter_start in it should use CSS_TASK_ITER_PROCS as the
oom killer only cares about the processes or more specifically
mm_struct (though two different thread groups can have same mm_struct
but that is fine).

Tetsuo Handa

unread,
Jun 15, 2019, 5:34:33 PM6/15/19
to Shakeel Butt, Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On 2019/06/16 3:50, Shakeel Butt wrote:
>> While dump_tasks() traverses only each thread group, mem_cgroup_scan_tasks()
>> traverses each thread.
>
> I think mem_cgroup_scan_tasks() traversing threads is not intentional
> and css_task_iter_start in it should use CSS_TASK_ITER_PROCS as the
> oom killer only cares about the processes or more specifically
> mm_struct (though two different thread groups can have same mm_struct
> but that is fine).

We can't use CSS_TASK_ITER_PROCS from mem_cgroup_scan_tasks(). I've tried
CSS_TASK_ITER_PROCS in an attempt to evaluate only one thread from each
thread group, but I found that CSS_TASK_ITER_PROCS causes skipping whole
threads in a thread group (and trivially allowing "Out of memory and no
killable processes...\n" flood) if thread group leader has already exited.

If we can agree with using a flag in mm_struct in order to track whether
each mm_struct was already evaluated for each out_of_memory() call, we can
printk() only one thread from all thread groups sharing that mm_struct...

Tetsuo Handa

unread,
Jun 16, 2019, 3:38:38 AM6/16/19
to Shakeel Butt, Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On 2019/06/16 6:33, Tetsuo Handa wrote:
> On 2019/06/16 3:50, Shakeel Butt wrote:
>>> While dump_tasks() traverses only each thread group, mem_cgroup_scan_tasks()
>>> traverses each thread.
>>
>> I think mem_cgroup_scan_tasks() traversing threads is not intentional
>> and css_task_iter_start in it should use CSS_TASK_ITER_PROCS as the
>> oom killer only cares about the processes or more specifically
>> mm_struct (though two different thread groups can have same mm_struct
>> but that is fine).
>
> We can't use CSS_TASK_ITER_PROCS from mem_cgroup_scan_tasks(). I've tried
> CSS_TASK_ITER_PROCS in an attempt to evaluate only one thread from each
> thread group, but I found that CSS_TASK_ITER_PROCS causes skipping whole
> threads in a thread group (and trivially allowing "Out of memory and no
> killable processes...\n" flood) if thread group leader has already exited.

Seems that CSS_TASK_ITER_PROCS from mem_cgroup_scan_tasks() is now working.
Maybe I was confused due to without commit 7775face207922ea ("memcg: killed
threads should not invoke memcg OOM killer"). We can scan one thread from
each thread group, and remove

/* Prefer thread group leaders for display purposes */
if (points == oc->chosen_points && thread_group_leader(oc->chosen))
goto next;

check.

Tetsuo Handa

unread,
Jun 16, 2019, 11:14:29 AM6/16/19
to Shakeel Butt, Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On 2019/06/16 16:37, Tetsuo Handa wrote:
> On 2019/06/16 6:33, Tetsuo Handa wrote:
>> On 2019/06/16 3:50, Shakeel Butt wrote:
>>>> While dump_tasks() traverses only each thread group, mem_cgroup_scan_tasks()
>>>> traverses each thread.
>>>
>>> I think mem_cgroup_scan_tasks() traversing threads is not intentional
>>> and css_task_iter_start in it should use CSS_TASK_ITER_PROCS as the
>>> oom killer only cares about the processes or more specifically
>>> mm_struct (though two different thread groups can have same mm_struct
>>> but that is fine).
>>
>> We can't use CSS_TASK_ITER_PROCS from mem_cgroup_scan_tasks(). I've tried
>> CSS_TASK_ITER_PROCS in an attempt to evaluate only one thread from each
>> thread group, but I found that CSS_TASK_ITER_PROCS causes skipping whole
>> threads in a thread group (and trivially allowing "Out of memory and no
>> killable processes...\n" flood) if thread group leader has already exited.
>
> Seems that CSS_TASK_ITER_PROCS from mem_cgroup_scan_tasks() is now working.


I found a reproducer and the commit.

----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#include <asm/unistd.h>

static const unsigned long size = 1048576 * 200;
static int thread(void *unused)
{
int fd = open("/dev/zero", O_RDONLY);
char *buf = mmap(NULL, size, PROT_WRITE | PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
sleep(1);
read(fd, buf, size);
return syscall(__NR_exit, 0);
}
int main(int argc, char *argv[])
{
FILE *fp;
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
clone(thread, malloc(8192) + 4096, CLONE_SIGHAND | CLONE_THREAD | CLONE_VM, NULL);
return syscall(__NR_exit, 0);
}
----------------------------------------

Here is a patch to use CSS_TASK_ITER_PROCS.

From 415e52cf55bc4ad931e4f005421b827f0b02693d Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Date: Mon, 17 Jun 2019 00:09:38 +0900
Subject: [PATCH] mm: memcontrol: Use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks().

Since commit c03cd7738a83b137 ("cgroup: Include dying leaders with live
threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
only one thread from each thread group.

Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a..b09ff45 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1163,7 +1163,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct css_task_iter it;
struct task_struct *task;

- css_task_iter_start(&iter->css, 0, &it);
+ css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
--
1.8.3.1

Michal Hocko

unread,
Jun 17, 2019, 2:31:22 AM6/17/19
to Tetsuo Handa, Shakeel Butt, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On Mon 17-06-19 00:13:47, Tetsuo Handa wrote:
[...]
> >From 415e52cf55bc4ad931e4f005421b827f0b02693d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> Date: Mon, 17 Jun 2019 00:09:38 +0900
> Subject: [PATCH] mm: memcontrol: Use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks().
>
> Since commit c03cd7738a83b137 ("cgroup: Include dying leaders with live
> threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
> mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
> only one thread from each thread group.

O(Threads#) is definitely much worse than O(proc#)

> Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mho...@suse.com>

Thanks!

> ---
> mm/memcontrol.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ba9138a..b09ff45 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1163,7 +1163,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
> struct css_task_iter it;
> struct task_struct *task;
>
> - css_task_iter_start(&iter->css, 0, &it);
> + css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
> while (!ret && (task = css_task_iter_next(&it)))
> ret = fn(task, arg);
> css_task_iter_end(&it);
> --
> 1.8.3.1

Michal Hocko

unread,
Jun 17, 2019, 2:33:21 AM6/17/19
to Shakeel Butt, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, Tetsuo Handa, syzkaller-bugs, yuzho...@didichuxing.com
On Sat 15-06-19 09:11:37, Shakeel Butt wrote:
> On Sat, Jun 15, 2019 at 6:50 AM Michal Hocko <mho...@kernel.org> wrote:
[...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5a58778c91d4..43eb479a5dc7 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -161,8 +161,8 @@ static bool oom_unkillable_task(struct task_struct *p,
> > return true;
> >
> > /* When mem_cgroup_out_of_memory() and p is not member of the group */
> > - if (memcg && !task_in_mem_cgroup(p, memcg))
> > - return true;
> > + if (memcg)
> > + return false;
>
> This will break the dump_tasks() usage of oom_unkillable_task(). We
> can change dump_tasks() to traverse processes like
> mem_cgroup_scan_tasks() for memcg OOMs.

Right you are. Doing a similar trick to the oom victim selection is
indeed better. We should really strive to not doing a global process
iteration when we can do a targeted scan. Care to send a patch?

Thanks!

Tetsuo Handa

unread,
Jun 17, 2019, 5:57:34 AM6/17/19
to Michal Hocko, Shakeel Butt, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
I posted a patch that (as a side effect) avoids oom_unkillable_task() from dump_tasks() at
https://lore.kernel.org/linux-mm/1558519686-16057-2-git-s...@I-love.SAKURA.ne.jp/ .
What do you think?

Michal Hocko

unread,
Jun 17, 2019, 6:13:14 AM6/17/19
to Tetsuo Handa, Shakeel Butt, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
I am sorry but I didn't get to look at this series yet. Anyway, changing
the dump_tasks to use a cgroup iterator for the memcg oom sounds like a
straight forward thing to do without making much more changes around.
Global task list iteration under a single RCU is a more complex problem
that is not limited to the OOM path.

Shakeel Butt

unread,
Jun 17, 2019, 9:23:19 AM6/17/19
to Tetsuo Handa, Michal Hocko, syzbot, Andrew Morton, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
Reviewed-by: Shakeel Butt <shak...@google.com>

Why not add the reproducer in the commit message?

Andrew Morton

unread,
Jun 17, 2019, 9:45:50 PM6/17/19
to Shakeel Butt, Tetsuo Handa, Michal Hocko, syzbot, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On Mon, 17 Jun 2019 06:23:07 -0700 Shakeel Butt <shak...@google.com> wrote:

> > Here is a patch to use CSS_TASK_ITER_PROCS.
> >
> > From 415e52cf55bc4ad931e4f005421b827f0b02693d Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> > Date: Mon, 17 Jun 2019 00:09:38 +0900
> > Subject: [PATCH] mm: memcontrol: Use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks().
> >
> > Since commit c03cd7738a83b137 ("cgroup: Include dying leaders with live
> > threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
> > mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
> > only one thread from each thread group.
> >
> > Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
>
> Reviewed-by: Shakeel Butt <shak...@google.com>
>
> Why not add the reproducer in the commit message?

That would be nice.

More nice would be, as always, a descriptoin of the user-visible impact
of the patch.

As I understand it, it's just a bit of a cleanup against current
mainline but without this patch in place, Shakeel's "mm, oom: refactor
dump_tasks for memcg OOMs" will cause kernel crashes. Correct?

Shakeel Butt

unread,
Jun 18, 2019, 12:21:32 AM6/18/19
to Andrew Morton, Tetsuo Handa, Michal Hocko, syzbot, Eric W. Biederman, Roman Gushchin, Johannes Weiner, Jérôme Glisse, LKML, Linux MM, syzkaller-bugs, yuzho...@didichuxing.com
On Mon, Jun 17, 2019 at 6:45 PM Andrew Morton <ak...@linux-foundation.org> wrote:
>
> On Mon, 17 Jun 2019 06:23:07 -0700 Shakeel Butt <shak...@google.com> wrote:
>
> > > Here is a patch to use CSS_TASK_ITER_PROCS.
> > >
> > > From 415e52cf55bc4ad931e4f005421b827f0b02693d Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> > > Date: Mon, 17 Jun 2019 00:09:38 +0900
> > > Subject: [PATCH] mm: memcontrol: Use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks().
> > >
> > > Since commit c03cd7738a83b137 ("cgroup: Include dying leaders with live
> > > threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
> > > mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
> > > only one thread from each thread group.
> > >
> > > Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> >
> > Reviewed-by: Shakeel Butt <shak...@google.com>
> >
> > Why not add the reproducer in the commit message?
>
> That would be nice.
>
> More nice would be, as always, a descriptoin of the user-visible impact
> of the patch.
>

This is just a cleanup and optimization where instead of traversing
all the threads in a memcg, we only traverse only one thread for each
thread group in a memcg. There is no user visible impact.

> As I understand it, it's just a bit of a cleanup against current
> mainline but without this patch in place, Shakeel's "mm, oom: refactor
> dump_tasks for memcg OOMs" will cause kernel crashes. Correct?

No, the patch "mm, oom: refactor dump_tasks for memcg OOMs" is making
dump_stacks not depend on the memcg check within
oom_unkillable_task().

"mm, oom: fix oom_unkillable_task for memcg OOMs" is the actual fix
which is making oom_unkillable_task() correctly handle the memcg OOMs
code paths.

Shakeel
Reply all
Reply to author
Forward
0 new messages