Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

3.10.16 cgroup_mutex deadlock

94 views
Skip to first unread message

Shawn Bohrer

unread,
Nov 11, 2013, 5:10:01 PM11/11/13
to
Hello,

This morning I had a machine running 3.10.16 go unresponsive but
before we killed it we were able to get the information below. I'm
not an expert here but it looks like most of the tasks below are
blocking waiting on the cgroup_mutex. You can see that the
resource_alloca:16502 task is holding the cgroup_mutex and that task
appears to be waiting on a lru_add_drain_all() to complete.

Initially I thought the deadlock might simply be that the per cpu
workqueue work from lru_add_drain_all() is stuck waiting on the
cgroup_free_fn to complete. However I've read
Documentation/workqueue.txt and it sounds like the current workqueue
has multiple kworker threads per cpu and thus this should not happen.
Both the cgroup_free_fn work and lru_add_dran_all() work run on the
system_wq which has max_active set to 0 so I believe multiple kworker
threads should run. This also appears to be true since all of the
cgroup_free_fn are running on kworker/12 thread and there are multiple
blocked.

Perhaps someone with more experience in the cgroup and workqueue code
can look at the stacks below and identify the problem, or explain why
the lru_add_drain_all() work has not completed:


[694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
[694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694702.018217] systemd D ffffffff81607820 0 1 0 0x00000000
[694702.020505] ffff88041dcc1d78 0000000000000086 ffff88041dc7f100 ffffffff8110ad54
[694702.023006] 0000000000000001 ffff88041dc78000 ffff88041dcc1fd8 ffff88041dcc1fd8
[694702.025508] ffff88041dcc1fd8 ffff88041dc78000 ffff88041a1e8698 ffffffff81a417c0
[694702.028011] Call Trace:
[694702.028788] [<ffffffff8110ad54>] ? vma_merge+0x124/0x330
[694702.030468] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694702.032011] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694702.033982] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694702.035926] [<ffffffff8112a2bd>] ? kmem_cache_alloc_trace+0x12d/0x160
[694702.037948] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694702.039546] [<ffffffff81095b77>] proc_cgroup_show+0x67/0x1d0
[694702.041330] [<ffffffff8115925b>] seq_read+0x16b/0x3e0
[694702.042927] [<ffffffff811383d0>] vfs_read+0xb0/0x180
[694702.044498] [<ffffffff81138652>] SyS_read+0x52/0xa0
[694702.046042] [<ffffffff814c2182>] system_call_fastpath+0x16/0x1b
[694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
[694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694702.052467] kworker/12:1 D 0000000000000000 0 203 2 0x00000000
[694702.054756] Workqueue: events cgroup_free_fn
[694702.056139] ffff88041bc1fcf8 0000000000000046 ffff88038e7b46a0 0000000300000001
[694702.058642] ffff88041bc1fd84 ffff88041da6e9f0 ffff88041bc1ffd8 ffff88041bc1ffd8
[694702.061144] ffff88041bc1ffd8 ffff88041da6e9f0 0000000000000087 ffffffff81a417c0
[694702.063647] Call Trace:
[694702.064423] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694702.065966] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694702.067936] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694702.069879] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694702.071476] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694702.073209] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694702.075019] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694702.076748] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694702.078560] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694702.080078] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694702.081995] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694702.083671] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 seconds.
[694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694702.090225] systemd-logind D ffffffff81607820 0 2885 1 0x00000000
[694702.092513] ffff88041ac6fd88 0000000000000082 ffff88041dd8aa60 ffff88041d9bc1a8
[694702.095014] ffff88041ac6fda0 ffff88041cac9530 ffff88041ac6ffd8 ffff88041ac6ffd8
[694702.097517] ffff88041ac6ffd8 ffff88041cac9530 0000000000000c36 ffffffff81a417c0
[694702.100019] Call Trace:
[694702.100793] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694702.102338] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694702.104309] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694702.198316] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694702.292456] [<ffffffff8108fa6d>] cgroup_lock_live_group+0x1d/0x40
[694702.386833] [<ffffffff810946c8>] cgroup_mkdir+0xa8/0x4b0
[694702.480679] [<ffffffff81145ea4>] vfs_mkdir+0x84/0xd0
[694702.574124] [<ffffffff8114791e>] SyS_mkdirat+0x5e/0xe0
[694702.666986] [<ffffffff811479b9>] SyS_mkdir+0x19/0x20
[694702.758969] [<ffffffff814c2182>] system_call_fastpath+0x16/0x1b
[694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds.
[694702.935749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694703.023603] kworker/12:2 D ffffffff816079c0 0 11512 2 0x00000000
[694703.109993] Workqueue: events cgroup_free_fn
[694703.193213] ffff88041b9dfcf8 0000000000000046 ffff88041da6e9f0 ffffea00106fd240
[694703.278353] ffff88041f803c00 ffff8803824254c0 ffff88041b9dffd8 ffff88041b9dffd8
[694703.363757] ffff88041b9dffd8 ffff8803824254c0 0000001f17887bb1 ffffffff81a417c0
[694703.448550] Call Trace:
[694703.531773] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694703.615316] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694703.698298] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694703.780456] [<ffffffff810931cc>] ? cgroup_free_fn+0x10c/0x120
[694703.861813] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694703.942719] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694704.023785] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694704.104080] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694704.184000] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694704.264027] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694704.343761] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694704.423868] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694704.503734] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694704.583766] INFO: task resource_alloca:16502 blocked for more than 120 seconds.
[694704.664964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694704.747361] resource_alloca D ffffffff81607820 0 16502 16467 0x00000000
[694704.831088] ffff8803a43cda18 0000000000000086 ffffffff81a10440 ffffffff810287fe
[694704.916440] ffff8803a43cd9c8 ffff88041ba9b170 ffff8803a43cdfd8 ffff8803a43cdfd8
[694705.002030] ffff8803a43cdfd8 ffff88041ba9b170 ffff8803a43cda38 ffff8803a43cdb50
[694705.086265] Call Trace:
[694705.168664] [<ffffffff810287fe>] ? physflat_send_IPI_mask+0xe/0x10
[694705.251780] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694705.334283] [<ffffffff814b6d65>] schedule_timeout+0x195/0x1f0
[694705.415962] [<ffffffff814b8012>] ? __schedule+0x2a2/0x740
[694705.497267] [<ffffffff814b8715>] wait_for_completion+0xa5/0x110
[694705.578786] [<ffffffff8106c6f0>] ? try_to_wake_up+0x270/0x270
[694705.659533] [<ffffffff81057743>] flush_work+0xe3/0x150
[694705.739934] [<ffffffff810558b0>] ? pool_mayday_timeout+0x100/0x100
[694705.820840] [<ffffffff810ec7a0>] ? __pagevec_release+0x40/0x40
[694705.901615] [<ffffffff810592c3>] schedule_on_each_cpu+0xc3/0x110
[694705.982441] [<ffffffff810ec7c5>] lru_add_drain_all+0x15/0x20
[694706.063268] [<ffffffff8112df6e>] migrate_prep+0xe/0x20
[694706.143986] [<ffffffff81120d7b>] do_migrate_pages+0x2b/0x220
[694706.224840] [<ffffffff8106839b>] ? task_rq_lock+0x5b/0xa0
[694706.305695] [<ffffffff8125e016>] ? cpumask_next_and+0x36/0x50
[694706.386735] [<ffffffff81096f88>] cpuset_migrate_mm+0x78/0xa0
[694706.467666] [<ffffffff81097936>] cpuset_attach+0x296/0x310
[694706.548253] [<ffffffff810928de>] cgroup_attach_task+0x47e/0x7a0
[694706.628732] [<ffffffff814b741d>] ? mutex_lock+0x1d/0x50
[694706.708308] [<ffffffff81092e87>] attach_task_by_pid+0x167/0x1a0
[694706.787271] [<ffffffff81092ef3>] cgroup_tasks_write+0x13/0x20
[694706.864902] [<ffffffff8108fe13>] cgroup_file_write+0x143/0x2e0
[694706.941469] [<ffffffff8113a113>] ? __sb_start_write+0x53/0x110
[694707.018036] [<ffffffff810f910d>] ? vm_mmap_pgoff+0x7d/0xb0
[694707.094629] [<ffffffff8113820e>] vfs_write+0xce/0x1e0
[694707.170158] [<ffffffff811386f2>] SyS_write+0x52/0xa0
[694707.244994] [<ffffffff814bda2e>] ? do_page_fault+0xe/0x10
[694707.319991] [<ffffffff814c2182>] system_call_fastpath+0x16/0x1b
[694707.395118] INFO: task kworker/12:0:24144 blocked for more than 120 seconds.
[694707.471258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694707.548490] kworker/12:0 D 0000000000000000 0 24144 2 0x00000000
[694707.626683] Workqueue: events cgroup_free_fn
[694707.704578] ffff88040d8a5cf8 0000000000000046 ffff88038e7b4db0 0000000000012c80
[694707.784819] 0000000000002e7b ffff88038e7b46a0 ffff88040d8a5fd8 ffff88040d8a5fd8
[694707.866017] ffff88040d8a5fd8 ffff88038e7b46a0 000000000000000c ffffffff81a417c0
[694707.947350] Call Trace:
[694708.027061] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694708.106747] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694708.185737] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694708.264354] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694708.341905] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694708.419431] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694708.496983] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694708.573522] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694708.649929] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694708.726017] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694708.802451] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694708.878645] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694708.955132] INFO: task kworker/12:3:24145 blocked for more than 120 seconds.
[694709.032683] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694709.111511] kworker/12:3 D 0000000000000000 0 24145 2 0x00000000
[694709.191327] Workqueue: events cgroup_free_fn
[694709.270794] ffff880399e6fcf8 0000000000000046 ffff88038e7b54c0 0000000000012c80
[694709.352338] 0000000000002e7b ffff88038e7b4db0 ffff880399e6ffd8 ffff880399e6ffd8
[694709.434361] ffff880399e6ffd8 ffff88038e7b4db0 000000000000000c ffffffff81a417c0
[694709.516544] Call Trace:
[694709.597775] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694709.679168] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694709.759888] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694709.839675] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694709.918397] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694709.996989] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694710.076272] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694710.154597] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694710.232443] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694710.309787] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694710.387874] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694710.465882] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694710.544284] INFO: task kworker/12:4:24146 blocked for more than 120 seconds.
[694710.623592] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694710.704151] kworker/12:4 D 0000000000000000 0 24146 2 0x00000000
[694710.785538] Workqueue: events cgroup_free_fn
[694710.866442] ffff88041ce81cf8 0000000000000046 ffff88038e7b5bd0 0000000000012c80
[694710.949265] 0000000000002e7b ffff88038e7b54c0 ffff88041ce81fd8 ffff88041ce81fd8
[694711.032620] ffff88041ce81fd8 ffff88038e7b54c0 000000000000000c ffffffff81a417c0
[694711.116268] Call Trace:
[694711.199224] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694711.283085] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694711.367344] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694711.451765] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694711.536157] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694711.620825] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694711.705459] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694711.789879] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694711.874566] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694711.959229] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694712.044502] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694712.129694] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694712.215339] INFO: task kworker/12:5:24147 blocked for more than 120 seconds.
[694712.302557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694712.391077] kworker/12:5 D 0000000000000000 0 24147 2 0x00000000
[694712.480423] Workqueue: events cgroup_free_fn
[694712.568568] ffff88040fc3fcf8 0000000000000046 ffff88038e7b62e0 0000000000012c80
[694712.657755] 0000000000002e7b ffff88038e7b5bd0 ffff88040fc3ffd8 ffff88040fc3ffd8
[694712.746434] ffff88040fc3ffd8 ffff88038e7b5bd0 000000000000000c ffffffff81a417c0
[694712.834847] Call Trace:
[694712.921531] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694713.008695] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694713.096069] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694713.182672] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694713.268848] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694713.355186] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694713.441498] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694713.527569] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694713.613723] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694713.699558] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694713.785921] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694713.872205] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694713.958940] INFO: task kworker/12:6:24148 blocked for more than 120 seconds.
[694714.046607] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[694714.135393] kworker/12:6 D 0000000000000000 0 24148 2 0x00000000
[694714.224979] Workqueue: events cgroup_free_fn
[694714.313390] ffff88040cf13cf8 0000000000000046 ffff88038e7b69f0 0000000000012c80
[694714.402790] 0000000000002e7b ffff88038e7b62e0 ffff88040cf13fd8 ffff88040cf13fd8
[694714.491601] ffff88040cf13fd8 ffff88038e7b62e0 000000000000000c ffffffff81a417c0
[694714.580174] Call Trace:
[694714.667046] [<ffffffff814b8eb9>] schedule+0x29/0x70
[694714.754314] [<ffffffff814b918e>] schedule_preempt_disabled+0xe/0x10
[694714.841716] [<ffffffff814b75b2>] __mutex_lock_slowpath+0x112/0x1b0
[694714.928347] [<ffffffff814b742a>] mutex_lock+0x2a/0x50
[694715.014577] [<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[694715.100968] [<ffffffff81057c54>] process_one_work+0x174/0x490
[694715.187306] [<ffffffff81058d0c>] worker_thread+0x11c/0x370
[694715.273404] [<ffffffff81058bf0>] ? manage_workers+0x2c0/0x2c0
[694715.359553] [<ffffffff8105f0b0>] kthread+0xc0/0xd0
[694715.445411] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0
[694715.531773] [<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[694715.618058] [<ffffffff8105eff0>] ? flush_kthread_worker+0xb0/0xb0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Li Zefan

unread,
Nov 12, 2013, 5:20:02 AM11/12/13
to
Cc more people

On 2013/11/12 6:06, Shawn Bohrer wrote:
> Hello,
>
> This morning I had a machine running 3.10.16 go unresponsive but
> before we killed it we were able to get the information below. I'm
> not an expert here but it looks like most of the tasks below are
> blocking waiting on the cgroup_mutex. You can see that the
> resource_alloca:16502 task is holding the cgroup_mutex and that task
> appears to be waiting on a lru_add_drain_all() to complete.

Ouch, another bug report!

This looks like the same bug that Hugh saw.
(http://permalink.gmane.org/gmane.linux.kernel.cgroups/9351)

What's new in your report is, the lru_add_drain_all() comes from
cpuset_attach() instead of memcg. Morever I thought it was a
3.11 specific bug.
> .

Michal Hocko

unread,
Nov 12, 2013, 9:40:02 AM11/12/13
to
On Tue 12-11-13 18:17:20, Li Zefan wrote:
> Cc more people
>
> On 2013/11/12 6:06, Shawn Bohrer wrote:
> > Hello,
> >
> > This morning I had a machine running 3.10.16 go unresponsive but
> > before we killed it we were able to get the information below. I'm
> > not an expert here but it looks like most of the tasks below are
> > blocking waiting on the cgroup_mutex. You can see that the
> > resource_alloca:16502 task is holding the cgroup_mutex and that task
> > appears to be waiting on a lru_add_drain_all() to complete.

Do you have sysrq+l output as well by any chance? That would tell
us what the current CPUs are doing. Dumping all kworker stacks
might be helpful as well. We know that lru_add_drain_all waits for
schedule_on_each_cpu to return so it is waiting for workers to finish.
I would be really curious why some of lru_add_drain_cpu cannot finish
properly. The only reason would be that some work item(s) do not get CPU
or somebody is holding lru_lock.

--
Michal Hocko
SUSE Labs

Michal Hocko

unread,
Nov 12, 2013, 12:00:01 PM11/12/13
to
On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
> On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
> > On Tue 12-11-13 18:17:20, Li Zefan wrote:
> > > Cc more people
> > >
> > > On 2013/11/12 6:06, Shawn Bohrer wrote:
> > > > Hello,
> > > >
> > > > This morning I had a machine running 3.10.16 go unresponsive but
> > > > before we killed it we were able to get the information below. I'm
> > > > not an expert here but it looks like most of the tasks below are
> > > > blocking waiting on the cgroup_mutex. You can see that the
> > > > resource_alloca:16502 task is holding the cgroup_mutex and that task
> > > > appears to be waiting on a lru_add_drain_all() to complete.
> >
> > Do you have sysrq+l output as well by any chance? That would tell
> > us what the current CPUs are doing. Dumping all kworker stacks
> > might be helpful as well. We know that lru_add_drain_all waits for
> > schedule_on_each_cpu to return so it is waiting for workers to finish.
> > I would be really curious why some of lru_add_drain_cpu cannot finish
> > properly. The only reason would be that some work item(s) do not get CPU
> > or somebody is holding lru_lock.
>
> In fact the sys-admin did manage to fire off a sysrq+l, I've put all
> of the info from the syslog below. I've looked it over and I'm not
> sure it reveals anything. First looking at the timestamps it appears
> we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
> previously sent.

I would expect sysrq+w would still show those kworkers blocked on the
same cgroup mutex?

> I also have atop logs over that whole time period
> that show hundreds of zombie processes which to me indicates that over
> that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking
> at the backtraces from the sysrq+l it appears most of the CPUs were
> idle

Right so either we managed to sleep with the lru_lock held which sounds
a bit improbable - but who knows - or there is some other problem. I
would expect the later to be true.

lru_add_drain executes per-cpu and preemption disabled this means that
its work item cannot be preempted so the only logical explanation seems
to be that the work item has never got scheduled.

> except there are a few where ptpd is trying to step the clock
> with clock_settime. The ptpd process also appears to get stuck for a
> bit but it looks like it recovers because it moves CPUs and the
> previous CPUs become idle.

It gets soft lockup because it is waiting for it's own IPIs which got
preempted by NMI trace dumper. But this is unrelated.

> The fact that ptpd is stepping the clock
> at all at this time means that timekeeping is a mess at this point and
> the system clock is way out of sync. There are also a few of these
> NMI messages in there that I don't understand but at this point the
> machine was a sinking ship.
>
> Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Uhhuh. NMI received for unknown reason 21 on CPU 26.
> Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Do you have a strange power saving mode enabled?
> Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Dazed and confused, but trying to continue
> Nov 11 07:03:29 sydtest0 kernel: [764305.327143] Uhhuh. NMI received for unknown reason 31 on CPU 27.
> Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Do you have a strange power saving mode enabled?
> Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Dazed and confused, but trying to continue
> Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Uhhuh. NMI received for unknown reason 31 on CPU 28.
> Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Do you have a strange power saving mode enabled?
> Nov 11 07:03:29 sydtest0 kernel: [764305.327243] Dazed and confused, but trying to continue
>
> Perhaps there is another task blocking somewhere holding the lru_lock, but at
> this point the machine has been rebooted so I'm not sure how we'd figure out
> what task that might be. Anyway here is the full output of sysrq+l plus
> whatever else ended up in the syslog.

OK. In case the issue happens again. It would be very helpful to get the
kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
debugging tricks.

Shawn Bohrer

unread,
Nov 14, 2013, 6:00:02 PM11/14/13
to
Yes, I believe so.

> > I also have atop logs over that whole time period
> > that show hundreds of zombie processes which to me indicates that over
> > that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking
> > at the backtraces from the sysrq+l it appears most of the CPUs were
> > idle
>
> Right so either we managed to sleep with the lru_lock held which sounds
> a bit improbable - but who knows - or there is some other problem. I
> would expect the later to be true.
>
> lru_add_drain executes per-cpu and preemption disabled this means that
> its work item cannot be preempted so the only logical explanation seems
> to be that the work item has never got scheduled.

Meaning you think there would be no kworker thread for the
lru_add_drain at this point? If so you might be correct.

> OK. In case the issue happens again. It would be very helpful to get the
> kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
> debugging tricks.

I set up one of my test pools with two scripts trying to reproduce the
problem. One essentially puts tasks into several cpuset groups that
have cpuset.memory_migrate set, then takes them back out. It also
occasionally switches cpuset.mems in those groups to try to keep the
memory of those tasks migrating between nodes. The second script is:

$ cat /home/hbi/cgroup_mutex_cgroup_maker.sh
#!/bin/bash

session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o)
cd /sys/fs/cgroup/systemd/user/hbi/${session_group}
pwd

while true; do
for x in $(seq 1 1000); do
mkdir $x
echo $$ > ${x}/tasks
echo $$ > tasks
rmdir $x
done
sleep .1
date
done

After running both concurrently on 40 machines for about 12 hours I've
managed to reproduce the issue at least once, possibly more. One
machine looked identical to this reported issue. It has a bunch of
stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle
except for the one triggering the sysrq+l. The sysrq+w unfortunately
wrapped dmesg so we didn't get the stacks of all blocked tasks. We
did however also cat /proc/<pid>/stack of all kworker threads on the
system. There were 265 kworker threads that all have the following
stack:

[kworker/2:1]
[<ffffffff810930ec>] cgroup_free_fn+0x2c/0x120
[<ffffffff81057c54>] process_one_work+0x174/0x490
[<ffffffff81058d0c>] worker_thread+0x11c/0x370
[<ffffffff8105f0b0>] kthread+0xc0/0xd0
[<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

And there were another 101 that had stacks like the following:

[kworker/0:0]
[<ffffffff81058daf>] worker_thread+0x1bf/0x370
[<ffffffff8105f0b0>] kthread+0xc0/0xd0
[<ffffffff814c20dc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

That's it. Again I'm not sure if that is helpful at all but it seems
to imply that the lru_add_drain_work was not scheduled.

I also managed to kill another two machines running my test. One of
them we didn't get anything out of, and the other looks like I
deadlocked on the css_set_lock lock. I'll follow up with the
css_set_lock deadlock in another email since it doesn't look related
to this one. But it does seem that I can probably reproduce this if
anyone has some debugging ideas.

--
Shawn

Tejun Heo

unread,
Nov 15, 2013, 1:30:02 AM11/15/13
to
Hello,

On Thu, Nov 14, 2013 at 04:56:49PM -0600, Shawn Bohrer wrote:
> After running both concurrently on 40 machines for about 12 hours I've
> managed to reproduce the issue at least once, possibly more. One
> machine looked identical to this reported issue. It has a bunch of
> stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
> waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle
> except for the one triggering the sysrq+l. The sysrq+w unfortunately
> wrapped dmesg so we didn't get the stacks of all blocked tasks. We
> did however also cat /proc/<pid>/stack of all kworker threads on the
> system. There were 265 kworker threads that all have the following
> stack:

Umm... so, WQ_DFL_ACTIVE is 256. It's just an arbitrarily largish
number which is supposed to serve as protection against runaway
kworker creation. The assumption there is that there won't be a
dependency chain which can be longer than that and if there are it
should be separated out into a separate workqueue. It looks like we
*can* have such long chain of dependency with high enough rate of
cgroup destruction. kworkers trying to destroy cgroups get blocked by
an earlier one which is holding cgroup_mutex. If the blocked ones
completely consume max_active and then the earlier one tries to
perform an operation which makes use of the system_wq, the forward
progress guarantee gets broken.

So, yeah, it makes sense now. We're just gonna have to separate out
cgroup destruction to a separate workqueue. Hugh's temp fix achieved
about the same effect by putting the affected part of destruction to a
different workqueue. I probably should have realized that we were
hitting max_active when I was told that moving some part to a
different workqueue makes the problem go away.

Will send out a patch soon.

Thanks.

--
tejun

Tejun Heo

unread,
Nov 15, 2013, 3:00:01 AM11/15/13
to
Hello,

Shawn, Hugh, can you please verify whether the attached patch makes
the deadlock go away?

Thanks.

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e0839bc..dc9dc06 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
static DEFINE_MUTEX(cgroup_root_mutex);

/*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions. Use a separate workqueue so that cgroup
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_destroy_wq;
+
+/*
* Generate an array of cgroup subsystem pointers. At boot time, this is
* populated with the built in subsystems, and modular subsystems are
* registered after that. The mutable section of this array is protected by
@@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);

INIT_WORK(&cgrp->destroy_work, cgroup_free_fn);
- schedule_work(&cgrp->destroy_work);
+ queue_work(cgroup_destroy_wq, &cgrp->destroy_work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
* css_put(). dput() requires process context which we don't have.
*/
INIT_WORK(&css->destroy_work, css_free_work_fn);
- schedule_work(&css->destroy_work);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
}

static void css_release(struct percpu_ref *ref)
@@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);

INIT_WORK(&css->destroy_work, css_killed_work_fn);
- schedule_work(&css->destroy_work);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
}

/**
@@ -5025,6 +5033,17 @@ int __init cgroup_init(void)
if (err)
return err;

+ /*
+ * There isn't much point in executing destruction path in
+ * parallel. Good chunk is serialized with cgroup_mutex anyway.
+ * Use 1 for @max_active.
+ */
+ cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+ if (!cgroup_destroy_wq) {
+ err = -ENOMEM;
+ goto out;
+ }
+
for_each_builtin_subsys(ss, i) {
if (!ss->early_init)
cgroup_init_subsys(ss);
@@ -5062,9 +5081,11 @@ int __init cgroup_init(void)
proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations);

out:
- if (err)
+ if (err) {
+ if (cgroup_destroy_wq)
+ destroy_workqueue(cgroup_destroy_wq);
bdi_destroy(&cgroup_backing_dev_info);
-
+ }
return err;

Hugh Dickins

unread,
Nov 17, 2013, 9:20:01 PM11/17/13
to
On Fri, 15 Nov 2013, Tejun Heo wrote:

> Hello,
>
> Shawn, Hugh, can you please verify whether the attached patch makes
> the deadlock go away?

Thanks a lot, Tejun: report below.
Sorry for the delay: I was on the point of reporting success last
night, when I tried a debug kernel: and that didn't work so well
(got spinlock bad magic report in pwd_adjust_max_active(), and
tests wouldn't run at all).

Even the non-early cgroup_init() is called well before the
early_initcall init_workqueues(): though only the debug (lockdep
and spinlock debug) kernel appeared to have a problem with that.

Here's the patch I ended up with successfully on a 3.11.7-based
kernel (though below I've rediffed it against 3.11.8): the
schedule_work->queue_work hunks are slightly different on 3.11
than in your patch against current, and I did alloc_workqueue()
from a separate core_initcall.

The interval between cgroup_init and that is a bit of a worry;
but we don't seem to have suffered from the interval between
cgroup_init and init_workqueues before (when system_wq is NULL)
- though you may have more courage than I to reorder them!

Initially I backed out my system_highpri_wq workaround, and
verified that it was still easy to reproduce the problem with
one of our cgroup stresstests. Yes it was, then your modified
patch below convincingly fixed it.

I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
as I'd expected, that didn't solve this issue (though it's worth
our keeping it in to rule out another source of problems). And I
checked back on dumps of failures: they indeed show the tell-tale
256 kworkers doing cgroup_offline_fn, just as you predicted.

Thanks!
Hugh

---
kernel/cgroup.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

--- 3.11.8/kernel/cgroup.c 2013-11-17 17:40:54.200640692 -0800
+++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800
@@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
static DEFINE_MUTEX(cgroup_root_mutex);

/*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions. Use a separate workqueue so that cgroup
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_destroy_wq;
+
+/*
* Generate an array of cgroup subsystem pointers. At boot time, this is
* populated with the built in subsystems, and modular subsystems are
* registered after that. The mutable section of this array is protected by
@@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);

INIT_WORK(&cgrp->destroy_work, cgroup_free_fn);
- schedule_work(&cgrp->destroy_work);
+ queue_work(cgroup_destroy_wq, &cgrp->destroy_work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

- schedule_work(&css->dput_work);
+ queue_work(cgroup_destroy_wq, &css->dput_work);
}

static void init_cgroup_css(struct cgroup_subsys_state *css,
@@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr

/* percpu ref's of all css's are killed, kick off the next step */
INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn);
- schedule_work(&cgrp->destroy_work);
+ queue_work(cgroup_destroy_wq, &cgrp->destroy_work);
}

static void css_ref_killed_fn(struct percpu_ref *ref)
@@ -4967,6 +4975,22 @@ out:
return err;
}

+static int __init cgroup_destroy_wq_init(void)
+{
+ /*
+ * There isn't much point in executing destruction path in
+ * parallel. Good chunk is serialized with cgroup_mutex anyway.
+ * Use 1 for @max_active.
+ *
+ * We would prefer to do this in cgroup_init() above, but that
+ * is called before init_workqueues(): so leave this until after.
+ */
+ cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+ BUG_ON(!cgroup_destroy_wq);
+ return 0;
+}
+core_initcall(cgroup_destroy_wq_init);
+
/*
* proc_cgroup_show()
* - Print task's cgroup paths into seq_file, one line for each hierarchy

Shawn Bohrer

unread,
Nov 18, 2013, 3:20:02 PM11/18/13
to
Thanks Tejun and Hugh. Sorry for my late entry in getting around to
testing this fix. On the surface it sounds correct however I'd like to
test this on top of 3.10.* since that is what we'll likely be running.
I've tried to apply Hugh's patch above on top of 3.10.19 but it
appears there are a number of conflicts. Looking over the changes and
my understanding of the problem I believe on 3.10 only the
cgroup_free_fn needs to be run in a separate workqueue. Below is the
patch I've applied on top of 3.10.19, which I'm about to start
testing. If it looks like I botched the backport in any way please
let me know so I can test a propper fix on top of 3.10.19.


---
kernel/cgroup.c | 28 ++++++++++++++++++++++++++--
1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b6b26fa..113a522 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -92,6 +92,14 @@ static DEFINE_MUTEX(cgroup_mutex);
static DEFINE_MUTEX(cgroup_root_mutex);

/*
+* cgroup destruction makes heavy use of work items and there can be a lot
+* of concurrent destructions. Use a separate workqueue so that cgroup
+* destruction work items don't end up filling up max_active of system_wq
+* which may lead to deadlock.
+*/
+static struct workqueue_struct *cgroup_destroy_wq;
+
+/*
* Generate an array of cgroup subsystem pointers. At boot time, this is
* populated with the built in subsystems, and modular subsystems are
* registered after that. The mutable section of this array is protected by
@@ -873,7 +881,8 @@ static void cgroup_free_rcu(struct rcu_head *head)
{
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);

- schedule_work(&cgrp->free_work);
+ INIT_WORK(&cgrp->free_work, cgroup_free_fn);
+ queue_work(cgroup_destroy_wq, &cgrp->free_work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -1405,7 +1414,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->allcg_node);
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
- INIT_WORK(&cgrp->free_work, cgroup_free_fn);
mutex_init(&cgrp->pidlist_mutex);
INIT_LIST_HEAD(&cgrp->event_list);
spin_lock_init(&cgrp->event_list_lock);
@@ -4686,6 +4694,22 @@ out:
return err;
}

+static int __init cgroup_destroy_wq_init(void)
+{
+ /*
+ * There isn't much point in executing destruction path in
+ * parallel. Good chunk is serialized with cgroup_mutex anyway.
+ * Use 1 for @max_active.
+ *
+ * We would prefer to do this in cgroup_init() above, but that
+ * is called before init_workqueues(): so leave this until after.
+ */
+ cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+ BUG_ON(!cgroup_destroy_wq);
+ return 0;
+}
+core_initcall(cgroup_destroy_wq_init);
+
/*
* proc_cgroup_show()
* - Print task's cgroup paths into seq_file, one line for each hierarchy
--
1.7.7.6

Li Zefan

unread,
Nov 18, 2013, 10:00:02 PM11/18/13
to
> Thanks Tejun and Hugh. Sorry for my late entry in getting around to
> testing this fix. On the surface it sounds correct however I'd like to
> test this on top of 3.10.* since that is what we'll likely be running.
> I've tried to apply Hugh's patch above on top of 3.10.19 but it
> appears there are a number of conflicts. Looking over the changes and
> my understanding of the problem I believe on 3.10 only the
> cgroup_free_fn needs to be run in a separate workqueue. Below is the
> patch I've applied on top of 3.10.19, which I'm about to start
> testing. If it looks like I botched the backport in any way please
> let me know so I can test a propper fix on top of 3.10.19.
>

You didn't move css free_work to the dedicate wq as Tejun's patch does.
css free_work won't acquire cgroup_mutex, but when destroying a lot of
cgroups, we can have a lot of css free_work in the workqueue, so I'd
suggest you also use cgroup_destroy_wq for it.

Shawn Bohrer

unread,
Nov 20, 2013, 5:50:02 PM11/20/13
to
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote:
> > Thanks Tejun and Hugh. Sorry for my late entry in getting around to
> > testing this fix. On the surface it sounds correct however I'd like to
> > test this on top of 3.10.* since that is what we'll likely be running.
> > I've tried to apply Hugh's patch above on top of 3.10.19 but it
> > appears there are a number of conflicts. Looking over the changes and
> > my understanding of the problem I believe on 3.10 only the
> > cgroup_free_fn needs to be run in a separate workqueue. Below is the
> > patch I've applied on top of 3.10.19, which I'm about to start
> > testing. If it looks like I botched the backport in any way please
> > let me know so I can test a propper fix on top of 3.10.19.
> >
>
> You didn't move css free_work to the dedicate wq as Tejun's patch does.
> css free_work won't acquire cgroup_mutex, but when destroying a lot of
> cgroups, we can have a lot of css free_work in the workqueue, so I'd
> suggest you also use cgroup_destroy_wq for it.

Well, I didn't move the css free_work, but I did test the patch I
posted on top of 3.10.19 and I am unable to reproduce the lockup so it
appears my patch was sufficient for 3.10.*. Hopefully we can get this
fix applied and backported into stable.

Thanks,
Shawn

William Dauchy

unread,
Nov 22, 2013, 4:10:02 PM11/22/13
to
Hugh, Tejun,

Do we have some news about this patch? I'm also hitting this bug on a 3.10.x

Thanks,
--
William

Tejun Heo

unread,
Nov 22, 2013, 5:20:01 PM11/22/13
to
On Fri, Nov 22, 2013 at 09:59:37PM +0100, William Dauchy wrote:
> Hugh, Tejun,
>
> Do we have some news about this patch? I'm also hitting this bug on a 3.10.x

Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to
Linus next week.

Thanks.

--
tejun

Tejun Heo

unread,
Nov 22, 2013, 5:20:02 PM11/22/13
to
Hello, Hugh.

I applied the following patch to cgroup/for-3.13-fixes. For longer
term, I think it'd be better to pull workqueue init before cgroup one
but this one should be easier to backport for now.

Thanks!

----- >8 -----
From e5fca243abae1445afbfceebda5f08462ef869d3 Mon Sep 17 00:00:00 2001
From: Tejun Heo <t...@kernel.org>
Date: Fri, 22 Nov 2013 17:14:39 -0500

Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo <t...@kernel.org>
Reported-by: Hugh Dickins <hu...@google.com>
Reported-by: Shawn Bohrer <shawn....@gmail.com>
Link: http://lkml.kernel.org/r/2013111122...@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1...@eggly.anvils
Cc: sta...@vger.kernel.org # v3.9+
---
kernel/cgroup.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4c62513..a7b98ee 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
static DEFINE_MUTEX(cgroup_root_mutex);

/*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions. Use a separate workqueue so that cgroup
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_destroy_wq;
+
+/*
* Generate an array of cgroup subsystem pointers. At boot time, this is
* populated with the built in subsystems, and modular subsystems are
* registered after that. The mutable section of this array is protected by
@@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);

INIT_WORK(&cgrp->destroy_work, cgroup_free_fn);
- schedule_work(&cgrp->destroy_work);
+ queue_work(cgroup_destroy_wq, &cgrp->destroy_work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -4249,7 +4257,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
* css_put(). dput() requires process context which we don't have.
*/
INIT_WORK(&css->destroy_work, css_free_work_fn);
- schedule_work(&css->destroy_work);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
}

static void css_release(struct percpu_ref *ref)
@@ -4539,7 +4547,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);

INIT_WORK(&css->destroy_work, css_killed_work_fn);
- schedule_work(&css->destroy_work);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
}

/**
@@ -5063,6 +5071,22 @@ out:
return err;
}

+static int __init cgroup_wq_init(void)
+{
+ /*
+ * There isn't much point in executing destruction path in
+ * parallel. Good chunk is serialized with cgroup_mutex anyway.
+ * Use 1 for @max_active.
+ *
+ * We would prefer to do this in cgroup_init() above, but that
+ * is called before init_workqueues(): so leave this until after.
+ */
+ cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+ BUG_ON(!cgroup_destroy_wq);
+ return 0;
+}
+core_initcall(cgroup_wq_init);
+
/*
* proc_cgroup_show()
* - Print task's cgroup paths into seq_file, one line for each hierarchy
--
1.8.4.2

William Dauchy

unread,
Nov 22, 2013, 6:00:01 PM11/22/13
to
Hi Tejun,

On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo <t...@kernel.org> wrote:
> Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to
> Linus next week.

Thank your for your quick reply. Do you also have a backport for
v3.10.x already available?

Best regards,
--
William

Hugh Dickins

unread,
Nov 24, 2013, 1:30:01 PM11/24/13
to
On Fri, 22 Nov 2013, Tejun Heo wrote:

> Hello, Hugh.
>
> I applied the following patch to cgroup/for-3.13-fixes.

Looks good, thanks a lot.

> For longer
> term, I think it'd be better to pull workqueue init before cgroup one
> but this one should be easier to backport for now.

Yes, that's the right direction, but this the right fix for now.

Hugh

Li Zefan

unread,
Nov 24, 2013, 8:20:01 PM11/24/13
to
Acked-by: Li Zefan <liz...@huawei.com>

Li Zefan

unread,
Nov 24, 2013, 8:30:01 PM11/24/13
to
On 2013/11/23 6:54, William Dauchy wrote:
> Hi Tejun,
>
> On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo <t...@kernel.org> wrote:
>> Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to
>> Linus next week.
>
> Thank your for your quick reply. Do you also have a backport for
> v3.10.x already available?
>

I'll do this after the patch hits mainline, if Tejun doesn't plan to.

William Dauchy

unread,
Dec 2, 2013, 5:40:01 AM12/2/13
to
Hi Li,

On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan <liz...@huawei.com> wrote:
> I'll do this after the patch hits mainline, if Tejun doesn't plan to.

Do you have some news about it?
--
William

Li Zefan

unread,
Dec 2, 2013, 8:40:02 PM12/2/13
to
On 2013/12/2 18:31, William Dauchy wrote:
> Hi Li,
>
> On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan <liz...@huawei.com> wrote:
>> I'll do this after the patch hits mainline, if Tejun doesn't plan to.
>
> Do you have some news about it?
>

Tejun has already done the backport. :)

It has been included in 3.10.22, which will be released in a couple of days.

http://article.gmane.org/gmane.linux.kernel.stable/71292
0 new messages