Sleeping thread held mutex in vm_pageout

Ryan Stone

unread,

Jan 19, 2015, 5:56:36 PM1/19/15

to freebsd...@freebsd.org

I recently had a system where a DIMM failed and the OOM killer was
constantly kicking in due to a memory-hungry daemon being constantly
restarted. This ended up triggering a race condition in the OOM
killer leading to this panic:

Sleeping thread (tid 100075, pid 8) owns a non-sleepable lock
sched_switch() at 0xffffffff8048386d = sched_switch+0x16d
mi_switch() at 0xffffffff80469dd6 = mi_switch+0x186
sleepq_wait() at 0xffffffff80499204 = sleepq_wait+0x44
__lockmgr_args() at 0xffffffff8044b88b = __lockmgr_args+0x67b
vop_stdlock() at 0xffffffff804d3689 = vop_stdlock+0x39
---Type <return> to continue, or q <return> to quit---
VOP_LOCK1_APV() at 0xffffffff8069da42 = VOP_LOCK1_APV+0x52
_vn_lock() at 0xffffffff804ed627 = _vn_lock+0x47
vm_object_deallocate() at 0xffffffff8061eef3 = vm_object_deallocate+0x203
vm_map_entry_deallocate() at 0xffffffff80616d2c = vm_map_entry_deallocate+0x4c
vm_map_process_deferred() at 0xffffffff80616d62 = vm_map_process_deferred+0x32
vm_map_remove() at 0xffffffff806183ff = vm_map_remove+0x6f
vmspace_free() at 0xffffffff80619206 = vmspace_free+0x56
vm_pageout_oom() at 0xffffffff806230d1 = vm_pageout_oom+0x181
vm_pageout() at 0xffffffff8062410b = vm_pageout+0x90b
fork_exit() at 0xffffffff8043a382 = fork_exit+0x112
fork_trampoline() at 0xffffffff8063385e = fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff80c3be1d00, rbp = 0 ---
panic: sleeping thread
cpuid = 5
curthread = grep/grep (82989/100544)
cpu_ticks = 1848294656444
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff801e52ba = db_trace_self_wrapper+0x2a
panic() at 0xffffffff80461608 = panic+0x228
propagate_priority() at 0xffffffff8049cbde = propagate_priority+0x15e
turnstile_wait() at 0xffffffff8049d278 = turnstile_wait+0x1b8
_mtx_lock_sleep() at 0xffffffff80451af1 = _mtx_lock_sleep+0xf1
---Type <return> to continue, or q <return> to quit---
_mtx_lock_flags() at 0xffffffff80451c75 = _mtx_lock_flags+0x75
exit1() at 0xffffffff804367de = exit1+0x36e
sys_exit() at 0xffffffff8043731e = sys_exit+0xe
syscallenter() at 0xffffffff8049b324 = syscallenter+0x104
syscall() at 0xffffffff80649bfc = syscall+0x4c
Xfast_syscall() at 0xffffffff806335f2 = Xfast_syscall+0xe2
--- syscall (1, FreeBSD ELF64, sys_exit), rip = 0x300a2df9c, rsp =
0x7ffffffd40c8, rbp = 0x7ffffffd40e0 ---
Uptime: 7m19s

The root cause is that vm_pageout_oom() acquires a reference on a
process's vmspace while holding its PROC_LOCK(), then the process
exited. This left vm_pageout_oom() holding the only reference on the
vmspace, so when it dropped the reference it called into
vm_map_remove() and wound up sleeping while still holding the
PROC_LOCK(). This was under FreeBSD 8 but the code in head does not
seem to have changed here.

I'm not quite familiar with the lock mechanisms here so I'm not sure
how to fix it. Does vm_pageout_oom() need to _PHOLD() the process
while holding the PROC_LOCK(), then drop the lock, then acquire the
vmspace reference? It appears that's how other places that call
vmspace_acquire_ref() work.
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Konstantin Belousov

unread,

Jan 20, 2015, 3:32:33 AM1/20/15

to Ryan Stone, freebsd...@freebsd.org

Well, the root cause is that the vmspace reference is dropped while owning
the process lock (a mutex).

>
> I'm not quite familiar with the lock mechanisms here so I'm not sure
> how to fix it. Does vm_pageout_oom() need to _PHOLD() the process
> while holding the PROC_LOCK(), then drop the lock, then acquire the
> vmspace reference? It appears that's how other places that call
> vmspace_acquire_ref() work.

Yes, I think it is enough to keep a hold ref on the big process instead
of keeping it locked. This should also allow to change trylock for
the next iteration process to plain lock. On the other hand, it seems
reasonable to keep trylock for vm_map locking, since in OOM situation
usually some processes are stuck waiting for page while maps are locked.

Holding the process lock for bigproc prevents not only exit, but also
execve from executing while oom loop selected the victim. This makes
it possible for a race where large process, selected for oom kill,
performs exec meantime and becoming small process, and then being killed
at the end of oom loop. I think it is acceptable.

diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ca9d7f9..d9f28c3 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1516,8 +1516,8 @@ vm_pageout_oom(int shortage)
FOREACH_PROC_IN_SYSTEM(p) {
int breakout;

- if (PROC_TRYLOCK(p) == 0)
- continue;
+ PROC_LOCK(p);
+
/*
* If this is a system, protected or killed process, skip it.
*/
@@ -1557,11 +1557,14 @@ vm_pageout_oom(int shortage)
PROC_UNLOCK(p);
continue;
}
+ _PHOLD(p);
if (!vm_map_trylock_read(&vm->vm_map)) {
- vmspace_free(vm);
+ _PRELE(p);
PROC_UNLOCK(p);
+ vmspace_free(vm);
continue;
}
+ PROC_UNLOCK(p);
size = vmspace_swap_count(vm);
vm_map_unlock_read(&vm->vm_map);
if (shortage == VM_OOM_MEM)
@@ -1573,16 +1576,19 @@ vm_pageout_oom(int shortage)
*/
if (size > bigsize) {
if (bigproc != NULL)
- PROC_UNLOCK(bigproc);
+ PRELE(bigproc);
bigproc = p;
bigsize = size;
- } else
- PROC_UNLOCK(p);
+ } else {
+ PRELE(p);
+ }
}
sx_sunlock(&allproc_lock);
if (bigproc != NULL) {
+ PROC_LOCK(bigproc);
killproc(bigproc, "out of swap space");
sched_nice(bigproc, PRIO_MIN);
+ _PRELE(bigproc);
PROC_UNLOCK(bigproc);
wakeup(&vm_cnt.v_free_count);

Ryan Stone

unread,

Jan 21, 2015, 9:05:04 PM1/21/15

to Konstantin Belousov, freebsd...@freebsd.org

Thanks for the patch. I tried it out this afternoon and got the
following panic:

panic: PHOLD of exiting process^M^M
cpuid = 9^M^M
curthread = pagedaemon/pagedaemon (8/100154)^M^M
cpu_ticks = 979566002403^M^M
KDB: stack backtrace:^M^M
db_trace_self_wrapper() at 0xffffffff801e672a = db_trace_self_wrapper+0x2a^M^M
panic() at 0xffffffff80480818 = panic+0x228^M^M
vm_pageout_oom() at 0xffffffff80669a9b = vm_pageout_oom+0x58b^M^M
vm_pageout() at 0xffffffff8066aa35 = vm_pageout+0x975^M^M
fork_exit() at 0xffffffff80454ada = fork_exit+0x12a^M^M
fork_trampoline() at 0xffffffff8067b37e = fork_trampoline+0xe^M^M

I believe that we need to check for P_WEXIT before considering p as a
candidate to be killed:

diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ca9d7f9..64d4277 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1516,13 +1516,13 @@ vm_pageout_oom(int shortage)

FOREACH_PROC_IN_SYSTEM(p) {
int breakout;

- if (PROC_TRYLOCK(p) == 0)
- continue;
+ PROC_LOCK(p);

/*
* If this is a system, protected or killed process, skip it.
*/

if (p->p_state != PRS_NORMAL ||
- (p->p_flag & (P_INEXEC | P_PROTECTED | P_SYSTEM)) ||
+ (p->p_flag &
+ (P_INEXEC | P_PROTECTED | P_SYSTEM | P_WEXIT)) ||
(p->p_pid == 1) || P_KILLED(p) ||
((p->p_pid < 48) && (swap_pager_avail != 0))) {
PROC_UNLOCK(p);

@@ -1557,11 +1557,14 @@ vm_pageout_oom(int shortage)
PROC_UNLOCK(p);
continue;
}
+ _PHOLD(p);
if (!vm_map_trylock_read(&vm->vm_map)) {
- vmspace_free(vm);
+ _PRELE(p);
PROC_UNLOCK(p);
+ vmspace_free(vm);
continue;
}
+ PROC_UNLOCK(p);
size = vmspace_swap_count(vm);
vm_map_unlock_read(&vm->vm_map);
if (shortage == VM_OOM_MEM)

@@ -1573,16 +1576,18 @@ vm_pageout_oom(int shortage)

*/
if (size > bigsize) {
if (bigproc != NULL)
- PROC_UNLOCK(bigproc);
+ PRELE(bigproc);
bigproc = p;
bigsize = size;

} else
- PROC_UNLOCK(p);
+ PRELE(p);

Konstantin Belousov

unread,

Jan 22, 2015, 3:35:31 AM1/22/15

to Ryan Stone, freebsd...@freebsd.org

On Wed, Jan 21, 2015 at 09:04:51PM -0500, Ryan Stone wrote:
> Thanks for the patch. I tried it out this afternoon and got the
> following panic:
>
> panic: PHOLD of exiting process^M^M
> cpuid = 9^M^M
> curthread = pagedaemon/pagedaemon (8/100154)^M^M
> cpu_ticks = 979566002403^M^M
> KDB: stack backtrace:^M^M
> db_trace_self_wrapper() at 0xffffffff801e672a = db_trace_self_wrapper+0x2a^M^M
> panic() at 0xffffffff80480818 = panic+0x228^M^M
> vm_pageout_oom() at 0xffffffff80669a9b = vm_pageout_oom+0x58b^M^M
> vm_pageout() at 0xffffffff8066aa35 = vm_pageout+0x975^M^M
> fork_exit() at 0xffffffff80454ada = fork_exit+0x12a^M^M
> fork_trampoline() at 0xffffffff8067b37e = fork_trampoline+0xe^M^M
>
>
> I believe that we need to check for P_WEXIT before considering p as a
> candidate to be killed:

Sure, you are right. I also took opportunity to reformat the condition
and to remove excessive () there.

diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ca9d7f9..d2e0483 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1516,15 +1542,15 @@ vm_pageout_oom(int shortage)

FOREACH_PROC_IN_SYSTEM(p) {
int breakout;

- if (PROC_TRYLOCK(p) == 0)
- continue;
+ PROC_LOCK(p);

+

/*
* If this is a system, protected or killed process, skip it.
*/

- if (p->p_state != PRS_NORMAL ||

- (p->p_flag & (P_INEXEC | P_PROTECTED | P_SYSTEM)) ||

- (p->p_pid == 1) || P_KILLED(p) ||
- ((p->p_pid < 48) && (swap_pager_avail != 0))) {
+ if (p->p_state != PRS_NORMAL || (p->p_flag & (P_INEXEC |
+ P_PROTECTED | P_SYSTEM | P_WEXIT)) != 0 ||
+ p->p_pid == 1 || P_KILLED(p) ||
+ (p->p_pid < 48 && swap_pager_avail != 0)) {
PROC_UNLOCK(p);
continue;
}
@@ -1557,11 +1583,14 @@ vm_pageout_oom(int shortage)

PROC_UNLOCK(p);
continue;
}
+ _PHOLD(p);
if (!vm_map_trylock_read(&vm->vm_map)) {
- vmspace_free(vm);
+ _PRELE(p);
PROC_UNLOCK(p);
+ vmspace_free(vm);
continue;
}
+ PROC_UNLOCK(p);
size = vmspace_swap_count(vm);
vm_map_unlock_read(&vm->vm_map);
if (shortage == VM_OOM_MEM)

@@ -1573,16 +1602,19 @@ vm_pageout_oom(int shortage)

*/
if (size > bigsize) {
if (bigproc != NULL)
- PROC_UNLOCK(bigproc);
+ PRELE(bigproc);
bigproc = p;
bigsize = size;

- } else
- PROC_UNLOCK(p);
+ } else {
+ PRELE(p);
+ }
}

Ryan Stone

unread,

Jan 22, 2015, 7:30:33 PM1/22/15

to Konstantin Belousov, freebsd...@freebsd.org

Other than the redundant braces on the else below, this looks good. I
tested it today and saw no further problems.

I do think that a WITNESS_WARN() in vmspace_free() would be
appropriate. I'll give that a sanity here to make sure that nothing
explodes and if everything looks ok, I'll commit it.

> @@ -1573,16 +1602,19 @@ vm_pageout_oom(int shortage)
> */
> if (size > bigsize) {
> if (bigproc != NULL)
> - PROC_UNLOCK(bigproc);
> + PRELE(bigproc);
> bigproc = p;
> bigsize = size;
> - } else
> - PROC_UNLOCK(p);
> + } else {
> + PRELE(p);
> + }

Konstantin Belousov

unread,

Jan 23, 2015, 3:37:12 AM1/23/15

to Ryan Stone, freebsd...@freebsd.org

On Thu, Jan 22, 2015 at 07:30:19PM -0500, Ryan Stone wrote:
> Other than the redundant braces on the else below, this looks good. I
> tested it today and saw no further problems.

The (possibly implicit) style rule is to have {} around both branches
in if () when one branch requires {}.

>
> I do think that a WITNESS_WARN() in vmspace_free() would be
> appropriate. I'll give that a sanity here to make sure that nothing
> explodes and if everything looks ok, I'll commit it.

Ok.

Ed Maste

unread,

Jan 23, 2015, 10:24:51 AM1/23/15

to Konstantin Belousov, freebsd...@freebsd.org, Ryan Stone

On 23 January 2015 at 03:36, Konstantin Belousov <kost...@gmail.com> wrote:
> On Thu, Jan 22, 2015 at 07:30:19PM -0500, Ryan Stone wrote:
>> Other than the redundant braces on the else below, this looks good. I
>> tested it today and saw no further problems.
> The (possibly implicit) style rule is to have {} around both branches
> in if () when one branch requires {}.

Unfortunately style(9) presents the opposite case as an example:

Closing and opening braces go on the same line as the else. Braces that
are not necessary may be left out.

if (test)
stmt;
else if (bar) {
stmt;
stmt;
} else
stmt;

It does say that the braces "may be" left out though, and I much
prefer the guideline as you state it.

Ryan Stone

unread,

Jan 23, 2015, 7:30:19 PM1/23/15

to Ed Maste, Konstantin Belousov, freebsd...@freebsd.org

On Fri, Jan 23, 2015 at 10:24 AM, Ed Maste <ema...@freebsd.org> wrote:
> It does say that the braces "may be" left out though, and I much
> prefer the guideline as you state it.

That works for me.

The WITNESS_WARN() survived a buildworld, so I'll commit that once kib
gets the vm_pageout_oom() fix in

Konstantin Belousov

unread,

Jan 24, 2015, 3:25:12 AM1/24/15

to Ryan Stone, freebsd...@freebsd.org, Ed Maste

On Fri, Jan 23, 2015 at 07:30:06PM -0500, Ryan Stone wrote:
> On Fri, Jan 23, 2015 at 10:24 AM, Ed Maste <ema...@freebsd.org> wrote:
> > It does say that the braces "may be" left out though, and I much
> > prefer the guideline as you state it.
>
> That works for me.
>
> The WITNESS_WARN() survived a buildworld, so I'll commit that once kib
> gets the vm_pageout_oom() fix in

I thought that you intend to commit the fix ? This is what I read from
your response to the latest patch, and I replied that I agree.

Sleeping thread held mutex in vm_pageout_oom()

Ryan Stone

Konstantin Belousov

Ryan Stone

Konstantin Belousov

Ryan Stone

Konstantin Belousov

Ed Maste

Ryan Stone

Konstantin Belousov