A uthread/pthread bug hunt

6 views
Skip to first unread message

Barret Rhoden

unread,
Oct 20, 2015, 2:15:42 PM10/20/15
to aka...@googlegroups.com
Hi -

(this bug is fixed, i'm emailing it out for those interested in such
things).


I had an app that uses eventfd and epoll can trigger:

uthread.c:621: run_uthread: Assertion `uthread->state == 2'
failed.

and

pthread.c:246: pth_sched_entry: Assertion `new_thread->state ==
2' failed.

The first one is usually a sign of running a uthread more than once,
concurrently (2LS / parlib bug). The second is a similar catch.

When one fails for my app, they usually both do, which probably means
we're dealing with multiple cores (o/w one assert would kill the
process, and we couldn't get to the second one). It's also racy.

I poked around in uth_blockon_evqs to see if there is anything
obviously wrong that could lead to this. Like waking a thread multiple
times, etc. Nothing obvious, and a few printfs in the area didn't help.

After failure, I printed some pthread 2LS debugging info. Sometimes
uthreads get put onto the same list multiple times, or the lists
otherwise get corrupted. Result:

uth 0x3622300, type 2 (debugging crap from the uth assert location)
uth 0x3623700, type 2

PTH 0x3622300, state 8 (debugging crap from the pth assert location)
ready q
PTH 0x3623700, state 8 (BLK_MUTEX)
PTH 0x3622300, state 8
active q
PTH 0x3620f00, state 2 (RUNNABLE)
PTH 0x3621e00, state 8
PTH 0x3623700, state 8
PTH 0x3622300, state 8
[user] pthread.c:260, vcore 3, Assertion failed: 0

The same uth/pth pops up a few times. 3622300 should be on the active
list, but not the ready queue. that's messed up. likewise, anything
on the ready q should be pth state 2 (RUNNABLE), not 8.

likewise, anything on the active q should be PTH_RUNNING (3). the
latter is actually a minor bug in pthreads, where we don't set that
state at any point (except for thread0, once). it's not actually
critical that we do that, but most things in the pthread 2LS are for
debugging anyways, so we ought to do it.

So it looks like we screwed up some of our 2LS callbacks, causing a
thread to be added to the same list a couple of times.

The most likely reason for this is pth_thread_runnable was called on
the same uthread repeatedly. though we check the state in there too,
so if someone called it twice, we should have had a printf at least
(which should be an assert/panic, really). perhaps someone is both
mucking with the pth->state (via the has_blocked callback) and calling
runnable around the same time. that could explain why the ready q has
entries that are not state 2 (runnable), though since its racy, i'd
expect to have errored out once in a while.

Checked for that. It doesn't appear that pth_thread_runnable
is adding a pthread to the ready q that is already on the q.

On a hunch, let me make sure that it's not on the active queue
at that point. By the time we get to thread_runnable, it should have
been removed from the active q a long time ago.

While this test is running, I can already see the bug. The
only time we're removed from the active q is thread_paused,
thread_blockon_sysc, thread_refl_fault, and generic_yield. However,
pth_thread_has_blocked does *not* call it.

The test (checking the active q during pth_thread_runnable)
quickly confirms the problem.

With that change (removing the pth from the active q during
has_blocked), the problem went away.

Barret

Kevin Klues

unread,
Oct 20, 2015, 2:22:41 PM10/20/15
to aka...@googlegroups.com
We reallyn eed to back port the stuff from parlib-on-linux

https://github.com/klueska/upthread/blob/master/src/gq/upthread.c#L164
https://github.com/klueska/upthread/blob/master/src/gq/upthread.c#L325
> --
> You received this message because you are subscribed to the Google Groups "Akaros" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
> To post to this group, send email to aka...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
~Kevin

Barret Rhoden

unread,
Oct 20, 2015, 2:26:51 PM10/20/15
to aka...@googlegroups.com
On 2015-10-20 at 11:22 Kevin Klues <klu...@gmail.com> wrote:
> We reallyn eed to back port the stuff from parlib-on-linux
>
> https://github.com/klueska/upthread/blob/master/src/gq/upthread.c#L164
> https://github.com/klueska/upthread/blob/master/src/gq/upthread.c#L325

Yes, but I don't see how that would have helped in this case. The
point of the active queue is to catch bugs and to force our uthread
interfaces to accommodate schedulers that may want to do odd things.

Kevin Klues

unread,
Oct 20, 2015, 2:29:07 PM10/20/15
to aka...@googlegroups.com
> Yes, but I don't see how that would have helped in this case. The
> point of the active queue is to catch bugs and to force our uthread
> interfaces to accommodate schedulers that may want to do odd things.

Well, for this specific case, I had already ran into and solved this
bug in that port. This line links to it:

https://github.com/klueska/upthread/blob/master/src/gq/upthread.c#L164

Kevin Klues

unread,
Oct 20, 2015, 2:33:16 PM10/20/15
to aka...@googlegroups.com
> Well, for this specific case, I had already ran into and solved this
> bug in that port. This line links to it:
>
> https://github.com/klueska/upthread/blob/master/src/gq/upthread.c#L164

i.e. I call __upthread_generic_yield() (like the others you
mentioned), which does the removal from the active queue.
Reply all
Reply to author
Forward
0 new messages