a bug hunt

3 views

Skip to first unread message

Barret Rhoden

unread,

Sep 30, 2015, 9:19:59 PM9/30/15

to aka...@googlegroups.com

Here's a little bug I noticed and people newer to Akaros debugging
might like to hear about the solution and how I got to it.

Problem:
- If you do two ifconfigs in series, the kernel crashes. This is with
lock debugging turned on, in case you want to repeat it.

Example:
/bin/ash: can't access tty; job control turned off
/ $ ash ifconfig
Welcome to QEMU!
bind #cons -> /dev flag 1
bind #proc -> /proc flag 1
bind #srv -> /srv flag 1
bind #ip -> /net flag 2
bind #ether.0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #srv/cs yet, sleeping until it does...
bind #kprof -> /prof/ flag 2
ifconfig complete
/ $ ash ifconfig
Welcome to QEMU!
bind #cons -> /dev flag 1
bind #proc -> /proc flag 1
bind #srv -> /srv flag 1
bind #ip -> /net flag 2
bind #ether.0 -> /net flag 2
echo failed: Unspecified, ipifc not yet bound to device
I am 10.0.2.15, default route 10.0.2.2
HW TRAP frame at 0xffff80013aa20a50 on core 0
rax 0x0000000000000001
rbx 0xffff80013bf8e558
rcx 0x0000000000000000
rdx 0x0000000000000000
rbp 0xffff80013aa20b28
rsi 0x0000000000000000
rdi 0x0000000000000000
r8 0x0000000000000000
r9 0x0000008000000000
r10 0x0000000000000020
r11 0xffff80013a376e30
r12 0x0000000000000000
r13 0xffff80013bf8e4b0
r14 0x0000000000000000
r15 0x0000000000000000
trap 0x0000000e Page Fault
gsbs 0xffffffffc5d5f8c0
fsbs 0x00000000deadbeef
err 0x--------00000000
rip 0xffffffffc2001720
cs 0x------------0008
flag 0x0000000000010202
rsp 0xffff80013aa20b10
ss 0x------------0010

Backtrace of kernel context on Core 0:
#01 [<0xffffffffc2001720>] in spin_unlock
#02 [<0xffffffffc20393a5>] in findmount
#03 [<0xffffffffc2039be6>] in walk
#04 [<0xffffffffc203a279>] in __namec_from
#05 [<0xffffffffc2043da3>] in sysopenat
#06 [<0xffffffffc2058200>] in sys_openat
#07 [<0xffffffffc2058df9>] in syscall
#08 [<0xffffffffc2058f51>] in run_local_syscall
#09 [<0xffffffffc21053b9>] in sysenter_callwrapper
kernel panic at kern/arch/x86/trap.c:254, from core 0: Proc-ful Page
Fault in the Kernel at 0x0000000000000010! Entering Nanwan's Dungeon on
Core 0 (Ints on): Type 'help' for a list of commands.
ROS(Core 0)>

Let's see what was running:

ROS(Core 0)> ps
PID Name State Parent
-------------------------------------------------
41 cs WAITING 25
1 busybox WAITING 0
42 mount RUNNING_S 25
2 /bin/ash WAITING 1
25 ash WAITING 2
19 cs RUNNABLE_S 0

Looks like mount (it was RUNNING_S), though that's not a guarantee.

Let's take a look at the backtrace. You'll want to use addr2line. I
have the following bash alias set up:

alias aa="addr2line -e obj/kern/akaros-kernel-64b"

$ aa 0xffffffffc2001720
/home/brho/akaros/ros-kernel/kern/src/atomic.c:21

That is:

static void decrease_lock_depth(uint32_t coreid)
{
per_cpu_info[coreid].lock_depth--;
}

which is called from spin_unlock (when building with lock debugging on).

void spin_unlock(spinlock_t *lock)
{
decrease_lock_depth(lock->calling_core);
...

And from the trapframe and PF addr, it looks like lock is null.

Let's check out the caller:

$ aa 0xffffffffc20393a5
/home/brho/akaros/ros-kernel/kern/src/ns/chan.c:576

That is:

int
findmount(struct chan **cp,
struct mhead **mp, int type, int dev, struct qid qid)
{
struct pgrp *pg;
struct mhead *m;

pg = current->pgrp;
rlock(&pg->ns);
for (m = MOUNTH(pg, qid); m; m = m->hash) {

576 is the for loop, right after the rlock. So the rlock is our
culprit.

void rlock(struct rwlock *rw_lock)
{
if (atomic_add_not_zero(&rw_lock->nr_readers, 1))
return;
spin_lock(&rw_lock->lock);
if (rw_lock->writing) {
cv_wait_and_unlock(&rw_lock->readers);
return;
}
atomic_inc(&rw_lock->nr_readers);
spin_unlock(&rw_lock->lock);
}

Hmm, that's a little weird. rlock doesn't show up in the backtrace,
but it's pointed to by the backtrace. the faulting bit is an unlock,
but if rw_lock->lock == 0, you'd think we'd notice on the spin_lock,
not the spin_unlock.

Let's put an assert in spin_unlock and see if we get a better bt.
Careful that you put it in the correct spin_unlock (based on whether or
not spinlock debugging is on).

I am 10.0.2.15, default route 10.0.2.2
kernel panic at kern/src/atomic.c:96, from core 0: assertion failed:
lock Entering Nanwan's Dungeon on Core 0 (Ints on):
Type 'help' for a list of commands.
ROS(Core 0)> bt
Stack Backtrace on Core 0:
#01 [<0xffffffffc201bf4c>] in mon_bt
#02 [<0xffffffffc201c407>] in monitor
#03 [<0xffffffffc200b92a>] in _panic
#04 [<0xffffffffc200176a>] in spin_unlock
#05 [<0xffffffffc2016656>] in cv_wait_and_unlock
#06 [<0xffffffffc20393c5>] in findmount
#07 [<0xffffffffc2039c06>] in walk
#08 [<0xffffffffc203a299>] in __namec_from
#09 [<0xffffffffc204372b>] in bindmount
#10 [<0xffffffffc2043b1a>] in sysmount
#11 [<0xffffffffc2054f71>] in sys_nmount
#12 [<0xffffffffc2058e19>] in syscall
#13 [<0xffffffffc2058f71>] in run_local_syscall
#14 [<0xffffffffc21053d9>] in sysenter_callwrapper

That gave us a little more info. What the heck? We now see
cv_wait_and_unlock() in the bt. Checking the addr2lines for findmount:

$ aa 0xffffffffc20393c5
/home/brho/akaros/ros-kernel/kern/src/ns/chan.c:576

We're still at the same place. For whatever reason, we get a better
backtrace from the monitor after a panic than from the monitor after a
PF. But look more closely. The old bt came from an openat call. This
one came from a bindmount call. So we're actually triggering on a
different syscall? If this call is earlier, how did we not PF when
lock == 0? Hmm.

Let's see what's running now (previously it was mount, it probably
still is).

kernel panic at kern/src/atomic.c:96, from core 0: assertion failed:
lock
Entering Nanwan's Dungeon on Core 0 (Ints on):
Type 'help' for a list of commands.
ROS(Core 0)> ps
PID Name State Parent
-------------------------------------------------
41 cs WAITING 25
1 busybox WAITING 0
42 mount WAITING 25
2 /bin/ash WAITING 1
25 ash WAITING 2
19 cs RUNNING_S 0

Nope. mount is just WAITING. looks like we're on a syscall from cs
(it is RUNNING_S), though that's not guaranteed.

Let's see what mount is up to:

ROS(Core 0)> pip 42
Lock 0xffff80013bb86a20: currently unlocked. Last locked at
[<0xffffffffc204e5a5>] in proc_yield on core 0 struct proc:
0xffff80013bb86a00 Program name: mount
PID: 42
PPID: 25
State: WAITING (0x0000000000000008)
Is not an MCP
Refcnt: 3
Flags: 0x00000000
CR3(phys): 0x0000000139eac000
Num Vcores: 0
Vcore Lists (may be in flux w/o locking):
----------------------
Online:
Bulk Preempted:
Inactive / Yielded:
Vcore 0
Vcore 1
Vcore 2
Vcore 3
Vcore 4
Vcore 5
Vcore 6
Nsec Online, up to the last offlining:
------------------------
VC 0: 1087013 VC 1: 0 VC 2:
0 VC 3: 0 VC 4: 0 VC 5:
0 VC 6: 0 Total CPU-NSEC: 1087013
Resources:
------------------------
Res type: 00, amt wanted: 00000000, amt granted: 00000000
Res type: 01, amt wanted: 00000000, amt granted: 00000000
Res type: 02, amt wanted: 00000000, amt granted: 00000000
Open Files:
FD: 00, File: 0xffff80013ffab000, File name: stdin
FD: 01, File: 0xffff80013ffab0b8, File name: stdout
FD: 02, File: 0xffff80013ffab170, File name: stderr
FD: 03, Chan pathname: #pipe./data1 ref 6, Dev: pipe, Devinfo:
qid.path: 0x0000000000000122, qid.type: 01
Children: (PID (struct proc*))

It's an SCP, and it's WAITING. Usually that means it is waiting for a
syscall to complete, and it has a pipe open. So you'd expect to see
something in db sem (list of all semaphores that have sleeping
kthreads, and which syscalls are going on):

ROS(Core 0)> db sem (I manually removed the kernel-internal ones)
All sems with waiters:

Semaphore 0xffff80013a39c0e0 has -1 signals (neg = waiters), recently
downed on core 0 with pc/frame 0xffffffffc20152e5 0xffff80013a959bd8
Kthread 0xffff80013a6e43a0 (read on fd 7), proc 19
(0xffff80013bb88600), sysc 0x00007f7fffbfcb80

Semaphore 0xffff80013bb89578 has -1 signals (neg = waiters), recently
downed on core 0 with pc/frame 0xffffffffc20152e5 0xffff80013af23de8
Kthread 0xffff80013ffa8910 (), proc 25 (0xffff80013bb89400),
sysc 0x00007f7fffbfe980

Semaphore 0xffff80013bb8a378 has -1 signals (neg = waiters), recently
downed on core 0 with pc/frame 0xffffffffc20152e5 0xffff80013af26de8
Kthread 0xffff80013ffa8828 (), proc 2 (0xffff80013bb8a200), sysc
0x00007f7fffbfe990

Semaphore 0xffff80013bb8b178 has -1 signals (neg = waiters), recently
downed on core 0 with pc/frame 0xffffffffc20152e5 0xffff80013ba50de8
Kthread 0xffff80013ffa8740 (), proc 1 (0xffff80013bb8b000), sysc
0x00007f7fffbfe950

Nothing for proc 42 (mount). So what is it waiting on? Could be some
other event, though I'm not sure what. If it was the pipe, I'd expect
something like the original 'cs' (pid 19), which is sleeping on FD 7.
Here's cs / 19's info:

ROS(Core 0)> pip 19
Lock 0xffff80013bb88620: currently unlocked. Last locked at
[<0xffffffffc204c5e9>] in proc_run_s on core 0 struct proc:
0xffff80013bb88600 Program name: cs
PID: 19
PPID: 0
State: RUNNING_S (0x0000000000000004)
Is not an MCP
Refcnt: 3
Flags: 0x00000000
CR3(phys): 0x000000013aad7000
Num Vcores: 0
Vcore Lists (may be in flux w/o locking):
----------------------
Online:
Bulk Preempted:
Inactive / Yielded:
Vcore 0
Vcore 1
Vcore 2
Vcore 3
Vcore 4
Vcore 5
Vcore 6
Nsec Online, up to the last offlining:
------------------------
VC 0: 59392831 VC 1: 0 VC 2:
0 VC 3: 0 VC 4: 0 VC 5:
0 VC 6: 0 Total CPU-NSEC: 59392831
Resources:
------------------------
Res type: 00, amt wanted: 00000001, amt granted: 00000000
Res type: 01, amt wanted: 00000000, amt granted: 00000000
Res type: 02, amt wanted: 00000000, amt granted: 00000000
Open Files:
FD: 00, File: 0xffff80013ffabb80, File name: null
FD: 01, File: 0xffff80013ffab0b8, File name: stdout
FD: 02, File: 0xffff80013ffab170, File name: stderr
FD: 03, File: 0xffff800139f620b8, File name: local
FD: 04, File: 0xffff800139f62170, File name: local
FD: 05, File: 0xffff800139f62228, File name: common
FD: 06, Chan pathname: /net/ndb ref 1, Dev: ip, Devinfo: Qndb
FD: 07, Chan pathname: #pipe./data ref 2, Dev: pipe, Devinfo:
qid.path: 0x0000000000000121, qid.type: 01
FD: 09, Chan pathname: #srv./cs ref 1, Dev: srv, Devinfo:
qid.path: 0x0000000000000002, qid.type: 01

See FD 7? That's the pipe, and that's how cs can sleep on a pipe (read
on FD 7). I'd expect mount to do that too.

But maybe mount is waiting on something else. Or maybe it's been
removed from its semaphore and its kthread is currently running (and
then panicking). The syscall just hasn't finished yet, so the process
is still WAITING for the completion event.

Side note: it'd be nice to have a way to see what kthread is
running on a given core at any moment. Something to add to the
monitor.

Another interesting thing is that cs pid 19 is RUNNING_S and it has a
blocking syscall. That's because it is an SCP with a 2LS (has
pthreads). That's totally legal. It's slightly interesting since we
have two separate CS programs running, both of which have attempted to
bind themselves onto /net/ (and serve the single file 'cs'
at /net/cs). Perhaps part of the problem is having multiple mounts at
a single point, though devmnt should just ask the first CS, then the
second CS, in bind order for 'cs' (which the first CS hits on always).

So maybe that's something.

Another tidbit is what ifconfig does:

cs&
if [ ! -e "#srv/cs" ]
then
echo "cs hasn't created #srv/cs yet, sleeping until it does..."
until [ -e "#srv/cs" ]
do
usleep 1000
done
fi
mount -a \#srv/cs /net

the if waits til srv/cs exists. though the second time through, it
already exists. also, cs removes #srv/cs, then creates a new one. so
if we removed the first cs's #srv/cs, and it was still running and
waiting on that pipe, what happens? and is it a potential problem that
mount could be run before/during/after the removal and addition of a
new #srv/cs?

Anyway, that srv/cs bit is speculation. Something to keep in mind, but
let's try and work with what we have.

if you printk in certain places, the second ifconfig doesn't panic. so
there's a race involved. hints back to the srv/cs removal.

during one of those runs, you'll see stuff like this:

Unable to open chan for mounting: No such file or directory,
file does not exist cs: mount read: file does not exist

From ps, the old CS is gone, the new one is still there. Why did the
old CS disappear? since it's file was removed by someone else, did it
then wake up, get the error, and exit? That sounds right, and you'll
see it in the code for CS. The "mount read" from up above is from CS,
before it starts to bail out.

Moving on, i did some printking to see if the cv was initialized
(which would explain the lack of a lock). the printks were noisy. but
we know the pointer value (deterministic).

I printk'd this: no lock for cv 0xffff80003bf69558!

Every run from a fresh boot it was the same pointer value. So I put in
manual asserts to limit the printing to just that pointer, and nothing
popped up (and asserted it was the same pointer value) as before).
there were a couple options. One is that the cv was init'd, but then
zeroed - perhaps a refcnting problem. The other is that it was never
init'd. It seems the latter is more likely, at this point.

to get more info, i hexdumped the area around cv. cv itself is all
0s. that doesn't answer too much (could be either not init'd or
zeroed), but it does tell us that it wasn't partially initialized.

here's the hexdump:

no lock for cv 0xffff80003bf69558!
ffff80003bf694d8: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
ffff80003bf694e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
...
ffff80003bf69528: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
ffff80003bf69538: ca 08 05 c2 ff ff ff ff 00 00 00 00 00 00 00 00 ................
ffff80003bf69548: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
ffff80003bf69558: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
...
kernel panic at kern/src/kthread.c:653, from core 0: assertion failed: cv->lock

there's some stuff before it. i was guessing it'd be a kmalloc tag,
but wasn't sure (in retrospect, I should have looked at the type where
the CV exists; in this case, it's not kmalloc'd directly). it looks
like there are two flags (the 01) and a pointer.

what's the pointer? (0xffffffffc20508ca) That looks like a kernel
address.

$ aa 0xffffffffc20508ca
/home/brho/classes/ros/ros-kernel/kern/src/rwlock.c:56

That pointer is the line of code in rlock that grabs the spinlock.
This might be a dead end. It's not like it's a function pointer -
we're pointing into the middle of a function. If you look at the asm:

ffffffffc20508c6: e8 c5 0c fb ff callq ffffffffc2001590 <spin_lock>
ffffffffc20508cb: 41 80 7c 24 20 00 cmpb $0x0,0x20(%r12)

08ca isn't even the next instruction (as you'd expect from a retaddr on
the stack). But it is really close, and maybe we grabbed that addr in
software (like we used to do with backtraces, where we'd subtract 1
from the retaddr).

Ah, given the function call is a spinlock, then the memory we are looking
at is likely a spinlock. The 01 means it is locked. Anther 01 could be
the IRQ checker. The address is the "last locked at" location. That
actually makes a lot of sense. This is part of the spinlock debugger
(record the last locked location inside the lock).

Anyway, at this point, it feels like there's a race that printks can
muck with. we're dealing with SCPs, but a printk takes long enough that a
ksched SCP timer tick could go off and change things up (as one
option). either way, perhaps there's a race that causes an unused
block of code that creates some object and fails to init it. a lot of
the old plan 9 stuff had objects that were init'd by being zeroed. we
need a function for many of our inits, so this could be the problem.

so let's go back to the code, see where the cv came from, and look for
whether or not it was init'd or for other suspicious stuff.

Our BT:

#03 [<0xffffffffc200ba4a>] in _panic
#04 [<0xffffffffc201690c>] in cv_wait_and_unlock
#05 [<0xffffffffc2039671>] in findmount
#06 [<0xffffffffc2039eb6>] in walk
#07 [<0xffffffffc203a549>] in __namec_from
#08 [<0xffffffffc2044073>] in sysopenat
#09 [<0xffffffffc20584e0>] in sys_openat
#10 [<0xffffffffc20590d9>] in syscall
#11 [<0xffffffffc2059231>] in run_local_syscall
#12 [<0xffffffffc2100389>] in sysenter_callwrapper

findmount was in chan.c L 576. that's right after an rlock call, so
we'd probably in rlock(), right before cv_wait_and_unlock. let's move
our asserts into there:

void rlock(struct rwlock *rw_lock)
{
if (atomic_add_not_zero(&rw_lock->nr_readers, 1))
return;
spin_lock(&rw_lock->lock);
if (rw_lock->writing) {
assert(rw_lock->readers.lock);
cv_wait_and_unlock(&rw_lock->readers);
return;
}
atomic_inc(&rw_lock->nr_readers);
spin_unlock(&rw_lock->lock);
}

Note we only try the cv_wait_and_unlock if 'writing' was set. that
should mean there is another kthread that grabbed the rwlock, and
*that* is something that could be based on a race between multiple
threads (which is a piece of evidence).

As expected, the assert fails:

kernel panic at kern/src/rwlock.c:58, from core 0: assertion failed: rw_lock->readers.lock

#03 [<0xffffffffc200ba4a>] in _panic
#04 [<0xffffffffc205092a>] in rlock
#05 [<0xffffffffc2039671>] in findmount
#06 [<0xffffffffc2039eb6>] in walk
#07 [<0xffffffffc203a549>] in __namec_from
#08 [<0xffffffffc2044073>] in sysopenat
#09 [<0xffffffffc2058510>] in sys_openat
#10 [<0xffffffffc2059109>] in syscall
#11 [<0xffffffffc2059261>] in run_local_syscall
#12 [<0xffffffffc21003b9>] in sysenter_callwrapper

Also note the location of the panics is right after the address we saw
in the hexdump for the spinlock's 'last locked' location. This makes
sense, since we just locked

spin_lock(&rw_lock->lock);

and the CV we hexdumped is embedded in the rwlock:

struct rwlock {
spinlock_t lock;
atomic_t nr_readers;
bool writing;
struct cond_var readers;
struct cond_var writers;
};

Anyway, the basic printk doesn't show an rwinit called for the rwlock
that we later panic on. Let's just cut to the chase. From the BT, we
know we're in findmount, right at:

pg = current->pgrp;
rlock(&pg->ns);

So the pgrp is probably not initializing its rwlock. After hopping
around with Gtags a bit, we find:

struct pgrp *newpgrp(void)
{
struct pgrp *p;

p = kzmalloc(sizeof(struct pgrp), KMALLOC_WAIT);
kref_init(&p->ref, freepgrp, 1);
p->pgrpid = NEXT_ID(pgrpid);
p->progmode = 0644;
qlock_init(&p->debug);
qlock_init(&p->nsh);
return p;
}

zmalloc'd, some init, but not an rwinit for ns. There's our problem.

With the basic fix, and a double-check to make sure the rest of the
pgrp struct is initialized, we have our fix. We can now at least
attempt to ifconfig multiple times without panicking the kernel.

The issue in the end was an uninitialized struct. Those usually cause
problems immediately, but this time it was due to some contention on
the lock, which was hard to catch with excessive prints. Due to the
raciness, it also felt like refcnting, removals from srv, or other
error conditions were the problem. But just working back with asserts
and conditional prints based on the value of the pointer allowed us to
work it all out.

In retrospect, once I saw a cv without a lock field, I should have
pushed harder on the uninitialized option. After getting through the
BT enough to see the rlock, it took about a minute to confirm the
problem. But I guess that's what happens when you break for dinner and
come back to it a day or so later.

Barret

Reply all

Reply to author

Forward

0 new messages