Problems with 9p

Kevin Klues

unread,

Jul 28, 2015, 12:18:02 AM7/28/15

to akaros

We still have the commit from Ron for:

"Don't use an empty path for 9p mounts. Specify /."

but this break the 9pserver I use that I download from github.com/rminnich/go9

I've just been reverting this commit locally so that I can continue to
use 9p. What is the long term solution here?

Kevin Klues

unread,

Jul 28, 2015, 2:35:57 AM7/28/15

to akaros

I'm also having problems with cs in Go (I know Barret messed with this
recently), and another bug in Go that I think is related to some of
the file system changes. Spent a few hours on them today and will
look at it again tomorrow. Working from the 'net' branch with the 9p
mount commit reverted.

--- FAIL: TestLookupPort (1.64 seconds)
port_test.go:50: LookupPort("udp", "echo") = 0, write /net/cs: cs:
can't translate service; want 7
port_test.go:50: LookupPort("udp", "tftp") = 0, write /net/cs: cs:
can't translate service; want 69
port_test.go:50: LookupPort("udp", "bootpc") = 0, write /net/cs: cs:
can't translate service; want 68
port_test.go:50: LookupPort("udp", "bootps") = 0, write /net/cs: cs:
can't translate service; want 67
port_test.go:50: LookupPort("udp", "domain") = 0, write /net/cs: cs:
can't translate service; want 53
port_test.go:50: LookupPort("udp", "ntp") = 0, write /net/cs: cs:
can't translate service; want 123
port_test.go:50: LookupPort("udp", "snmp") = 0, write /net/cs: cs:
can't translate service; want 161
port_test.go:50: LookupPort("udp", "syslog") = 0, write /net/cs: cs:
can't translate service; want 514
FAIL
FAIL net 14.852s

On Mon, Jul 27, 2015 at 9:53 PM, ron minnich <rmin...@gmail.com> wrote:
> Figure out why go 9p failed in that case as it should not. But for now we
> can revert that cl ... My go 9p may end up only being used by akaros.

barret rhoden

unread,

Jul 28, 2015, 8:23:14 AM7/28/15

to aka...@googlegroups.com, klu...@eecs.berkeley.edu

Ron patched the upstream go9p server. Updating it on my workstation
was a minor pain in the ass. Things like go build and go get didn't
seem to do the trick. I ended up doing:

(set/export GOPATH)

$ go get -u github.com/rminnich/go9p/ufs
$ go install github.com/rminnich/go9p/ufs

I cd'd into src/github.com/rminnich/go9p and verified before and after
the -u that the git repo was updated.

With that fix applied, the new ufs should work with the current akaros.

I use something like:

$ ufs -addr=":1025" -akaros=true \
-root=/home/brho/classes/ros/ros-kernel/mnt/rootfs/ &

(can't use ~/ in the root path btw, though $HOME should be fine.)

Barret

barret rhoden

unread,

Jul 28, 2015, 9:06:46 AM7/28/15

to aka...@googlegroups.com, klu...@eecs.berkeley.edu

My guess is CS is failing ("can't translate service") in iplookup(),
probably due to ipserv(), maybe ndbgetvalue() (user/ndblib).

Sync up with you later today. =)

Barret

ron minnich

unread,

Jul 28, 2015, 9:45:54 AM7/28/15

to aka...@googlegroups.com, klu...@eecs.berkeley.edu

I think we should 'vendor' go9p, i.e. just copy it into the tree. That's a standard solution for go.

ron

Dan Cross

unread,

Jul 28, 2015, 9:48:19 AM7/28/15

to Kevin Klues, akaros

That sounds vaguely like a bad ndb file.

ron minnich

unread,

Jul 28, 2015, 9:53:00 AM7/28/15

to aka...@googlegroups.com, Kevin Klues

or weird IO errors on the ndb file, since we just changed the file layer :)

Kevin Klues

unread,

Jul 28, 2015, 10:39:42 AM7/28/15

to aka...@googlegroups.com, klu...@eecs.berkeley.edu

I know I've done all the updates properly, but that '-akaros' flag looks new, and I don't think I'm using that. I'll try it out when I get in this morning.

--
~Kevin

ron minnich

unread,

Jul 28, 2015, 10:51:02 AM7/28/15

to aka...@googlegroups.com, klu...@eecs.berkeley.edu

you need -akaros or symlinks won't work. We're going to have to get akaros to talk 9p2000.L,

but first I have to get go9p to talk that, so it's a long walk.

ron

Barret Rhoden

unread,

Jul 28, 2015, 10:58:18 AM7/28/15

to aka...@googlegroups.com

On 2015-07-28 at 07:39 Kevin Klues <klu...@gmail.com> wrote:
> I know I've done all the updates properly, but that '-akaros' flag
> looks new, and I don't think I'm using that. I'll try it out when I
> get in this morning.

If you were seeing the bug where you couldn't attach to a particular
root (which the "/" change in Akaros helped expose), you should just
need to have the commit:

f34ad4af "Constrain clients attaching to ufs to the ufs.Root."

in the git log.

Barret

Kevin Klues

unread,

Jul 28, 2015, 12:13:05 PM7/28/15

to akaros

The ufs problem was user error. I was updating properly, but not
restarting it properly. My script was supposed to kill any running
instances of ufs and then restart it, but it wasn't killing the old
one, so the new one was never getting started. I also didn't have
"-akaros=true", which would likely have caused other bugs in the
future.

The (now updated) script I'm referring to his here:
https://github.com/klueska/go-akaros/blob/akaros/misc/akaros/bin/go-akaros-9pserver.sh

Kevin

--
~Kevin

ron minnich

unread,

Jul 28, 2015, 12:42:09 PM7/28/15

to akaros

I spent about an hour yesterday with that exact same problem. It's not unusual. Just annoying.

The new go9p has an http port which might actually be useful for something, esp. things like this.

ron

Kevin Klues

unread,

Jul 28, 2015, 1:29:50 PM7/28/15

to akaros

So the ufs error is fixed, but I'm still having the cs problems, and
some of my os/exec tests are hanging forever (right now it looks as if
wait() never returns for whatever reason).

--
~Kevin

Barret Rhoden

unread,

Jul 28, 2015, 1:48:04 PM7/28/15

to aka...@googlegroups.com

On 2015-07-28 at 10:29 Kevin Klues <klu...@gmail.com> wrote:
> So the ufs error is fixed, but I'm still having the cs problems, and
> some of my os/exec tests are hanging forever (right now it looks as if
> wait() never returns for whatever reason).

cs could be due to some new bug, or it could be that the SCP 2LS stuff
and threading in CS opened up new paths or new races or something.

for processes that are stuck on wait, what's the pip of their children
look like? there could be some bugs with how close works or
something. i also fixed what looked like a bug in devpipe. if that's
not closing correctly, there could be issues where a parent is stuck
sleeping on a child's pipe.

i updated pip btw, so all of the FDs should be in the same table.

barret

Kevin Klues

unread,

Jul 28, 2015, 2:23:12 PM7/28/15

to akaros

The wait actually returns, and its stuck on the following:

var copyError error
for _ = range c.goroutine {
if err := <-c.errch; err != nil && copyError == nil {
copyError = err
}
}

Where it's trying to read errors out of a channel.
The relevant code is here:
https://github.com/klueska/go-akaros/blob/akaros/src/pkg/os/exec/exec.go#L360

Still digging...

--
~Kevin

Kevin Klues

unread,

Jul 28, 2015, 4:21:44 PM7/28/15

to akaros

To test the CS bug, put the attached binary in /bin and run it as:

/bin/net.test -test.run="TestLookupPort"

--
~Kevin

net.test

Kevin Klues

unread,

Jul 28, 2015, 7:00:03 PM7/28/15

to akaros

I narrowed it down manually to being blocked on a read from a pipe
(probably because a close didn't take effect).

git bisect narrowed it down to this commit:

commit f0f7ff9c68c90722ce2960f4cb0b1fa8ffc08969
Author: Barret Rhoden <br...@cs.berkeley.edu>
Date: Fri Jul 24 11:55:43 2015 -0400

Implement 9ns FD functions with fd tables

Now the 9ns and VFS share the FD table, instead of using separate tables
(e.g. the fgrp).

I'll keep digging, but if you're online, Barret, maybe you have some
hints. I'm going to start with the difference in fdclose.

--
~Kevin

barret rhoden

unread,

Jul 28, 2015, 7:12:11 PM7/28/15

to aka...@googlegroups.com

looking at it now.

barret rhoden

unread,

Jul 28, 2015, 7:13:45 PM7/28/15

to aka...@googlegroups.com

my guess is that channel refcounting is wrong. that'd prevent the full
close from occurring.

On 2015-07-28 at 16:00 Kevin Klues wrote:

Kevin Klues

unread,

Jul 28, 2015, 7:16:00 PM7/28/15

to akaros

I'll send you the binary out of band

--
~Kevin

barret rhoden

unread,

Jul 28, 2015, 7:58:30 PM7/28/15

to aka...@googlegroups.com

Refcounting bug.

The issue was when duping FDs. fdtochan increfs. insert_object will
incref when it succeeds (stores a ref in the FD table), but regardless
of whether or not it succeeded, we need to drop the ref from fdtochan.

Here's the fix, which I amended in origin/net.

This also seems to have fixed the CS bug (which also uses pipes, though
not sure if that has anything to do with it).

diff --git a/kern/src/ns/sysfile.c b/kern/src/ns/sysfile.c
index 4767e7f9c9a9..fc15db57f9b5 100644
--- a/kern/src/ns/sysfile.c
+++ b/kern/src/ns/sysfile.c
@@ -237,10 +237,11 @@ int sys_dup_to(struct proc *from_proc, unsigned int from_fd,
error(Eperm);
}
ret = insert_obj_fdt(&to_proc->open_files, c, to_fd, 0, TRUE, FALSE);
- if (ret < 0) {
- cclose(c);
+ /* drop the ref from fdtochan. if insert succeeded, there is one other ref
+ * stored in the FDT */
+ cclose(c);
+ if (ret < 0)
error("Can't insert FD %d into FDG", to_fd);
- }
poperror();
return 0;
}

Barret

Kevin Klues

unread,

Jul 28, 2015, 8:10:17 PM7/28/15

to akaros

This fixes the os.exec problem for me. Thanks!

I'm still seeing the cs error though...
I can get the cs error to go away if i revert the commit about making
it a multithreaded SCP.
What was the output of the net test, that made you think it was working?
Did you make sure and run make testclean and rebuild cs since the change?

Kevin

--
~Kevin

Kevin Klues

unread,

Jul 28, 2015, 8:20:29 PM7/28/15

to akaros

What's interesting is that CS seems to work, in general. It's just
this one test of looking up Port IDs that seems to be broken... All
other tests in all other packages (including other net tests), pass as
before.

--
~Kevin

barret rhoden

unread,

Jul 28, 2015, 8:24:11 PM7/28/15

to aka...@googlegroups.com

On 2015-07-28 at 17:10 Kevin Klues wrote:
> This fixes the os.exec problem for me. Thanks!

Sweet. Thanks for tracking it down to a commit and a problem case.

> I'm still seeing the cs error though...
> I can get the cs error to go away if i revert the commit about making
> it a multithreaded SCP.
> What was the output of the net test, that made you think it was
> working? Did you make sure and run make testclean and rebuild cs
> since the change?

It just said PASS when I ran it.

>From the tip of origin/net, with make userclean, tests, fill-kfs.

/ $ ash ifconfig
Welcome to QEMU!
bind #c -> /dev flag 1
bind #p -> /proc flag 1
bind #s -> /srv flag 1
bind #I -> /net flag 2
bind #l0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #s/cs yet, sleeping until it does...
bind #K -> /prof/ flag 2
ifconfig complete
/ $ /bin/net.test -test.run="TestLookupPort"
PASS
/ $

Tomorrow I can try it on another machine.

Barret

Kevin Klues

unread,

Jul 28, 2015, 8:57:07 PM7/28/15

to akaros

Done for the night, but just double checked for me:

At tip of net:

make userclean tests fill-kfs

ROS(Core 0)> bb
ash: can't access tty; job control turned off
/ $ mkdir tmp

/ $ ash ifconfig
Welcome to QEMU!
bind #c -> /dev flag 1
bind #p -> /proc flag 1
bind #s -> /srv flag 1
bind #I -> /net flag 2
bind #l0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #s/cs yet, sleeping until it does...
bind #K -> /prof/ flag 2
ifconfig complete
/ $ /bin/net.test -test.run="TestLookupPort"

--- FAIL: TestLookupPort (0.03 seconds)

port_test.go:50: LookupPort("udp", "echo") = 0, write /net/cs: cs:
can't translate service; want 7
port_test.go:50: LookupPort("udp", "tftp") = 0, write /net/cs: cs:
can't translate service; want 69
port_test.go:50: LookupPort("udp", "bootpc") = 0, write /net/cs: cs:
can't translate service; want 68
port_test.go:50: LookupPort("udp", "bootps") = 0, write /net/cs: cs:
can't translate service; want 67
port_test.go:50: LookupPort("udp", "domain") = 0, write /net/cs: cs:
can't translate service; want 53
port_test.go:50: LookupPort("udp", "ntp") = 0, write /net/cs: cs:
can't translate service; want 123
port_test.go:50: LookupPort("udp", "snmp") = 0, write /net/cs: cs:
can't translate service; want 161
port_test.go:50: LookupPort("udp", "syslog") = 0, write /net/cs: cs:
can't translate service; want 514
FAIL

With revert commit applied on top of net:

commit a0a8d54259ea7a3b26282da4761f24010a7825a9
Author: Kevin Klues <klu...@cs.berkeley.edu>
Date: Tue Jul 28 17:16:04 2015 -0700

Revert "Cs is a multithreaded SCP"

This reverts commit 5b696b1fc6d4ff81812f54e4bd78d86b399e00d5.

make userclean tests fill-kfs

ROS(Core 0)> bb
ash: can't access tty; job control turned off
/ $ mkdir tmp

/ $ ash ifconfig
Welcome to QEMU!
bind #c -> /dev flag 1
bind #p -> /proc flag 1
bind #s -> /srv flag 1
bind #I -> /net flag 2
bind #l0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #s/cs yet, sleeping until it does...
bind #K -> /prof/ flag 2
ifconfig complete
/ $ /bin/net.test -test.run="TestLookupPort"
PASS

--
~Kevin

barret rhoden

unread,

Jul 28, 2015, 10:44:00 PM7/28/15

to aka...@googlegroups.com

Might be a race condition. I got it to happen a couple times. I
noticed you ran bb and mkdir manually, so i tried a few things.

ROS(Core 0)> bb
ash: can't access tty; job control turned off

/ $ ash if

/ $ ash ifconfig
Welcome to QEMU!
bind #c -> /dev flag 1
bind #p -> /proc flag 1
bind #s -> /srv flag 1
bind #I -> /net flag 2
bind #l0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #s/cs yet, sleeping until it does...
bind #K -> /prof/ flag 2
ifconfig complete
/ $ /bin/net.test -test.run="TestLookupPort"
PASS
/ $

reboot. next time with mkdir

ROS(Core 0)> bb
ash: can't access tty; job control turned off
/ $ mkdir tmp

mkdir: can't create directory 'tmp': File exists
/ $ ash if

/ $ ash ifconfig
Welcome to QEMU!
bind #c -> /dev flag 1
bind #p -> /proc flag 1
bind #s -> /srv flag 1
bind #I -> /net flag 2
bind #l0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #s/cs yet, sleeping until it does...
bind #K -> /prof/ flag 2
ifconfig complete
/ $ /bin/net.test -test.run="TestLookupPort"
PASS
/ $

reboot, same thing:

ROS(Core 0)> bb
ash: can't access tty; job control turned off
/ $ mkdir tmp

mkdir: can't create directory 'tmp': File exists
/ $ ash if

/ $ ash ifconfig
Welcome to QEMU!
bind #c -> /dev flag 1
bind #p -> /proc flag 1
bind #s -> /srv flag 1
bind #I -> /net flag 2
bind #l0 -> /net flag 2
I am 10.0.2.15, default route 10.0.2.2
cs hasn't created #s/cs yet, sleeping until it does...
bind #K -> /prof/ flag 2
ifconfig complete
/ $ /bin/net.test -test.run="TestLookupPort"

--- FAIL: TestLookupPort (0.04 seconds)

port_test.go:50: LookupPort("udp", "echo") = 0, write /net/cs: cs: can't translate service; want 7
port_test.go:50: LookupPort("udp", "tftp") = 0, write /net/cs: cs: can't translate service; want 69
port_test.go:50: LookupPort("udp", "bootpc") = 0, write /net/cs: cs: can't translate service; want 68
port_test.go:50: LookupPort("udp", "bootps") = 0, write /net/cs: cs: can't translate service; want 67
port_test.go:50: LookupPort("udp", "domain") = 0, write /net/cs: cs: can't translate service; want 53
port_test.go:50: LookupPort("udp", "ntp") = 0, write /net/cs: cs: can't translate service; want 123
port_test.go:50: LookupPort("udp", "snmp") = 0, write /net/cs: cs: can't translate service; want 161
port_test.go:50: LookupPort("udp", "syslog") = 0, write /net/cs: cs: can't translate service; want 514
FAIL

/ $

once it starts failing, it fails regularly for a given boot. your host might be faster or otherwise different than mine, such that the race always occurs.

this also feels like an old bnx2x bug i had, where the race wasn't about concurrency as much as it was about getting different memory allocations. perhaps doing a mkdir is important, or running it from the console manually for the timing. running the test from init.sh hasn't tripped yet (20+ tries or so).

anyway, debugging that way can be a mess. luckily, it always fails for you. =)

when it does fail, i'd be curious if the problem persists if you start a new CS. you might be able to just run CS again, though you might have to kill the old one.

barret

Barret Rhoden

unread,

Jul 29, 2015, 11:47:59 AM7/29/15

to aka...@googlegroups.com

Short version:

Scroll all the way to the end. No spoilers!

Long version:

I managed to get syscall traces for good and bad cs, where good passes
the test, bad doesn't (attached). Both are from origin/net, and the
difference when I ran them was I did mkdir /tmp before or after
launching cs, and all of them were done manually (no init.sh).

The traces were run during the net.test, not during startup or anything.

I stripped down the traces a bit and diffed them, changing things like
mmaped addrs to 0x400000XXXXXXX and the PID to XX, so diff shows us
what is different. It is odd (and maybe important) that some mmap
happened to offset the addresses for certain data structures. If
everything was identical, those addrs would be too. (e.g., char *foo
should be at 0x400000f00bar for both good and bad, since the existence
of other processes shouldn't change our *virtual* mappings).

Some of the early differences are like this:

Syscall 100 ( read):(0x7, 0x7f7fffbfcd04, 0x13, 0x0, 0x0, 0x0)
ret: 0x13 proc: XX core: 0 vcore: 0 data:
- t\001\000\036\000
+ t\001\000\030\000

I think that's just the 9p FID being different for some reason.

Things don't differ til about halfway through the trace, til both good
and bad attempt to open lib/ndb/local.udp. Here's the thing: they both
fail to open that file. The file doesn't even exist. In fact, every
open call during the test from the good trace fails.

It makes me wonder how this thing is working at all.

There are also a few fixed MMAPs done at different places in the trace:

+mmap):(0x684000, 0x21000, 0x7, 0x30, 0xffffffffffffffff, 0x0)
(much later:)
-mmap):(0x6c5000, 0x21000, 0x7, 0x30, 0xffffffffffffffff, 0x0)

anyway, the same pattern happens for each request; something like this:
(minus is the good trace, + is the bad one)

Syscall 102 ( open):(0x400000XXXXXX, 0x13, 0x0, 0x0, 0x0, 0x0) ret: 0xffffffffffffffff proc: XX core: 0 vcore: 0 data:
/lib/ndb/common.udp
-Syscall 102 ( open):(0x400000XXXXXX, 0x11, 0x0, 0x0, 0x0, 0x0) ret: 0xffffffffffffffff proc: XX core: 0 vcore: 0 data:
- lib/ndb/local.udp
-Syscall 102 ( open):(0x400000XXXXXX, 0x12, 0x0, 0x0, 0x0, 0x0) ret: 0xffffffffffffffff proc: XX core: 0 vcore: 0 data:
- /lib/ndb/local.udp
-Syscall 102 ( open):(0x400000XXXXXX, 0x13, 0x0, 0x0, 0x0, 0x0) ret: 0xffffffffffffffff proc: XX core: 0 vcore: 0 data:
- /lib/ndb/common.udp
-Syscall 101 ( write):(0x7, 0x400000XXXXXX, 0xb, 0x0, 0x0, 0x0) ret: 0xb proc: XX core: 0 vcore: 0 data:
- \013\000
-Syscall 19 ( munmap):(0x400000XXXXXX, 0x4000, 0x0, 0x0, 0x0, 0x0) ret: 0x0 proc: XX core: 0 vcore: 0 data:
-Syscall 13 ( proc_yield):(0x0, 0x0, 0x0, 0x0, 0x0, 0x0) ret: 0x0 proc: XX core: 0 vcore: 0 data:
-Syscall 100 ( read):(0x7, 0x7f7fffbfcd00, 0x4, 0x0, 0x0, 0x0) ret: 0x4 proc: XX core: 0 vcore: 0 data:
-
-Syscall 100 ( read):(0x7, 0x7f7fffbfcd04, 0x13, 0x0, 0x0, 0x0) ret: 0x13 proc: XX core: 0 vcore: 0 data:
- t\001\000\036\000
-Syscall 18 ( mmap):(0x0, 0x4000, 0x7, 0x8020, 0xffffffffffffffff, 0x0) ret: 0x400000XXXXXX proc: XX core: 0 vcore: 0 data:
-Syscall 101 ( write):(0x7, 0x400000XXXXXX, 0x26, 0x0, 0x0, 0x0) ret: 0x26 proc: XX core: 0 vcore: 0 data:
- &\000\000\000u\001\000\033\000\000
-Syscall 19 ( munmap):(0x400000XXXXXX, 0x4000, 0x0, 0x0, 0x0, 0x0) ret: 0x0 proc: XX core: 0 vcore: 0 data:
-Syscall 13 ( proc_yield):(0x0, 0x0, 0x0, 0x0, 0x0, 0x0) ret: 0x0 proc: XX core: 0 vcore: 0 data:
-Syscall 100 ( read):(0x7, 0x7f7fffbfcd00, 0x4, 0x0, 0x0, 0x0) ret: 0x4 proc: XX core: 0 vcore: 0 data:
-
-Syscall 100 ( read):(0x7, 0x7f7fffbfcd04, 0x13, 0x0, 0x0, 0x0) ret: 0x13 proc: XX core: 0 vcore: 0 data:
- t\001\000\036\000
-Syscall 18 ( mmap):(0x0, 0x4000, 0x7, 0x8020, 0xffffffffffffffff, 0x0) ret: 0x400000XXXXXX proc: XX core: 0 vcore: 0 data:
-Syscall 101 ( write):(0x7, 0x400000XXXXXX, 0xb, 0x0, 0x0, 0x0) ret: 0xb proc: XX core: 0 vcore: 0 data:
- \013\000
+Syscall 101 ( write):(0x7, 0x400000XXXXXX, 0x24, 0x0, 0x0, 0x0) ret: 0x24 proc: XX core: 0 vcore: 0 data:
+ $\000\000\000k\001\000\033\000cs: ca

you can see the beginning of the errstr message in the last write
syscall. this is bad-cs saying "cs: can't translate service".

So the good one tried a few more things, fails in all of its opens,
then somehow tells the test the right answer.

My hunch is that something is wrong in ndb. It might not be a race
related to multi-threading, since the go test is probably not making
concurrent requests (right?). Though I don't doubt that something in
ndb is racy.

Following up on NDB, I put a printf at ndbgetvalue (where cs calls into
NDB). Unfortunately, once I did that, I wasn't able to get the bad CS
again. Here's what they both look like:

/ $ /bin/net.test -test.run="TestLookupPort"

NDBGV: 0x624bc0 0x400000611c70 port 1 port 0x400000611c68
NDBGV: 0x624bc0 0x400000611d10 sys localhost ip 0x400000611cd8
NDBGV: 0x624bc0 0x400000611be0 port 1 port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 port 1 port 0x400000611c68
NDBGV: 0x624bc0 0x400000611d10 sys localhost ip 0x400000611cd8
NDBGV: 0x624bc0 0x400000611be0 port 1 port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp echo port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp echo port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp discard port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp discard port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp systat port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp systat port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp daytime port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp daytime port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp chargen port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp chargen port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp ftp-data port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp ftp-data port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp ftp port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp ftp port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp telnet port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp telnet port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp smtp port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp smtp port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp time port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp time port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp domain port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp domain port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp finger port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 tcp finger port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp echo port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp echo port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp tftp port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp tftp port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp bootpc port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp bootpc port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp bootps port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp bootps port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp domain port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp domain port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp ntp port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp ntp port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp snmp port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp snmp port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 udp syslog port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp syslog port 0x400000611bd8
NDBGV: 0x624bc0 0x400000611c70 tcp --badport-- port 0x400000611c68

It's interesting that we got through the TCP part of the test
(presumably the bad cs would have failed for TCP and we would hear
about it from net.test.)

So next up, I put a printf in cs where I thought ndblib was failing.
In the good case, it'd seem to fail just at the end:

(lots of good NDBGV prints)
NDBGV: 0x624bc0 0x400000611c70 tcp --badport-- port 0x400000611c68
NDBGV failed
Ipserve seemed to fail 1
PASS

In the bad case (I managed to trigger it again), NDBGV would fail for
all of the UDP tests:

NDBGV: 0x624bc0 0x400000613c70 udp echo port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp tftp port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp bootpc port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp bootps port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp domain port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp ntp port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp snmp port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 udp syslog port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
NDBGV: 0x624bc0 0x400000613c70 tcp --badport-- port 0x400000613c68
NDBGV failed
Ipserve seemed to fail 1
--- FAIL: TestLookupPort (0.23 seconds)

port_test.go:50: LookupPort("udp", "echo") = 0, write /net/cs: cs: can't translate service; want 7
port_test.go:50: LookupPort("udp", "tftp") = 0, write /net/cs: cs: can't translate service; want 69
port_test.go:50: LookupPort("udp", "bootpc") = 0, write /net/cs: cs: can't translate service; want 68
port_test.go:50: LookupPort("udp", "bootps") = 0, write /net/cs: cs: can't translate service; want 67
port_test.go:50: LookupPort("udp", "domain") = 0, write /net/cs: cs: can't translate service; want 53
port_test.go:50: LookupPort("udp", "ntp") = 0, write /net/cs: cs: can't translate service; want 123
port_test.go:50: LookupPort("udp", "snmp") = 0, write /net/cs: cs: can't translate service; want 161
port_test.go:50: LookupPort("udp", "syslog") = 0, write /net/cs: cs: can't translate service; want 514
FAIL

/ $

Looking at udp echo. Here's the good one (two calls):

NDBGV: 0x624bc0 0x400000611c70 udp echo port 0x400000611c68
NDBGV: 0x624bc0 0x400000611be0 udp echo port 0x400000611bd8

and the bad one (one call):

NDBGV: 0x624bc0 0x400000613c70 udp echo port 0x400000613c68

the pointers are different in the first call again, where the bad one
is two pages ahead of the good one, implying different memory maps.

So the two questions are, why are the memory maps different (seems to
pop up a lot), and why would ndbgetvalue() fail for the bad cs?

Here's the diff in the VMRs (good = 19, bad = 20)

/ $ m kfunc print_vmrs 0xffff80013bb8c600
-VM Regions for proc 19
+VM Regions for proc 20
00: (0x0000000000100000 - 0x0000000000120000): 0x00000007, 0x0000a012, 0xffff80013ffaccf0, 0x0000000000000000
01: (0x000000000031f000 - 0x0000000000320000): 0x00000001, 0x0000a012, 0xffff80013ffaccf0, 0x000000000001f000
02: (0x0000000000320000 - 0x0000000000321000): 0x00000003, 0x0000a012, 0xffff80013ffaccf0, 0x0000000000020000
@@ -15,15 +15,15 @@
12: (0x000040000028d000 - 0x000040000028e000): 0x00000001, 0x0000a812, 0xffff80013fface60, 0x000000000008c000
13: (0x000040000028e000 - 0x000040000028f000): 0x00000003, 0x0000a812, 0xffff80013fface60, 0x000000000008d000
14: (0x000040000028f000 - 0x0000400000290000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
-15: (0x0000400000290000 - 0x00004000003ce000): 0x00000005, 0x0000a802, 0xffff800139f66000, 0x0000000000000000
-16: (0x00004000003ce000 - 0x00004000005ce000): 0x00000000, 0x0000a802, 0xffff800139f66000, 0x000000000013e000
-17: (0x00004000005ce000 - 0x00004000005d2000): 0x00000001, 0x0000a812, 0xffff800139f66000, 0x000000000013e000
-18: (0x00004000005d2000 - 0x00004000005d4000): 0x00000003, 0x0000a812, 0xffff800139f66000, 0x0000000000142000
+15: (0x0000400000290000 - 0x00004000003ce000): 0x00000005, 0x0000a802, 0xffff800139f65000, 0x0000000000000000
+16: (0x00004000003ce000 - 0x00004000005ce000): 0x00000000, 0x0000a802, 0xffff800139f65000, 0x000000000013e000
+17: (0x00004000005ce000 - 0x00004000005d2000): 0x00000001, 0x0000a812, 0xffff800139f65000, 0x000000000013e000
+18: (0x00004000005d2000 - 0x00004000005d4000): 0x00000003, 0x0000a812, 0xffff800139f65000, 0x0000000000142000
19: (0x00004000005d4000 - 0x00004000005d7000): 0x00000003, 0x00000032, 0x0000000000000000, 0x0000000000000000
20: (0x00004000005d7000 - 0x00004000005d8000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
21: (0x00004000005d8000 - 0x00004000005da000): 0x00000007, 0x00008020, 0x0000000000000000, 0x0000000000000000
22: (0x00004000005da000 - 0x0000400000606000): 0x00000003, 0x00008020, 0x0000000000000000, 0x0000000000000000
-23: (0x0000400000606000 - 0x000040000060e000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
+23: (0x0000400000606000 - 0x0000400000610000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
24: (0x0000400004000000 - 0x0000400004021000): 0x00000003, 0x00004022, 0x0000000000000000, 0x0000000000000000
25: (0x0000400004021000 - 0x0000400008000000): 0x00000000, 0x00004022, 0x0000000000000000, 0x0000000000000000
26: (0x0000400008000000 - 0x0000400008021000): 0x00000003, 0x00004022, 0x0000000000000000, 0x0000000000000000
@@ -125,5 +125,4 @@
122: (0x00004000c8000000 - 0x00004000c8021000): 0x00000003, 0x00004022, 0x0000000000000000, 0x0000000000000000
123: (0x00004000c8021000 - 0x00004000cc000000): 0x00000000, 0x00004022, 0x0000000000000000, 0x0000000000000000
124: (0x00007f7fffaff000 - 0x00007f7fffbff000): 0x00000003, 0x00000030, 0x0000000000000000, 0x0000000000000000

15-18 differ due to the file pointer (which kernel file). it's easier
to see in vimdiff. The difference with VMR 23 is the size. The bad
one wanted two more pages, which also happens to be the offset.
Perhaps there was a malloc of about two pages?

Let's look at the memory of the bad cs at that location:

/ $ m showmapping 20 0x0000400000606000
Virtual Physical Ps Dr Ac CD WT U W P EPTE
-----------------------------------------------------------------
0x0000400000606000 0x000000013a726000 0 0 1 0 0 1 1 1 0x13a726037
/ $ m kfunc pahexdump 0x000000013a726000 4096
ffff80013a726000: 0a 23 0a 23 20 20 66 69 6c 65 73 20 63 6f 6d 70 .#.# files comp
ffff80013a726010: 72 69 73 69 6e 67 20 74 68 65 20 64 61 74 61 62 rising the datab
ffff80013a726020: 61 73 65 2c 20 75 73 65 20 61 73 20 6d 61 6e 79 ase, use as many
ffff80013a726030: 20 61 73 20 79 6f 75 20 6c 69 6b 65 2c 20 73 65 as you like, se
ffff80013a726040: 65 20 6e 64 62 28 36 29 0a 23 0a 64 61 74 61 62 e ndb(6).#.datab
ffff80013a726050: 61 73 65 3d 0a 09 66 69 6c 65 3d 2f 6c 69 62 2f ase=..file=/lib/
ffff80013a726060: 6e 64 62 2f 6c 6f 63 61 6c 0a 09 66 69 6c 65 3d ndb/local..file=

Looks like ndb stuff. That's actually the start of /lib/ndb/local,
which is the "dbfile" in cs and the deffile in ndbopen. That file is
opened and stored into a calloc'd region in ndbopen.c. That all
happened in ndbinit() (cs.c).

So it looks like the reason for the different addresses is due to
different mallocs, and those mmaps came from glibc.

If we go back to the function call (bad):
NDBGV: 0x624bc0 0x400000613c70 udp echo port 0x400000613c68

0x400000613c70 is &s, which is a stack variable in ipserv (cs.c). That
a little tricky. Pthread stacks are mmap'd directly; they don't go
through glibc. But if you look at the syscall trace:

Syscall 18 ( mmap):(0x0, 0x4000, 0x7, 0x8020,
0xffffffffffffffff, 0x0) ret: 0x40000060e000

That's in the VMR 23. The kernel is coalescing VMRs with the same
attributes. So in VMR 23 we have some pthread stacks and some glibc
malloc'd stuff. Incidentally, here's the same syscall from "good":

Syscall 18 ( mmap):(0x0, 0x4000, 0x7, 0x8020,
0xffffffffffffffff, 0x0) ret: 0x40000060a000

That's off by 4, not by 2. (4 is the size of a pthread stack btw).
It's hard to draw too much from this, since it's not the same good runs
as with the NDBV prints.

So, we have some allocation that is different between when we
open /lib/ndb/local and when we create the pthread for the job. Or at
least that is likely. Another thing to keep in mind is that cs
services a couple NDBGV requests during ifconfig.

Maybe due to various races, multiple threads existed at some point
during ifconfig when we used to only have one, and that's thrown off
our allocs. Then some permanent structure was malloced at a slightly
different time, or whatever.

So let's look at why ndbgetvalue fails. When it fails, it's in
ndbsearch. There's a call to a very suspicious function in there:

struct ndbtuple*
ndbsearch(struct ndb *db, struct ndbs *s, char *attr, char *val)
{
uint8_t *p;
struct ndbtuple *t;
struct ndbhf *hf;
hf = hfopen(db, attr);

hfopen:

static struct ndbhf*
hfopen(struct ndb *db, char *attr)
{
static int beenhere = 0; // XXX racy?

that static is a sign of potential trouble.

/* if the database has changed, throw out hash files and reopen db
*/
#if 0
if((d = dirfstat(Bfildes(&db->b))) == NULL || db->qid.path !=
d->qid.path || db->qid.vers != d->qid.vers){
#else
if (! beenhere){
#endif
if(ndbreopen(db) < 0){
free(d);
return 0;
}
beenhere = 1;
db->mtime = 1;
}

looks like the code was doing something originally, then was hacked
up. it's a really narrow race, and doesn't seem to be the problem this
time, since ndbsearch always (both good and bad) has hf = 0;

NDBGV: 0x624bc0 0x400000611c70 tcp finger port 0x400000611c68
NDBSEARCH: 0x624bc0 0x400000611c70 tcp finger, hf (nil)
NDBSEARCH: 0x622d10 0x400000611c70 tcp finger, hf (nil)
NDBSEARCH: 0x624380 0x400000611c70 tcp finger, hf (nil)
NDBSEARCH: 0x6247a0 0x400000611c70 tcp finger, hf (nil)

NDBGV: 0x624bc0 0x400000611c70 udp echo port 0x400000611c68
NDBSEARCH: 0x624bc0 0x400000611c70 udp echo, hf (nil)
NDBSEARCH: 0x622d10 0x400000611c70 udp echo, hf (nil)
NDBSEARCH: 0x624380 0x400000611c70 udp echo, hf (nil)
NDBSEARCH: 0x6247a0 0x400000611c70 udp echo, hf (nil)
NDBGV failed

Incidentally, I guess that hfopen hash crap doesn't work at all?

So I put in some printfs to catch when things were exiting, trying to
isolate the cause of ndbgetvalue's failure. Here's one:

NDBGV: 0x624bc0 0x400000612c70 udp echo port 0x400000612c68
NDBSEARCH: 0x624bc0 0x400000612c70 udp echo, hf (nil)
NDBSNEXT 0x400000612c70, udp, echo
NDBSEARCH: 0x622d10 0x400000612c70 udp echo, hf (nil)
NDBSNEXT 0x400000612c70, udp, echo
NDBSEARCH: 0x624380 0x400000612c70 udp echo, hf (nil)
NDBSNEXT 0x400000612c70, udp, echo
NDBSEARCH: 0x6247a0 0x400000612c70 udp echo, hf (nil)
NDBSNEXT 0x400000612c70, udp, echo
4: NDBSNEXT ret (nil)
7: NDBS ret (nil)
5: NDBSNEXT ret (nil)
7: NDBS ret (nil)
5: NDBSNEXT ret (nil)
7: NDBS ret (nil)
5: NDBSNEXT ret (nil)
7: NDBS ret (nil)
NDBGV failed

Hmmm, ndbsearch calls ndbsnext and vice versa, back and forth a bit.

Hmmm. So we have slight differences in memory layout, a reminder that
our stacks our only 4 pages, and some 9ns functions that call each
other back and forth. And we just recently switched to using pthreads
whose stack was allocated from mmap inside pthread.c instead of the
*huge* stack allocated by the kernel.

Yeah, that sounds like we're running off the small pthread stack. I
cranked the stacksize from four pages to eight, and haven't been able
to recreate the bug.

I don't know which part was being corrupted, or by whom. If there were
multiple threads running at one point (like during ifconfig, though
I think not), then someone's stack could have been clobbered. Or
perhaps there was a nearby malloc that was getting overrun (more
likely).

Short version:

Pthread stacks from the 2LS were too small.

Barret

cs-trace-bad-orig

cs-trace-good-orig

diff-edited

Kevin Klues

unread,

Jul 29, 2015, 12:13:23 PM7/29/15

to akaros

Well, hot damn. I think 9 times out of 10 it ends up being that for
most things we migrate from single threaded to multithreaded. Though,
I prefer that over some crazy race condition related to wild writes in
(non-stack) memory.

--
~Kevin

Kevin Klues

unread,

Jul 29, 2015, 12:23:30 PM7/29/15

to akaros

I see your new commit increasing the default stack size in pthread.h.
You also mention the guardpage stuff and some of the disadvantages of
it.

Instead of upping the default, we could have just upped it for cs
specifically via pthread_attr_setstacksize(). I personally like
keeping them small, and then forcing apps to up them if we find they
need more space. Alternatively, we could make it large by default,
and then force our high performance apps to drop it lower if they know
they don't need that much space.

Either way, without a guard page, it does cause us to bump into
problems like this on occasion. I like your suggestion of:

Perhaps we can make guard pages an option, turned on by default for
'legacy' apps and turned off for high-perf apps, similar to how we deal
with TLS.

It seems like we can have one function that we call for
high-performance apps that does a number of things, like disable TLS,
turn off guard pages, lower the default stack size, etc.

Kevin

--
~Kevin

Barret Rhoden

unread,

Jul 29, 2015, 12:50:10 PM7/29/15

to aka...@googlegroups.com

On 2015-07-29 at 09:23 Kevin Klues <klu...@gmail.com> wrote:
> I see your new commit increasing the default stack size in pthread.h.
> You also mention the guardpage stuff and some of the disadvantages of
> it.
>
> Instead of upping the default, we could have just upped it for cs
> specifically via pthread_attr_setstacksize(). I personally like
> keeping them small, and then forcing apps to up them if we find they
> need more space. Alternatively, we could make it large by default,
> and then force our high performance apps to drop it lower if they know
> they don't need that much space.
>
> Either way, without a guard page, it does cause us to bump into
> problems like this on occasion. I like your suggestion of:
>
> Perhaps we can make guard pages an option, turned on by default
> for 'legacy' apps and turned off for high-perf apps, similar to how
> we deal with TLS.
>
> It seems like we can have one function that we call for
> high-performance apps that does a number of things, like disable TLS,
> turn off guard pages, lower the default stack size, etc.

Yeah, the high-perf apps can pick and choose what they want. I was a
little reluctant to just crank it up for CS, since we happened to know
it called into ndb. Who know what other apps we'll bring in that need
a little more stack. (e.g. 'dns')

Another thing we can do is allocate but don't populate very large stacks
for pthreads/uthreads. Then we'd have the room to grow, but would
probably rarely use the extra physical memory. We'd be using virtual
address space, which is basically free, and only sparsely populate it
with actual pages. Right now we page in the stacks at alloc time, which
is probably excessive anyways. There are a couple minor issues, like
running out of memory, but those are issues anyways.

I tried turning off the MAP_POPULATE in __pthread_allocate_stack and
haven't had any adverse affects. So how about we crank up the
stacksize to Linux's PTHREAD_STACK_MIN and just don't populate the
pages? At least until we have something more capable.

Barret

ron minnich

unread,

Jul 29, 2015, 1:00:38 PM7/29/15

to aka...@googlegroups.com

What's the problem with guard pages again?

Eric Brewer

unread,

Jul 29, 2015, 1:02:47 PM7/29/15

to akaros

One problem, perhaps not the main one, is that they don't play well with huge pages...

ron minnich

unread,

Jul 29, 2015, 1:07:59 PM7/29/15

to akaros

yes, that's been the problem on NIX/Harvey. But does akaros play well with huge pages?

Guard pages in today's world are definitely a kludge, but on our current architectures I don't see a good way out.

There are some hacks, however.

ron

Kevin Klues

unread,

Jul 29, 2015, 1:22:09 PM7/29/15

to akaros

The main problem Barret mentioned in his commit message is that it it
increases the number of VMRs in the kernel to O(nr_pthreads). Every
stack would need two VMRs, one for the stack and one for the guard
(since it has different permissions and needs to be virtually
contiguous. As it stands now, we are able to coalesce VMRs from
multiple threads into a single VMR because they essentially look the
same across contiguous virtual addresses.

--
~Kevin

Barret Rhoden

unread,

Jul 29, 2015, 1:34:23 PM7/29/15

to aka...@googlegroups.com

On 2015-07-29 at 10:02 Eric Brewer <bre...@berkeley.edu> wrote:
> One problem, perhaps not the main one, is that they don't play well
> with huge pages...

Speaking of not playing well with huge pages, my proposal from the
previous email to sparsely lay out thread stacks in a large VA space
would still end up 'wasting' space in chunks of 2MB, if those stacks
were backed by huge pages.

> The main problem Barret mentioned in his commit message is that it it
> increases the number of VMRs in the kernel to O(nr_pthreads). Every
> stack would need two VMRs, one for the stack and one for the guard
> (since it has different permissions and needs to be virtually
> contiguous. As it stands now, we are able to coalesce VMRs from
> multiple threads into a single VMR because they essentially look the
> same across contiguous virtual addresses.

And to follow up on that, the issue with having more VMRs means VMR
lookup (e.g. page fault handling) is more expensive (esp on Akaros
where we don't do anything smart yet).

Barret

Kevin Klues

unread,

Jul 29, 2015, 1:45:32 PM7/29/15

to akaros

I'm fine with lazily populating the stack in favor of having a larger
one by default. It would be nice to MAP_POPULATE the first couple of
pages and demand page the rest. As part of mmapping the stack, we
could touch an address on each of the first few pages to prime them.

--
~Kevin

ron minnich

unread,

Jul 29, 2015, 1:54:47 PM7/29/15

to akaros

yeah, my bad, did not read thoroughly enough.

ron

Kevin Klues

unread,

Jul 30, 2015, 11:36:39 AM7/30/15

to akaros

By the way, I never actually confirmed it, but we are now back to
passing all the Go tests with these 2 updates: the fix for the
refcount bug and the pthread stack size increase.

--
~Kevin

Barret Rhoden

unread,

Aug 3, 2015, 1:56:31 PM8/3/15

to aka...@googlegroups.com

On 2015-07-29 at 10:45 Kevin Klues <klu...@gmail.com> wrote:
> I'm fine with lazily populating the stack in favor of having a larger
> one by default. It would be nice to MAP_POPULATE the first couple of
> pages and demand page the rest. As part of mmapping the stack, we
> could touch an address on each of the first few pages to prime them.

I made this change. Pthread stacks are 4MB virtually (1024 pages),
with the page at the top of the stack faulted in at stack allocation
time.

I tested it with the net.test that was failing earlier and
pthread_test. Seems to work alright.

Let me know if you see any issues. (origin/net, rebased that original
pthread commit "Increase pthread's default stack size").

Barret

Kevin Klues

unread,

Aug 3, 2015, 2:48:58 PM8/3/15

to akaros

Cool. I'll take a look later today.

--
~Kevin

Kevin Klues

unread,

Aug 3, 2015, 4:18:28 PM8/3/15

to akaros

'go tests std -short' still all pass with this change, including net,
which previously triggered the problem for me.

--
~Kevin

Kevin Klues

unread,

Aug 3, 2015, 4:22:09 PM8/3/15

to akaros

As an aside, in order to get any network related tests to pass, i
still need to (and have always had to) turn off "Block Extra Data" or
otherwise risk a periodic kernel crash. FYI, in case you ever run
these tests yourself.

--
~Kevin

Barret Rhoden

unread,

Aug 3, 2015, 4:31:01 PM8/3/15

to aka...@googlegroups.com

On 2015-08-03 at 13:22 Kevin Klues <klu...@gmail.com> wrote:
> As an aside, in order to get any network related tests to pass, i
> still need to (and have always had to) turn off "Block Extra Data" or
> otherwise risk a periodic kernel crash. FYI, in case you ever run
> these tests yourself.

guess we should fix that. drew and i got most of the way through the
extra data business, then put in the asserts and whatnot as a TODO list.

barret

Kevin Klues

unread,

Aug 3, 2015, 4:33:58 PM8/3/15

to akaros

The one I usually hit looks like the following:

kernel panic at kern/src/ns/qio.c:275, from core 7: can't pullup 57
bytes, no place to put it: bp->lim 0xffff80008e6fa32f, bp->rp
0xffff80008e6fa2f8, bp->lim-bp->rp 55

--
~Kevin

Kevin Klues

unread,

Oct 15, 2015, 9:50:00 PM10/15/15

to Akaros

I tested it with the net.test that was failing earlier and
pthread_test. Seems to work alright.
Let me know if you see any issues. (origin/net, rebased that original
pthread commit "Increase pthread's default stack size").