can't reboot after running a 5.0 kernel

Chuck Cranor

unread,

Jan 27, 2011, 10:56:18 AM1/27/11

to

Hi-

I've got a system that identifies itself as:

Digital AlphaStation 600 5/266, 266MHz, s/n
cpu0 at mainbus0: ID 0 (primary), 21164-5

It is currently running 4.0 fine (reboots cleanly, etc.).

If I boot at 5.0 kernel on it and attempt a reboot, I get the
following:

# halt
Jan 27 10:19:23 halt: halted by root
syncing disks... done
unmounting file systems... done
halted.

halted CPU 0

halt code = 5
HALT instruction executed
PC = fffffc0000300128
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pka0.7.0.1001.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
waiting for pkb0.7.0.1002.0 to start...
>>>
>>>
>>>boot
(boot dka100.1.0.1001.0 -flags a)
failed to open dka100.1.0.1001.0
>>>

If I cycle the power, it boots fine once again. My guess is that the
5.0 kernel is leaving the pka0 device in a bad state. I only get the
"waiting for pka0.7.0.1001.0 to start..." errors after running the 5.0
kernel (4.0 kernel is fine).

Any ideas on what is causing this problem?

chuck (dmesg output is below)

NetBSD 4.0.0_PATCH (GENERIC.MP) #0: Tue May 20 16:45:50 EDT 2008
Digital AlphaStation 600 5/266, 266MHz, s/n
8192 byte page size, 1 processor.
total memory = 704 MB
(2176 KB reserved for PROM, 701 MB used by NetBSD)
avail memory = 681 MB
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21164-5
cia0 at mainbus0: DECchip 2117x Core Logic Chipset (ALCOR/ALCOR2), pass 2
pci0 at cia0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
tlp0 at pci0 dev 7 function 0: DECchip 21140 Ethernet, pass 1.2
tlp0: broken MicroWire interface detected; setting SROM size to 1Kb
tlp0: interrupting at kn20aa irq 8
tlp0: DEC DE500-XA, Ethernet address 00:00:f8:04:67:93
tlp0: 10baseT, 100baseTX, 100baseTX-FDX, 10baseT-FDX
ppb0 at pci0 dev 8 function 0: Digital Equipment DC21050 PCI-PCI Bridge (rev. 0x02)
pci1 at ppb0 bus 1
pci1: memory space enabled, rd/line, wr/inv ok
tlp1 at pci1 dev 0 function 0: DECchip 21040 Ethernet, pass 2.4
tlp1: interrupting at kn20aa irq 16
tlp1: Ethernet address 00:00:f8:21:2c:2a
tlp1: 10baseT, 10baseT-FDX, 10base5, manual
isp0 at pci1 dev 1 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at kn20aa irq 17
isp0: invalid NVRAM header
scsibus0 at isp0: 16 targets, 8 luns per target
isp1 at pci1 dev 2 function 0: QLogic 1020 Fast Wide SCSI HBA
isp1: interrupting at kn20aa irq 18
isp1: invalid NVRAM header
scsibus1 at isp1: 16 targets, 8 luns per target
pceb0 at pci0 dev 10 function 0: Intel 82375EB/SB PCI-EISA Bridge (rev. 0x05)
3Com 3CR990-TX-97 10/100 Ethernet with 3XP (ethernet network, revision 0x02) at pci0 dev 12 function 0 not configured
eisa0 at pceb0
isa0 at pceb0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
attimer0 at isa0 port 0x40-0x43: AT Timer
pcppi0 at isa0 port 0x61
pcppi0: children must have an explicit unit
midi0 at pcppi0: PC speaker (CPU-intensive output)
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
pcppi0: attached to attimer0
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 1 lun 0: <SEAGATE, ST373405LC, 0002> disk fixed
sd0: 70007 MB, 29550 cyl, 8 head, 606 sec, 512 bytes/sect x 143374741 sectors
sd0: sync (100.00ns offset 12), 16-bit (20.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 2 lun 0: <SEAGATE, ST373405LC, 0003> disk fixed
sd1: 70007 MB, 29550 cyl, 8 head, 606 sec, 512 bytes/sect x 143374741 sectors
sd1: sync (100.00ns offset 12), 16-bit (20.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 5 lun 0: <DEC, RRD45 (C) DEC, 1645> cdrom removable
cd0: async, 8-bit transfers
root on sd0a dumps on sd0b
root file system type: ffs

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Dustin Marquess

unread,

Jan 27, 2011, 11:09:27 AM1/27/11

to

I've seen exactly this on another Alpha. I believe this is PR #40077.

There's a workaround in the PR. Basically you have to disable the
code that disables PCI bus mastering on a shutdown.

-Dustin

On Thu, Jan 27, 2011 at 7:56 AM, Chuck Cranor <ch...@ece.cmu.edu> wrote:
> If I boot at 5.0 kernel on it and attempt a reboot, I get the
> following:
>
> # halt
> Jan 27 10:19:23 halt: halted by root
> syncing disks... done
> unmounting file systems... done
> halted.
>
>
> halted CPU 0
>
> halt code = 5
> HALT instruction executed
> PC = fffffc0000300128
> waiting for pka0.7.0.1001.0 to start...

--

Izumi Tsutsui

unread,

Jan 27, 2011, 11:40:12 AM1/27/11

to

> I've seen exactly this on another Alpha. I believe this is PR #40077.
>
> There's a workaround in the PR. Basically you have to disable the
> code that disables PCI bus mastering on a shutdown.

I wonder if this pmf(9) hack is acceptable...
(diff is for netbsd-5 branch)

Index: pci/sio.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/pci/sio.c,v
retrieving revision 1.43
diff -u -p -r1.43 sio.c
--- pci/sio.c 28 Apr 2008 20:23:11 -0000 1.43
+++ pci/sio.c 27 Jan 2011 16:33:08 -0000
@@ -108,6 +108,8 @@ struct sio_softc {
int siomatch __P((struct device *, struct cfdata *, void *));
void sioattach __P((struct device *, struct device *, void *));

+static bool sioshutdown(device_t, int);
+
CFATTACH_DECL(sio, sizeof(struct sio_softc),
siomatch, sioattach, NULL, NULL);

@@ -204,9 +206,21 @@ sioattach(parent, self, aux)
sc->sc_is82c693 = (PCI_VENDOR(pa->pa_id) == PCI_VENDOR_CONTAQ &&
PCI_PRODUCT(pa->pa_id) == PCI_PRODUCT_CONTAQ_82C693);

+ if (!pmf_device_register1(self, NULL, NULL, sioshutdown))
+ aprint_error_dev(self, "couldn't establish power handler\n");
+
config_defer(self, sio_bridge_callback);
}

+bool
+sioshutdown(device_t self, int howto)
+{
+
+ sio_intr_shutdown();
+
+ return true;
+}
+
void
sio_bridge_callback(self)
struct device *self;
Index: pci/sio_pic.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/pci/sio_pic.c,v
retrieving revision 1.36
diff -u -p -r1.36 sio_pic.c
--- pci/sio_pic.c 28 Apr 2008 20:23:11 -0000 1.36
+++ pci/sio_pic.c 27 Jan 2011 16:33:08 -0000
@@ -116,6 +116,7 @@ static struct alpha_shared_intr *sio_int
*/
u_int8_t initial_ocw1[2];
u_int8_t initial_elcr[2];
+bool sio_doshutdown;
#endif

void sio_setirqstat __P((int, int, int));
@@ -123,9 +124,6 @@ void sio_setirqstat __P((int, int, int)
u_int8_t (*sio_read_elcr) __P((int));
void (*sio_write_elcr) __P((int, u_int8_t));
static void specific_eoi __P((int));
-#ifdef BROKEN_PROM_CONSOLE
-void sio_intr_shutdown __P((void *));
-#endif

/******************** i82378 SIO ELCR functions ********************/

@@ -352,7 +350,8 @@ sio_intr_setup(pc, iot)
initial_ocw1[1] = bus_space_read_1(sio_iot, sio_ioh_icu2, 1);
initial_elcr[0] = (*sio_read_elcr)(0); /* XXX */
initial_elcr[1] = (*sio_read_elcr)(1); /* XXX */
- shutdownhook_establish(sio_intr_shutdown, 0);
+ /* shutdown hook will be established in sioattach() via pmf(9) */
+ sio_doshutdown = true;
#endif

sio_intr = alpha_shared_intr_alloc(ICU_LEN, 8);
@@ -407,20 +406,22 @@ sio_intr_setup(pc, iot)
}
}

-#ifdef BROKEN_PROM_CONSOLE
void
-sio_intr_shutdown(arg)
- void *arg;
+sio_intr_shutdown(void)
{
+#ifdef BROKEN_PROM_CONSOLE
+
/*
* Restore the initial values, to make the PROM happy.
*/
- bus_space_write_1(sio_iot, sio_ioh_icu1, 1, initial_ocw1[0]);
- bus_space_write_1(sio_iot, sio_ioh_icu2, 1, initial_ocw1[1]);
- (*sio_write_elcr)(0, initial_elcr[0]); /* XXX */
- (*sio_write_elcr)(1, initial_elcr[1]); /* XXX */
-}
+ if (sio_doshutdown) {
+ bus_space_write_1(sio_iot, sio_ioh_icu1, 1, initial_ocw1[0]);
+ bus_space_write_1(sio_iot, sio_ioh_icu2, 1, initial_ocw1[1]);
+ (*sio_write_elcr)(0, initial_elcr[0]); /* XXX */
+ (*sio_write_elcr)(1, initial_elcr[1]); /* XXX */
+ }
#endif
+}

const char *
sio_intr_string(v, irq)
Index: pci/siovar.h
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/pci/siovar.h,v
retrieving revision 1.10
diff -u -p -r1.10 siovar.h
--- pci/siovar.h 5 Jun 2000 21:47:29 -0000 1.10
+++ pci/siovar.h 27 Jan 2011 16:33:08 -0000
@@ -36,3 +36,5 @@ void *sio_intr_establish __P((void *, in
void *));
void sio_intr_disestablish __P((void *, void *));
int sio_intr_alloc __P((void *, int, int, int *));
+
+void sio_intr_shutdown(void);

---
Izumi Tsutsui

Chuck Cranor

unread,

Jan 27, 2011, 12:42:50 PM1/27/11

to

On Fri, Jan 28, 2011 at 01:40:12AM +0900, Izumi Tsutsui wrote:
> > I've seen exactly this on another Alpha. I believe this is PR #40077.
> >
> > There's a workaround in the PR. Basically you have to disable the
> > code that disables PCI bus mastering on a shutdown.
>
> I wonder if this pmf(9) hack is acceptable...
> (diff is for netbsd-5 branch)

Thanks for the info guys.

My alpha doesn't have an "sio" device so that patch didn't make a difference
(posted a dmesg in the other email). Tsutsui, is there another device that
we could try to patch to fix it?

Commenting out the body of pci_child_shutdown() as noted in PR 40077
did fix the problem!

chuck

Michael L. Hitch

unread,

Jan 27, 2011, 12:38:18 PM1/27/11

to

On Thu, 27 Jan 2011, Chuck Cranor wrote:

> Thanks for the info guys.
>
> My alpha doesn't have an "sio" device so that patch didn't make a difference
> (posted a dmesg in the other email). Tsutsui, is there another device that
> we could try to patch to fix it?
>
> Commenting out the body of pci_child_shutdown() as noted in PR 40077
> did fix the problem!

I think David Young provided an alternate fix, which saves and restores
the pci command register. It was on port-i386 last August in the thread
http://mail-index.netbsd.org/port-i386/2010/08/04/msg002089.html and a
patch specifically in
http://mail-index.netbsd.org/port-i386/2010/08/04/msg002091.html and
http://mail-index.netbsd.org/port-i386/2010/08/05/msg002098.html for a
netbsd-5 version.

Mike

--
Michael L. Hitch mhi...@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA

Chuck Cranor

unread,

Jan 27, 2011, 9:32:26 PM1/27/11

to

On Thu, Jan 27, 2011 at 10:38:18AM -0700, Michael L. Hitch wrote:
> I think David Young provided an alternate fix, which saves and restores
> the pci command register.

> http://mail-index.netbsd.org/port-i386/2010/08/05/msg002098.html for a
> netbsd-5 version.

The netbsd-5 version of the patch from David works!! Thanks Mike.

chuck

Izumi Tsutsui

unread,

Jan 28, 2011, 4:25:42 AM1/28/11

to

chuck@ wrote:

> My alpha doesn't have an "sio" device so that patch didn't make a difference
> (posted a dmesg in the other email). Tsutsui, is there another device that
> we could try to patch to fix it?

Ah, my sio patch is for a different problem. Sorry for noise.

mhitch@ wrote:

> http://mail-index.netbsd.org/port-i386/2010/08/04/msg002091.html and

Seems reasonable. Commit?

---
Izumi Tsutsui

Paul Mather

unread,

Jan 27, 2011, 11:18:35 AM1/27/11

to

I also have this same problem on my AlphaServer 1000A and posted about it back in 2009 (http://mail-index.netbsd.org/port-alpha/2009/06/19/msg000314.html). See that thread for the response. The summary is that "this is a known problem, see PR/40077."

Cheers,

Paul.

Chuck Cranor

unread,

Jan 28, 2011, 12:56:53 PM1/28/11

to

On Fri, Jan 28, 2011 at 06:25:42PM +0900, Izumi Tsutsui wrote:
> > http://mail-index.netbsd.org/port-i386/2010/08/04/msg002091.html and
> Seems reasonable. Commit?

Yeah, I'm for it. It is a step forward. I wonder why David Young
didn't push it through?

But I'm still having bad kernel issues with the alpha:

netbsd 5.0: kernel boots multiuser, but eventually processes start
hanging and the load goes up.

netbsd 5.0.2_PATCH, _and_
netbsd 5.1: kernel hangs hard booting multiuser... it gets to
"Starting named." and then hangs. Sending a break
to the console has no effect (can't get DDB, have to
power cycle).

HEAD: on a whim I just tried a HEAD kernel. it booted multiuser, but
I didn't do anything beyond that.

I played a bit with CVS, and the hard hang I'm seeing with 5.0.2 and
5.1 was introduced to the netbsd-5 branch some time between 01-May-09
and 01-Jul-09.

Has anyone successfully booted a 5.1 kernel on the alpha?

chuck

here are the kernel's I tried (in order, as above):

ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-5.0/alpha/binary/kernel/netbsd-GENERIC.gz
http://nyftp.netbsd.org/pub/NetBSD-daily/netbsd-5-0/201101252100Z/alpha/binary/kernel/netbsd-GENERIC.gz
http://nyftp.netbsd.org/pub/NetBSD-daily/netbsd-5-1/201101261300Z/alpha/binary/kernel/netbsd-GENERIC.gz
http://nyftp.netbsd.org/pub/NetBSD-daily/HEAD/201101271500Z/alpha/binary/kernel/netbsd-GENERIC.gz

Dustin Marquess

unread,

Jan 28, 2011, 1:35:47 PM1/28/11

to

I have, but it's been a while. I moved my Alpha to -current/HEAD, as
it's the only one that can reliably do SMP on Alpha. The only other
problem I ran into is eventually if the machine doesn't get cleanly
unmounted, it'll refuse to mount the drive saying that it's dirty
(even using WAPBL), and running fsck to fix that will move almost
everything on the drive to lost+found. It does mount cleanly after
that though :). I haven't opened a PR on it yet, since I'm not sure
if it's a hardware problem, some kind of interaction between WAPBL and
SMP on Alpha, or what.

-Dustin

Chuck Cranor

unread,

Jan 28, 2011, 4:34:06 PM1/28/11

to

On Fri, Jan 28, 2011 at 10:35:47AM -0800, Dustin Marquess wrote:
> I have, but it's been a while. I moved my Alpha to -current/HEAD, as
> it's the only one that can reliably do SMP on Alpha.

I found the changes that cause my netbsd-5 alpha to hard-hang on
boot after starting named (can't even break to DDB)! The changes
were pulled up on 09-Jun-2009 via ticket #798 from martin@netbsd.

http://releng.netbsd.org/cgi-bin/req-5.cgi?show=798

These changes are in NetBSD/alpha 5.0.2, but not in NetBSD/alpha 5.0
(which explains why I can boot 5.0, but not 5.0.2 or 5.1). It also
looks like that code may have got reworked in HEAD (which explains
why I can also boot HEAD kernel).

...
Configuring network interfaces: tlp0.
add net default: gateway 172.19.144.1
Adding interface aliases:
Building databases...
Starting syslogd.
Starting named.
<<< hard hang here, requires power cycle to clear >>>

But now, what to do about it? Can we fix the netbsd-5 branch?

chuck

Module Name: src
Committed By: martin
Date: Mon Jun 1 20:58:16 UTC 2009

Modified Files:
src/sys/arch/alpha/alpha: locore.s vm_machdep.c
src/sys/arch/alpha/include: alpha.h

Log Message:
Do not use lwp_trampoline for cpu_setfunc, but a simplified setfunc_trampoline
that does not call lwp_startup() instead.

To generate a diff of this commit:
cvs rdiff -u -r1.113 -r1.114 src/sys/arch/alpha/alpha/locore.s
cvs rdiff -u -r1.99 -r1.100 src/sys/arch/alpha/alpha/vm_machdep.c
cvs rdiff -u -r1.23 -r1.24 src/sys/arch/alpha/include/alpha.h

Martin Husemann

unread,

Jan 28, 2011, 6:43:12 PM1/28/11

to

On Fri, Jan 28, 2011 at 04:34:06PM -0500, Chuck Cranor wrote:
> These changes are in NetBSD/alpha 5.0.2, but not in NetBSD/alpha 5.0
> (which explains why I can boot 5.0, but not 5.0.2 or 5.1). It also
> looks like that code may have got reworked in HEAD (which explains
> why I can also boot HEAD kernel).

I'm not sure I see what you mean - the code in -current looks pretty much
identical to me.

Martin

Chuck Cranor

unread,

Jan 28, 2011, 7:27:14 PM1/28/11

to

On Sat, Jan 29, 2011 at 12:43:12AM +0100, Martin Husemann wrote:
> I'm not sure I see what you mean - the code in -current looks pretty much
> identical to me.

I could be wrong about that. I mainly looked at the CVS logs.

I did boil the problem down a bit more... I think the hard
hang happens when you run the first program linked with "pthreads"
(i.e. /usr/sbin/named). IF you don't run named, I bet the
system will boot to multiuser OK.

I just did the following test:
[1] boot to single user
[2] "mount -u /" and "mount -r /usr"
[3] run "ifconfig" and "route add default" to bring up network
[4] run "/usr/sbin/named" --- system hard hangs here, won't break to
DDB and requires a power cycle to recover.

I'm wondering if your patch failed to setup some register that
is required for threaded apps? I don't know the alpha architecture,
so I'm not clear on what to do. Maybe someone here who knows
alpha can look? I can easily test any proposed fixes.

My userland is a 4.0 one, in case that matters.

chuck

>>>boot -file testin -fl s
(boot dka100.1.0.1001.0 -file testin -flags s)
block 0 of dka100.1.0.1001.0 is a valid boot block
reading 14 blocks from dka100.1.0.1001.0
bootstrap code read in
base = 136000, image_start = 0, image_bytes = 1c00
initializing HWRPB at 2000
initializing page table at 128000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code

NetBSD/alpha 2.0.2 FFS Primary Bootstrap
Jumping to entry point...

NetBSD/alpha 2.0.2 Secondary Bootstrap, Revision 1.13
(bui...@works.netbsd.org, Tue Mar 22 03:21:07 UTC 2005)

VMS PAL rev: 0x1000000010112
OSF PAL rev: 0x1000000020115
Switch to OSF PAL code succeeded.

Boot file: testin
Boot flags: s
9394176+480920 [533520+353943]=0xa44000

Entering testin at 0xfffffc0000301140...
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008
The NetBSD Foundation, Inc. All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.

NetBSD 5.0_STABLE (GENERIC-$Revision: 1.325 $) #6: Fri Jan 28 16:22:21 EST 2011
ch...@xxx.pdl.cmu.edu:/.amd/flow/home/chuck/src/netbsd/cur/src/sys/arch/alpha/compile/GENERIC

Digital AlphaStation 600 5/266, 266MHz, s/n

8192 byte page size, 1 processor.
total memory = 704 MB
(2176 KB reserved for PROM, 701 MB used by NetBSD)
avail memory = 681 MB
mainbus0 (root)

cpu0 at mainbus0: ID 0 (primary), 21164-5

cia0 at mainbus0: DECchip 2117x Core Logic Chipset (ALCOR/ALCOR2), pass 2
pci0 at cia0 bus 0

tlp0 at pci0 dev 7 function 0: DECchip 21140 Ethernet, pass 1.2

tlp0: interrupting at kn20aa irq 8
tlp0: DEC DE500-XA, Ethernet address 00:00:f8:04:67:93
tlp0: 10baseT, 100baseTX, 100baseTX-FDX, 10baseT-FDX
ppb0 at pci0 dev 8 function 0: Digital Equipment DC21050 PCI-PCI Bridge (rev. 0x02)
pci1 at ppb0 bus 1

tlp1 at pci1 dev 0 function 0: DECchip 21040 Ethernet, pass 2.4
tlp1: interrupting at kn20aa irq 16
tlp1: Ethernet address 00:00:f8:21:2c:2a
tlp1: 10baseT, 10baseT-FDX, 10base5, manual
isp0 at pci1 dev 1 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at kn20aa irq 17
isp0: invalid NVRAM header

isp1 at pci1 dev 2 function 0: QLogic 1020 Fast Wide SCSI HBA
isp1: interrupting at kn20aa irq 18
isp1: invalid NVRAM header

pceb0 at pci0 dev 10 function 0: Intel 82375EB/SB PCI-EISA Bridge (rev. 0x05)
3Com 3CR990-TX-97 10/100 Ethernet with 3XP (ethernet network, revision 0x02) at pci0 dev 12 function 0 not configured
eisa0 at pceb0
isa0 at pceb0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
attimer0 at isa0 port 0x40-0x43: AT Timer
pcppi0 at isa0 port 0x61

midi0 at pcppi0: PC speaker (CPU-intensive output)
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2

mcclock0 at isa0 port 0x70-0x71: mc146818 compatible time-of-day clock
attimer0: attached to pcppi0

scsibus0 at isp0: 16 targets, 8 luns per target

scsibus1 at isp1: 16 targets, 8 luns per target

scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...

fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec

sd0 at scsibus0 target 1 lun 0: <SEAGATE, ST373405LC, 0002> disk fixed
sd0: 70007 MB, 29550 cyl, 8 head, 606 sec, 512 bytes/sect x 143374741 sectors
sd0: sync (100.00ns offset 12), 16-bit (20.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 2 lun 0: <SEAGATE, ST373405LC, 0003> disk fixed
sd1: 70007 MB, 29550 cyl, 8 head, 606 sec, 512 bytes/sect x 143374741 sectors
sd1: sync (100.00ns offset 12), 16-bit (20.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 5 lun 0: <DEC, RRD45 (C) DEC, 1645> cdrom removable
cd0: async, 8-bit transfers

Kernelized RAIDframe activated

root on sd0a dumps on sd0b
root file system type: ffs

Enter pathname of shell or RETURN for /bin/sh:
We recommend creating a non-root account and using su(1) for root access.
No entry for terminal type "dumb";
using dumb terminal settings.
# mount -u /
# mount -r /usr
# ifconfig tlp0 172.19.144.6 netmask 0xfffff000 media 100baseTX
# route add default 172.19.144.1

add net default: gateway 172.19.144.1

#
# /usr/sbin/named
<<< system hangs hard ... no DDB, requires power cycle to reboot >>>

Chuck Cranor

unread,

Jan 28, 2011, 8:50:18 PM1/28/11

to

On Fri, Jan 28, 2011 at 07:27:14PM -0500, Chuck Cranor wrote:
> I just did the following test:
> [1] boot to single user
> [2] "mount -u /" and "mount -r /usr"
> [3] run "ifconfig" and "route add default" to bring up network
> [4] run "/usr/sbin/named" --- system hard hangs here, won't break to
> DDB and requires a power cycle to recover.

In fact, you don't even need to that much. Just run "dig"
(which is linked to pthreads) and the alpha will hard hang:

Enter pathname of shell or RETURN for /bin/sh:
We recommend creating a non-root account and using su(1) for root access.
No entry for terminal type "dumb";
using dumb terminal settings.
# mount -u /
# mount -r /usr
#

#
#
# dig
<<< system hard hangs, power cycle required >>>

chuck

Chuck Cranor

unread,

Jan 28, 2011, 10:56:51 PM1/28/11

to

On Fri, Jan 28, 2011 at 07:27:14PM -0500, Chuck Cranor wrote:

> My userland is a 4.0 one, in case that matters.

I think userland version does matter, because the only place
cpu_setfunc() is called is from sys/compat/sa/compat_sa.c ... so
you need a binary that uses the SA version of threads in order to
hang your system... like one from 4.0.

I don't think a 5.0 threaded program would use the code in
compat_sa.c? So cpu_setfunc() wouldn't get called in that case.

Here's a static linked 4.0 version of dig that hangs my netbsd-5
alpha kernel:

http://yogi.pdl.cmu.edu/~chuck/tmp/dig.static.gz

in case anyone wants to try it.

chuck

Chuck Cranor

unread,

Jan 29, 2011, 11:03:23 PM1/29/11

to

On Fri, Jan 28, 2011 at 07:27:14PM -0500, Chuck Cranor wrote:

> On Sat, Jan 29, 2011 at 12:43:12AM +0100, Martin Husemann wrote:

> > I'm not sure I see what you mean - the code in -current looks pretty much
> > identical to me.

hi-

I looked a bit more at the changes you submitted for pullup
in netbsd-5 ticket 798:

http://releng.netbsd.org/cgi-bin/req-5.cgi?show=798

I don't know enough about that code to understand why you wanted
that change, but I was able to factor it out a bit anyway....

The changes to vm_machdep.c appear to have removed the
call to cpu_setfunc() from cpu_lwp_fork() and replaced it
with the actual content of the old cpu_setfunc() function.
The net result here is that the behavior of cpu_lwp_fork()
does not change, but it no longer calls cpu_setfunc().

The old cpu_setfunc() is now replace with a new stripped
down version that calls setfunc_trampoline() instead of
lwp_trampoline() [the s3 register is no longer setup or used]
The only thing that calls the cpu_setfunc() is now compat_sa.c
( cpu_lwp_fork() no longer calls it ).

The main difference between the lwp_trampoline() and the new
setfunc_trampoline() is that the setfunc_trampoline() no longer
calls lwp_startup(). Removin the call to lwp_startup() causes
the alpha to hang hard if you run a 4.0 threaded app like "dig"...

So, lwp_startup() does something that keeps the system from
hanging. To figure out what that was, I started adding in bits
of lpw_startup() into the setfunc_trampoline() until the system
stopped hanging. It turns out the two critical bits are:

void
xlwp_startup(struct lwp *prev, struct lwp *new)
{
if (prev != NULL) {
curcpu()->ci_mtx_count++; /*YES*/
prev->l_ctxswtch = 0; /*YES*/
}
}

Put that much of lwp_startup() back into setfunc_trampoline(), and
the system no longer hangs when you run "dig"... a complete diff
that applies to a netbsd-5 branch checked out on date 10-Jun-2009
(e.g. with "cvs -q update -r netbsd-5 -dP -D 10-Jun-2009") is included
at the end.

You need both the l_ctxswtch and ci_mtx_count statements.
If you comment out the "l_ctxswtch" statement, the system hangs
as soon as you run "dig". If you comment out the ci_mtx_count
statement, the system runs "dig" (it prints an error message to
console) but then hangs when "dig" exits. Couldn't get DDB in
either case.

What parts of lwp_startup() are you trying to avoid?

chuck

Index: arch/alpha/alpha/locore.s
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/locore.s,v
retrieving revision 1.113.10.1
diff -u -r1.113.10.1 locore.s
--- arch/alpha/alpha/locore.s 9 Jun 2009 17:38:38 -0000 1.113.10.1
+++ arch/alpha/alpha/locore.s 30 Jan 2011 03:47:33 -0000
@@ -752,6 +752,9 @@
* Simplified version of above: don't call lwp_startup()
*/
LEAF_NOPROFILE(setfunc_trampoline, 0)
+ mov v0, a0 /* NEW */
+ mov s3, a1 /* NEW */
+ CALL(xlwp_startup) /* NEW */
mov s0, pv
mov s1, ra
mov s2, a0
Index: arch/alpha/alpha/vm_machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/vm_machdep.c,v
retrieving revision 1.96.30.1
diff -u -r1.96.30.1 vm_machdep.c
--- arch/alpha/alpha/vm_machdep.c 9 Jun 2009 17:38:39 -0000 1.96.30.1
+++ arch/alpha/alpha/vm_machdep.c 30 Jan 2011 03:47:33 -0000
@@ -228,6 +228,8 @@
(u_int64_t)exception_return; /* s1: ra */
up->u_pcb.pcb_context[2] =
(u_int64_t)arg; /* s2: arg */
+ up->u_pcb.pcb_context[3] =
+ (u_int64_t)l; /* s3: lwp */
up->u_pcb.pcb_context[7] =
(u_int64_t)setfunc_trampoline; /* ra: assembly magic */
}
Index: kern/kern_lwp.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_lwp.c,v
retrieving revision 1.126.2.2
diff -u -r1.126.2.2 kern_lwp.c
--- kern/kern_lwp.c 8 Mar 2009 03:15:36 -0000 1.126.2.2
+++ kern/kern_lwp.c 30 Jan 2011 03:48:08 -0000
@@ -706,6 +706,22 @@
}
}

+
+/*
+ * Called by MD code when a new LWP begins execution. Must be called
+ * with the previous LWP locked (so at splsched), or if there is no
+ * previous LWP, at splsched.
+ */
+void xlwp_startup(struct lwp *prev, struct lwp *new);
+void
+xlwp_startup(struct lwp *prev, struct lwp *new)
+{
+ if (prev != NULL) {
+ curcpu()->ci_mtx_count++; /*YES*/
+ prev->l_ctxswtch = 0; /*YES*/
+ }
+}
+
/*
* Exit an LWP.
*/

Martin Husemann

unread,

Jan 30, 2011, 6:23:25 AM1/30/11

to

Hi Chuck,

thanks for looking into this!

On Sat, Jan 29, 2011 at 11:03:23PM -0500, Chuck Cranor wrote:
> The main difference between the lwp_trampoline() and the new
> setfunc_trampoline() is that the setfunc_trampoline() no longer
> calls lwp_startup().

Yes, that was the intention of it all - but unfortunately I can not
remember why, and where this was discussed. The original change happened
back in 2008 and I then tried to make sure all ports behave consistently
and filed a bunch of PRs.

I think it was about cpu_setfunc() being callable later, for already
running lwps again - but apparently that does not happen (anymore?) -
maybe somebody else remembers details?

I'll have to dig through old PRs and (unfortunately not very clear) commit
messages.

Can you please file a PR? We should make up our mind if cpu_setfunc is
supposed to call lwp_startup() or not, make sure all ports do it consistently,
and find out if the SA compat code needs fixing (e.g. arrange for a call to
lwp_startup by other means). Or backout all the changes to various ports
that introduced the separate trampoline.

Thanks,

Chuck Cranor

unread,

Jan 31, 2011, 4:04:59 PM1/31/11

to

On Sun, Jan 30, 2011 at 12:23:25PM +0100, Martin Husemann wrote:
> I'll have to dig through old PRs and (unfortunately not very clear) commit
> messages.
>
> Can you please file a PR? We should make up our mind if cpu_setfunc is
> supposed to call lwp_startup() or not, make sure all ports do it consistently,
> and find out if the SA compat code needs fixing (e.g. arrange for a call to
> lwp_startup by other means). Or backout all the changes to various ports
> that introduced the separate trampoline.

Hi-

I filed a PR entitled:

4.0 sa threaded apps hard hang netbsd-5 and HEAD kernels on some ports [cpu_setfunc() related]

http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=44500

For this.

I got some additional information in the process...

First, the hard hang occurs in mi_switch() in the following loop (I
added the debug printf):

/*
* We may need to spin-wait for if 'newl' is still
* context switching on another CPU.
*/

if (newl->l_ctxswtch != 0) {
u_int count;
count = SPINLOCK_BACKOFF_MIN;
while (newl->l_ctxswtch) {
SPINLOCK_BACKOFF(count);
printf("POINTA\n"); /*XXXCDC*/
}
}

it just prints "POINTA" endlessly --- it never exits that loop. Note
my system only has one CPU (so the case the comment is looking for does
not apply). Because interrupts are disabled, it is not possible to break
to DDB if you are stuck in that while() loop, your system is hung (that's
why you have to power cycle).

I also did a survey of some of the ports in the tree, and it looks
some port's cpu_setfunc() still call lwp_startup() while other ports
have been modified (like the alpha) to not call it:

arch cpu_setfunc calls does it call lpw_startup? when changed?
------- ---------------------- ----------------------------------------
acorn26 lwp_trampoline yes
alpha setfunc_trampoline no (vm_machdep.1.100, 2009/06/01)
arm32 lwp_trampoline yes
hppa setfunc_trampoline no (vm_machdep.c 1.36, 2009/06/03)
m68k setfunc_trampoline no (vm_machdep.c 1.28, 2009/05/30)
mips setfunc_trampoline no (vm_machdep.c 1.123, 2009/05/30)
powerpc setfunc_trampoline no (vm_machdep.c 1.77, 2009/06/07)
sh3 lwp_setfunc_trampoline no (never called lpw_startup?)
sparc lwp_setfunc_trampoline no (vm_machdep.c 1.100, 2009/05/29)
sparc64 lwp_setfunc_trampoline no (vm_machep.c 1.89, 2009/05/30)
x86 lwp_trampoline yes

the "no" ports are likely to have problems with compat_sa binaries,
I think.

The most interesting one is the sh3 (because it didn't get the change
in 2009) and the commit comment from mrg on the sparc (because it
is the earliest instance of this change --- 2009/05/29):

----------------------------
revision 1.100
date: 2009/05/29 22:06:56; author: mrg; state: Exp; lines: +11 -5
fix up cpu_setfunc() as noted by uwe:

- don't call lwp_startup for cpu_setfunc() users
- introduce lwp_setfunc_trampoline instead
- no need to set the "new" lwp for setfunc
----------------------------

But I couldn't find where mrg said that uwe@netbsd noted it.

chuck

Valeriy E. Ushakov

unread,

Jan 31, 2011, 5:30:17 PM1/31/11

to

Probably on ICB.

As far as I can tell, sh3 never called lwp_startup for cpu_setfunc()
users, as cvs tells me for src/sys/arch/sh3/sh3/locore_subr.S

revision 1.30.4.3
date: 2007/03/25 01:59:02; author: uwe; state: Exp; lines: +47 -107
Adapt sh3 to yamt-idlelwp.

which introduced current code in its present form. I don't remember
any details, I guess I was just doing what I was told.

My vague recollection is that my understanding at the time was that
lwp_startup was intended for "new" lwps (cpu_lwp_fork), but
lwp_setfunc_trampoline (cpu_setfunc) was intended for lwps that
already exist, to make them run some new code for an upcall.

I guess that was what I "noted" to mrg.

-uwe

Valeriy E. Ushakov

unread,

Jan 31, 2011, 7:20:47 PM1/31/11

to

On Tue, Feb 01, 2011 at 01:30:17 +0300, Valeriy E. Ushakov wrote:

> I guess that was what I "noted" to mrg.

Actually, I think it was this mail:

http://mail-index.netbsd.org/source-changes-d/2009/05/27/msg000502.html

quoted below for convenience

| >
| >
| > To generate a diff of this commit:

| > cvs rdiff -u -r1.98 -r1.99 src/sys/arch/sparc/sparc/vm_machdep.c
|
| Is that correct? It used to be in 4.0, but in 5.0 lwp trampoline
| calls lwp_startup before calling the lwp function. For SA the
| trampoline used to be used for recycling lwps for upcalls. Is that
| correct to call lwp_startup in that case? There never was any
| official note on that when SA became undead.

-uwe

Valeriy E. Ushakov

unread,

Jan 31, 2011, 7:54:33 PM1/31/11

to

On Mon, Jan 31, 2011 at 16:04:59 -0500, Chuck Cranor wrote:

> I also did a survey of some of the ports in the tree, and it looks
> some port's cpu_setfunc() still call lwp_startup() while other ports
> have been modified (like the alpha) to not call it:

Ok, glancing at the code I think that cpu_setfunc must arrange for the
trampoline to call lwp_startup, since, even though we "recycle" an
existing lwp it still acts like a "new" one since it's not resumed in
mi_switch() - so (as you discovered) we need lwp_startup to run the
code that for an "old" lwp we do at kern_synch.c:791

I guess what happened is that

. at the time of yamt-idlelwp SA were already dead

. when I adapted sh3 to yamt-idlelwp I didn't bother to investigate
and just conservatively left cpu_setfunc as it was (i.e. w/out call
to lwp_startup)

. that didn't matter since SA were dead

. I got mrg@ confused with my question
http://mail-index.netbsd.org/source-changes-d/2009/05/27/msg000502.html

. mrg changed sparc and sparc64

. martin propagated that change to other ports

. all that still didn't matter since SA were dead

. SA were resurrected, but, I guess, not tested that much on !x86

Sorry :)

-uwe

Valeriy E. Ushakov

unread,

Jan 31, 2011, 9:11:33 PM1/31/11

to

On Tue, Feb 01, 2011 at 03:54:33 +0300, Valeriy E. Ushakov wrote:

> Ok, glancing at the code I think that cpu_setfunc must arrange for the
> trampoline to call lwp_startup, since, even though we "recycle" an
> existing lwp it still acts like a "new" one since it's not resumed in
> mi_switch() - so (as you discovered) we need lwp_startup to run the
> code that for an "old" lwp we do at kern_synch.c:791

I fixed sh3.

-uwe

Michael L. Hitch

unread,

Feb 1, 2011, 12:56:12 PM2/1/11

to

On Mon, 31 Jan 2011, Chuck Cranor wrote:

> I also did a survey of some of the ports in the tree, and it looks
> some port's cpu_setfunc() still call lwp_startup() while other ports
> have been modified (like the alpha) to not call it:
>
> arch cpu_setfunc calls does it call lpw_startup? when changed?
> ------- ---------------------- ----------------------------------------

...

> m68k setfunc_trampoline no (vm_machdep.c 1.28, 2009/05/30)

I can verify that m68k does hang the same way, and that reverting the
cpu_setfunc() change makes it work.

Mike
--
Michael L. Hitch mhi...@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA

--

Chuck Cranor

unread,

Feb 1, 2011, 2:05:47 PM2/1/11

to

On Tue, Feb 01, 2011 at 03:54:33AM +0300, Valeriy E. Ushakov wrote:
> . mrg changed sparc and sparc64
> . martin propagated that change to other ports
> . all that still didn't matter since SA were dead
> . SA were resurrected, but, I guess, not tested that much on !x86
> Sorry :)

That's ok, I know the issues with SA threads were not easy to deal
with. Thanks for looking into it!

Martin has marked in kern/44500 that he'll cleanup and request pullups,
so we'll get it resolved.

chuck