SMP status ?

BERTRAND Joel

unread,

Dec 15, 2010, 7:41:56 AM12/15/10

to

Hello,

Just a question. Is sparc32 smp usable ? I have to install NetBSD 5 (or
current) on a SS20 (with SM71 or RT626). Last time I have checked, smp
support was partially broken and I haven't seen news for a long time.

Regards,

JKB

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Martin Husemann

unread,

Dec 15, 2010, 9:21:55 AM12/15/10

to

On Wed, Dec 15, 2010 at 01:41:56PM +0100, BERTRAND Joel wrote:
> Hello,
>
> Just a question. Is sparc32 smp usable ? I have to install NetBSD 5
> (or current) on a SS20 (with SM71 or RT626). Last time I have checked, smp
> support was partially broken and I haven't seen news for a long time.

Try -current, it should mostly work there. No chance for 5.x.

Martin

BERTRAND Joel

unread,

Dec 15, 2010, 9:48:50 AM12/15/10

to

Martin Husemann a écrit :

> On Wed, Dec 15, 2010 at 01:41:56PM +0100, BERTRAND Joel wrote:
>> Hello,
>>
>> Just a question. Is sparc32 smp usable ? I have to install NetBSD 5
>> (or current) on a SS20 (with SM71 or RT626). Last time I have checked, smp
>> support was partially broken and I haven't seen news for a long time.
>
> Try -current, it should mostly work there. No chance for 5.x.

Thanks for your answer. With SM71 or RT626 ? Last time I have checked,
if I remember, SuperSPARC ran better than HyperSPARC, but I have some
Ross processors.

JKB

BERTRAND Joel

unread,

Dec 20, 2010, 1:59:52 PM12/20/10

to

Martin Husemann a �crit :

> On Wed, Dec 15, 2010 at 01:41:56PM +0100, BERTRAND Joel wrote:
>> Hello,
>>
>> Just a question. Is sparc32 smp usable ? I have to install NetBSD 5
>> (or current) on a SS20 (with SM71 or RT626). Last time I have checked, smp
>> support was partially broken and I haven't seen news for a long time.
>
> Try -current, it should mostly work there. No chance for 5.x.

I'm trying NetBSD current (up to date binary sets). MP kernel is more
stable than 5.99.24 I have installed from sources a long time ago.

But, when I put more than one processor in this SS20, system randomly
reboots without any error on serial console when system is loaded by
some running process.

I'm testing with only one processor and I believe that system is stable
enough to build system from sources.

Test configuration:

SMCC SPARCstation 10/20 UP/MP POST version VRV3.45 (09/11/95)

CPU_#0 HyperSPARC ROSS RT620/RT626 0x00080000 Bytes ECache

CPU_#1 ******* NOT installed *******
CPU_#2 ******* NOT installed *******
CPU_#3 ******* NOT installed *******

>>>>> Power On Self Test (POST) is running .... <<<<<

SPARCstation 20 (1 X RT626), No Keyboard
ROM Rev. 2.25R hyperSPARC, 512 MB memory installed, Serial #7892611.
Ethernet address 8:0:20:78:6e:83, Host ID: 72786e83.

...

NetBSD riemann 5.99.41 NetBSD 5.99.41 (GENERIC.MP) #0: Fri Dec 17
22:07:08 UTC 2010
bui...@b8.netbsd.org:/home/builds/ab/HEAD/sparc/201012172000Z-obj/home/builds/ab/HEAD/src/sys/arch/sparc/compile/GENERIC.MP
sparc

Regards,

JKB

matthew green

unread,

Jan 6, 2011, 4:09:50 AM1/6/11

to

are you able to test an NFS world vs loacl disks? my nfs ss20
hasn't crashed or reset for months.

also, have you tested with supersparc cpus? i don't have a
HS system these.

.mrg.

BERTRAND Joel

unread,

Jan 6, 2011, 4:26:58 AM1/6/11

to

matthew green a écrit :

> are you able to test an NFS world vs loacl disks?

This SS20 runs NetBSD from its local disk (without raid). This
workstation uses NIS from a Linux NIS server (with some hand made table
modifications to allow NetBSD NIS client to get information from Linux
NIS server) and all normal account are exported by a Linux server
(NFSv3). But system can reboot when root is logged on console (without
any mounted NFS volume or other logged users). I don't have any nfsroot
image to test without internal disk.

> my nfs ss20
> hasn't crashed or reset for months.
>
> also, have you tested with supersparc cpus? i don't have a
> HS system these.

I can remove HS to test with dual SM71. I shall test as soon as
possible. Maybe this bug is HS specific. If someone has good knowledge
of these CPU, I can open an ssh access to this station.

Best regards,

JKB

matthew green

unread,

Jan 6, 2011, 7:36:05 AM1/6/11

to

OK, my systems are not stable either. can you try this patch and let me
know how it goes? it's just a workaround, and may not apply cleanly.
(it's from a while back when i was more familiar with this problem.)

http://www.netbsd.org/~mrg/ipi_savefpstate.diff

BERTRAND Joel

unread,

Jan 6, 2011, 8:49:37 AM1/6/11

to

matthew green a écrit :

> OK, my systems are not stable either. can you try this patch and let me
> know how it goes? it's just a workaround, and may not apply cleanly.
> (it's from a while back when i was more familiar with this problem.)
>
> http://www.netbsd.org/~mrg/ipi_savefpstate.diff

With SS-II or HS ?

BERTRAND Joel

unread,

Jan 9, 2011, 9:09:45 AM1/9/11

to

BERTRAND Joel a �crit :
> matthew green a �crit :

>> OK, my systems are not stable either. can you try this patch and let me
>> know how it goes? it's just a workaround, and may not apply cleanly.
>> (it's from a while back when i was more familiar with this problem.)
>>
>> http://www.netbsd.org/~mrg/ipi_savefpstate.diff
>
> With SS-II or HS ?

Hello,

I have applied your patch on an uptodate NetBSD source tree. I have
only added a prototype for ipi_savefpstate (void
ipi_savefpstate(struct lwp *);)

When SS20 boots, I obtain :

NetBSD 5.99.43 (GENERIC.MP) #1: Sun Jan 9 14:59:20 CET 2011
root@riemann:/usr/src/obj/sys/arch/sparc/compile/GENERIC.MP
total memory = 511 MB
avail memory = 496 MB
timecounter: Timecounters tick every 10.000 msec
bootpath:
/iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@3,0
mainbus0 (root): SUNW,SPARCstation-20: hostid 72786e83
cpu0 at mainbus0: mid 8: RT620/625 @ 200 MHz, on-chip FPU
cpu0: 512K byte write-back, 32 bytes/line, sw flush: cache enabled
cpu1 at mainbus0: mid 10: RT620/625 @ 200 MHz, on-chip FPU
cpu1: 512K byte write-back, 32 bytes/line, sw flush: cache enabled
obio0 at mainbus0
...
eccmemctl0 at mainbus0 ioaddr 0x0: version 0x0/0x2
cpu0: booting secondary processors: cpu1
scsibus0: waiting 2 seconds for devices to settle...
wskbd0 at kbd0 mux 1
dbri0: speakerbox detected
dbri0: cs4215 rev E found at offset 8
stray interrupt cpu0 ipl 0xc pc=0xf013b24c npc=0xf014ec30
psr=0x1e4000c7<S,PS>
audio0 at dbri0: full duplex, playback, capture, mmap
sd0 at scsibus0 target 3 lun 0: <FUJITSU, MAW3073NC, 0104> disk fixed
sd0: 70136 MB, 78753 cyl, 2 head, 911 sec, 512 bytes/sect x 143638992
sectors
sd0: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers, tagged
queueing
cd0 at scsibus0 target 6 lun 0: <TOSHIBA, XM-4101TASUNSLCD, 3424> cdrom
removabe
cd0: async, 8-bit transfers
kbd0: reset failed
ra dumps on sd0b
root file system type: ffs
xcall(cpu1,0xf000abdc): couldn't ping cpus: cpu0
trap type 0x7: pc=0xf01508e8 npc=0xf013ff74 psr=0x1e100bc0<S,PS>
kernel: alignment fault trap
Stopped in pid 11.1 (date) at netbsd:lwp_exit_switchaway+0x4c:
ld [%l2 + 0x10], %o0
db{1}> trace
lwp_exit_switchaway(0xf407b8c0, 0x0, 0xf42a19d0, 0xf03e4c08, 0x0, 0x16)
at netbs
d:exit1+0x40c
exit1(0xf407b8c0, 0x0, 0x20032464, 0xf4235fb0, 0x2, 0xf0502000) at
netbsd:sys_ex
it+0x2c
sys_exit(0xf407b8c0, 0xf4235f20, 0xf4235f40, 0x0, 0x0, 0x6) at
netbsd:syscall_pl
ain+0xe0
syscall_plain(0x401, 0xf4235fb0, 0x20275da8, 0x1, 0x0, 0x27770) at
0xf0008844
db{1}>

Best regards,

JKB

matthew green

unread,

Jan 9, 2011, 9:30:54 AM1/9/11

to

> http://www.netbsd.org/~mrg/ipi_savefpstate.diff

oops, i forgot to reply to myself last night and say that
this patch is obviously wrong.

i'm working on a "fixed" one (it's still only a workaround.)

BERTRAND Joel

unread,

Jan 9, 2011, 9:59:43 AM1/9/11

to

matthew green a écrit :

>> http://www.netbsd.org/~mrg/ipi_savefpstate.diff
>
> oops, i forgot to reply to myself last night and say that
> this patch is obviously wrong.
>
> i'm working on a "fixed" one (it's still only a workaround.)

Don't worry, I shall wait for your new one.

Regards,

JKB

matthew green

unread,

Jan 10, 2011, 5:47:41 AM1/10/11

to

i'm curious if anyone else has success with the following change. it
has survived at least 3x longer than normal under load for me.

thanks.

Index: cpu.c
===================================================================
RCS file: /cvsroot/src/sys/arch/sparc/sparc/cpu.c,v
retrieving revision 1.223
diff -p -r1.223 cpu.c
*** cpu.c 22 Jun 2010 18:29:02 -0000 1.223
--- cpu.c 10 Jan 2011 10:46:41 -0000
*************** void
*** 500,506 ****
cpu_init_system(void)
{

! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_VM);
}

/*
--- 500,506 ----
cpu_init_system(void)
{

! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_SCHED);
}

/*

Michael

unread,

Jan 10, 2011, 10:00:07 AM1/10/11

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

On Jan 10, 2011, at 5:47 AM, matthew green wrote:

> i'm curious if anyone else has success with the following change. it
> has survived at least 3x longer than normal under load for me.
>
> thanks.
>
>
> Index: cpu.c
> ===================================================================
> RCS file: /cvsroot/src/sys/arch/sparc/sparc/cpu.c,v
> retrieving revision 1.223
> diff -p -r1.223 cpu.c
> *** cpu.c 22 Jun 2010 18:29:02 -0000 1.223
> --- cpu.c 10 Jan 2011 10:46:41 -0000
> *************** void
> *** 500,506 ****
> cpu_init_system(void)
> {
>
> ! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_VM);
> }
>
> /*
> --- 500,506 ----
> cpu_init_system(void)
> {
>
> ! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_SCHED);
> }
>
> /*

Doesn't seem to make a difference here ( 2x 125MHz HyperSPARC in an
SS20 )
Something completely different though - now the dbri driver causes a
locking-against-myself error when calling bus_dmamem_map() from
dbri_malloc() when attaching audio - that used to work, even on MP
kernels. THe panic doesn't happen on UP kernels and the panic is
always on cpu1.

have fun
Michael

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQEVAwUBTSse+MpnzkX8Yg2nAQIL+Af/eU2yrFTw3ecx6VQH/W6Z66RdY92eKM2a
9SkjFs8fvSRbrWrWTdf1NtGD+czQQrfNqhuITwt0eznPaC2n5gy+EkVJ3hbgfhaH
Cy/5Sa3SWms1XQybyY0B+3kMVGw5cpy5woj3760iUiABKcrUS1e8gVb9/EMMYxeC
jBV/wke3XP3lXIifUFDYZWWxyEKHbMNKVmN+cvXtk9+IZZaTgXVhI7B1+mLm9kEz
TG0p17603+1MUZOGGZtPFdfXJul3xlan9rywDeUE+4EFiWjYDtzhF9RQj8eJhAmN
tmz9LXmpxjQGd+DsGKezqfP+D5J0vZpavwTkoaHQok/RW9CfuED2zQ==
=LWH4
-----END PGP SIGNATURE-----

matthew green

unread,

Jan 10, 2011, 11:59:14 AM1/10/11

to

> > i'm curious if anyone else has success with the following change. it
> > has survived at least 3x longer than normal under load for me.

my ss20 crashed after about 5 hours with the same problem, but
then the next boot crashed almost instantly.. in about 2 mins.

> > Index: cpu.c
> > ===================================================================
> > RCS file: /cvsroot/src/sys/arch/sparc/sparc/cpu.c,v
> > retrieving revision 1.223
> > diff -p -r1.223 cpu.c
> > *** cpu.c 22 Jun 2010 18:29:02 -0000 1.223
> > --- cpu.c 10 Jan 2011 10:46:41 -0000
> > *************** void
> > *** 500,506 ****
> > cpu_init_system(void)
> > {
> >
> > ! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_VM);
> > }
> >
> > /*
> > --- 500,506 ----
> > cpu_init_system(void)
> > {
> >
> > ! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_SCHED);
> > }
> >
> > /*
>
> Doesn't seem to make a difference here ( 2x 125MHz HyperSPARC in an
> SS20 )
> Something completely different though - now the dbri driver causes a
> locking-against-myself error when calling bus_dmamem_map() from
> dbri_malloc() when attaching audio - that used to work, even on MP
> kernels. THe panic doesn't happen on UP kernels and the panic is
> always on cpu1.

what are the addresses with the bug? ie last locker, what lock, etc.

thanks,

.mrg.

Michael

unread,

Jan 10, 2011, 6:30:32 PM1/10/11

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

On Jan 10, 2011, at 11:59 AM, matthew green wrote:

>
>>> i'm curious if anyone else has success with the following change.
>>> it
>>> has survived at least 3x longer than normal under load for me.
>
> my ss20 crashed after about 5 hours with the same problem, but
> then the next boot crashed almost instantly.. in about 2 mins.

Ah, so it's not just me. Mine booted fine but crashed about 1 minute
into build.sh.

>>> Index: cpu.c
>>> ===================================================================
>>> RCS file: /cvsroot/src/sys/arch/sparc/sparc/cpu.c,v
>>> retrieving revision 1.223
>>> diff -p -r1.223 cpu.c
>>> *** cpu.c 22 Jun 2010 18:29:02 -0000 1.223
>>> --- cpu.c 10 Jan 2011 10:46:41 -0000
>>> *************** void
>>> *** 500,506 ****
>>> cpu_init_system(void)
>>> {
>>>
>>> ! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_VM);
>>> }
>>>
>>> /*
>>> --- 500,506 ----
>>> cpu_init_system(void)
>>> {
>>>
>>> ! mutex_init(&xpmsg_mutex, MUTEX_SPIN, IPL_SCHED);
>>> }
>>>
>>> /*
>>
>> Doesn't seem to make a difference here ( 2x 125MHz HyperSPARC in an
>> SS20 )
>> Something completely different though - now the dbri driver causes a
>> locking-against-myself error when calling bus_dmamem_map() from
>> dbri_malloc() when attaching audio - that used to work, even on MP
>> kernels. THe panic doesn't happen on UP kernels and the panic is
>> always on cpu1.
>
> what are the addresses with the bug? ie last locker, what lock, etc.

Let me check...
it's rw_vector_enter: locking against myself
lock address 0xf02b644c
owner and current_lwp point at config_interrupts_thread()
rw_enter was called from vm_map_lock
The only difference to other audio drivers is that dbri defers
attaching audio using config_interrupts() - the whole thing is very,
very DMA-centric ( only 8 registers in the whole chip, half of them
for testing ) and even making it talk to the codec without interrupts
enabled would be quite a pain. An earlier call to bus_dmamem_map()
succeeds without trouble, I'll check if pushing it further back makes
any difference.
And let's see if LOCKDEBUG helps.

have fun
Michael

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQEVAwUBTSuWmcpnzkX8Yg2nAQJ5YQf/dbDFlvZ0mWb8Z5aaiXVKEs3Z0cJFv+K8
Tg3kXTaOp7aNysvde1AoxLnTXcUHexFxpA7Ma4wXScVNEkFdhvwF6kWs0VkpAPnA
OCC/0ifDpjg0ko5ETEPZHcexriPGBo2Qc8f4SXX55kcfn0Zv+EaUc/c6x6qXuQXX
sGirCy/M2ZRmXpxE3E5rB/bEz7Pw4nbyXiOX6wzQZq7ILfUXplmWgWDk50O0ae2w
MPSivTnmX57+KYqe/T2IhKA4kyRDpkK2CTBsNd5k8qjVSIy+latePaq9Lu9grMH5
TBy+ovmQc3RjTdgwnCSE4RodICV9vUCLVRfmBYxACOZLsVTbYeKDAA==
=8MZ8
-----END PGP SIGNATURE-----

BERTRAND Joel

unread,

Jan 12, 2011, 3:40:17 AM1/12/11

to

Michael a écrit :

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,

Hello,

> On Jan 10, 2011, at 11:59 AM, matthew green wrote:
>
>>

>>>> i'm curious if anyone else has success with the following change. it
>>>> has survived at least 3x longer than normal under load for me.
>>
>> my ss20 crashed after about 5 hours with the same problem, but
>> then the next boot crashed almost instantly.. in about 2 mins.
>

> Ah, so it's not just me. Mine booted fine but crashed about 1 minute
> into build.sh.

Same constatation with my SS20 (quad RT626, I cannot test today with
dual CPU configuration as my other SS20 is building an uptodate NetBSD).
System can boot until prompt and quickly crashes without entering in
kdb. It only reboots.

Regards,

JKB

matthew green

unread,

Jan 13, 2011, 3:25:11 AM1/13/11

to

i've commited a change to -current that should work around this problem,
so any stability issues should be gone. it's just a workaround but it is
worth updating to since it makes things mostly work, at least on my SS20.

it still has the problem, but doesn't crash now, and the "savefp null ipi"
event counters will slowly increase.

if you're adventurous, please try -current sources and apply this patch:

http://www.netbsd.org/~mrg/sparc.kpreempt.diff

for me it's run for almost 2 hours without a problem reported. i chose
the various points to add these new kpreempt_disable() calls by reviewing
what the x86 platform does for it. it *probably* contains points where
preemption is disabled unnecessarily.

andy, can you have a look? i really am not sure why this matters, but
it seems to really help.

.mrg.

BERTRAND Joel

unread,

Jan 14, 2011, 12:01:28 PM1/14/11

to

matthew green a écrit :

> i've commited a change to -current that should work around this problem,
> so any stability issues should be gone. it's just a workaround but it is
> worth updating to since it makes things mostly work, at least on my SS20.
>
> it still has the problem, but doesn't crash now, and the "savefp null ipi"
> event counters will slowly increase.
>
> if you're adventurous, please try -current sources and apply this patch:
>
> http://www.netbsd.org/~mrg/sparc.kpreempt.diff
>
> for me it's run for almost 2 hours without a problem reported. i chose
> the various points to add these new kpreempt_disable() calls by reviewing
> what the x86 platform does for it. it *probably* contains points where
> preemption is disabled unnecessarily.

I have applied this patch on my SS20 (with only dual Ross, I shall test
a configuration with quad Ross later).

dmesg shows:

NetBSD 5.99.43 (GENERIC.MP) #0: Fri Jan 14 17:28:22 CET 2011

root@riemann:/usr/src/obj/sys/arch/sparc/compile/GENERIC.MP
total memory = 511 MB
avail memory = 496 MB
timecounter: Timecounters tick every 10.000 msec
bootpath:
/iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@3,0
mainbus0 (root): SUNW,SPARCstation-20: hostid 72786e83
cpu0 at mainbus0: mid 8: RT620/625 @ 200 MHz, on-chip FPU
cpu0: 512K byte write-back, 32 bytes/line, sw flush: cache enabled
cpu1 at mainbus0: mid 10: RT620/625 @ 200 MHz, on-chip FPU
cpu1: 512K byte write-back, 32 bytes/line, sw flush: cache enabled

...

cpu0: booting secondary processors: cpu1

...

and I can see some stray interrupt messages :
stray interrupt cpu0 ipl 0xc pc=0xf0148abc npc=0xf0148ac0
psr=0x1e4000c6<S,PS>
stray interrupt cpu0 ipl 0xc pc=0xf0148b4c npc=0xf0148b50
psr=0x1e8000c6<S,PS>
stray interrupt cpu0 ipl 0xc pc=0xf0148b4c npc=0xf0148b50
psr=0x1e8000c6<S,PS>

This SS20 is building distribution with -j2 and your patch seems to fix
stability issue. Thanks a lot.

Regards,

JKB

BERTRAND Joel

unread,

Jan 14, 2011, 5:00:59 PM1/14/11

to

Bad news...

Patched kernel randomly enters in something like deadlock. On serial
console, I can enter in kdb:

Stopped in pid 0.1 (system) at netbsd:cpu_Debugger+0x4:
or %o7,%g0,%g1
db{0}> trace
cpu_Debugger(0xf41fd638, 0x0, 0x41, 0xf41fd6b0, 0x0, 0x7fef) at
netbsd:zsc_intr_hard+0xf4

zsc_intr_hard(0x0, 0x7ffffc00, 0xf499f8, 0x7a59, 0xffff, 0x809bb000) at
netbsd:zshard+0x8

zshard(0xf41fd5f8, 0xf03236b8, 0xf00, 0x1e8000e1, 0x0, 0x6ac) at
netbsd:sparc_interrupt44c+0x150

sparc_interrupt44c(0x0, 0x40, 0xf0338a54, 0x1, 0x20000000, 0xf4049000)
at netbsd:nullop+0x4
nullop(0xf03cf078, 0x7497, 0x4838, 0xf0502000, 0xf0502010, 0x1c) at
netbsd:xcall+0x40
xcall(0xf000a0c4, 0x0, 0x0, 0x0, 0x0, 0x2) at netbsd:resetpriority+0x50
resetpriority(0xf5450000, 0x0, 0xf039d100, 0xf4067c00, 0x0, 0x0) at
netbsd:sched_pstats+0x11c
sched_pstats(0xf406cc00, 0x0, 0x64, 0x0, 0x1e4000e6, 0xf03de3e0) at
netbsd:uvm_scheduler+0x58
uvm_scheduler(0xf4067600, 0xf0377360, 0xf02e62c4, 0xf4067600, 0x7d, 0x0)
at netbsd:main+0x870
main(0x0, 0xfffffff8, 0x0, 0x0, 0xffef2010, 0xf0002400) at
netbsd:nmi_sun4m+0xd4
4

db{0}>

There is no other message on serial console.

Test workstation runs with dual RT626. I'm testing the same kernel with
dual SM71.

BERTRAND Joel

unread,

Jan 14, 2011, 5:58:13 PM1/14/11

to

With same kernel and two SM71, system seems to be more stable than with
two RT626:

riemann:[~] > uptime
11:58PM up 1 hr, 3 users, load averages: 4.65, 4.62, 4.43
riemann:[~] > /sbin/dmesg | grep cpu
cpu0 at mainbus0: mid 8: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K
external (32 b/l): cache enabled
cpu1 at mainbus0: mid 10: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu1: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K
external (32 b/l): cache enabled
...

RT626's are hotter that SM71's, but I'm not sure that RT626 core
temperature is hotter in dual RT626 configuration than single RT626
configuration (one core by mbus module). In a first time, I though that
RT626 troubles come from an overheating configuration. But if this
trouble comes from temperature, I should exactly see the same problem
with only one CPU and this SMP kernel perfectly runs on an UP
configuration even with RT626.

matthew green

unread,

Jan 16, 2011, 1:07:25 AM1/16/11

to

yeah, i've seen a couple of crashes, i won't have time to
look at this for a few days, i'll see what happens then.

.mrg.

BERTRAND Joel

unread,

Jan 16, 2011, 4:19:54 AM1/16/11

to

matthew green a écrit :

> yeah, i've seen a couple of crashes, i won't have time to
> look at this for a few days, i'll see what happens then.

I don't know if it shall help you, but I'm unable to reproduce crashes
with dual SM71 configuration even with high load average. Are your
crashes NMI related ?

Regards,

JKB

matthew green

unread,

Jan 16, 2011, 4:47:08 AM1/16/11

to

> matthew green a écrit :
> > yeah, i've seen a couple of crashes, i won't have time to
> > look at this for a few days, i'll see what happens then.
>
> I don't know if it shall help you, but I'm unable to reproduce crashes
> with dual SM71 configuration even with high load average. Are your
> crashes NMI related ?

they only appear with LOCKDEBUG enabled. a normal kernel runs very
well for me now, i haven't seen a problem.

i have a hack to not print strayintr messages for zs in MP, which
should avoid that problem.

.mrg.

Index: intr.c
===================================================================
RCS file: /cvsroot/src/sys/arch/sparc/sparc/intr.c,v
retrieving revision 1.108
diff -p -r1.108 intr.c
*** intr.c 5 Jan 2010 21:38:50 -0000 1.108
--- intr.c 16 Jan 2011 09:46:41 -0000
*************** strayintr(struct clockframe *fp)
*** 133,138 ****
--- 133,151 ----
char bits[64];
int timesince;

+ #if defined(MULTIPROCESSOR)
+ /*
+ * XXX
+ *
+ * Don't whine about zs interrupts on MP. We sometimes get
+ * stray interrupts when polled kernel output on cpu>0 eats
+ * the interrupt and cpu0 sees it.
+ */
+ #define ZS_INTR_IPL 12
+ if (fp->ipl == ZS_INTR_IPL)
+ return;
+ #endif
+
snprintb(bits, sizeof(bits), PSR_BITS, fp->psr);
printf("stray interrupt cpu%d ipl 0x%x pc=0x%x npc=0x%x psr=%s\n",
cpu_number(), fp->ipl, fp->pc, fp->npc, bits);

Julian Coleman

unread,

Jan 16, 2011, 5:04:13 AM1/16/11

to

Hi,

> i've commited a change to -current that should work around this problem,
> so any stability issues should be gone. it's just a workaround but it is
> worth updating to since it makes things mostly work, at least on my SS20.

> if you're adventurous, please try -current sources and apply this patch:
>
> http://www.netbsd.org/~mrg/sparc.kpreempt.diff

I've tried -current with and without this patch on two SS20's. One has two
SM71's, the other has three RT625's (100MHz). Both have 448MB of memory.
They both boot from disk. I tried `build.sh -j 4` on both of them, with
/usr/src and /usr/obj NFS mounted. I can't get past the "cleandir" phase
(there are no files to clean - obj is empty). The SS20 with the SM71's
locks solid and has to be power cycled. The SS20 with the RT625's crashes
with:

Mutex error: mutex_vector_enter: locking against myself

This is repeatable for both machines and the same with or without the patch
above. The kernel that I'm using has DIADNOSTIC, DEBUG and LOCKDEBUG, so I
was able to look at the details of the crash. The backtrace is:

syscall_plain -> sys_write -> dofilewrite -> vn_write -> VOP_WRITE ->
ffs_write -> ufs_balloc_range -> mutex_enter -> lockdebug_abort1

The mutex enter call is from /usr/src/sys/ufs/ufs/ufs_inode.c:276

goto out;
}
mutex_enter(&uobj->vmobjlock);
-276-> mutex_enter(&uvm_pageqlock);
for (i = 0; i < npages; i++) {
UVMHIST_LOG(ubchist, "got pgs[%d] %p", i, pgs[i],0,0);
KASSERT((pgs[i]->flags & PG_RELEASED) == 0);

and the lock has "last locked : 0x00000000f00bbf8c", which is
/usr/src/sys/miscfs/genfs/genfs_io.c:698

out:
UVMHIST_LOG(ubchist, "succeeding, npages %d", npages,0,0,0);
error = 0;
-698-> mutex_enter(&uvm_pageqlock);
for (i = 0; i < npages; i++) {
struct vm_page *pg = pgs[i];
if (pg == NULL) {

However, it's not obvious how the lock taken here can be left unlocked.

Thanks,

J

--
My other computer also runs NetBSD / Sailing at Newbiggin
http://www.netbsd.org/ / http://www.newbigginsailingclub.org/

BERTRAND Joel

unread,

Jan 17, 2011, 5:14:08 AM1/17/11

to

matthew green a écrit :

>> matthew green a écrit :
>>> yeah, i've seen a couple of crashes, i won't have time to
>>> look at this for a few days, i'll see what happens then.
>>
>> I don't know if it shall help you, but I'm unable to reproduce crashes
>> with dual SM71 configuration even with high load average. Are your
>> crashes NMI related ?
>
> they only appear with LOCKDEBUG enabled. a normal kernel runs very
> well for me now, i haven't seen a problem.

Just a question. I have built a GENERIC.MP kernel and DEBUG option is
not set. How can I check if LOCKDEBUG is enabled ? I haven't find any
LOCKDEBUG option in config directory.

Regards,

JKB

Julian Coleman

unread,

Jan 17, 2011, 8:48:37 AM1/17/11

to

Hi

> Just a question. I have built a GENERIC.MP kernel and DEBUG option
> is not set. How can I check if LOCKDEBUG is enabled ? I haven't find any
> LOCKDEBUG option in config directory.

The LOCKDEBUG (and SYSCALL_DEBUG) options were not added to any of the sparc
kernel configuration files. I have just added them (commented out) to
GENERIC, KRUPS, MRCOFFEE and TADPOLE3GX

You need at add (or uncomment):

options LOCKDEBUG

to your kernel configuration to enable LOCKDEBUG.

Thanks,

J

--
My other computer also runs NetBSD / Sailing at Newbiggin
http://www.netbsd.org/ / http://www.newbigginsailingclub.org/

--

BERTRAND Joel

unread,

Jan 17, 2011, 9:19:58 AM1/17/11

to

Julian Coleman a écrit :

> Hi
>
>> Just a question. I have built a GENERIC.MP kernel and DEBUG option
>> is not set. How can I check if LOCKDEBUG is enabled ? I haven't find any
>> LOCKDEBUG option in config directory.
>
> The LOCKDEBUG (and SYSCALL_DEBUG) options were not added to any of the sparc
> kernel configuration files. I have just added them (commented out) to
> GENERIC, KRUPS, MRCOFFEE and TADPOLE3GX
>
> You need at add (or uncomment):
>
> options LOCKDEBUG
>
> to your kernel configuration to enable LOCKDEBUG.

Thus, GENERIC.MP kernel seems to be built without LOCKDEBUG functions.
And my SS20 (with dual ROSS) crashes.

Regards,

JKB

Andrew Doran

unread,

Jan 17, 2011, 10:00:20 AM1/17/11

to

My general sense is that it's a bad idea to call printf() above IPL_VM
everywhere, since the synchronization around that is still unclear
(what happens in the drivers in particular).

Andrew Doran

unread,

Jan 17, 2011, 10:05:06 AM1/17/11

to

On Thu, Jan 13, 2011 at 07:25:11PM +1100, matthew green wrote:
>
> i've commited a change to -current that should work around this problem,
> so any stability issues should be gone. it's just a workaround but it is
> worth updating to since it makes things mostly work, at least on my SS20.
>
> it still has the problem, but doesn't crash now, and the "savefp null ipi"
> event counters will slowly increase.
>
> if you're adventurous, please try -current sources and apply this patch:
>
> http://www.netbsd.org/~mrg/sparc.kpreempt.diff
>
> for me it's run for almost 2 hours without a problem reported. i chose
> the various points to add these new kpreempt_disable() calls by reviewing
> what the x86 platform does for it. it *probably* contains points where
> preemption is disabled unnecessarily.
>
> andy, can you have a look? i really am not sure why this matters, but
> it seems to really help.

Hey Matt. Some thoughts, first whenever we are > IPL_NONE, preemption is off
already (so splvm or whatever does the job).

Since interrupts raise the IPL you've got the same thing there
implicitly while in an interrupt path.

Second the kernel preemption gunk
should do almost nothing on a non-preemptive machine. So I'd be really
surpised if this is having any effect other than changing timing or
something wacky like code/register/cache layout. evcounters for preemption
should always report zero on sparc (unless somebody went an implemented
preemption there!).

>
>
>
> .mrg.

matthew green

unread,

Jan 22, 2011, 9:00:32 AM1/22/11

to

> Just a question. I have built a GENERIC.MP kernel and DEBUG option is
> not set. How can I check if LOCKDEBUG is enabled ? I haven't find any
> LOCKDEBUG option in config directory.

you can just add, eg.

options DEBUG,LOCKDEBUG,DIAGNOSTIC

to your kernel config to form them on.

"config -x /netbsd" will display the config embedded in the named
kernel. it's useful for other things, too.

please update to the latest -current, with no other patches applied.
i'm curious what happens then.

thanks

.mrg.

matthew green

unread,

Jan 27, 2011, 1:26:51 AM1/27/11

to

i've just commited a workaround that should get rid of all the
xcall() failed to ping messages. i've also commited a change
to the way we handle interrupt evcnt(9) to be more per-cpu.

please update and see if it helps this. you can monitor if this
workaround has triggered by monitoring the kernel evcnt's that
match "IPI mutex_trylock fail" (see: vmstat -e.)

BERTRAND Joel

unread,

Jan 30, 2011, 3:48:40 PM1/30/11

to

matthew green a écrit :

> i've just commited a workaround that should get rid of all the
> xcall() failed to ping messages. i've also commited a change
> to the way we handle interrupt evcnt(9) to be more per-cpu.
>
> please update and see if it helps this. you can monitor if this
> workaround has triggered by monitoring the kernel evcnt's that
> match "IPI mutex_trylock fail" (see: vmstat -e.)

I don't have time enough to test your patch until next saturday. I'll
come back when I'll test it.

Regards,

JKB