One of the CPUs is all bound up with system usage. I *know* this is a job
for DTrace, but I'm not yet savvy enough to work out exactly what to do
with it. (I've only previously studied the behaviour of applications.)
Here's what the standard tools tell me. First vmstat:
0 0 0 2220160 1387472 0 73 0 0 0 0 0 0 0 0 0 26000 1436 25997 2 39
60
0 0 0 2220160 1387476 0 0 0 0 0 0 0 0 0 0 0 25945 1589 25952 3 39
58
0 0 0 2220160 1387476 0 0 0 0 0 0 0 0 0 0 0 26070 752 25932 1 38
61
and now mpstat.
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 4 13119 411 25292 211 111 61 0 273 0 75 0
24
1 0 0 10 12784 12635 593 29 109 52 0 1135 4 2 0
94
0 0 0 2 13078 424 25171 129 124 116 0 244 1 75 0
25
1 0 0 14 12810 12637 807 75 126 100 0 1232 3 3 0
94
I've shut down all the apps. When I run prstat, prstat is the top task.
Obviously the kernel is doing something, but what? This is a problem which
gets worse over time - a reboot fixes it, but I want to know how to track
down the problem. Hopefully someone can help.
TIA
Rob
Many interrupts. My ideas:
Network traffic? Disconnect while watching.
Keyboard defective? Disconnect while watching.
--
Michael Tosch @ hp : com
--
Thobias Vakayil
Alcatel Development India (ADI)
PH: 2349961/72/86 EXTN :7018
# intrstat
device | cpu0 %tim
-------------+---------------
audioi810#0 | 70 0.0
ehci#0 | 1 0.0
gani#0 | 193 0.4
nvidia#0 | 193 0.1
ohci#0 | 70 0.1
device | cpu0 %tim
-------------+---------------
ata#0 | 1 0.0
audioi810#0 | 168 0.2
ehci#0 | 1 0.0
gani#0 | 122 0.3
nvidia#0 | 122 0.0
ohci#0 | 168 0.1
> This is on an idle system, a W2100z running an OpenSolaris build-24 kernel.
> (DEBUG). I think I had the same problem with the Nevada build that was on
> the box before.
>
> One of the CPUs is all bound up with system usage. I *know* this is a job
> for DTrace, but I'm not yet savvy enough to work out exactly what to do
> with it. (I've only previously studied the behaviour of applications.)
>
> Here's what the standard tools tell me. First vmstat:
>
> 0 0 0 2220160 1387472 0 73 0 0 0 0 0 0 0 0 0 26000 1436 25997 2 39
> 60
The problem is the 26000 interrupts / second.
Try to find out which device is sending the 26000 interrupts using
the "intrstat" command.
It is possible that the number of interrupts reported by "intrstat" is
nowhere near the 26000 interrupts / second. If that is the case, try
the following dtrace script (save as "inststat.d", "chmod +x intrstat.d",
run intrstat.d):
========================================================================
#!/usr/sbin/dtrace -s
interrupt-start
{
self->ts = vtimestamp;
}
interrupt-complete
/arg0 == 0/
{
this->dn_name = "<unknown>";
@counts[this->dn_name, arg1] = count();
@times[this->dn_name, arg1] = sum(vtimestamp - self->ts);
self->ts = 0;
}
interrupt-complete
/arg0 != 0/
{
this->devi = (struct dev_info*)arg0;
this->dn_name = stringof(`devnamesp[this->devi->devi_major].dn_name);
@counts[this->dn_name, this->devi->devi_instance] = count();
@times[this->dn_name, this->devi->devi_instance] =
sum(vtimestamp - self->ts);
self->ts = 0;
}
tick-1sec
{
printf("\n%-10s %-16s %s\n---------------------------------------\n",
"driver", "instance", "intr-count");
printa("%-10s %-16a %@8u\n", @counts);
/* printa(@times); */
clear(@counts);
clear(@times);
}
========================================================================
See also (the yahoo thread is probably about exactly the same problem):
http://www.opensolaris.org/jive/thread.jspa?threadID=2942&tstart=0
http://groups.yahoo.com/group/solarisx86/message/29841
http://groups.yahoo.com/group/solarisx86/message/29998
I forgot to mention that I'd run intrstat and seen nothing untoward. Of
course, now I've got something to try, the machine is behaving perfectly!
> It is possible that the number of interrupts reported by "intrstat" is
> nowhere near the 26000 interrupts / second. If that is the case, try
> the following dtrace script (save as "inststat.d", "chmod +x intrstat.d",
> run intrstat.d):
Cool. That's just what I was looking for.
> See also (the yahoo thread is probably about exactly the same problem):
Looks just the same as mine. Thanks a lot Juergen - lots of interesting
reading in that group. I'll report back if and when I solve my problem.
R
i'm seeing the same symptoms you do. juergen already referenced my thread
on the solarisx86 list.
rdf <li...@thecatflap.co.uk> writes:
> I forgot to mention that I'd run intrstat and seen nothing untoward. Of
> course, now I've got something to try, the machine is behaving perfectly!
hihi, i know that... took me a week of waiting (until today) to see that happen
again.
> > It is possible that the number of interrupts reported by "intrstat" is
> > nowhere near the 26000 interrupts / second. If that is the case, try
> > the following dtrace script (save as "inststat.d", "chmod +x intrstat.d",
> > run intrstat.d):
>
> Cool. That's just what I was looking for.
for me it's the acpi interrupts running up after about a week.
using juergens dtrace script i see:
bernd.charon ~/build/autovec # ./acpi3.d
dtrace: script './acpi3.d' matched 4 probes
CPU ID FUNCTION:NAME
0 48441 :tick-1sec
driver instance intr-count
---------------------------------------
adpu320 0x1 4
bge 0x0 4
hci1394 0x0 106
ohci 0x1 106
<unknown> unix`xc_serv 140
ehci 0x0 172
nvidia 0x0 172
<unknown> unix`cbe_fire 207
<unknown> acpica`acpi_wrapper_isr 16024
--
As they say, never underestimate the bandwidth of a 747 full of CDROMs!
>adpu320 0x1 4
>bge 0x0 4
>hci1394 0x0 106
>ohci 0x1 106
><unknown> unix`xc_serv 140
>ehci 0x0 172
>nvidia 0x0 172
><unknown> unix`cbe_fire 207
><unknown> acpica`acpi_wrapper_isr 16024
Ah, interesting; so ACPI gives you all these interrupts.
This is somewhat surprising unless we programmed the interrupt incorrectly;
acpica generally disables interrupts for devices which have no handler defined.
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
> Bernd Markgraf <mark...@mail.cs.uni-magdeburg.de> writes:
>
>><unknown> unix`xc_serv 140
>>ehci 0x0 172
>>nvidia 0x0 172
>><unknown> unix`cbe_fire 207
>><unknown> acpica`acpi_wrapper_isr 16024
>
> Ah, interesting; so ACPI gives you all these interrupts.
>
> This is somewhat surprising unless we programmed the interrupt
> incorrectly;
That has been my theory so far; bug 6284164 might be related and it
seems the acpi sci interrupt was installed on the wrong vector in the
past. But bug 6284164 is supposed to be fixed in solaris
nevada(snv_22); the original poster that started this usenet thread
reported the problem for a debug build 24 kernel. OTOH, Bernd is using
snv_20, ...
Bernd has just mailed me the result of dtrace'ing acpi_wrapper_isr()
return values using the following script, and it returns "1" all the
time (== DDI_INTR_CLAIMED):
#!/usr/sbin/dtrace -s
fbt:acpica:acpi_wrapper_isr:return
{
printf("return %d", arg1);
}
So "acpica" is apparently working hard, handling all of the
acpi_wrapper_isr interrupts.
> acpica generally disables interrupts for devices which
> have no handler defined.
But isn't this an acpi "sci" interrupt? What kind of interrupts
are send to that acpi "sci" interrupt handler? Things like
"cpu is running too hot", "fan is running too slow", "battery
needs charging" (on laptops), and things like that, perhaps ?
> Ah, interesting; so ACPI gives you all these interrupts.
yes. juergen suspected this a while ago...
> This is somewhat surprising unless we programmed the interrupt incorrectly;
> acpica generally disables interrupts for devices which have no handler defined.
how do look further into this? juergen sent me an dtrace script, which
gives the return values of the acpi_wrapper_isr function. this one always
returns 1 (interrupt claimed). is there any way to fid out, where the interrupts
come from?
bernd
--
Du beginnst mit einem vollen Sack Glück und einem leeren Sack
Erfahrung. Der Trick ist, den Erfahrungssack aufzufüllen bevor
Du den Glück-Sack geleert hast.
> But isn't this an acpi "sci" interrupt? What kind of interrupts
> are send to that acpi "sci" interrupt handler? Things like
> "cpu is running too hot", "fan is running too slow", "battery
> needs charging" (on laptops), and things like that, perhaps ?
i was wondering what triggers this. as it happens after about a
week i suspected that something might get too hot. but that boxed
is in an airconditioned room (21ged C all the time) so i don't
think it's temperature related.
>But isn't this an acpi "sci" interrupt? What kind of interrupts
>are send to that acpi "sci" interrupt handler? Things like
>"cpu is running too hot", "fan is running too slow", "battery
>needs charging" (on laptops), and things like that, perhaps ?
Yes, but those would be the result of GPEs and such and when there
are no interested parties, ACPICA disables those GPEs.
What is more likely is that the SCI is incorrectly inserted and that
there's another device which is claiming interrupts.
(The way b21 and before work are such that the interrupt cannot be shared
and if another device attaches before ACPI then it loses the vector once
acpica attached and vice versa).
So upgrade to b24 or 25 and see if that fixes the issue.
Looks like this is the same issue I'm having. Here's the output of Juergen's
dtrace script. I'm afraid this is on build-25.
driver instance intr-count
---------------------------------------
ata 0x1 0
bge 0x0 2
adpu320 0x1 21
ehci 0x0 152
nvidia 0x0 152
<unknown> unix`xc_serv 172
<unknown> unix`cbe_fire 206
<unknown> acpica`acpi_wrapper_isr 12232
I've tried disabling acpi with eeprom,
[tap:/home/rob]$ eeprom acpi-user-options
acpi-user-options=0x2
and rebooted, but having read these pages:
http://blogs.sun.com/roller/page/danasblog
http://blogs.sun.com/roller/page/chewmyblog/20050614
I'm less certain that was the right thing to do. I've also just upgraded the
BIOS on the W2100z.
Rob
> Looks like this is the same issue I'm having. Here's the output of Juergen's
> dtrace script. I'm afraid this is on build-25.
>
> driver instance intr-count
> ---------------------------------------
> ata 0x1 0
> bge 0x0 2
> adpu320 0x1 21
> ehci 0x0 152
> nvidia 0x0 152
> <unknown> unix`xc_serv 172
> <unknown> unix`cbe_fire 206
> <unknown> acpica`acpi_wrapper_isr 12232
i also updated to b25 and still see the same symptoms:
bernd.charon ~/build/autovec # uptime
2:55pm up 2:33, 1 user, load average: 2.08, 2.04, 1.77
bernd.charon ~/build/autovec # uname -a
SunOS charon 5.11 snv_25 i86pc i386 i86pc
driver instance intr-count
---------------------------------------
ata 0x1 0
ohci 0x0 0
adpu320 0x1 1
bge 0x0 4
<unknown> unix`xc_serv 111
hci1394 0x0 112
ohci 0x1 112
ehci 0x0 176
nvidia 0x0 176
<unknown> unix`cbe_fire 204
<unknown> acpica`acpi_wrapper_isr 12579
bernd
I set acpi-user-options to 0x8 (legacy mode) which disables the acpi
interpreter. This worked fine for the last week. No symptoms whatsoever.
Juergen sent me another dtrace script which catches the ACPI events.
So after a week of no trouble i switched back to the default acpi
mechanism (acpi-user-options=0x0) and five minutes later i had the
same symptoms again and the script indicated GPE26 to be the source.
Just a quick note. Maybe 'rdf' could check if acpi-user-options=0x8
make his box symptom free as well.