Very high load average - nothing appears to use much CPU time

Dave

未読、

2009/08/21 20:34:282009/08/21

To:

I believe I asked for help on this issue once before, and I think
someone had a useful reply, but I can't find the message.

I have a dual processor SPARC running Solaris 10 update 7. The machine
is really slow and unresponsive. prstat shows a very high load average ,
yet nothing appears to be using a lot of CPU time.

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
28932 drkirkby 30M 24M run 23 0 0:37:42 5.2% maxima/1
10118 drkirkby 34M 29M cpu0 13 0 1:56:58 4.9% maxima/1
3755 drkirkby 30M 19M run 23 0 3:56:16 4.8% maxima/1
1 root 2896K 1760K sleep 35 0 1:23:30 2.9% init/1
1030 drkirkby 294M 248M sleep 49 0 0:32:04 2.1%
thunderbird-bin/9
692 drkirkby 387M 95M sleep 59 0 3:30:14 1.8% Xsun/1
18929 drkirkby 147M 88M sleep 49 0 0:12:07 0.7% python/1
9738 drkirkby 147M 63M sleep 39 0 0:21:53 0.7% python/1
24642 drkirkby 167M 88M sleep 39 0 0:19:51 0.7% python/1
890 drkirkby 99M 42M sleep 59 0 0:05:58 0.6% metacity/1
912 drkirkby 129M 65M sleep 58 0 0:17:18 0.5% gnome-terminal/2
944 drkirkby 3976K 3360K sleep 59 0 0:00:36 0.2% prstat/1
24629 drkirkby 122M 88M sleep 59 0 0:00:20 0.2% firefox-bin/6
1129 drkirkby 20M 12M sleep 49 0 0:06:34 0.1% sunpcbinary/3
950 drkirkby 232M 70M sleep 49 0 0:06:21 0.1% java/18
Total: 113 processes, 354 lwps, load averages: 8.36, 8.32, 8.21

It's only a workstation with me using it, so there are not thousands of
processes all trying to get some time on the cpu.

The biggest consumer of CPU time is using 5.2% and the top 10 only sum
to about 25% of the available CPU. Yet the load average is high. I
suspect something is rapidly creating a process which lives for a short
time then dies, but another is created.

I suspect its possible to do something with dtrace to debug this, but
can anyone give me any hints where to start.

--
I respectfully request that this message is not archived by companies as
unscrupulous as 'Experts Exchange' . In case you are unaware,
'Experts Exchange' take questions posted on the web and try to find
idiots stupid enough to pay for the answers, which were posted freely
by others. They are leeches.

Zfs..

未読、

2009/08/21 20:53:182009/08/21

To:

We had problems with short lived PIDS reaking havoc on a system long
ago, I used prstat -vm ( I think ) to see some of the short lived
PIDS but I know that a colleague of mine had a better command ( using
prstat ) to pinpoint the culprit. He is away on leave so I cannot ask.
I'm sure some of the gurus will be able to help you better.

hume.sp...@bofh.ca

未読、

2009/08/21 22:39:222009/08/21

To:

Dave <f...@coo.com> wrote:
> I suspect its possible to do something with dtrace to debug this, but
> can anyone give me any hints where to start.

Try execsnoop.d: http://www.brendangregg.com/DTrace/execsnoop.d
That might show something... also, statsnoop.d and opensnoop.d might be
valuable.

Also, you could turn on process accounting.

--
Brandon Hume - hume -> BOFH.Ca, http://WWW.BOFH.Ca/

Greg Andrews

未読、

2009/08/21 23:11:312009/08/21

To:

Dave <f...@coo.com> writes:
>
>I have a dual processor SPARC running Solaris 10 update 7. The machine
>is really slow and unresponsive. prstat shows a very high load average ,
>yet nothing appears to be using a lot of CPU time.
>
>
> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
> 28932 drkirkby 30M 24M run 23 0 0:37:42 5.2% maxima/1
> 10118 drkirkby 34M 29M cpu0 13 0 1:56:58 4.9% maxima/1
> 3755 drkirkby 30M 19M run 23 0 3:56:16 4.8% maxima/1
> 1 root 2896K 1760K sleep 35 0 1:23:30 2.9% init/1
> 1030 drkirkby 294M 248M sleep 49 0 0:32:04 2.1%

>....

>Total: 113 processes, 354 lwps, load averages: 8.36, 8.32, 8.21
>
>
>It's only a workstation with me using it, so there are not thousands of
>processes all trying to get some time on the cpu.
>
>The biggest consumer of CPU time is using 5.2% and the top 10 only sum
>to about 25% of the available CPU. Yet the load average is high. I
>suspect something is rapidly creating a process which lives for a short
>time then dies, but another is created.
>

Before you go to dtrace, you should look at the output of another
tool: vmstat

I would suggest taking the output of 'vmstat 15' or 'vmstat 30' and
post 5 minutes' worth here.

-Greg
--
Do NOT reply via e-mail.
Reply in the newsgroup.

Tony Curtis

未読、

2009/08/21 23:15:542009/08/21

To:

>> On Sat, 22 Aug 2009 02:39:22 +0000 (UTC),
>> hume.sp...@bofh.ca said:

> Dave <f...@coo.com> wrote:
>> I suspect its possible to do something with dtrace to
>> debug this, but can anyone give me any hints where to
>> start.

> Try execsnoop.d:
> http://www.brendangregg.com/DTrace/execsnoop.d That
> might show something... also, statsnoop.d and
> opensnoop.d might be valuable.

Good things to look at. If I read correctly, the problem
only shows up when you launch your "maxima" processes?

Are they doing a lot of I/O? I've seen large parallel
programs doing parallel I/O make even big Sun machines
spiral up out of control on their load...

hth
t

Ewald Ertl

未読、

2009/08/22 3:24:442009/08/22

To:

Hi,

I had a similar problem on a server containing zones.
The problem where tons of files stored in /tmp.
/tmp is simply storage from the virtual memory, which
was missing for the processes in my case.
After clean up of /tmp everything was responsive again.

Ewald

Dave

未読、

2009/08/22 5:54:302009/08/22

To:

Thank you. That is not the issue here.

Stefaan A Eeckels

未読、

2009/08/22 5:05:092009/08/22

To:

On Sat, 22 Aug 2009 02:39:22 +0000 (UTC)

hume.sp...@bofh.ca wrote:

> Dave <f...@coo.com> wrote:
> > I suspect its possible to do something with dtrace to debug this,
> > but can anyone give me any hints where to start.
>
> Try execsnoop.d: http://www.brendangregg.com/DTrace/execsnoop.d
> That might show something... also, statsnoop.d and opensnoop.d might
> be valuable.

That version of execsnoop is deprecated. Better visit Brendan's site
and download the latest version of his DTraceToolkit. It's a real
sysadmin treasure trove.

--
Stefaan A Eeckels
--
Tener razón es una razón màs para no tener ningùn éxito.
--Nicolàs Dàvila

Dave

未読、

2009/08/22 6:32:252009/08/22

To:

Tony Curtis wrote:
>>> On Sat, 22 Aug 2009 02:39:22 +0000 (UTC),
>>> hume.sp...@bofh.ca said:
>
>> Dave <f...@coo.com> wrote:
>>> I suspect its possible to do something with dtrace to
>>> debug this, but can anyone give me any hints where to
>>> start.
>
>> Try execsnoop.d:
>> http://www.brendangregg.com/DTrace/execsnoop.d That
>> might show something... also, statsnoop.d and
>> opensnoop.d might be valuable.
>
> Good things to look at. If I read correctly, the problem
> only shows up when you launch your "maxima" processes?

I'm not 100% sure on this, though there is certainly some maxima
processes running when there should not be.

A test suite in Sage is launching Maxima. Unfortunately, the test suite
takes ages to run and I'm not too sure exactly what bits will launch
maxima. I was unable last night to kill the maxima processes until the
python processes were killed.

> Are they doing a lot of I/O? I've seen large parallel
> programs doing parallel I/O make even big Sun machines
> spiral up out of control on their load...
>
> hth
> t

No, Maxima is a computer algebra system, so has no need to any IO
really. Being a workstation, with me sitting next to it, I would soon
hear a lot more noise if there was a lot of disk IO.

It is annoying as the machine becomes unresponsive, so it not a nice
environment for debugging. I might try binding the dodgy processes to
one CPU, so hopefully leaving another CPU free other things. That will
hopefully make debugging it a bit less painful.

Dave

未読、

2009/08/23 4:26:092009/08/23

To:

Here's the vmstat output at 15 second intervals, for a time duration
equal to the length of time it took me to make a coffee! Below
that is my current output from 'prstat', which shows the problem
but it can't be directly compared to what I posted earlier, as
that was not done at the same time.

# vmstat 15
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s5 s1 s1 in sy cs us sy id
2 0 0 3737080 2832736 535 10382 1 0 0 0 0 -0 2 0 0 716 23423 1390 22 28 50
9 0 0 1359992 511064 1144 49116 0 0 0 0 0 0 9 0 0 595 97836 1122 33 67 0
10 0 0 1337120 488072 1099 50678 0 0 0 0 0 0 0 0 0 467 101447 1136 31 69 0
9 0 0 1327752 478240 1068 45129 0 0 0 0 0 0 2 0 0 541 89854 1100 36 64 0
8 0 0 1322408 472752 1047 27661 0 0 0 0 0 0 0 0 0 511 54479 1072 53 47 0
9 0 0 1322016 472408 1071 51116 0 0 0 0 0 0 2 0 0 519 101145 1049 33 67 0
8 0 0 1321184 470184 1089 51596 0 0 0 0 0 0 0 0 0 457 102024 1025 32 68 0
9 0 0 1393704 536760 1082 51582 0 0 0 0 0 0 2 0 0 528 102873 1072 31 69 0
9 0 0 1368728 518864 1146 49675 0 0 0 0 0 0 0 0 0 463 98968 1099 32 68 0
9 0 0 1338832 489856 1135 50687 0 0 0 0 0 0 3 0 0 582 100838 1164 31 69 0
9 0 0 1334192 484936 1078 52391 0 0 0 0 0 0 0 0 0 451 103807 1100 31 69 0
9 0 0 1320472 470480 1096 52112 0 0 0 0 0 0 3 0 0 568 103304 1117 31 69 0
9 0 0 1350560 494328 1105 49994 0 0 0 0 0 0 0 0 0 474 99526 1123 32 68 0
9 0 0 1379504 523344 1121 51749 0 0 0 0 0 0 3 0 0 591 103009 1157 30 70 0
9 0 0 1353656 505360 1114 51845 0 0 0 0 0 0 0 0 0 468 102817 1118 32 68 0
9 0 0 1339368 489568 1093 51857 0 0 0 0 0 0 3 0 0 552 103249 1123 31 69 0
9 0 0 1331568 481424 1119 52130 0 0 0 0 0 0 0 0 0 461 103143 1135 30 70 0
8 0 0 1318320 467688 1078 49873 0 0 0 0 0 0 4 0 0 578 99403 1141 32 68 0
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s5 s1 s1 in sy cs us sy id
9 0 0 1311000 459016 1065 51256 0 0 0 0 0 0 0 0 0 429 108771 7105 31 69 0
9 0 0 1301552 452272 1100 52220 0 0 0 0 0 0 4 0 0 559 103457 1107 31 69 0
10 0 0 1299920 449680 1087 52323 0 0 0 0 0 0 0 0 0 441 103597 1084 31 69 0
10 0 0 1298920 447808 1089 51721 0 0 0 0 0 0 2 0 0 502 102267 1095 32 68 0
9 0 0 1295568 443640 1071 51905 0 0 0 0 0 0 0 0 0 455 102774 1104 32 68 0
9 0 0 1293144 441520 1098 51147 0 0 0 0 0 0 2 0 0 525 101162 1093 32 68 0
11 0 0 1252000 411472 1135 51405 0 0 0 0 0 0 0 0 0 453 101977 1106 31 69 0
9 0 0 1247328 405056 1089 50315 0 0 0 0 0 0 2 0 0 507 100615 1104 32 68 0
9 0 0 1240496 398208 1029 51519 0 0 0 0 0 0 0 0 0 465 107240 1099 31 69 0
9 0 0 1237512 395040 1060 51896 0 0 0 0 0 0 2 0 0 512 104193 1045 32 68 0
9 0 0 1237368 393992 1082 51476 0 0 0 0 0 0 0 0 0 547 102123 1192 32 68 0
9 0 0 1236504 392432 1048 45764 0 0 0 0 0 0 3 0 0 1088 91950 1823 36 64 0

prstat output

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

10633 drkirkby 34M 23M run 21 0 0:29:40 14% maxima/1
7903 drkirkby 34M 23M run 31 0 0:51:26 13% maxima/1
10444 drkirkby 30M 19M run 31 0 0:36:35 13% maxima/1
1030 drkirkby 322M 246M sleep 49 0 1:06:11 4.2% thunderbird-bin/9
3566 drkirkby 113M 67M run 53 0 0:00:03 3.8% python/1
24629 drkirkby 290M 180M sleep 59 0 2:00:28 2.6% firefox-bin/8
692 drkirkby 514M 158M sleep 59 0 4:15:25 1.9% Xsun/1
1 root 2896K 1760K run 49 0 1:26:13 1.7% init/1
14615 drkirkby 147M 72M sleep 59 0 0:01:29 1.2% python/1
890 drkirkby 99M 39M sleep 59 0 0:10:08 0.7% metacity/1
17198 drkirkby 255M 168M sleep 59 0 0:07:25 0.4% acroread/1
912 drkirkby 131M 62M sleep 59 0 0:41:27 0.3% gnome-terminal/2
909 drkirkby 101M 36M run 49 0 0:02:52 0.3% wnck-applet/1
3363 drkirkby 8176K 5664K sleep 59 0 0:00:00 0.1% python/1
17566 drkirkby 263M 57M sleep 49 0 0:05:06 0.1% Mathematica/5
Total: 155 processes, 381 lwps, load averages: 11.37, 10.78, 10.54

Bruce Esquibel

未読、

2009/08/23 8:23:472009/08/23

To:

Dave <f...@coo.com> wrote:

> prstat output

> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
> 10633 drkirkby 34M 23M run 21 0 0:29:40 14% maxima/1
> 7903 drkirkby 34M 23M run 31 0 0:51:26 13% maxima/1
> 10444 drkirkby 30M 19M run 31 0 0:36:35 13% maxima/1
> 1030 drkirkby 322M 246M sleep 49 0 1:06:11 4.2% thunderbird-bin/9

> 24629 drkirkby 290M 180M sleep 59 0 2:00:28 2.6% firefox-bin/8
> 692 drkirkby 514M 158M sleep 59 0 4:15:25 1.9% Xsun/1
> 1 root 2896K 1760K run 49 0 1:26:13 1.7% init/1

> 912 drkirkby 131M 62M sleep 59 0 0:41:27 0.3% gnome-terminal/2

You know, just as an observation, when was the last time you rebooted?

The thing that bothers me is those cpu "TIME" reading are really large,
especially for like Xsun.

I'm just saying maybe it's time to give it a swift kick in the ass.

What exactly is it anyway, you only mentioned "2 cpu sparc workstation",
like an SS20 or something newer?

-bruce
b...@ripco.com

Dave

未読、

2009/08/23 9:12:332009/08/23

To:

Bruce Esquibel wrote:
> Dave <f...@coo.com> wrote:
>
>> prstat output
>
>> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>> 10633 drkirkby 34M 23M run 21 0 0:29:40 14% maxima/1
>> 7903 drkirkby 34M 23M run 31 0 0:51:26 13% maxima/1
>> 10444 drkirkby 30M 19M run 31 0 0:36:35 13% maxima/1
>> 1030 drkirkby 322M 246M sleep 49 0 1:06:11 4.2% thunderbird-bin/9
>> 24629 drkirkby 290M 180M sleep 59 0 2:00:28 2.6% firefox-bin/8
>> 692 drkirkby 514M 158M sleep 59 0 4:15:25 1.9% Xsun/1
>> 1 root 2896K 1760K run 49 0 1:26:13 1.7% init/1
>> 912 drkirkby 131M 62M sleep 59 0 0:41:27 0.3% gnome-terminal/2
>
>
> You know, just as an observation, when was the last time you rebooted?

Only 3 days ago - we has a scheduled power cut, so I shut the machine
down. D

> The thing that bothers me is those cpu "TIME" reading are really large,
> especially for like Xsun.

Does 4 hours of CPU time on XSun seem excessive? I must admit, I've
never known it that high and often the machine is up for a couple of
months at a time.

> I'm just saying maybe it's time to give it a swift kick in the ass.
>
> What exactly is it anyway, you only mentioned "2 cpu sparc workstation",
> like an SS20 or something newer?

It's a Blade 1000 with 2 x 1200 MHz CPUs. I own a Blade 2000 which had
exactly the same issue Sage (which launches maxima and python) making
the load average high.

> -bruce
> b...@ripco.com

Bruce Esquibel

未読、

2009/08/23 13:12:202009/08/23

To:

Dave <f...@coo.com> wrote:

> Does 4 hours of CPU time on XSun seem excessive? I must admit, I've
> never known it that high and often the machine is up for a couple of
> months at a time.

I really don't know and only have apples to compare to your oranges.

Most of my machines were rebooted last sunday due to electrical work and
aren't running any software you are, but yeah, it seems excessive. There is
one in the group that is running apache/mysql for about 120 domains and it's
around 0:03:20 and 0:05:00 repectively.

There is one 280r which I think is similar to your box, it escaped the power
down, been up since may 17th but only runs mail services, sendmail, the
smf-sav/smf-zombie stuff and the largest cpu time I see on that is the main
sendmail process at 4:37:40. But keep in mind, that is 3+ months, not 3
days.

Maybe it doesn't mean anything, I never kept track of that "time" stuff
anyway and might be normal with the applications you are running.

Not that it'll provide an answer but if you have "top" (rather than prstat),
try running that and see what that 2nd or 3rd line from the top says, CPU
States. If the kernel % is high, you might have some kind of denial of
service attack going on if you are sure there isn't excessive i/o on the
disks.

It is a strange problem, no doubt.

-bruce
b...@ripco.com

Dave

未読、

2009/08/23 14:04:552009/08/23

To:

Bruce Esquibel wrote:

> Not that it'll provide an answer but if you have "top" (rather than prstat),
> try running that and see what that 2nd or 3rd line from the top says, CPU
> States. If the kernel % is high, you might have some kind of denial of
> service attack going on if you are sure there isn't excessive i/o on the
> disks.
>
> It is a strange problem, no doubt.
>
> -bruce
> b...@ripco.com

I've rebooted the machine. It's been up 3 hours and has used 0:03:49 of
CPU time for Xsun, which is around 1 minute/hour, so perhaps the other
data was not unusual.

I don't believe the issue is a DOS attack - it only happens when I run
this particular bit of software (Sage) which I'm helping develop. (It
already runs ok on Linux and OSX, but Solaris is tricky at best).

I suspect there is something going on in Sage which some linux user
added, which is screwing it up on Solaris.

i don't know if the unix gurus can make any sence of the vmstat output.
I should take a look at that a bit more closely myself - after reading
about how to interpret it.

Richard B. Gilbert

未読、

2009/08/23 15:31:212009/08/23

To:

Bruce Esquibel wrote:
> Dave <f...@coo.com> wrote:
>
>> prstat output
>
>> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>> 10633 drkirkby 34M 23M run 21 0 0:29:40 14% maxima/1
>> 7903 drkirkby 34M 23M run 31 0 0:51:26 13% maxima/1
>> 10444 drkirkby 30M 19M run 31 0 0:36:35 13% maxima/1
>> 1030 drkirkby 322M 246M sleep 49 0 1:06:11 4.2% thunderbird-bin/9
>> 24629 drkirkby 290M 180M sleep 59 0 2:00:28 2.6% firefox-bin/8
>> 692 drkirkby 514M 158M sleep 59 0 4:15:25 1.9% Xsun/1
>> 1 root 2896K 1760K run 49 0 1:26:13 1.7% init/1
>> 912 drkirkby 131M 62M sleep 59 0 0:41:27 0.3% gnome-terminal/2
>
>
> You know, just as an observation, when was the last time you rebooted?

That SHOULD not be significant! Unless there is a software or hardware
problem the machine should be able to run for weeks or months between
reboots. It's not Windoze!!! I have occasionally had up times of
nearly a year.
<snip>

Mike Gerdts

未読、

2009/08/23 23:30:122009/08/23

To:

[snip]

You quite regularly have 9+ processes waiting for some CPU time, with
lots of minor faults. Most of the time is in %sys. This could be
caused by excessive process creation. As others have mentioned,
execsnoop is a good tool to try. Even without dtrace, you can use
"sar -c 5 10" to see the fork/exec rate.

If that doesn't show anything, look deeper into what is causing so
much memory activity, resulting in minor faults. When looking into
this, be sure that you are using prstat in a way that allows you to
see what each thread is doing for only the reported interval. "prstat
-mL" or "prstat -mLc" tends to be my preferred way to use prstat.
When you run prstat without -m, the CPU time displayed is not the CPU
time for that interval - it is a time-decayed average over a rather
long period. Additionally, it breaks the usage out by %usr, %sys, and
other states that may indicate something useful. Note that 100% with
microstate accounting is 100% of one CPU.

My guess is that there is something that is using a lot of memory with
tiny pages, thus thrashing the TLB. Using "trapstat -T 5 10" should
help understand what size of pages are responsible for TLB misses. If
you see that you are seeing more than a couple percent TLB miss for a
particular page size, my guess is that you will find that the process
(es) that show the highest %sys in "prstat -mL" will also show that
they have a large number of pages at the same size as the missed
pages. Use "pmap -s $pid" to find the page sizes and counts in use.

Assuming this points somewhere useful, ppgsz(1) may be a good place to
start looking. Note that ppgsz will only affect pages that are
allocated after it is used.

> prstat output

Because the CPU column below is "recent CPU" time - not for the
specific interval - if you have processes with spikey CPU activity,
normal prstat output will not be terribly useful. Again, "prstat -mL"
is your friend. On an 8 processor system, I observed that it took
about 8 5 second intervals before a process that was spinning on CPU
indicated that it was using 11% (100 / 8 = 12) of the available CPU.
In contrast, prstat -mL will show that it is using 100% of a single
CPU (100 %usr + 0 %sys + 0 %tfl + ... etc.) immediately.

Dave

未読、

2009/08/24 12:03:002009/08/24

To:

Stefaan A Eeckels wrote:
> On Sat, 22 Aug 2009 02:39:22 +0000 (UTC)
> hume.sp...@bofh.ca wrote:
>
>> Dave <f...@coo.com> wrote:
>>> I suspect its possible to do something with dtrace to debug this,
>>> but can anyone give me any hints where to start.
>> Try execsnoop.d: http://www.brendangregg.com/DTrace/execsnoop.d
>> That might show something... also, statsnoop.d and opensnoop.d might
>> be valuable.
>
> That version of execsnoop is deprecated. Better visit Brendan's site
> and download the latest version of his DTraceToolkit. It's a real
> sysadmin treasure trove.
>

Thank you both. I've downloaded it, and see that it is python that is
killing the system. Maxima has gone wild and using a lot of CPU time,
but it is not creating lots of processes. Below is the output from
execsnoop. At the bottom you can see where I kill python, After that,
the process creation dropped to sensible levels.

Sage is written in python, so that is the problem, but exactly where it
lies I have yet to investigate

89954125921 100 16164 16163 grep ^ *8512
89954131267 100 16166 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954138885 100 16167 16166 grep ^ *8512
89954144121 100 16169 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954151844 100 16170 16169 grep ^ *8512
89954157107 100 16172 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954164796 100 16173 16172 grep ^ *8512
89954170175 100 16175 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954177974 100 16176 16175 grep ^ *8512
89954183355 100 16178 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954191064 100 16179 16178 grep ^ *8512
89954196172 100 16181 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954203909 100 16182 16181 grep ^ *8512
89954209230 100 16184 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954216945 100 16185 16184 grep ^ *8512
89954222267 100 16187 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954229768 100 16188 16187 grep ^ *8512
89954235055 100 16190 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954242804 100 16191 16190 grep ^ *8512
89954248175 100 16193 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954255933 100 16194 16193 grep ^ *8512
89954261326 100 16196 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954269103 100 16197 16196 grep ^ *8512
89954274449 100 16199 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954282172 100 16200 16199 grep ^ *8512
89954287142 100 16202 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954295109 100 16203 16202 grep ^ *8512
89954300482 100 16205 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955301750 100 16361 1229 pkill -9 python
89954441806 100 16206 16205 grep ^ *8512
89954447604 100 16208 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954532904 100 16209 16208 grep ^ *8512
89954538513 100 16211 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954546237 100 16212 16211 grep ^ *8512
89954551418 100 16214 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954558943 100 16215 16214 grep ^ *8512
89954564435 100 16217 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954572299 100 16218 16217 grep ^ *8512
89954577433 100 16220 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954584944 100 16221 16220 grep ^ *8512
89954590099 100 16223 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954597643 100 16224 16223 grep ^ *8512
89954602756 100 16226 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954610226 100 16227 16226 grep ^ *8512
89954615365 100 16229 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954622889 100 16230 16229 grep ^ *8512
89954628033 100 16232 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954635528 100 16233 16232 grep ^ *8512
89954640608 100 16235 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954648239 100 16236 16235 grep ^ *8512
89954653360 100 16238 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954660868 100 16239 16238 grep ^ *8512
89954666029 100 16241 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954673520 100 16242 16241 grep ^ *8512
89954678622 100 16244 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954686121 100 16245 16244 grep ^ *8512
89954691273 100 16247 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954698787 100 16248 16247 grep ^ *8512
89954704074 100 16250 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954713357 100 16251 16250 grep ^ *8512
89954718573 100 16253 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954726139 100 16254 16253 grep ^ *8512
89954802024 100 16256 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954811697 100 16257 16256 grep ^ *8512
89954817203 100 16259 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954825047 100 16260 16259 grep ^ *8512
89954830707 100 16262 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954838299 100 16263 16262 grep ^ *8512
89954843433 100 16265 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954850982 100 16266 16265 grep ^ *8512
89954856080 100 16268 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954863861 100 16269 16268 grep ^ *8512
89954869016 100 16271 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954876595 100 16272 16271 grep ^ *8512
89954881909 100 16274 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954889315 100 16275 16274 grep ^ *8512
89954894379 100 16277 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954901745 100 16278 16277 grep ^ *8512
89954906700 100 16280 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954914073 100 16281 16280 grep ^ *8512
89954919139 100 16283 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954926676 100 16284 16283 grep ^ *8512
89954931825 100 16286 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954939280 100 16287 16286 grep ^ *8512
89954944457 100 16289 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954951883 100 16290 16289 grep ^ *8512
89954956937 100 16292 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954964591 100 16293 16292 grep ^ *8512
89954969856 100 16295 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954977547 100 16296 16295 grep ^ *8512
89954982916 100 16298 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89954990624 100 16299 16298 grep ^ *8512
89954996093 100 16301 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955003657 100 16302 16301 grep ^ *8512
89955008689 100 16304 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955016172 100 16305 16304 grep ^ *8512
89955021337 100 16307 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955028887 100 16308 16307 grep ^ *8512
89955034009 100 16310 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955041604 100 16311 16310 grep ^ *8512
89955046746 100 16313 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955054767 100 16314 16313 grep ^ *8512
89955061051 100 16316 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955068562 100 16317 16316 grep ^ *8512
89955073669 100 16319 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955081137 100 16320 16319 grep ^ *8512
89955086275 100 16322 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955093721 100 16323 16322 grep ^ *8512
89955098831 100 16325 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955106461 100 16326 16325 grep ^ *8512
89955111602 100 16328 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955119128 100 16329 16328 grep ^ *8512
89955124216 100 16331 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955132803 100 16332 16331 grep ^ *8512
89955138069 100 16334 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955145697 100 16335 16334 grep ^ *8512
89955150875 100 16337 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955158407 100 16338 16337 grep ^ *8512
89955163512 100 16340 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955171111 100 16341 16340 grep ^ *8512
89955176200 100 16343 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955183935 100 16344 16343 grep ^ *8512
89955189256 100 16346 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955197060 100 16347 16346 grep ^ *8512
89955202402 100 16349 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955210086 100 16350 16349 grep ^ *8512
89955215419 100 16352 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955222968 100 16353 16352 grep ^ *8512
89955228085 100 16355 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955235490 100 16356 16355 grep ^ *8512
89955240596 100 16358 8512 sh -c top -b -n 65635 |grep "^ *8512 "
89955346959 100 16359 16358 grep ^ *8512
89955610235 100 16362 875 grep ERROR occurred
/export/home/drkirkby/sage/sage-4.1.1/tmp/test-dsage.log
89955622552 100 16363 875 tee -a
/export/home/drkirkby/sage/sage-4.1.1/tmp/test.log
89955636569 100 16364 875 cat
/export/home/drkirkby/sage/sage-4.1.1/tmp/test.log
90051484511 0 16365 580 fptest -f 1200 -p 108 -d 1
90075609183 0 16367 16366 /bin/sh /usr/local/bin/aol2
90075604564 0 16366 299 sh -c /usr/local/bin/aol2
90075633367 0 16368 16367 date
90075672470 0 16369 16367 /usr/sbin/ipf -Fa

hume.sp...@bofh.ca

未読、

2009/08/24 13:11:272009/08/24

To:

Dave <f...@coo.com> wrote:
> 89954125921 100 16164 16163 grep ^ *8512
> 89954131267 100 16166 8512 sh -c top -b -n 65635 |grep "^ *8512 "

It looks like something is using a fairly dumb way of figuring out its
own CPU usage.

Dave

未読、

2009/08/24 15:29:522009/08/24

To:

hume.sp...@bofh.ca wrote:
> Dave <f...@coo.com> wrote:
>> 89954125921 100 16164 16163 grep ^ *8512
>> 89954131267 100 16166 8512 sh -c top -b -n 65635 |grep "^ *8512 "
>
> It looks like something is using a fairly dumb way of figuring out its
> own CPU usage.
>

I'm puzzled what the hell it is supposed to be doing.

Whatever it is, the thing succeeds in bringing my system to an almost
standstill.

Thanks for the help. Hopefully now I know what the problem is, a look at
teh source code will show the dodggy code.

Colin B.

未読、

2009/08/24 18:29:412009/08/24

To:

Dave <f...@coo.com> wrote:
> Bruce Esquibel wrote:
>> Dave <f...@coo.com> wrote:
>>
>>> prstat output
>>
>>> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>>> 10633 drkirkby 34M 23M run 21 0 0:29:40 14% maxima/1
>>> 7903 drkirkby 34M 23M run 31 0 0:51:26 13% maxima/1
>>> 10444 drkirkby 30M 19M run 31 0 0:36:35 13% maxima/1
>>> 1030 drkirkby 322M 246M sleep 49 0 1:06:11 4.2% thunderbird-bin/9
>>> 24629 drkirkby 290M 180M sleep 59 0 2:00:28 2.6% firefox-bin/8
>>> 692 drkirkby 514M 158M sleep 59 0 4:15:25 1.9% Xsun/1
>>> 1 root 2896K 1760K run 49 0 1:26:13 1.7% init/1
>>> 912 drkirkby 131M 62M sleep 59 0 0:41:27 0.3% gnome-terminal/2
>>
>>
>> You know, just as an observation, when was the last time you rebooted?
>
> Only 3 days ago - we has a scheduled power cut, so I shut the machine
> down. D
>
>> The thing that bothers me is those cpu "TIME" reading are really large,
>> especially for like Xsun.
>
>
> Does 4 hours of CPU time on XSun seem excessive? I must admit, I've
> never known it that high and often the machine is up for a couple of
> months at a time.

Doesn't seem like it to me. From my Ultra45...

gir_$ prstat

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

8695 cbigam 1440M 327M sleep 56 0 26:49:13 8.7% firefox-bin/11
1982 cbigam 751M 166M sleep 59 0 8:23:28 2.8% Xsun/1
2265 cbigam 175M 104M sleep 59 0 0:18:38 0.8% gnome-terminal/2
2443 cbigam 68M 22M sleep 58 0 1:01:19 0.3% gnome-netstatus/1
...
gir_$ uptime
4:26pm up 13 day(s), 2:29, 2 users, load average: 0.20, 0.20, 0.24

So eight hours in under two weeks. (and almost 27 hours for firefox in
the same period - yikes!).

Colin

Dave

未読、

2009/08/24 18:32:032009/08/24

To:

Dave wrote:
> hume.sp...@bofh.ca wrote:
>> Dave <f...@coo.com> wrote:
>>> 89954125921 100 16164 16163 grep ^ *8512
>>> 89954131267 100 16166 8512 sh -c top -b -n 65635 |grep "^
>>> *8512 "
>>
>> It looks like something is using a fairly dumb way of figuring out its
>> own CPU usage.
>>
>
>
> I'm puzzled what the hell it is supposed to be doing.
>
> Whatever it is, the thing succeeds in bringing my system to an almost
> standstill.
>
> Thanks for the help. Hopefully now I know what the problem is, a look at
> teh source code will show the dodggy code.
>

I finally tracked it down. It was python code trying to get the memory
used by a process.

U = os.uname()[0].lower()
pid = os.getpid()

if U == 'linux':
cmd = 'top -b -n 1 -p %s'%pid
elif U == 'darwin':
cmd = 'top -l 1 |grep "^ *%s "'%pid
elif U == 'sunos':
cmd = 'top -b -n 65635 |grep "^ *%s "'%pid
else:
raise NotImplementedError, "top not implemented on platform %s"%U

<SNIP>

elif U == 'sunos':
# An evil and ugly workaround some Solaris race condition.
while True:
try:
m = float(top().split()[-5].strip('M'))
break
except:
pass
else:

The *evil and ugly* code was I believe calling top in a loop as it kept
failing. It was that which wrecked havoc on my system. I've written some
C code to do this, which will mean top will not be needed.

Dave

Rob Kouwenberg

未読、

2009/08/25 16:34:042009/08/25

To:

Dave <f...@coo.com> wrote:
> I suspect there is something going on in Sage which some linux user
> added, which is screwing it up on Solaris.

Pfft a wild guess: if you're writing gnu software in C, do note that
mallocs and alike might perform differently on platforms. Some just
reserve the memory, others zero all the requested memory space
(performance hit). Testing and reading the source code might find the
culprit.

Locks might also perform differently. Does Sage use I/O with file
handles, resource starvation possibilities ? Dependant on OS and tuning,
devices, ttys, etc might be used. If I'm not mistaken solaris has a
default max of 256 ptys.

If you really want to do some reading, find a copy of 'the magic garden
explained' for solarisms, read some mckusick books for BSD'isms and read
Stevens books for network programming. Sun also has a high performance
and troubleshooting courses (ST350 is a fun one).

Good luck !

PS do port to OSX ;)

David Kirkby

未読、

2009/08/26 0:51:322009/08/26

To:

On Aug 25, 9:34 pm, rob_AT_badeend_DOT...@nowhere.nowhere (Rob

Kouwenberg) wrote:
> Dave <f...@coo.com> wrote:
> > I suspect there is something going on in Sage which some linux user
> > added, which is screwing it up on Solaris.
>
> Pfft a wild guess: if you're writing gnu software in C, do note that
> mallocs and alike might perform differently on platforms. Some just
> reserve the memory, others zero all the requested memory space
> (performance hit). Testing and reading the source code might find the
> culprit.

Much of Sage is written in python, as python is used to glue lots of
other bits of software together. So of that software is written in C,
others in C++, Fortran, LISP - and probably others too.

> Locks might also perform differently. Does Sage use I/O with file
> handles, resource starvation possibilities ? Dependant on OS and tuning,
> devices, ttys, etc might be used. If I'm not mistaken solaris has a
> default max of 256 ptys.

I now know what the problem was. Sage was calling 'top' to find the
memory usage by a process. That was failing as top was not in my path.
The code then repeatidly tried calling top, so was in an infinite
loop.

> If you really want to do some reading, find a copy of 'the magic garden
> explained' for solarisms, read some mckusick books for BSD'isms and read
> Stevens books for network programming. Sun also has a high performance
> and troubleshooting courses (ST350 is a fun one).
>
> Good luck !
>
> PS do port to OSX ;)

Why would we bother porting Sage to OS X? It already runs on OS X. So
give it a try if you want. Sage runs on

* Linux
* OS X
* Solaris (but difficult to build, only 32-bit and a bit buggy)

There are ports in progress for Windows (sponsored by Microsoft) and
FreeBSD. My attempts to drum up some interest from AIX and HP-UX
people have not had much response. I might one day try the latter two
systems, as I do have computers at home which run AIX, tru64, HP-UX
and IRIX. But the IBM RS6000 which runs AIX uses too much power to
keep it on for long periods. It's also too slow to be used as a
platform for development for somethign as large as Sage. I suspect it
would take several days to compile the source code.

Dave

Rob Kouwenberg

未読、

2009/08/29 6:34:572009/08/29

To:

David Kirkby <drki...@gmail.com> wrote:
> I now know what the problem was. Sage was calling 'top' to find the
> memory usage by a process. That was failing as top was not in my path.
> The code then repeatidly tried calling top, so was in an infinite
> loop.

OMG ! you're calling this a commercial application !? Why would you fork
an external application that compares snapshot statistics and parse
output ?? Do you know how top functions !? This is bananarepublic
quality !!

claus...@googlemail.com

未読、

2009/08/29 7:19:532009/08/29

To:

*snip*

> OMG ! you're calling this a commercial application !? Why would you fork
> an external application that compares snapshot statistics and parse
> output ?? Do you know how top functions !? This is bananarepublic
> quality !!

Oh, I have seen awk one liners being called every other millisecond or
so in order to parse a log file and then trigger other scripts by it.
This is a commercial, expensive, CMS system ...

Gary R. Schmidt

未読、

2009/08/29 9:03:582009/08/29

To:

And don't forget that the now senior idiot who implemented the dumb idea
has defended it against all-comers for so long now that inside the
company heesh has a reputation for being a "really, really good
designer," and is relied on to do all the bleeding-edge stuff that
no-one else can understand. (Or gets a chance to!!)

Next question - how many of us have had to work around that idiot?

Cheers,
Gary B-)

Tim Bradshaw

未読、

2009/08/29 9:43:312009/08/29

To:

On 2009-08-29 11:34:57 +0100, rob_AT_bad...@nowhere.nowhere
(Rob Kouwenberg) said:

> OMG ! you're calling this a commercial application !? Why would you fork
> an external application that compares snapshot statistics and parse
> output ?? Do you know how top functions !? This is bananarepublic
> quality !!

There is no limit to how bad commercial applications (or free
applications) can be (or how resistant people can be to fixing the
cretinism). Sun's patch utilities are a nice case in point: it's
really *not* a good idea to write the whole package database for every
file you change, yet they both do this (dtrace is your friend to
confirm things like this) and people will deny that this is the
problem. Now add zones to the mix (copy of the package database per
zone) and you're looking at machines which can essentially not be
patched at all (though Live Upgrade helps: at least you don't need an
outage while the patching is happening).

--tim

Marc

未読、

2009/08/29 20:58:522009/08/29

To:

Tim Bradshaw wrote:

> cretinism). Sun's patch utilities are a nice case in point: it's
> really *not* a good idea to write the whole package database for every
> file you change, yet they both do this (dtrace is your friend to
> confirm things like this) and people will deny that this is the
> problem. Now add zones to the mix (copy of the package database per
> zone) and you're looking at machines which can essentially not be
> patched at all (though Live Upgrade helps: at least you don't need an
> outage while the patching is happening).

Opensolaris moved to an other packaging system (I don't know how well it
works), and I don't think Sun employees are denying that slow patching in
Solaris is an issue.

Tim Bradshaw

未読、

2009/08/30 6:50:552009/08/30

To:

On 2009-08-30 01:58:52 +0100, Marc <marc....@gmail.com> said:

> Opensolaris moved to an other packaging system (I don't know how well it
> works), and I don't think Sun employees are denying that slow patching in
> Solaris is an issue.

Depends who you ask, but people will tell you "it's slow because x"
where "x" varies but is generally not "because we screwed up the design
and can't be bothered to fix it". There are some honourable exceptions
to this, but they clearly don't have the ability to actually get
someone to fix it, sadly.

Note it does not require moving to a new patching / packaging system to
fix the problem, or a change in format to the contents file (people
have told me both these).

OpenSolaris's packaging system does seem to be better, but helps not at
all for older releases.

Stefaan A Eeckels

未読、

2009/08/31 12:09:052009/08/31

To:

On Mon, 24 Aug 2009 23:32:03 +0100
Dave <f...@coo.com> wrote:

> The *evil and ugly* code was I believe calling top in a loop as it
> kept failing. It was that which wrecked havoc on my system. I've
> written some C code to do this, which will mean top will not be
> needed.

Way to go!!

--
Stefaan A Eeckels
--

You don't have to spend the rest of your life
exercising yourself to death.
-- SPAM can be profound :)

Dave

未読、

2009/08/31 17:12:112009/08/31

To:

Hold your horses!

* I did not write the code
* It is not commercial.

I will write something to get the information needed though. (I believe
it was used for testing memory leaks, which rather makes the use of 1 MB
resolution a bit silly.

I'm well aware the code is bad. I'm also aware how to sort it out.

Rob Kouwenberg

未読、

2009/09/06 17:12:082009/09/06

To:

Dave <f...@coo.com> wrote:
> Hold your horses!
>
> * I did not write the code

:) Yeah that's what usually happens. Do note that IBM code is worse:
they hire programmers, sack them and when enough customers complain dust
the bugs out ..

> * It is not commercial.
> I will write something to get the information needed though. (I believe
> it was used for testing memory leaks, which rather makes the use of 1 MB
> resolution a bit silly.

> I'm well aware the code is bad. I'm also aware how to sort it out.

Code can always be improved, it's more a matter of functionality,
commercial and functional requirements.

FWIW have a look at the /bin/ps sourcecode in C: it's got all you need..
Or mount your favorite /proc solution and parse the appropriate files ..