Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Any Known Issue with SMP Ver 1.1.1Ga for OpenServer 5.0.6?

0 views
Skip to first unread message

JP

unread,
Nov 1, 2002, 10:05:44 PM11/1/02
to
I have posted some messages regarding some unexplainable %sys activities in
SCO even when system was not running any tasks. It's been a few months but
the problem was still not resolved.

The system has two Xeon processors running SMP 1.1.1Ga. The problem will
disappear if I keep rebooting and run sar to verify. There is another way
to bring the %idle back to 100. The trick is to deactivate the second CPU.
That way I can see %sys shown as zero and %idle go back to 100.

I have phsyically swapped the 2 CPU's but it did not make any difference.
No matter which CPU take turn as the "second" CPU, every time I disable the
second CPU, I can see a better performance from the sar. I cannot help
wondering if there is any bug or incompatibility between SMP and the
hardware. I am using HP NetServer LH3000 U3 (HP part no. P2482b).

Joe

Bela Lubkin

unread,
Nov 2, 2002, 1:30:36 AM11/2/02
to sco...@xenitec.on.ca
JP wrote:

I'm having a lot of trouble following the discussion since you keep
posting from different accounts, with different subject lines, and
including either way too much or no context at all... Some of what I'm
going to say here might not be accurate due to lack of that context.

The "smoking gun" symptom that you're seeing -- a couple of percent of
sys time on the 2nd CPU -- is rather subtle. Most people would never
notice it, and you wouldn't be concerned about it either if there wasn't
a secondary symptom. You say that your users complain about performance
when the system is in the "bad" state. You need to characterize _that_
performance problem, because it's is _not_ caused by the 2% sys time. I
am not going to accept that your users are so performance-sensitive that
they notice and complain about a 1-2% difference.

I'm not saying that the two aren't tied together. But you must write a
description of the actual performance issues your users see, which must
be something other than "their jobs only run at 98-99% of normal speed".

No, there is no known issue that fits the description you've been giving,
as far as anyone can tell.

You've also made inconsistent statements which are making it very hard
to follow the problem. Previously you've shown how you reboot and
sometimes see 2% sys time, sometimes see 0%. You said that the users do
not complain on a "good" boot where %sys is 0. In fact you seem to say
so in this message, but then you also say "every time I disable the
second CPU, I can see a better performance from the sar". That's
confusing the matter. If you mean "with two CPUs I see some bad boots
and some good; with one, all boots are good" -- say that.

From the symptoms you describe, I am _guessing_ at a possible cause. It
_sounds_ like the 2nd CPU is running at a greatly reduced speed. The
"smoking gun" symptom would happen because the CPU is never entirely
idle. Both CPUs take 100 timer ticks per second, for instance. With a
normally functioning full speed CPU, handling those ticks probably takes
less than .1 % of a CPU. Now suppose the CPU was running 20x slower,
for some reason. It would still take as many clock cycles to handle
each timer tick, but now there are 1/20 as many total clock cycles per
realtime second, so now the CPU is 2% busy.

To users, this would show up as some sort of erratic slow execution.
Each process starts out on one CPU or the other and tends to stick to
the same CPU, but can also migrate depending on system activity.
Running the same job repeatedly would result in varying runtimes.

And of course this is only happening during some boots.

To summarize, this is my _guess_ based on the symptoms:

- some boots are fine, both CPUs work at full speed
- some boots are bad, CPU 2 runs at a significantly reduced speed
(for an as-yet unknown reason)
- when CPU 2 is running slowly, your users complain about performance,
and you notice it in `sar` because just handling the timer ticks
takes enough CPU to be noticable

Here's a shell script which may help diagnose this. It starts up one
process per CPU, then in each process, times a simple spin loop several
times. If the system is otherwise idle, each process will end up
running on a separate CPU. It should be quite obvious in the output if
one process is running significantly faster than the other. In that
case, _which_ process runs faster may change over the course of a run,
but you'll still see that something weird is happening.

On an idle system, what you _should_ see is that each loop takes about
the same time (with maybe up to 10% variation), and the entire set of
loops run by each process should end at about the same time. If one CPU
is running significantly faster then you'll see some loops that take a
lot less time, and one process may finish long before the other.

VERY IMPORTANT: run this at least once on a "good" 2-CPU boot and once
on a "bad" 2-CPU boot. The point is to compare behavior between the two
states.

>Bela<

=============================================================================

#!/bin/sh

LOOPS=10 # how many times to run the outer loop
SPINS=2000000 # adjust this manually so each loop takes about 1 second

procs=--
trap 'kill -1 $procs >/dev/null 2>&1; exit' 1 2 3 15
ncpu=`LANG=C uname -X | awk '/NumCPU/ { print $3 }'`
proc=0
while [ $proc != $ncpu ]; do
proc=`expr $proc + 1`
loop=1
while [ $loop -le $LOOPS ]; do
echo Process $proc loop $loop: `/bin/time awk 'BEGIN { for(i=0; i<'$SPINS'; i++) ; }' 2>&1`
loop=`expr $loop + 1`
done &
procs="$procs $!"
done
wait

Mike Brown

unread,
Nov 2, 2002, 3:55:38 PM11/2/02
to

There are some issues with new Xeons that may cause slow performance.

- Make sure HyperThreading/Jackson Technology is turned off in the BIOS.

- Load OSS648a

A consistent answer to your problem would be a CPU initialization problem
that causes the CPUs to run slower. The minor amount of SYS load would
not show up on a fast CPU, but may take up 1-2% on a slow CPU.

Mike
--
Michael Brown

The Kingsway Group

Bela Lubkin

unread,
Nov 2, 2002, 5:13:31 PM11/2/02
to sco...@xenitec.on.ca
Mike Brown wrote:

> JP wrote:

> > The system has two Xeon processors running SMP 1.1.1Ga. The problem will
> > disappear if I keep rebooting and run sar to verify. There is another way
> > to bring the %idle back to 100. The trick is to deactivate the second CPU.
> > That way I can see %sys shown as zero and %idle go back to 100.

> There are some issues with new Xeons that may cause slow performance.


>
> - Make sure HyperThreading/Jackson Technology is turned off in the BIOS.
>
> - Load OSS648a
>
> A consistent answer to your problem would be a CPU initialization problem
> that causes the CPUs to run slower. The minor amount of SYS load would
> not show up on a fast CPU, but may take up 1-2% on a slow CPU.

That's what I think the problem is, too. But in his earlier posts, the
`sar` output shows the CPU as "PentIII". I don't think we've ever
announced Pentium 4 family CPUs (including today's "Xeon with no mention
of Pentium" P4 family CPUs) as "PentIII". The problems fixed by oss648a
do not have any effect on Pentium 3 family CPUs, including Pentium 3
Xeon's. Also, he posted "%clock" outputs showing that the first CPU is
1.0GHz, a speed which I believe is slower than any production P4-family
Xeon that has been shipped.

On the gripping hand, "JP" has resisted all attempts to give a good
description of his system, so who knows.

I'd like to see output of:

hwconfig -h
hw -v

from the system. I'd also like him to run those once on a "good" boot
and once on a "bad", diff the results, and report any interesting
differences. (Do _not_ particularly want to see two entire sets of the
output.) When posting that, make sure your news posting program does
not spuriously wordwrap the output.

>Bela<

Mike Brown

unread,
Nov 3, 2002, 8:04:10 PM11/3/02
to

Somehow the posts got separated at some point by the ISP or my newsreader.

The 1.0 Ghz Xeons do not have Hyperthreading or any of the other problems
I mentioned, sorry I did not tie in the previous post.

I have used the following as a quick test of raw CPU speed:

timex bc <<!!
2^4096
!!

On a Xeon 1 Ghz I get

real 0.10
user 0.06
sys 0.00


What numbers do you get when the system is "slow"?

Bela Lubkin

unread,
Nov 4, 2002, 2:34:13 AM11/4/02
to sco...@xenitec.on.ca
Mike Brown wrote:

> Somehow the posts got separated at some point by the ISP or my newsreader.

He's been posting under different subjects and even from different
account names and hosts -- makes it rather hard to follow.

> The 1.0 Ghz Xeons do not have Hyperthreading or any of the other problems
> I mentioned, sorry I did not tie in the previous post.
>
> I have used the following as a quick test of raw CPU speed:
>
> timex bc <<!!
> 2^4096
> !!
>
> On a Xeon 1 Ghz I get
>
> real 0.10
> user 0.06
> sys 0.00
>
>
> What numbers do you get when the system is "slow"?

I don't have a "slow" system to test, but tried it on a 200MHz Pentium
Pro:

real 0.45
user 0.33
sys 0.06

Couple of problems with this benchmark: on current CPUs it's just too
fast (you only got 10 units of measurement granularity); and it produces
a lot of output. The output can slow it down, but that's measuring
completely different resources (video speed etc.) When I ran it with
output directed to /dev/null, I got results like .36/.33/.03,
.36/.34/.02.

Let's not try to focus on whether his machine is slow in an absolute
sense. He has behavior where sometimes it is perceived as fast and
sometimes as slow. Let's compare benchmark results on the same machine
in its two states. Trying to bring in absolute comparisons will just
confuse things further.

>Bela<

JP

unread,
Nov 10, 2002, 10:25:04 AM11/10/02
to
"Bela Lubkin" <be...@caldera.com> wrote in message
news:2002110122...@mammoth.ca.caldera.com...

Bela,

Just got back and can't wait to give my apology for causing some confusion
in the past. I really feel sorry for wasting your efforts for the
misleading or inconsistent information given. I did not work full time for
the site but just provided them with network support. Some of the
information given did not come from my direct observation. Also, I accessed
the newsgroup from different machines, at work or at client's site. As a
result, the names and email addresses on the threads are quite confusing.
My news reader removes read subjects regularly so I was tempted to start a
new thread. But now I know that would cause more problem and will avoid
doing this.

To quickly answer some of your questions, I have the following points to
make:

1. I have verified the configuration of the system. Instead of being a
dual Xeon as what the owner has described to me, it was actually a dual
Pentium 3 1Ghz system.

2. I know the 2-4% of SYS may be negligible but I notice that my keyboard
entry sometimes did not echo immediately after a "bad boot". It hesistates
for a brief moment and then have all the characters come out at once. I
have had that experience but of a more severe nature during initial
installation. That problem was overcome by disabling the amiraid monitor on
the HP server. It was a known issue that was caused by the HP raid driver.
This time, I am not sure what the problem. I try to get rid of it by keep
rebooting instead of leaving it and see if they really cause grief to the
users. It is a busy server which is serving offices of different time zone.
I really cannot bring down the server during weekdays.

3. Yes. If I disable the second CPU, I can see that the 2% SYS disappear
and %idle show 100. With both CPU's running, the hourly average of CPU
utilization reported by sar can never give 100 %idle. I am tempted to leave
it disabled to see if users would tell me some difference.

> confusing the matter. If you mean "with two CPUs I see some bad boots
> and some good; with one, all boots are good" -- say that.

4. I started to leave the system running after a "bad boot". Here is the
status on a Sunday morning with no application loaded.
# sar 1 10

SCO_SV sco_srv 3.2v5.0.6 PentIII 11/10/2002

10:06:11 %usr %sys %wio %idle (-u)
10:06:12 0 0 0 100
10:06:13 0 5 0 95
10:06:14 0 0 0 100
10:06:15 0 0 0 100
10:06:16 0 0 0 100
10:06:17 0 3 1 96
10:06:18 0 1 0 99
10:06:19 0 0 0 100
10:06:20 0 4 0 96
10:06:21 0 1 0 99

Average 0 1 0 99

5. I have swapped the slots of the 2 CPU's. No matter how the 2 CPU's are
seated, 'cpuonoff -i 2' can eliminate the unknown %sys activities although
the actual "CPU2" has changed.

6. I am sure that we did not have this problem during initial set up. Even
a few months after installation, I could recall that the daily sar reports
still shows 100 in %idle during some night hours. I have never noticed any
consistent %sys utilization during weekends.

7. I have to admit that I might be too early to worry about performance
impact of the "bad boot" as it is only 2-4% (sometimes 5%). But I have seen
them suffered from the AMIRAID monitor before, which showed a very similiar
symptom in sar. Therefore, I tend to hide it away by rebooting and turn to
other experts in the newsgroup for trouble-shooting.

8. Your speculation makes total sense. I cannot wait to schedule a time to
go on site and test the script.

Thank you for your patience, Bela.

Joe

0 new messages