What exactly is 'load average' ?

Dr. David Kirkby

unread,

Jun 15, 2003, 9:55:53 PM6/15/03

to

I've been trying to measure the load average on quad processor Sun
Ultra 80 running Solaris 9 using a UNIX system call. I seem to be able
to do this, but I'm baffled by the numbers I get back. Perhaps someone
can enlighten me on exactly what is the 'load average' on Solaris.

Consider this C program (abbreviated for clarity), started on a system
with little else running. It's a home computer, with X running, but
doing nothing - not even editing a file.

#include <sys/loadavg.h>
#inclue <stdio.h>

int main()
{
double loadavg[3]; /* Arrary to store load averages */

do_CPU_intensive_task_for_3_minutes();
sleep(70);
getloadavg(loadavg, 1); /* Take one sample of the 3 load averages*/
printf("1 minute load average is %f\n",loadavg[LOADAVG_1MIN]);
}
This typically prints a value of 0.4 to 0.6.

What I can't understand is that after the system has gone to sleep
(basically doing nothing) for 70 s, the one minute load average is not
close to zero, but typically around 0.4.

I don't expect the 5 minute or 15 minute load averages to be close to
zero, since averaged over 5 or 15 minutes, the system has been busy.
But over the last 1 minute it has been sleeping, so why is the load
average not close to zero??

I know 'top' uses resources, but using that to display the load
averages, gives about the same value my program gives, so I think the
code is working sensibly. The return value from getloadavg() is as
expected.

--
Dr. David Kirkby,
Senior Research Fellow,
Department of Medical Physics,
University College London,
11-20 Capper St, London, WC1E 6JA.
Tel: 020 7679 6408 Fax: 020 7679 6269
Internal telephone: ext 46408
e-mail da...@medphys.ucl.ac.uk

Tom Hamilton

unread,

Jun 15, 2003, 10:46:29 PM6/15/03

to

load is the number of processes in the run queue (waiting for a cpu slice)
all other requirements (io, etc) having been satisfied. Output from my ps
-elf reveals numerous processes running independant of MY using the
machine. vold, sendmail, cron, inetd, syslog, sshd, xntpd. I suspect
these need attention at some point.

Frank Cusack

unread,

Jun 15, 2003, 11:58:09 PM6/15/03

to

On Mon, 16 Jun 2003 02:46:29 GMT "Tom Hamilton" <thamil...@snet.net> wrote:
> On Mon, 16 Jun 2003 02:55:53 +0100, Dr. David Kirkby wrote:
> load is the number of processes in the run queue (waiting for a cpu slice)
> all other requirements (io, etc) having been satisfied.

McKusick says load is the length of the run queue AND the number of
processes waiting for disk i/o to complete, for 4.4BSD anyway. I
suspect that really means, waiting on vnode pager to complete a pagein
(so as to account for NFS i/o as well as local i/o). At least on
Linux, by observation load avg does seem to take NFS activity into
account.

hmm ... This page is interesting:
<http://www.teamquest.com/html/gunther/ldavg1.shtml>, found in 30
seconds with Google.

It disagrees with McKusick in its text (although it doesn't say 4.4BSD
specifically), saying load is the runq length (only), and citing
sources (including Adrian Cockcroft), but then goes on to show how
it's calculated in Linux, which clearly is not just the runq length.
Sigh.

Anyway, that page describes it pretty well, I think. Just add in the
waiting on disk i/o factor ...

/fc

Rich Teer

unread,

Jun 16, 2003, 1:21:34 AM6/16/03

to

On Mon, 16 Jun 2003, Dr. David Kirkby wrote:

> I've been trying to measure the load average on quad processor Sun
> Ultra 80 running Solaris 9 using a UNIX system call. I seem to be able
> to do this, but I'm baffled by the numbers I get back. Perhaps someone
> can enlighten me on exactly what is the 'load average' on Solaris.

The run queue is the number processes running plus the
number of process that are runnable, but waiting for CPU
resource. The load average is the average length of the
run queue, averaged over a given time period.

See, e.g., the uptime man page.

--
Rich Teer, SCNA, SCSA

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-online.net

Valentin Nechayev

unread,

Jun 16, 2003, 2:43:04 AM6/16/03

to

>>> Dr. David Kirkby wrote:

DDK> What I can't understand is that after the system has gone to sleep
DDK> (basically doing nothing) for 70 s, the one minute load average is not
DDK> close to zero, but typically around 0.4.

It may be calculated not using circular buffer of loads (such buffer will be
too long), but using some accumulating technique similar to following one:
la[x+1] = 0.999*la[x] + current_queue_length*0.001

Coefficients in the formula above are arbitrary; real ones should be
determined knowing measuring rate and "target interval" (1, 5, 15min)
of this LA.

I don't know Solaris implementation, these words were from common theory.

-netch-

Greg Andrews

unread,

Jun 16, 2003, 1:52:16 PM6/16/03

to

"Dr. David Kirkby" <drki...@ntlworld.com> writes:
>
>I don't expect the 5 minute or 15 minute load averages to be close to
>zero, since averaged over 5 or 15 minutes, the system has been busy.
>But over the last 1 minute it has been sleeping, so why is the load
>average not close to zero??
>
>I know 'top' uses resources, but using that to display the load
>averages, gives about the same value my program gives, so I think the
>code is working sensibly. The return value from getloadavg() is as
>expected.
>

So we have a syllogism here: Your program's readout matches top's
readout, and top uses resources, therefore your program...

-Greg
--
Do NOT reply via e-mail.
Reply in the newsgroup.

fob

unread,

Jun 16, 2003, 6:36:41 PM6/16/03

to

In article <pan.2003.06.16...@snet.net>,
thamil...@snet.net says...

> On Mon, 16 Jun 2003 02:55:53 +0100, Dr. David Kirkby wrote:
>
> > I've been trying to measure the load average on quad processor Sun
> > Ultra 80 running Solaris 9 using a UNIX system call.

Speaking of multiprocessor systems, is it true that in an N-processor
system Solaris only runs threads on N-1 processors and dedicates the Nth
processor to handling interrupts? If so, then is that true for all
versions of Solaris from 2.5.1 through 2.9? (Where N > 1, of course.)

Darren Dunham

unread,

Jun 16, 2003, 6:45:13 PM6/16/03

to

Nope. That would make a 2 CPU system a big waste.

Certain hardware interrupts may be directed at a particular CPU, but no
cpu is "dedicated" to serving interrupts.

--
Darren Dunham ddu...@taos.com
Unix System Administrator Taos - The SysAdmin Company
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >

Dr. David Kirkby

unread,

Jun 16, 2003, 8:57:00 PM6/16/03

to

No,
running 'top' or not did not make much difference to the
numbers returned by my program. I was just commenting that the
two programs agreed roughly on the numbers. i.e. my program
seemed to be correctly coded.

This system has been pretty quite pretty much all day, with
the 1 minute load average at 0.07 now and the 5 and 15 minute ones at
0.05.

last pid: 26162; load averages: 0.07, 0.05,
0.05 01:25:06
108 processes: 106 sleeping, 1 zombie, 1 on cpu
CPU states: % idle, % user, % kernel, % iowait, %
swap
Memory: 4096M real, 3322M free, 325M swap in use, 4818M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
26162 davek 1 10 0 2016K 1296K cpu/3 0:00 0.40% top
863 root 1 59 0 141M 50M sleep 12:21 0.39% Xsun
4839 davek 1 49 0 51M 45M sleep 5:09 0.07%
.netscape.bin
1862 davek 5 59 0 9208K 7560K sleep 0:15 0.04% dtwm
26137 root 1 59 0 1072K 832K sleep 0:00 0.02% sh

But clearly it takes a lot longer than one minute for
the one-minute load average to fall to anywhere near these levels.

Dr. David Kirkby

unread,

Jun 16, 2003, 9:01:59 PM6/16/03

to

Rich Teer wrote:

> The run queue is the number processes running plus the
> number of process that are runnable, but waiting for CPU
> resource. The load average is the average length of the
> run queue, averaged over a given time period.
>
> See, e.g., the uptime man page.

As the URL
http://www.teamquest.com/html/gunther/ldavg1.shtml
(thanks to Frank Cusack <fcu...@fcusack.com>)
shows, it is exponentially damped, which explains why the 1 minute
load average is not low 70 s after the system has been 'quite'. Their
graph shows the 1 minute load average taking about 5 minutes to fall
to near zero after one cpu bound job is stopped.

So I can't help feeling the man pages are not telling the whole truth.

Rich Teer

unread,

Jun 16, 2003, 9:51:05 PM6/16/03

to

On Tue, 17 Jun 2003, Dr. David Kirkby wrote:

> So I can't help feeling the man pages are not telling the whole truth.

You're a scientist, so you know the answer to that one:
blind 'em with science, and baffle 'em with bullshit! :-)

Valentin Nechayev

unread,

Jun 17, 2003, 2:56:24 AM6/17/03

to

>>> fob wrote:

> >> I've been trying to measure the load average on quad processor Sun
> >> Ultra 80 running Solaris 9 using a UNIX system call.

f> Speaking of multiprocessor systems, is it true that in an N-processor
f> system Solaris only runs threads on N-1 processors and dedicates the Nth
f> processor to handling interrupts?

I think it's false. Generally MP systems distribute interrupts evenly
to all processors; this is well elastic to satisfy most kinds of load.

-netch-

Tony Walton

unread,

Jun 17, 2003, 4:43:17 AM6/17/03

to

fob wrote:
>
>
> Speaking of multiprocessor systems, is it true that in an N-processor
> system Solaris only runs threads on N-1 processors and dedicates the Nth
> processor to handling interrupts?

No.

--
Tony

Kurtis D. Rader

unread,

Jun 16, 2003, 10:44:41 PM6/16/03

to

On Mon, 16 Jun 2003 02:55:53 +0100, Dr. David Kirkby wrote:

> I've been trying to measure the load average on quad processor Sun Ultra 80
> running Solaris 9 using a UNIX system call. I seem to be able to do this, but
> I'm baffled by the numbers I get back. Perhaps someone can enlighten me on
> exactly what is the 'load average' on Solaris.

Frank Cusack came closest to providing the correct answer. On the UNIXes
to which I have source code access (Linux and a couple of proprietary
versions based on a melding of System V and BSD) it is calculated as

processes running + runable + fast wait
---------------------------------------
number of CPUs

The part that typically confuses people is the "fast wait" term. That is
the number of processes sleeping at a non-interruptable priority. This
is typically because they are waiting for a disk I/O operation to
complete. But could also be due to any number of other reasons; e.g.,
sleeping for a kernel memory allocation to complete. The reason that
term is included is based on the premise that processes sleeping at a
non-interruptable priority will do so for a very short duration (e.g.,
a few milliseconds). Therefore, because they are likely to become runable
again in the very near future (possibly before the current timeslice
completes) it is reasonable to include them in the calculation of the
CPU "load".

Of course, the actual calculation is slightly more complex to handle the
aging, but that doesn't materially affect the definition. The above
calculation is typically performed at each timer tick (e.g., 1/100 of a
second) and the current value factored into the 1, 5, and 15 minute
averages using an exponential decay equation executed using integer
arithmetic for speed.

On many systems there are a few system daemons that sleep at
non-interruptable priority. In many cases the daemons rarely execute, so
therefore skew the load average. They reason they sleep non-interruptably
is that it simplifies the code.

Dr. David Kirkby

unread,

Jun 17, 2003, 8:52:49 AM6/17/03

to

"Kurtis D. Rader" wrote:
>
> On Mon, 16 Jun 2003 02:55:53 +0100, Dr. David Kirkby wrote:
>
> > I've been trying to measure the load average on quad processor Sun Ultra 80
> > running Solaris 9 using a UNIX system call. I seem to be able to do this, but
> > I'm baffled by the numbers I get back. Perhaps someone can enlighten me on
> > exactly what is the 'load average' on Solaris.
>
> Frank Cusack came closest to providing the correct answer. On the UNIXes
> to which I have source code access (Linux and a couple of proprietary
> versions based on a melding of System V and BSD) it is calculated as
>
> processes running + runable + fast wait
> ---------------------------------------
> number of CPUs
>

> Of course, the actual calculation is slightly more complex to handle the

> aging, but that doesn't materially affect the definition. The above
> calculation is typically performed at each timer tick (e.g., 1/100 of a
> second) and the current value factored into the 1, 5, and 15 minute
> averages using an exponential decay equation executed using integer
> arithmetic for speed.

Those three little words 'exponential decay equation' make a huge
impact on the definition, which are completely missing from the
uptime(1) man page. To quote from that man page on my Solaris 9
system:

"The uptime command prints the current time, the length of time
the system has been up, and the average number of jobs in the run
queue over the last 1, 5 and 15 minutes"

In theory at least, ignoring rounding errors, and being very pedantic,
the 1 minute load average on my system now depends on what the system
has done ever since it been booted!! So it's hardly an average of
'anything' over the last 1 minute.

The other issue you bought up is the number of CPUs. Whilst I can't
find the page, somewhere in Adrian Cockroft's book "Sun performance
and Tuning" it says something like (its not a quote):

"You need more cpu power if the load average is greater than 3x the
number of CPUs". That would suggest to me that a load average of 20 is
fine if you have 100 CPUs, but not if you have 1 CPU.

If you throw into the equation for the definition of load average
"divided by number of CPUs" - as your equation shows, that suggests an
acceptable load average is independent of the number of CPUs.

I just checked the man pages for an HP-UX 11 system:

HU-UX $ man uptime.
"uptime prints the current time, the length of time the system has
been up, the number of users logged on to the system, and the average
number of jobs in the run queue over the last 1, 5, and 15 minutes for
the active processors."

and for an Tru64 5.1B system, which has a -m option to show the 'Mach
factor'

Tru64 $ man uptime

"The load average numbers give the number of jobs in the run queue for
the
last 5 seconds, the last 30 seconds, and the last 60 seconds. The Mach
factor is a variant of the load average, given for the same intervals;
the closer its value to 0, the higher the load."

So Solaris and HP-UX at least agree on 1, 5 and 15 minutes, whereas
Tru64 uses 5, 30 and 60s. HP-UX mentions the number of processors,
Solaris and Tru64 don't.

Perhaps later, just for interest sake, I'll load this quad processor
Ultra 80, give the load averages time to stabilise and I'll take some
of the CPUs off-line and see if that affects the load average or not.

I don't have a Sun contract so can't file a bug report, but I can't
help but feel there is a bug in the man page from uptime(1) if it
omits to tell users of this exponential weighting factor, or it omits
to tell of a scaling factor such as the number of CPUs. Either factor,
if they exist, should be documented.

I know I'm a scientist, but am I really being too pedantic to expect
such information in a man page? It would have saved me the hassle of
writing a program to read the load average, because this exponential
weighting factor meant it was unsuitable for my needs.

El Toro

unread,

Jun 17, 2003, 11:50:00 AM6/17/03

to

> processes running + runable + fast wait
> ---------------------------------------
> number of CPUs

Does the load average reported on Linux, FreeBSD, and say, Solaris take
the number of CPUs into account, as this formula dictates? IIRC, I've
always had to factor SMP into the picture after obtaining the load average
number reported by 'uptime'.

I've had this discussion on load average with a few different groups, and
it has always been a confusing topic given the variety of definitions one
would receive from various sources. However, at least the one provided
here seems to be the one that is gaining more ground than the others on
the most commonly used Unices.

Frank Cusack

unread,

Jun 17, 2003, 7:10:50 PM6/17/03

to

On Tue, 17 Jun 2003 13:52:49 +0100 "Dr. David Kirkby" <drki...@ntlworld.com> wrote:
> I know I'm a scientist, but am I really being too pedantic to expect
> such information in a man page?

I don't think so.

/fc

Dr. David Kirkby

unread,

Jun 17, 2003, 8:03:53 PM6/17/03

to

Me neither.

Perhaps someone from Sun, or someone who has a Sun contract, would
consider filing the contents of the uptime(1) man page as a bug
report.

John D Groenveld

unread,

Jun 17, 2003, 8:04:28 PM6/17/03

to

In article <3EEF0F21...@ntlworld.com>,

Dr. David Kirkby <drki...@ntlworld.com> wrote:

>I don't have a Sun contract so can't file a bug report, but I can't

Feedback via the comments and feedback link on docs.sun.com has yielded
manpage fixes for me previously.

John
groe...@acm.org

Kurtis D. Rader

unread,

Jun 17, 2003, 9:55:24 PM6/17/03

to

On Tue, 17 Jun 2003 09:56:24 +0300, Valentin Nechayev wrote:

> I think it's false. Generally MP systems distribute interrupts evenly to
> all processors; this is well elastic to satisfy most kinds of load.

That's true of implementations commonly seen today. However, it has not
always been true. When I joined Sequent Computer Systems (now part of IBM)
over thirteen years ago some we had a few competitors selling
multiprocessing systems. Some even claimed to their systems were SMP.
However most of them (at that time) did in fact dedicate CPUs to specific
functions (e.g., interrupt handling).

Also, distributing interrupts in a truly random fashion amongst the CPUs is
often a bad thing since you're decreasing the probability of a L1 or L2
cache hit. So some UNIXes (e.g., Linux) attempt to "steer" the interrupt to
the best CPU. Where "best" is normally defined as the CPU which most
recently handled an interrupt for the device.

Kurtis D. Rader

unread,

Jun 17, 2003, 10:32:07 PM6/17/03

to

On Tue, 17 Jun 2003 13:52:49 +0100, Dr. David Kirkby wrote:

> Those three little words 'exponential decay equation' make a huge impact
> on the definition,

No, it does not. At least as I define "huge impact."

> In theory at least, ignoring rounding errors, and being very pedantic,
> the 1 minute load average on my system now depends on what the system has
> done ever since it been booted!! So it's hardly an average of 'anything'
> over the last 1 minute.

True, but in practice it's a non-issue. The cost to calculate true one,
five, and fifteen minute moving averages would be prohibitively expensive
relative to the value of the metric.

> The other issue you bought up is the number of CPUs. Whilst I can't find
> the page, somewhere in Adrian Cockroft's book "Sun performance and
> Tuning" it says something like (its not a quote):
>
> "You need more cpu power if the load average is greater than 3x the
> number of CPUs".
>
> That would suggest to me that a load average of 20 is fine if you have
> 100 CPUs, but not if you have 1 CPU.

Quotes like the above are why I have relatively low regard for most tuning
books. At this point I should probably state my qualifications. I don't
hold a doctorate in Computer Science. However, I've been earning my living
solving performance problems involving UNIX for the past twelve years.
I've also been flown halfway around the world to help a customer with a
performance problem more times than I can remember.

There is no threshold that inherently demarcates "good" from "bad" load
averages (i.e., indicates whether or not there is a performance problem).
This is due to the inclusion of processes in a fast wait state in the
calculation. I've seen systems with load averages greater than three which
had perfectly acceptable performance. I've seen others with a load average
of 0.2 that had unacceptable application performance.

When I talk to customers I tell them to not worry about the specific value.
Rather, pay attention to deviations from the "normal" load average for the
system and workload.

> So Solaris and HP-UX at least agree on 1, 5 and 15 minutes, whereas Tru64
> uses 5, 30 and 60s. HP-UX mentions the number of processors, Solaris and
> Tru64 don't.

I can't speak to those operating systems since I don't have access to their
source code. However, I would be very surprised if they calculated it any
differently than I described. That is, I'm confident they all use the
number of online CPUs as the denominator.

> Perhaps later, just for interest sake, I'll load this quad processor
> Ultra 80, give the load averages time to stabilise and I'll take some of
> the CPUs off-line and see if that affects the load average or not.

If it doesn't the operating system's calculation of load average is broken.
The load average isn't meaningful if it doesn't factor in the number of
online CPUs.

> I don't have a Sun contract so can't file a bug report, but I can't help
> but feel there is a bug in the man page from uptime(1) if it omits to
> tell users of this exponential weighting factor, or it omits to tell of a
> scaling factor such as the number of CPUs. Either factor, if they exist,
> should be documented.
>
> I know I'm a scientist, but am I really being too pedantic to expect such
> information in a man page? It would have saved me the hassle of writing a
> program to read the load average, because this exponential weighting
> factor meant it was unsuitable for my needs.

I agree this needs to be better documented by each OS. Preferrably by
the inclusion of mathematical equations and/or psuedo-code. However,
I disagree with your statement that the exponential weighting is the
reason the metric was unsuitable for you needs. That has nothing to do
with it. The exponential weighting is simply used to implement a digital
decay such that prior measurements are forgotten in a reasonable period
(e.g., 5 times the load average interval). So, you are correct that it
isn't a true 1, 5, and 15 minute moving average. But in practice the
difference is inconsequential.

The reason it is unsuitable is the inclusion of the number of processes
sleeping at a non-interruptible priority skews the load average in a
manner that does not accurately reflect the true "CPU load" on your
system. Which is why I tell customers to ignore the load average.

As an example: I recently handled a call from a large stock brokerage
in the USA involving this very topic. They were concerned that the load
average on their Linux system was greater than eleven. It turned out
the reason for the unusually high load average was a bunch of kernel
threads sleeping non-interruptibly. Those threads were associated with
"housekeeping" chores of a storage array vendor's software. Those
threads were seldom awakened and therefore consumed almost no CPU
cycles. But because they chose to simplify the implementation by sleeping
non-interruptibly those processes artificially skewed the load average
to higher values. The customer had no cause for alarm.

Richard Pettit [SE Toolkit Author]

unread,

Jun 17, 2003, 11:46:32 PM6/17/03

to

Dr. David Kirkby wrote:
> "Kurtis D. Rader" wrote:
>
>>On Mon, 16 Jun 2003 02:55:53 +0100, Dr. David Kirkby wrote:
>>
>>
>>>I've been trying to measure the load average on quad processor Sun Ultra 80
>>>running Solaris 9 using a UNIX system call. I seem to be able to do this, but
>>>I'm baffled by the numbers I get back. Perhaps someone can enlighten me on
>>>exactly what is the 'load average' on Solaris.

There are 3 load averages. 1 minute, 5 minutes and 15 minutes. The definition
that you are supposed to just stand back and take for granted is that they are
the average number of processes in the run queue over the last interval
defined by its name. Since the runque value is updated every second, one
might suspect that "average" means (for the 1 minute value) that for each
second in a minute, the number of processes in the run queue is accumulated
and then at the end of the minute, the value is divided by 60. (BTW, the
once every second is a Solaris-ism. Linux seems to use LOAD_FREQ which is
every 5 seconds). The actual computation is typical operating system fixed
point mumbo-jumbo which I will allow the motivated in the audience to
explain. Personally, I really don't care, since looking at the load average
as a sign of system performance is like taking a person's pulse to see if
they have cancer.

The code of interest in Linux is in kernel/timer.c and include/linux/sched.h.

> The other issue you bought up is the number of CPUs. Whilst I can't
> find the page, somewhere in Adrian Cockroft's book "Sun performance
> and Tuning" it says something like (its not a quote):
>
> "You need more cpu power if the load average is greater than 3x the
> number of CPUs". That would suggest to me that a load average of 20 is
> fine if you have 100 CPUs, but not if you have 1 CPU.

That's correct. A run queue length of 10 on a single CPU box means you're
in need of, well, something. A run queue length of 10 on a 20 way box puts
you in tall clover. Note, I use the term run queue length which is the
value (from kstat: unix:0:sysinfo:runque) divided by unix:0:sysinfo:updates.
That value will tell you what the length of the run queue is RIGHT NOW.
Not what you should take for granted that it averaged out to over some
long previous interval.

> If you throw into the equation for the definition of load average
> "divided by number of CPUs" - as your equation shows, that suggests an
> acceptable load average is independent of the number of CPUs.

You should not make a rule that says "if the run queue length is greater than 2"
because it doesn't scale. If you have a run queue for each processor, that's
a different problem. But when you have one queue for multiple servers, you
need to divide by the number of servers (CPUs).

But don't use the avenrun numbers. If you're using Solaris, write a kstat Perl
script that uses the runque and updates values. Or, if you use SE 3.3, this code
tells you:

#include <unistd.se>
#include <kstat.se>

int main()
{
ks_sysinfo si;
double r;
double u;
uint32_t old_r = 0;
uint32_t old_u = 0;

for(;;) {
refresh$(si);
r = si.runque - old_r;
u = si.updates - old_u;
if (old_r != 0) {
printf("%.2f\n", r / u);
}
old_r = si.runque;
old_u = si.updates;
sleep(1);
}

return 0;
}

Rich

Frank Cusack

unread,

Jun 18, 2003, 7:34:23 AM6/18/03

to

On Mon, 16 Jun 2003 19:44:41 -0700 "Kurtis D. Rader" <kra...@skepticism.us> wrote:
> The part that typically confuses people is the "fast wait" term. That is

...

Very interesting stuff, and very well described!

thanks
/fc

Dr. David Kirkby

unread,

Jun 18, 2003, 9:28:54 AM6/18/03

to

"Kurtis D. Rader" wrote:
>
> On Tue, 17 Jun 2003 13:52:49 +0100, Dr. David Kirkby wrote:
>
> > Those three little words 'exponential decay equation' make a huge impact
> > on the definition,
>
> No, it does not. At least as I define "huge impact."

Well I disagree - for me it made a ***huge*** impact.

I was trying to use the metric to determine if a system was very
'quite' - suitable for benchmarking purposes. If I compiled the
benchmark, slept for 70 s, measured the 1 minute load average and
found it much above zero, I could be sure the system was quite, so any
results from that benchmark would be invalid.

But the fact the man page omits any mention of exponential weighting,
meant I wasted my time finding the system calls to measure the load
average.

> > "You need more cpu power if the load average is greater than 3x the
> > number of CPUs".
> >
> > That would suggest to me that a load average of 20 is fine if you have
> > 100 CPUs, but not if you have 1 CPU.
>
> Quotes like the above are why I have relatively low regard for most tuning
> books.

Perhaps you should look at the book by Adrian Cockroft and Richard
Pettit - sorry I omitted your name Richard. Perhaps you might learn
something from that book. I have the 2nd edition (1998), which is
showing its age. There may be a later edition, but I feel its still
useful. You might well learn something.

> At this point I should probably state my qualifications. I don't
> hold a doctorate in Computer Science.

Me neither. In fact I have no CS qualifications at all. BSs in
electrical and electronic engineering, MSc in Microwaves and
Optoelectronics and a PhD in Medical Physics.

> However, I've been earning my living
> solving performance problems involving UNIX for the past twelve years.
> I've also been flown halfway around the world to help a customer with a
> performance problem more times than I can remember.

You should know your stuff then! Far more than me, I would accept,
since I don't earn my living from IT at all. I worked in the IT
industry for 6 months once, as a fill-in job, before something better
came up.

> > So Solaris and HP-UX at least agree on 1, 5 and 15 minutes, whereas Tru64
> > uses 5, 30 and 60s. HP-UX mentions the number of processors, Solaris and
> > Tru64 don't.
>
> I can't speak to those operating systems since I don't have access to their
> source code. However, I would be very surprised if they calculated it any
> differently than I described. That is, I'm confident they all use the
> number of online CPUs as the denominator.

Given your wealth of international experience, I'm surprised you don't
know that Solaris does not factor in the CPUs - because it does not!

There's some data I show at the end of the email which I just
collected on my U80, running 8 CPU intensive tasks as the number of
processors is changed. Load averages remain at about 8. Note the load
averages are approximately equal to the number of CPU intensive tasks,
not twice that number as stated in that web page referenced earlier.

I only have uni-processor boxes running Tru64 and HP-UX, so can't make
any measurements there.

> The load average isn't meaningful if it doesn't factor in the number of
> online CPUs.

I beg to differ. As long as you know the definition, it's just as
meaningful.

If someone says "The temperature is 35 degrees", that statement
contains no useful information. If they state it in Celsius,
Fahrenheit, Kelvin or whatever, it has meaning. I might mentally
convert from one to another, but that is not hard.

The load average will have a lot more meaning if it ever gets properly
documented - the fact it does or does not factor in the number of CPUs
is immaterial, as long as we know what it does.

> I agree this needs to be better documented by each OS.

We agree on something !!!

> Preferrably by
> the inclusion of mathematical equations and/or psuedo-code.

Yes agreed.

> However,
> I disagree with your statement that the exponential weighting is the
> reason the metric was unsuitable for you needs.

You have made an assumption (an incorrect one) about my needs. Why I
was trying to use the metric, was to determine if a system was "very
quite". There may have been better ways (I'm open to suggestions), but
based on the contents of the uptime(1) man page, the 1 minute load
average looked a good candidate. I then found out it was not, so
raised the question "What exactly is the load average".

> As an example: I recently handled a call from a large stock brokerage
> in the USA involving this very topic.

I hope you ask your customers what there needs are, rather than
jumping to conclusions as you have done with me.

Dr. David Kirkby.

*****Data measured on U80. Partially annotated.***
// System has remained in this state for a long time.
// 8 CPU intensive tasks, load averages ~ 8.
# uptime
12:38pm up 2 day(s), 19:22, 1 user, load average: 8.00, 8.07, 8.09
// Verify only one CPU (#0) is online.
# psrinfo
0 on-line since 06/15/2003 17:16:48
1 off-line since 06/18/2003 11:11:43
2 off-line since 06/18/2003 11:11:49
3 off-line since 06/18/2003 11:12:03
// Put CPU #1 online.
# psradm -n 1
// verify CPU #1 is now online
# psrinfo
0 on-line since 06/15/2003 17:16:48
1 on-line since 06/18/2003 12:39:15
2 off-line since 06/18/2003 11:11:49
3 off-line since 06/18/2003 11:12:03
// Despite having two CPUs online, the load average remains at about
8.
// I check this at 3 times, at about 1 minute intervals and see no
change.
# uptime
12:39pm up 2 day(s), 19:23, 1 user, load average: 8.00, 8.06, 8.09
# uptime
12:40pm up 2 day(s), 19:23, 1 user, load average: 8.00, 8.05, 8.09
# uptime
12:41pm up 2 day(s), 19:25, 1 user, load average: 8.00, 8.04, 8.08
// Bring another processor online.
# psradm -n 2
// Verify 3 processors are now online.
# psrinfo
0 on-line since 06/15/2003 17:16:48
1 on-line since 06/18/2003 12:39:15
2 on-line since 06/18/2003 12:42:08
3 off-line since 06/18/2003 11:12:03
// Check the load average a few times at about 1 minute spacing.
# uptime
12:42pm up 2 day(s), 19:26, 1 user, load average: 8.02, 8.04, 8.08
# uptime
12:43pm up 2 day(s), 19:26, 1 user, load average: 8.01, 8.03, 8.07
# uptime
12:44pm up 2 day(s), 19:28, 1 user, load average: 8.01, 8.03, 8.07

// Bring all CPUs online.
# psradm -n 3
# psrinfo
0 on-line since 06/15/2003 17:16:48
1 on-line since 06/18/2003 12:39:15
2 on-line since 06/18/2003 12:42:08
3 on-line since 06/18/2003 12:45:09

// Check the loadaverges again at ~ 2 minute intervals.
# uptime
12:45pm up 2 day(s), 19:28, 1 user, load average: 8.00, 8.03, 8.07
# uptime
12:47pm up 2 day(s), 19:31, 1 user, load average: 8.00, 8.02, 8.05
# uptime
12:52pm up 2 day(s), 19:36, 1 user, load average: 8.00, 8.01, 8.04
# psrinfo
0 on-line since 06/15/2003 17:16:48
1 on-line since 06/18/2003 12:39:15
2 on-line since 06/18/2003 12:42:08
3 on-line since 06/18/2003 12:45:09
// Disable cpus 0 1 and 2.
# psradm -f 0 1 2
# uptime
12:53pm up 2 day(s), 19:36, 1 user, load average: 8.02, 8.01, 8.04
# psrinfo
0 off-line since 06/18/2003 12:53:00
1 off-line since 06/18/2003 12:53:00
2 off-line since 06/18/2003 12:53:00
3 on-line since 06/18/2003 12:45:09

// Yes, I did finally remember to switch all the CPUs back online!!

Dr. David Kirkby

unread,

Jun 18, 2003, 9:34:11 AM6/18/03

to

"Dr. David Kirkby" wrote:
> found it much above zero, I could be sure the system was quite, so any

I ommited a rather important 'not' there! I meant to say

"If I compiled the benchmark, slept for 70 s, measured the 1 minute
load average and found it much above zero, I could be sure the system

was NOT quite, so any results from that benchmark would be invalid."

Sorry.

Darren Dunham

unread,

Jun 18, 2003, 12:19:22 PM6/18/03

to

In comp.unix.solaris Kurtis D. Rader <kra...@skepticism.us> wrote:
> I can't speak to those operating systems since I don't have access to their
> source code. However, I would be very surprised if they calculated it any
> differently than I described. That is, I'm confident they all use the
> number of online CPUs as the denominator.

The Solaris CPU average (vmstat, iostat, sar) does, the load average
(uptime) does not.

With 4 CPU intensive threads running on an 8 CPU machine, you may
reasonably expect to see something near 50% usr CPU time and a load
average near 4.

Dr. David Kirkby

unread,

Jun 18, 2003, 1:57:18 PM6/18/03

to

Well, it might be interesting, but it's certainly not accurate.

Kurtis D. Rader

unread,

Jun 18, 2003, 9:53:24 PM6/18/03

to

On Wed, 18 Jun 2003 16:19:22 +0000, Darren Dunham wrote:

> In comp.unix.solaris Kurtis D. Rader <kra...@skepticism.us> wrote:
>> I can't speak to those operating systems since I don't have access to
>> their source code. However, I would be very surprised if they calculated
>> it any differently than I described. That is, I'm confident they all use
>> the number of online CPUs as the denominator.
>
> The Solaris CPU average (vmstat, iostat, sar) does, the load average
> (uptime) does not.
>
> With 4 CPU intensive threads running on an 8 CPU machine, you may
> reasonably expect to see something near 50% usr CPU time and a load
> average near 4.

I would have to consider the Solaris definition of load average more bogus
than usual. If it isn't scaled by the number of online CPUs it doesn't tell
you anything useful. Of course, given the inclusion of "fast wait"
processes, it sometimes isn't a very good indicator in any event.

Kurtis D. Rader

unread,

Jun 18, 2003, 9:57:56 PM6/18/03

to

On Wed, 18 Jun 2003 18:57:18 +0100, Dr. David Kirkby wrote:

> Well, it might be interesting, but it's certainly not accurate.

What part of my answer wasn't accurate? I'll accept that we disagree about
the magnitude of the impact a digital decay versus a true moving average
has on the meaning of the metric but I don't see how that difference of
opinion makes my description inaccurate.

Kurtis D. Rader

unread,

Jun 18, 2003, 10:34:26 PM6/18/03

to

On Wed, 18 Jun 2003 14:34:11 +0100, Dr. David Kirkby wrote:

> "Dr. David Kirkby" wrote:
>> found it much above zero, I could be sure the system was quite, so any
>
> I ommited a rather important 'not' there! I meant to say
>
> "If I compiled the benchmark, slept for 70 s, measured the 1 minute load
> average and found it much above zero, I could be sure the system was NOT
> quite, so any results from that benchmark would be invalid."

How many CPUs are in your system? The kernel isn't responsible
for dividing by the number of online CPUs, the user process which
retrieves and displays the load average is. So, if your system has
ten CPUs the actual load average would have been between 0.04 and 0.06
which is typical for a system at run-level two or higher. Remember,
at multi-user run levels there are lots of tasks that periodically wake
up, look for work, then go back to sleep. So even if your load average
isn't being skewed by processes sleeping at a non-interruptible priority
it is still unlikely to be zero unless you've taken steps to stop all
unnecessary daemons (e.g., cron).

The digital decay (implemented by an exponential equation) is not
the reason the run level never reaches zero (unless the kernel code
for maintaining the load average metrics is broken). Of course, each
implementation is free to use different coefficients. For example,
DYNIX/ptx calculates the load average every five seconds and uses
coefficients that result in 90% of the existing value being forgotten
within five times the load average interval (e.g., five minutes for
the one minute load average).

Also, your basic premise is flawed. Just because the system is idle
subsequent to the completion of your benchmark does not in any way
guarantee it was idle during the benchmark. A better approach might
be to measure the CPU time accrued by your benchmark processes, divide
by the number of CPUs and compare that to the elapsed run time. If the
CPU time is less than or equal to the elapsed time then your CPU bound
process may not have been materially impacted by other processes. I
say "may not have" because, given a multi-processor system, context
switches and L1 and L2 cache pollution caused by other processes could
still adversely affect your benchmark yet not cause the elapsed time
to exceed the CPU time.

I've done a lot of computer system benchmarking over the past
decade. Rule #1 is to tightly control the environment of the test. This
means, among other things, disabling all non-essential services (e.g.,
cron). Even then it is important to perform multiple runs and perform
the appropriate statistical analysis on the results. Not once have I
ever relied upon a post-run measurement of load average (or anything
else) to tell me whether or not I can believe the benchmark results.

Lastly, I realize that British english differs from American english
(based on several trips to the UK to solve performance problems for
customers and regularly working with my UK counterparts). Nonetheless
I think you probably meant "quiet" not "quite".

David Schwartz

unread,

Jun 19, 2003, 8:07:56 AM6/19/03

to

"Kurtis D. Rader" <kra...@skepticism.us> wrote in message
news:pan.2003.06.19....@skepticism.us...

> I would have to consider the Solaris definition of load average more bogus
> than usual. If it isn't scaled by the number of online CPUs it doesn't
tell
> you anything useful. Of course, given the inclusion of "fast wait"
> processes, it sometimes isn't a very good indicator in any event.

The usual gestalt understanding of 'load average' is 'the average number
of scheduling entities that either are running or want to be running'. This
is a bit vague on what it means for a process to 'want to be running', but
it's very useful conceptually.

DS

Dr. David Kirkby

unread,

Jun 19, 2003, 6:09:20 PM6/19/03

to

How about the effect of multiple processors ?

David Schwartz

unread,

Jun 19, 2003, 6:18:47 PM6/19/03

to

"Dr. David Kirkby" <drki...@ntlworld.com> wrote in message
news:3EF23490...@ntlworld.com...

> > What part of my answer wasn't accurate? I'll accept that we disagree
about
> > the magnitude of the impact a digital decay versus a true moving average
> > has on the meaning of the metric but I don't see how that difference of
> > opinion makes my description inaccurate.
>
> How about the effect of multiple processors ?

There is no effect. The load average is the same regardless of the
number of processors. If you divided the load average by the number of
processors, you'd underestimate the effect of I/O bound loads. How are 20
processes all waiting for disk I/O any less load on an 8 CPU machine than a
4 CPU machine?

DS

Kurtis D. Rader

unread,

Jun 19, 2003, 10:59:25 PM6/19/03

to

On Thu, 19 Jun 2003 05:07:56 +0000, David Schwartz wrote:

> The usual gestalt understanding of 'load average' is 'the average number
> of scheduling entities that either are running or want to be running'.
> This is a bit vague on what it means for a process to 'want to be
> running', but it's very useful conceptually.

Agreed. Conceptually the load average tells you whether the CPUs are
underloaded, at capacity, or overloaded; and if over or underutilized
by how much. The original definition included processes in a
non-interruptible sleep in the numerator[1]. The DYNIX/ptx operating
system, where most of my practical tuning experience was gained and
for which I have studied the source extensively, was based on the BSD
4.3 sources and hence inherited that definition.

Note that the definition is typically described as the number of
processes waiting for a disk I/O operation to complete. That statement is
often equivalent to the number of processes sleeping non-interruptibly,
but not always. The general idea is that a process waiting for a disk I/O
operation to complete will become runable within a few milliseconds;
quite likely within the next timeslice quanta. A process in that
state should therefore be treated as if it was currently runable when
calculating the CPU "load".

Which is fine unless there are processes blocked at non-interruptible
priorities for extended periods of time. When that occurs the
CPU load-average is artificially inflated. Today it's not unusual
to see systems running daemons (e.g., kernel threads) which sleep
non-interruptibly for extended periods. Which is why I tell my customers
not to worry about whether or not the load average is greater than 1.0,
but rather whether it has deviated significantly from the typical load
average for the system when its performance is acceptable.

Finally, note that the load average has to be scaled by the number
of online CPUs to be useful (and match the original definition). In
particular its use to calculate process priority[2] requires that it
be relative to a single processor system. When the code to calculate
this value was originally written SMP machines weren't in widespread
use and I don't believe were available to the BSD authors. So it's
understandable that the original implementation did not scale the
value. The Symmetry computer line produced by Sequent was the first
widely deployed SMP system to the best of my knowledge. The majority
of our initial customers were universities excited at being able to buy
an affordable multi-processor system for evaluating parallel execution
design issues.

[1] See page 429 of "The Design and Implementation of the 4.3BSD UNIX
Operating System".

[1] See page 87-89 of "The Design and Implementation of the 4.3BSD UNIX
Operating System".

Kurtis D. Rader

unread,

Jun 19, 2003, 11:03:13 PM6/19/03

to

On Thu, 19 Jun 2003 23:09:20 +0100, Dr. David Kirkby wrote:

> How about the effect of multiple processors ?

From my original reply:

processes running + runable + fast wait
---------------------------------------
number of CPUs

However, in rereading that original post I admit that I did not make it
clear the values in the avenrun[] array are not scaled, and that it is
up to the user process to do so. So, I'll agree that my original answer
was not as accurate as either of us would have liked.

David Schwartz

unread,

Jun 20, 2003, 12:21:58 AM6/20/03

to

"Kurtis D. Rader" <kra...@skepticism.us> wrote in message
news:pan.2003.06.20....@skepticism.us...

However, scaling it in this way makes no sense unless most processes are
CPU bound. Perhaps it should be:

(processes running + runnable) / (number of CPUs) + (fast wait)

I don't see why 10 processes waiting for disk I/O are any less load when
there's 20 CPUs than when there's 4. Remember, this is already an average,
so the number of processes waiting for disk I/O is how many processes are
typically waiting for disk I/O -- they don't need CPUs.

DS

Dr. David Kirkby

unread,

Jun 20, 2003, 7:02:58 AM6/20/03

to

I think you mis-understood me. I was jusy saying the original
definition given by Kurtis D. Rader was wrong, since it did have
'number of CPUs' in denominator, when this is not the case.

I don't have have sufficient knowledge to judge whether such an
algorithm should include number of CPUs, fast wait, runnable
processors, the temperature of the room or the number of keys missing
on the keyboard. I don't know, so don't wish to comment. I suspect
there is no single number that sums up such a complex issue, but don't
care to comment any further.

I'd just like to see whatever algorithm is used properly documented.
At the suggestion of John Groenveld, I've submitted feedback via the
comment link on docs.sun.com about the inaccuracy of the uptime(1) man
page. Whether anything ever gets changed is up to Sun.

Dan Foster

unread,

Aug 5, 2003, 2:51:36 AM8/5/03

to

In article <ZLrHa.3459$y16....@newssvr19.news.prodigy.com>, Darren Dunham <ddu...@redwood.taos.com> wrote:
> In comp.unix.solaris fob <t...@fob.knob> wrote:
>> In article <pan.2003.06.16...@snet.net>,
>> thamil...@snet.net says...

>>> On Mon, 16 Jun 2003 02:55:53 +0100, Dr. David Kirkby wrote:
>>>
>>> > I've been trying to measure the load average on quad processor Sun
>>> > Ultra 80 running Solaris 9 using a UNIX system call.
>

>> Speaking of multiprocessor systems, is it true that in an N-processor
>> system Solaris only runs threads on N-1 processors and dedicates the Nth

>> processor to handling interrupts? If so, then is that true for all
>> versions of Solaris from 2.5.1 through 2.9? (Where N > 1, of course.)
>
> Nope. That would make a 2 CPU system a big waste.
>
> Certain hardware interrupts may be directed at a particular CPU, but no
> cpu is "dedicated" to serving interrupts.

That's a decent approach... especially since with historical design of VMS
prior to 7.2-1 or so, CPU 0 was the one that handled interrupts. It wasn't
ordinarily a problem... but then people got fast systems with a pretty fast
interconnect (Memory Channel) which had the interesting side effect of
totally pegging CPU 0 with its interrupt load when it was really busy.

End result? The system ended up waiting on CPU 0 to do servicing, so the
interconnect wasn't as good as a slower and older cluster interconnect (CI)
under load despite MC's superior bandwidth and latency.

Subsequently, VMS Engineering designed a fast path for the MC code and
added better support to avoid saturating only CPU 0... but point being
that dedicating a sole CPU for interrupt servicing isn't always the panacea
that it may sound like at first blush.

Hence, I find Sun's approach reasonable in this respect. And, no, not a
slam on VMS, either.

-Dan

Dan Foster

unread,

Aug 5, 2003, 5:37:15 AM8/5/03

to

In article <slrnbiukv...@gaia.roc2.gblx.net>, Dan Foster <d...@globalcrossing.net> wrote:
> That's a decent approach... especially since with historical design of VMS
> prior to 7.2-1 or so, CPU 0 was the one that handled interrupts. It wasn't
> ordinarily a problem... but then people got fast systems with a pretty fast
> interconnect (Memory Channel) which had the interesting side effect of
> totally pegging CPU 0 with its interrupt load when it was really busy.

Hmm, I think I *meant* to say Gigabit Ethernet, rather than Memory Channel
as being the interconnect in question.

I haven't been following the SCS interconnects that closely since I don't
currently run a VMS cluster ;)

As of 7.3, Fibre Channel adapters got Fast_Path support... 7.3-1 or later
(current is 7.3-1, 7.3-2 to be released shortly) will have Fast_Path
support for other LAN adapters (e.g. Gigabit Ethernet).

Until that time, Gig-E was the one with half the latency of the CI and far
greater bandwidth but about 2x the CPU load since it's got to deal with so
many packets (and resultant interrupts) per second esp. when under load.

It was so bad that folks were being told to use Fast Ethernet rather than
Gigabit Ethernet as their SCS LAN-based interconnect for this reason. Some
shops even went back to the older CI stuff -- while old, at least, CI
processing is offloaded to dedicated controllers rather than the host
system's CPU 0 and vague recollections of a more guaranteed and better
balanced/even response even under load.

Why is the issue relatively recent? Mostly a combination of blazing fast
SMP Alphas (they may be no POWER4, but a well designed Alpha server of
today is still no slouch) plus the relatively recently widely available
Gigabit Ethernet technology.

At any rate, I'm sure that a current VMS admin or a comp.os.vms regular
will keep me straight if necessary. ;)

-Dan