Long pauses

Scott

unread,

Apr 16, 2002, 5:20:23 AM4/16/02

to

Hi,

Sorry this is a bit vague!
We have a Dell Poweredge server running SCO Openserver 505a.
Often there is a long pause when writing even the smallest file, it also
tends to pause for 5-10 seconds when logging in, not sure if this is
related.
I have tried reducing BDFLUSH to 20, and NAUTOUP to 5, but this seems to
have made no difference, maybe made things worse.
There is also an Informix IDS database engine on the server which uses
'cooked' regular files.
There is plenty of disk space free, and swap is rarely touched.

Thanks for any help,
Scott.

Bela Lubkin

unread,

Apr 16, 2002, 6:02:33 AM4/16/02

to sco...@xenitec.on.ca

Scott wrote:

In general I would expect reducing BDFLUSHR to improve this, and
reducing NAUTOUP to make it worse. The configuration you tested says:
every time a block is written, keep it around in cache until it's at
least 5 seconds old; sweep the cache for blocks old enough to write out
every 20 seconds. Since it's only being swept every 20 seconds, 20
seconds worth of writes suddenly "come due" at once (blocks written
between 20 and 25 seconds ago). If something was busy writing during
that period, the performance hit is large.

I have been experimenting recently with BDFLUSH=1. This sweeps the
cache every second, so any one load of blocks to be written should be
reasonably small. This should work with any reasonable NAUTOUP value.
e.g. BDFLUSH=1, NAUTOUP=1 will sweep blocks out to disk no more than 2
seconds after they're made dirty. BDFLUSH=1, NAUTOUP=30 keeps most
writes around for a full 30 seconds, then sweeps them to disk in small
chunks.

Modern CPUs are fast enough that it isn't terribly costly to run the
sweep every second. A test box which is a busy multiuser system with 2
Pentium III-1000 CPUs has been up for 37 days and bdflush has
accumulated about 50 minutes of CPU time, or about 1/1000 of one CPU
over the time it's been up.

Setting BDFLUSHR to 1 does have a noticable effect on the system. You
can _hear_ it writing to the disk every second, like clockwork.
Sometimes it's a little write, sometimes big. The overall amount of
work being done by your disks is about the same, but it's distributed
differently and you will notice the difference.

Even with BDFLUSHR=1, you can have big disk hits (but not quite as big).
The problem is, the CPU is _much_ faster than the disk. Since the
buffer cache decouples write performance from disk performance, a
process can "write" data much faster than the disk could accept it. It
sits in cache for the specified time (NAUTOUP + cycle time until the
next BDFLUSHR), then all comes due at once. Even though it's delayed,
it's still faster than the disk can accept it, so the disk gets very
busy for a while. What's needed is a way to tell the buffer cache to
tone it down: that getting writes out exactly when NAUTOUP and BDFLUSHR
conspire to isn't _that_ important -- after all, hardly anyone even
knows that they exist, much less how to tune them for a specific
application! Even when the buffer cache has a lot of "due" write
buffers, it should take time out to let a lot more read requests
through.

Well, it currently doesn't. Musings for future development. I need to
find papers on the subject and read up on what's been done elsewhere.

>Bela<

Scott

unread,

Apr 16, 2002, 8:12:02 AM4/16/02

to

"Bela Lubkin" <be...@caldera.com> wrote in message
news:2002041603...@mammoth.ca.caldera.com...

Thank you very much for your comprehensive reply!
I will try bringing BDFLUSH down much lower, and putting NAUTOUP back to 10.
Hopefully, I'll be able to reboot tonight (if I can manage to kick all the
users off for long enough!), and I'll let you know what happens tomorrow!

Many thanks,
Scott.

Bela Lubkin

unread,

Apr 16, 2002, 2:59:01 PM4/16/02

to sco...@xenitec.on.ca

[possible FAQ material in this msg and its grandparent?]

Scott wrote:

> Thank you very much for your comprehensive reply!
> I will try bringing BDFLUSH down much lower, and putting NAUTOUP back to 10.
> Hopefully, I'll be able to reboot tonight (if I can manage to kick all the
> users off for long enough!), and I'll let you know what happens tomorrow!

Ok. FYI, both of these parameters can safely be tuned on a running
system. (This is not _generally_ true of all parameters, but I've
looked carefully at the kernel source that uses these two parameters and
it both re-acquires the values from the main tunable variables, and uses
them in a safe manner which won't be harmed by dynamic changes.)

To change them, you can use /etc/scodb:

# scodb -w
scodb> tune.t_bdflushr=1
scodb> v.v_autoup=A
scodb> q
^ ^ or whatever values you want; note: hexadecimal!

You could also use /etc/pat, which is in tls613 at
ftp://stage.caldera.com/TLS/:

# pat -n /unix tune+20 ........ = 00000001 # tune.t_bdflushr=1: run bdflush often
# pat -n /unix v+4c ........ = 0000000A # v.v_autoup=10: flush dirty bufs after 10s

`pat` has the advantage that you can put comments into the flow of
control. It has the fairly large disadvantage that it doesn't know the
shapes of the structures, so you have to manually compute them and could
mistakenly patch the wrong field.

If you patch BDFLUSHR, bdflush will wake up after the last sleep at the
old BDFLUSHR, then start sleeping at the new rate. So for instance you
may not notice the once-a-second write cycle until up to old-BDFLUSHR
seconds have passed. I believe you could also force the issue by doing
a `sync` immediately after tuning it.

If you patch NAUTOUP, buffers already in the buffer cache _will_ be
affected -- it is checked by bdflush() whenever it wakes up -- the
timeout is a property considered by the bdflush algorithm rather than a
property stored in each individual buffer. (I suppose you could
dynamically tune these parameters to deal with external conditions --
set both to 1 during a lightning storm, trading performance for minimal
data loss in the event of a catastrophic failure?)

>Bela<

Brian K. White

unread,

Apr 16, 2002, 5:28:00 PM4/16/02

to

"Bela Lubkin" <be...@caldera.com> wrote in message

news:2002041611...@mammoth.ca.caldera.com...

wow, neat.

what do you think a raid card with 128 megs of ram means to this?
go ahead and set both to 1 or 5 or so, and let the raid card smooth out
the actual writes?

(adaptec 2100s, "dpti" driver)

should have basically no performance hit at the os since writing isn't
really writing most times with all that cache

--
Brian K. White -- br...@aljex.com -- http://www.aljex.com/bkw/
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro BBx Linux SCO Prosper/FACTS AutoCAD #callahans Satriani

Scott

unread,

Apr 17, 2002, 4:17:03 AM4/17/02

to

"Brian K. White" <br...@aljex.com> wrote in message
news:AH0v8.67769$K5.61...@bin5.nnrp.aus1.giganews.com...

No problems so far this morning, but thinks haven't got too heavy yet!
BTW, should have said, we have a 64meg RAID card, does this mean I could set
both to 1 with no almost no performance hit at all?

Thanks again for all your help!

Bela Lubkin

unread,

Apr 17, 2002, 6:01:14 AM4/17/02

to sco...@xenitec.on.ca

Scott wrote:

> "Brian K. White" <br...@aljex.com> wrote in message
> news:AH0v8.67769$K5.61...@bin5.nnrp.aus1.giganews.com...
> >

> > wow, neat.
> >
> > what do you think a raid card with 128 megs of ram means to this?
> > go ahead and set both to 1 or 5 or so, and let the raid card smooth out
> > the actual writes?
> >
> > (adaptec 2100s, "dpti" driver)
> >
> > should have basically no performance hit at the os since writing isn't
> > really writing most times with all that cache
>

> No problems so far this morning, but thinks haven't got too heavy yet!
> BTW, should have said, we have a 64meg RAID card, does this mean I could set
> both to 1 with no almost no performance hit at all?

(Are "Scott" and "Brian K. White" the same person?!)

I don't have enough experience with RAID systems to give you a confident
response. But, as a general rule, if the RAID and its cache are going
to apply their own delayed writeback algorithm then it would probably be
best if the OS got its delayed writeback the heck out of the way.

Setting both to 1 would do this as well as possible.

Hmmm. I've studied the kernel code that handles these parameters and I
can now report that it _looks like_ both parameters handle the
degenerate case (0) correctly. If BDFLUSHR is 0, bdflush() wakes up on
every timer tick, i.e. 100 times a second, to scan for dirty blocks to
be flushed. If NAUTOUP is 0, bdflush() will find that all dirty blocks
are ready to be flushed. So in effect no block will stay dirty in the
OS buffer cache for more than 10ms. WARNING: I have not tested either
of the degenerate cases. Also, this might have a significant cost in
CPU time. If you try it, monitor bdflush's CPU consumption (`ps -fp3`
will show it).

In any case, this is a silly way to disable the OS's write buffering.
What you really want is a way to tell the buffer cache "don't bother,
just write things immediately and my hardware will deal with it".
Unfortunately that doesn't exist, and maybe can't. It's possible for a
filesystem driver to create a buffer, mark it for delayed write, but
also mark it with an "I'm not done with this" flag that prevents the
delayed write from actually happening. Then the filesystem continues to
make changes to the buffer until it's ready, then it turns off the "not
done" flag, and the delayed write happens in due course. So a simple
change which simply immediately wrote all buffers would damage
filesystem semantics.

>Bela<

Mike Copp

unread,

Apr 17, 2002, 10:43:04 AM4/17/02

to

Scott,

What have you set your RAID controller cache to do: write thru or
write back?

Write thru writes data directly to disk and cache - this does not
improve write performance but if a subsequent read requires data that
is still in the cache it will be a quicker return that getting it from
HDD.

Write back writes data only to the cache which then writes this to HDD
during idle cycles - this increases write performance but has an
element of risk in that, if there is a power loss, corruption of the
file system is a possibility.

I'm not sure if Bela's idea would conflict with write back or not.
With multiple reads & writes to the same file (such as in a database)
there may be a position where 'bad' data is returned due to conflicts
between OS and the RAID controller cache.

We have a PowerEdge 4300 here, and have also suffered poor
performance. At first I used sarcheck (*highly* recommended) to show
details of systems performance and what fine tuning may be needed.
After a while, it reckoned that the problem was probably an I/O
bottleneck caused by our RAID controller. To cut a long story short,
it turms out that RAID5 is a poor way to manage a database and so I'm
in the process of changing RAID level to 1+0 (and faster disks) which
should speed everything up.

Mike

Bela Lubkin <be...@caldera.com> wrote in message news:<2002041703...@mammoth.ca.caldera.com>...

Scott

unread,

Apr 24, 2002, 8:27:27 AM4/24/02

to

Hi,

Setting BDFLUSH and NAUTOUP to 1 has definitely improved things!
But there is still the occasional long pause when logging in or saving a
(even small) file.

I'm not sure how the RAID controller has been set up, I only inherited this
system a couple of months ago, and no one here seems to know how it was
done.
I can find no RAID software or configuration files on the system!
All I know is that it's a 64meg card.

Thanks for your help,
Scott.

"Mike Copp" <sys...@essential-trading.co.uk> wrote in message
news:eed3bff0.02041...@posting.google.com...

Bill Vermillion

unread,

Apr 24, 2002, 10:27:28 AM4/24/02

to

In article <ucd93ql...@corp.supernews.com>,
Scott <this...@myemailaddress.com> wrote:

>Setting BDFLUSH and NAUTOUP to 1 has definitely improved things!
>But there is still the occasional long pause when logging in or
>saving a (even small) file.

>I'm not sure how the RAID controller has been set up, I only
>inherited this system a couple of months ago, and no one here
>seems to know how it was done.

Gut feeling is that someone set the controller to write-thru and
not write-back. Many use the former using the reason that you
won't lose data in a power failure, but you should have a good UPS
that would preclude that.

--
Bill Vermillion - bv @ wjv . com

Bela Lubkin

unread,

Apr 25, 2002, 6:26:06 AM4/25/02

to sco...@xenitec.on.ca

Scott wrote:

> Setting BDFLUSH and NAUTOUP to 1 has definitely improved things!
> But there is still the occasional long pause when logging in or saving a
> (even small) file.

A process doing continuous writes for even one second could probably
store up several seconds' worth of write activity. You could experiment
with setting BDFLUSHR to 0, which (if my reading of the code is right)
will cause it to be executed on every timer tick. Then you should have
no more than 1/100s worth of writes come due at once.

And likewise, try setting NAUTOUP to 0. The combination means "do all
writes pretty much immediately (but waste a fair amount of CPU time
figuring out what to do)".

NOTE: to any current or future readers, we're talking about a RAID
system where presumably the RAID caching controller handles write
scheduling. You would _not_ want to configure a system with either of
these parameters set to 0 if that meant actually performing all writes
immediately. Performance would suffer terribly!

>Bela<