Raw Devices: Increased Performance?

0 views
Skip to first unread message

Jo Manna

unread,
Jun 21, 1996, 3:00:00 AM6/21/96
to jgm

Hello there,

I am fairly new to Oracle and am
trying get some info on what performance
increases (if any) I can expect by using
a raw device as apposed to a unix file system.

We are using Oracle 7.2.3 on DEC Digital 3.2D-1

Any figures welcome.

Thanks in Advance

-----------------------------------------
Jo

msa...@magi.com

unread,
Jun 22, 1996, 3:00:00 AM6/22/96
to

Might get 5 to 10% stressin the MIGHT.

Michael

Steve Long

unread,
Jun 22, 1996, 3:00:00 AM6/22/96
to

Jo,

I strongly recommend you DO NOT use raw devices. Although you will see
a minimal performance gain ( < 10%), you will create a significant
increase in administrative overhead managing back-up and recovery as
well as considerably more time performing a recovery in the event of a
device failure. Economically speaking, it is cheaper to buy more
memory and processors to gain considerably more performance than to
spend the additional time on administrative tasks (personnel time) when
raw devices are used for minimal performance gains.

Steve
804-262-6332

-----------------------------------------------------------------

Jo Manna

unread,
Jun 25, 1996, 3:00:00 AM6/25/96
to jgm

Steve Long wrote:
>
> Jo,
>
> I strongly recommend you DO NOT use raw devices. Although you will see
> a minimal performance gain ( < 10%), you will create a significant
> increase in administrative overhead managing back-up and recovery as
> well as considerably more time performing a recovery in the event of a
> device failure. Economically speaking, it is cheaper to buy more
> memory and processors to gain considerably more performance than to
> spend the additional time on administrative tasks (personnel time) when
> raw devices are used for minimal performance gains.
>
> Steve
> 804-262-6332
>
> -----------------------------------------------------------------

Steve,

Thanks for the reply. This is quite interesting. I have previously
used Sybase and I remember that using Unix File Systems instead of
Raw Devices was not recommended. Just off the top of me head the
reasons for this was something like...

... the Sybase Server being in 'charge' of the actual I/O and if using a
Unix File System could not guarantee that a 'commit was a commit',
due to the OS buffering..... and so on.

Obviously we are talking about Oracle here and not Sybase, but I am
just wondering how Oracle gets around this if at all, or have I missed
something?


Thanks
Jo

Willy Klotz

unread,
Jun 25, 1996, 3:00:00 AM6/25/96
to
ans...@ix.netcom.com(Steve Long) wrote:

>Jo,

>I strongly recommend you DO NOT use raw devices. Although you will see
>a minimal performance gain ( < 10%), you will create a significant
>increase in administrative overhead managing back-up and recovery as
>well as considerably more time performing a recovery in the event of a
>device failure. Economically speaking, it is cheaper to buy more
>memory and processors to gain considerably more performance than to
>spend the additional time on administrative tasks (personnel time) when
>raw devices are used for minimal performance gains.

This is an endless discussion.....

What I cannot understand, however, is the always brought up argument
about the "administrative overhead".

At our site, we constantly use raw devices and we had a performance
gain of 20 %; this is because our file-cache is used heavily by small
reports, sometimes dozends per minute.

Back to the topic, where is the administrative overhead ? When you
need a new datafile, then you have to define a file in a filesystem
or, using raw devices, a slice of disk; in both scenarios you have to
make sure that you have enough free space on your system.

The same goes, IMHO, in a recovery scenario - or what did I miss here
?

Regarding backup, it is simply a question of your backup procedure -
if you use cpio or tar, you surely are out of luck...

>Steve
>804-262-6332

Willy Klotz
======================================================================
Willys Mail FidoNet 2:2474/117 2:2474/118
Mailbox: analog 06297 910104
ISDN 06297 910105
Internet: 06297...@t-online.de
-> No Request from 06.00 to 08.00 <-
======================================================================

Halina Monka

unread,
Jun 26, 1996, 3:00:00 AM6/26/96
to

Jo Manna wrote:

>
> Steve Long wrote:
> >
> > Jo,
> >
> > I strongly recommend you DO NOT use raw devices. Although you will see
> > a minimal performance gain ( < 10%), you will create a significant
> > increase in administrative overhead managing back-up and recovery as
> > well as considerably more time performing a recovery in the event of a
> > device failure. Economically speaking, it is cheaper to buy more
> > memory and processors to gain considerably more performance than to
> > spend the additional time on administrative tasks (personnel time) when
> > raw devices are used for minimal performance gains.
> >
> > Steve
> > 804-262-6332
> >
> > -----------------------------------------------------------------
>
> Steve,
>
> Thanks for the reply. This is quite interesting. I have previously
> used Sybase and I remember that using Unix File Systems instead of
> Raw Devices was not recommended. Just off the top of me head the
> reasons for this was something like...
>
> ... the Sybase Server being in 'charge' of the actual I/O and if using a
> Unix File System could not guarantee that a 'commit was a commit',
> due to the OS buffering..... and so on.
>
> Obviously we are talking about Oracle here and not Sybase, but I am
> just wondering how Oracle gets around this if at all, or have I missed
> something?
>
> Thanks
> JoJO,
Yes, using raw devices for your datafiles make administration more
complex. It does give you some performance gain, < 10% as Steve pointed
out. However, you do not have to worry about commit when using
file system. Oracle make use of redo logs and rollback segments
and ( in case of distributed env. ) two-phase commit to ensure the
integrity of your transactions.
Note that if you using Oracle paraller server , you HAVE TO
use raw devices. Hope this explain.
Halina Monka

Joel Garry

unread,
Jun 26, 1996, 3:00:00 AM6/26/96
to

In article <31C9F8...@tpg.tpg.oz.au> Jo Manna <j...@tpg.tpg.oz.au> writes:
>Hello there,
>
>I am fairly new to Oracle and am
>trying get some info on what performance
>increases (if any) I can expect by using
>a raw device as apposed to a unix file system.
>
>We are using Oracle 7.2.3 on DEC Digital 3.2D-1

There was a thread on this a while ago, with some folks claiming minimal
increase, while others claimed up to 300%.

Personally, I think the one time that someone unthinkingly writes over
the raw system will more than make up for any increase. In other words,
it requires greater control over administration than any place can be
expected to have over a long period of time.

On the other hand, some of the high performance options require raw files.

>
>Any figures welcome.
>
>Thanks in Advance
>
>-----------------------------------------
>Jo


--
Joel Garry joe...@rossinc.com Compuserve 70661,1534
These are my opinions, not necessarily those of Ross Systems, Inc. <> <>
%DCL-W-SOFTONEDGEDONTPUSH, Software On Edge - Don't Push. \ V /
panic: ifree: freeing free inodes... O

Neil Greene

unread,
Jun 26, 1996, 3:00:00 AM6/26/96
to Joel Garry
Joel Garry wrote:
>
> In article <31C9F8...@tpg.tpg.oz.au> Jo Manna <j...@tpg.tpg.oz.au> writes:
> Personally, I think the one time that someone unthinkingly writes over
> the raw system will more than make up for any increase. In other words,
> it requires greater control over administration than any place can be
> expected to have over a long period of time.

How does someone simply write over a raw partition? Change the owner of the raw device to oracle, group dba.
Your raw disks certainly are not world writable by anyone, so why leave the raw partitions this way. Of
course, if you have any special account that needs to read the device (/dev/rdsk/blah) for backup purposes,
then do so. But, regular users....no way.

--
Neil Greene "Pinky, you make me worry. You are too
Sr. Oracle DBA / Unix Administrator close to being a poster child for
SHL Systemhouse, Inc. - L.A. cheese wiz," Brain
<URL:mailto:ngr...@laoc.SHL.com>

Kurtis D. Rader

unread,
Jun 27, 1996, 3:00:00 AM6/27/96
to
This makes me think of the favorite TLA of the sales force:

FUD: Fear, Uncertainty, and Doubt

joe...@rossinc.com (Joel Garry) writes:

>There was a thread on this a while ago, with some folks claiming minimal
>increase, while others claimed up to 300%.

>Personally, I think the one time that someone unthinkingly writes over


>the raw system will more than make up for any increase. In other words,
>it requires greater control over administration than any place can be
>expected to have over a long period of time.

This is true only for raw disk partitions. If you are using a
logical volume manager (ala Veritas VxVm or ptx/SVM) then the risk
of overwriting your raw database is no greater than overwriting a
cooked database file. And in fact I have never taken a call from
a customer who has accidentally destroyed his raw database. Yet
I have taken four calls (that I can remember) in the past six years
from customers who have accidentally blown away their cooked
database. For that matter, I ran both raw and cooked databases
back when I was a sysadmin/DBA and never had a problem with either.

This is something I have studied extensively in my role designing
system architectures for some of the largest Oracle sites in the
world. Performance is actually pretty low on my list of reasons
for favoring raw over cooked databases. A matrix I use when
discussing this with clients shows the number of factors in favor
of raw outweighing the number favoring cooked. That des not mean
I always choose a raw implementation. It simply means that you
shouldn't let FUD (see the start of the post) drive a decision.
--
Kurtis D. Rader, Senior Consultant kra...@sequent.com (email)
Sequent Computer Systems +1 503/578-3714 (voice) +65 223-5116 (fax)
80 Robinson Road, #18-03 Currently on assignment in the
Singapore, 0106 Asia-Pacific region

Gary Walsh

unread,
Jun 28, 1996, 3:00:00 AM6/28/96
to
joe...@rossinc.com (Joel Garry) wrote:
>In article <31C9F8...@tpg.tpg.oz.au> Jo Manna <j...@tpg.tpg.oz.au> writes:
>>I am fairly new to Oracle and am
>>trying get some info on what performance
>>increases (if any) I can expect by using
>>a raw device as apposed to a unix file system.
>Personally, I think the one time that someone unthinkingly writes over
>the raw system will more than make up for any increase. In other words,
>it requires greater control over administration than any place can be
>expected to have over a long period of time.

Whereas, of course, people never accidentally remove files from a UNIX
file system.... Provided your sys admin is 'fit for purpose' then I
don't see that files are any more or less likely to be trashed than raw
devices. Of course, it helps if your OS has some form of LVM so that
you can give 'raw devices' meaningful names.

Gary

--
_____________________________________________________________________
Gary Walsh E-mail: Gary_...@spire.hds.co.uk
Open Systems Product Specialist Phone: +44 1753 618806
Hitachi Data Systems Europe Disclaimer: My views, not HDS'

Christoph Torlinsky

unread,
Jun 29, 1996, 3:00:00 AM6/29/96
to
I dont mean to sound too ignorant..but what exactly is a cooked and
raw database (?). We are talking about Logical volume manager databases
correct? The only one I have seen on svm is the one that is used also
with raw device vols (which are not newfsed, my understanding)and regular
filesystem type vols. These are usually placed in slice 11 (for us) as
partition type 6, which I assume is always raw ? no?

-chris

ps: I have never hosed an svm database myself, I did get them out of synch, say
if a pbay went down , and it didnt come on during bootup, but I always
have two good database which make for the error (usually placed on qcic 0)
which my understanding svm does an rdcp to replicate those db's, which
means that they are always raw? no?


Kurtis D. Rader (kra...@crg8.sequent.com) wrote:
: This makes me think of the favorite TLA of the sales force:

: FUD: Fear, Uncertainty, and Doubt

: joe...@rossinc.com (Joel Garry) writes:

: >There was a thread on this a while ago, with some folks claiming minimal
: >increase, while others claimed up to 300%.

: >Personally, I think the one time that someone unthinkingly writes over


: >the raw system will more than make up for any increase. In other words,
: >it requires greater control over administration than any place can be
: >expected to have over a long period of time.

: This is true only for raw disk partitions. If you are using a

Erik Walthinsen (Omega)

unread,
Jun 30, 1996, 3:00:00 AM6/30/96
to

On 29 Jun 1996, Christoph Torlinsky wrote:

> I dont mean to sound too ignorant..but what exactly is a cooked and
> raw database (?). We are talking about Logical volume manager databases
> correct? The only one I have seen on svm is the one that is used also
> with raw device vols (which are not newfsed, my understanding)and regular
> filesystem type vols. These are usually placed in slice 11 (for us) as
> partition type 6, which I assume is always raw ? no?

Actually, I believe that they are talking about raw/cooked databases, as
in Oracle or Informix. The basic difference is as follows:

Raw: DB engine manages /dev/rdsk/qdX or qdXsX directly, handles buffers
itself, buffers being similar to DOS's smartdrv (yeah, whatever...)
Covered in System Administration 1.
Cooked: OS manages partition, and actually has a filesystem on it. The
database engine just has a BIG file sitting on it that hold data.
The OS manages the buffers.

This is why when you run a system with Oracle or whatever you have to
carefully decide what BUFPCT to use. If you use raw partitions, you want
to give more memory to Oracle for buffering, which does a much better job
that the OS, because it understands the structure of the database a
helluva lot better than the OS does.

With a low BUFPCT, the database can allocate more memory for it's buffers.
If you use cooked databases, the OS needs a higher BUFPCT so it can do the
buffering itself.


> ps: I have never hosed an svm database myself, I did get them out of
> synch, say if a pbay went down , and it didnt come on during bootup,
> but I always have two good database which make for the error
> (usually placed on qcic 0) which my understanding svm does an rdcp
> to replicate those db's, which means that they are always raw? no?

Under SVM 1.x, databases are on a partition of type 6. I'm not sure of
the particulars of those versions, but I believe that the database sits on
a partition autonomously. I know that in the next version it distributes
it across private sectors intelligently, so if there's a failure it always
has copies of it floating around. For all I know, 1.x does the same. I
certainly hope so... ;-)

But the real issue is the difference between raw and cooked. Basically,
raw partitions are where the application (either Oracle, Informix, etc.,
or SVM) accesses the partition/disk directly, via the /dev/rdsk entry.
The OS doesn't care what it does. With cooked partitions, the OS sits
there and translates the traffic back and forth.


> : FUD: Fear, Uncertainty, and Doubt

Quite a useful thing, I'd think... ;-)


> : >Personally, I think the one time that someone unthinkingly writes over
> : >the raw system will more than make up for any increase. In other words,
> : >it requires greater control over administration than any place can be
> : >expected to have over a long period of time.

Wait a minute... I thought that when a program takes over control of a
partition it locks it somehow. I don't know the specifics, but I know
that such things are possible (what exactly I probably shouldn't say, just
one of those Sequent internal things...) Nonetheless, I know it is
possible to lock a partition from anything else touching it, especially if
it's the disk itself. I know that when you have a VTOC on a disk and run
devbuild it, you can't rdcp -z to the /dev/rdsk/qdX, and if you have a
partition newfs'd and mounted, you can't write to the /dev/rdsk/qdXsX.


> : Yet I have taken four calls (that I can remember) in the past six


> : years from customers who have accidentally blown away their cooked
> : database.

From what I understand of cooked databases, this seems quite easy. After
all, the DB engine just puts a big file on the filesystem and plays around
with it. If the DB engine hasn't been started, it seems quite easy to
"accidentally" remove that file, especially if it's an incompetent at the
controls...

__ Erik Walthinsen - Programmer, webmaster, 3D artist
/ \
| | M E G A Teleport ISP: om...@teleport.com
_\ /_ Sequent Computers: om...@sequent.com

Sequent Symmetry Archive: http://www.teleport.com/~omega/sequent (temp loc.)

Ted Do

unread,
Jul 1, 1996, 3:00:00 AM7/1/96
to Gary Walsh

Gary Walsh wrote:
> DELETED

> devices. Of course, it helps if your OS has some form of LVM so that
> you can give 'raw devices' meaningful names.

In any UNIX, you can always use soft/hard links to create meaningful
names. Another UNIX feature, which is less well know to the non-system
UNIX persons, is the mknod(1M) command. To use it, you must provide
the major and minor number of the raw devices.

-Ted.

--
=============================================
= Theodore Do =
= Technical Consultant, UNIX/ORACLE =
= TRW Information Technology Services =
= theod...@trw.com =
=============================================

Joel Garry

unread,
Jul 2, 1996, 3:00:00 AM7/2/96
to

In article <4quudo$e...@scel.sequent.com> kra...@crg8.sequent.com (Kurtis D. Rader) writes:
>This makes me think of the favorite TLA of the sales force:
>
> FUD: Fear, Uncertainty, and Doubt
>
>joe...@rossinc.com (Joel Garry) writes:
>
>>There was a thread on this a while ago, with some folks claiming minimal
>>increase, while others claimed up to 300%.
>
>>Personally, I think the one time that someone unthinkingly writes over
>>the raw system will more than make up for any increase. In other words,
>>it requires greater control over administration than any place can be
>>expected to have over a long period of time.
>
>This is true only for raw disk partitions. If you are using a
>logical volume manager (ala Veritas VxVm or ptx/SVM) then the risk

OK, if you are. And ok if you only are. However, if you are using
a mix of machines, some of which have it and some don't, that is
even worse.

>of overwriting your raw database is no greater than overwriting a
>cooked database file. And in fact I have never taken a call from
>a customer who has accidentally destroyed his raw database. Yet

Well, I have.

>I have taken four calls (that I can remember) in the past six years
>from customers who have accidentally blown away their cooked

>database. For that matter, I ran both raw and cooked databases
>back when I was a sysadmin/DBA and never had a problem with either.

Good for you. Now how that applies to someone who would ask the
question, "Should I use a raw partition?" is another question.
They don't have as much experience as you. Noting your sequent.com
address, I wonder how much DEC Unix volume manager experience you
really have. I don't have any, so maybe you are right. My only
experience with Veritas was on SCO, and it didn't work right. It
did appeal to the perverse side of my sense of humor that the software
that was supposed to increase reliability coredumpted. I don't
assume any customer has an lvm (except AIX, and even that only
from all the off-point flames, considering the original poster asked
about DEC) and don't assume that it, as any software, is bug and
user-error free.

>
>This is something I have studied extensively in my role designing
>system architectures for some of the largest Oracle sites in the

Let us hope that the largest Oracle sites in the world have
some sophisticated and stable adminstration. Now how does that
matter for some little place with a DEC box? From what I've seen,
a lot of smaller shops say "You - Yeah, You. You get to learn
unix and oracle now." Of course, my view is probably skewed, 'cause
then they call me with "I haven't taken the courses yet, but we
have this new computer..."

>world. Performance is actually pretty low on my list of reasons
>for favoring raw over cooked databases. A matrix I use when
>discussing this with clients shows the number of factors in favor
>of raw outweighing the number favoring cooked. That des not mean
>I always choose a raw implementation. It simply means that you
>shouldn't let FUD (see the start of the post) drive a decision.

Please share the matrix.

You didn't answer my point, which is that administration can deteriorate
over time, and it can be worse with raw filesystems. You merely
intimated that I am a salesperson or something.

>--
>Kurtis D. Rader, Senior Consultant kra...@sequent.com (email)
>Sequent Computer Systems +1 503/578-3714 (voice) +65 223-5116 (fax)
>80 Robinson Road, #18-03 Currently on assignment in the
>Singapore, 0106 Asia-Pacific region

Eugene Freydenzon

unread,
Jul 7, 1996, 3:00:00 AM7/7/96
to

Jo Manna wrote:
>
> Steve Long wrote:
> >
> > Jo,
> >

> > I strongly recommend you DO NOT use raw devices. Although you will see
> > a minimal performance gain ( < 10%), you will create a significant
> > increase in administrative overhead managing back-up and recovery as
> > well as considerably more time performing a recovery in the event of a
> > device failure. Economically speaking, it is cheaper to buy more
> > memory and processors to gain considerably more performance than to
> > spend the additional time on administrative tasks (personnel time) when
> > raw devices are used for minimal performance gains.
> >
> > Steve
> > 804-262-6332
> >
> > -----------------------------------------------------------------
>
> Steve,
>
> Thanks for the reply. This is quite interesting. I have previously
> used Sybase and I remember that using Unix File Systems instead of
> Raw Devices was not recommended. Just off the top of me head the
> reasons for this was something like...
>
> ... the Sybase Server being in 'charge' of the actual I/O and if using a
> Unix File System could not guarantee that a 'commit was a commit',
> due to the OS buffering..... and so on.
>
> Obviously we are talking about Oracle here and not Sybase, but I am
> just wondering how Oracle gets around this if at all, or have I missed
> something?
>
> Thanks
> Jo
Sometimes I used raw devices for temp tablespaces.

Paul Zola

unread,
Jul 8, 1996, 3:00:00 AM7/8/96
to Joel Garry

} >joe...@rossinc.com (Joel Garry) writes:
} >
} >>There was a thread on this a while ago, with some folks claiming minimal
} >>increase, while others claimed up to 300%.

Let me paraphrase from Cary Millsap's paper "The OFA Standard Oracle7
for Open Systems", (part number A19308) which everyone in this thread
should read before posting anything on this subject.

(1) If disk I/O is not the bottleneck, then going to raw devices will
have *no* *performance* *impact* *at* *all*.
(2) If disk I/O is the bottleneck, then going to raw devices may
sometimes gain up to a 10% performance improvement relative to
the same database using filesystem files.
(3) Under very common circumstances, going to raw devices can actually
*decrease* database performance.
(4) Anyone contemplating going to raw devices should benchmark their
application on both raw and filesystem devices to see if there is
a significant performance increase in using raw devices.
(5) Anyone who does not have the time, expertise, or resources to
perform the raw-vs-filesystem benchmark *should* *not* *consider*
using raw devices.

Finally: a word about those 300% speedups that people report. Often,
when you look at the changes that they've made to go to raw devices
you'll find that they have done one of:
(1) Export/Import the database, and thereby remove fragmentation;
(2) Move control or redo log files onto separate devices or
controllers;
(3) Move database files onto separate devices or controllers.

If you adjust for all of the performance improvements that they've
gotten from all the other optimizations, then you'll see that *just*
going raw hasn't really bought them that much. Defragmenting, in
particular, can buy you a *lot* of speed.

Of course, they could always have made the same performance
improvements without going raw, and gotten most, if not all, of the
performance gains that they're attributing to raw devices.

-p

==============================================================================
Paul Zola Technical Specialist World-Wide Technical Support
------------------------------------------------------------------------------
GCS H--- s:++ g++ au+ !a w+ v++ C+++ UAV++$ UUOC+++$ UHS++++$ P+>++ E-- N++ n+
W+(++)$ M+ V- po- Y+ !5 !j R- G? !tv b++(+++) !D B-- e++ u** h f-->+ r*
==============================================================================
Disclaimer: Opinions and statements are mine, and do not necessarily
reflect the opinions of Oracle Corporation.

MarkP28665

unread,
Jul 8, 1996, 3:00:00 AM7/8/96
to

My employer has had two Oracle dbms experts on site to look over our
system. We paid a very high hourly rate (rumored to be 275.00/hour) for
these people and we kept them each for a minimum of five days. Both
recommended that we run on raw partitions and not UNIX file systems.

Raw partitions usually result in a 50% performance improvement for an
individual physical I/O and this generally results in a 10% performance
impovement for the database as a whole. Going to raw partitions with NOT
help much if the problem is bad code. Most performance improvements will
come from rewriting SQL and changing how applications work, not from
changing the database.

Raw partitions are not any harder to manage than UNIX file system files if
you plan your system out in advance and follow a few simple rules. You
can move raw partitions around and redefine where they are located via
UNIX without having to rename them via Oracle. Oracle should be stopped
at the time, but we have done it several times to move files to new disks.
Switch from 'cpio' and 'tar' to 'dd' or the vendor provided fast
character special data set copy utility for your backups and what real
difficulity do raw partitions present.

In the old days raw partitions were superior than UNIX files because UNIX
controls the buffering of the I/O and buffers recorded as written by
Oracle could in fact not be written. With raw partitions the I/O is
unbuffered by the OS. Most, but not all, modern UNIX system provide a
write through the buffer function which tells UNIX not to buffer the I/O.
Oracle development uses the unbuffered call when available or so one of
the experts told me after talking to a private internal support resource.
If your UNIX system does not support the write-thru the buffer method you
may want to switch to protect your data.

Some Oracle options like parallel server (not query) require the use of
raw partitions so you may not have an option. And as far as the Millsap
paper goes I was advised by a friend with an inside contact in development
to read it with a grain of salt because it was written to address the
needs of Oracle support who mostly support small installations where the
depth of knowledge is shallow, and because application developers always
point to the database as the problem and not their code.

UNIX files systems work fine for most shops, but if you need every bit of
performance that you can get then you should switch to raw partitions.
This could be an endless discussion.

David Williams

unread,
Jul 8, 1996, 3:00:00 AM7/8/96
to

In article <4rpsh3$s...@inet-nntp-gw-1.us.oracle.com>, Paul Zola
<pz...@us.oracle.com> writes

>
>} >joe...@rossinc.com (Joel Garry) writes:
>} >
>} >>There was a thread on this a while ago, with some folks claiming minimal
>} >>increase, while others claimed up to 300%.
>
>Let me paraphrase from Cary Millsap's paper "The OFA Standard Oracle7
>for Open Systems", (part number A19308) which everyone in this thread
>should read before posting anything on this subject.
>
>(1) If disk I/O is not the bottleneck, then going to raw devices will
> have *no* *performance* *impact* *at* *all*.

Correct

>(2) If disk I/O is the bottleneck, then going to raw devices may
> sometimes gain up to a 10% performance improvement relative to
> the same database using filesystem files.

Agreed

>(3) Under very common circumstances, going to raw devices can actually
> *decrease* database performance.

??? Explain - Not going with the UNIX buffer cache and also copying
(DMAing) directly into user space rather than via kernel space i.e.
one memory write rather than a memory write and a memory copy is
SLOWER??

>(4) Anyone contemplating going to raw devices should benchmark their
> application on both raw and filesystem devices to see if there is
> a significant performance increase in using raw devices.

Not sure it's neccessary.

>(5) Anyone who does not have the time, expertise, or resources to
> perform the raw-vs-filesystem benchmark *should* *not* *consider*
> using raw devices.
>
>Finally: a word about those 300% speedups that people report. Often,
>when you look at the changes that they've made to go to raw devices
>you'll find that they have done one of:
> (1) Export/Import the database, and thereby remove fragmentation;
> (2) Move control or redo log files onto separate devices or
> controllers;
> (3) Move database files onto separate devices or controllers.
>
>If you adjust for all of the performance improvements that they've
>gotten from all the other optimizations, then you'll see that *just*
>going raw hasn't really bought them that much. Defragmenting, in
>particular, can buy you a *lot* of speed.
>
>Of course, they could always have made the same performance
>improvements without going raw, and gotten most, if not all, of the
>performance gains that they're attributing to raw devices.
>

Agreed - fragmentation does decrease performance dramtically but how
can filesystems be faster - loading inodes into memory and
following inodes means more seeking since inodes as not that close to
the data even with cylinder groups in use. You would expect at least
one seek from inode table to the data which is not required when
using raw devices.

The inode table is deliberately storing control i.e. disk layout
information away from (in a separate cylinder) to data.

How can it be faster???


> -p
>
>==============================================================================
>Paul Zola Technical Specialist World-Wide Technical Support
>------------------------------------------------------------------------------
>GCS H--- s:++ g++ au+ !a w+ v++ C+++ UAV++$ UUOC+++$ UHS++++$ P+>++ E-- N++ n+
> W+(++)$ M+ V- po- Y+ !5 !j R- G? !tv b++(+++) !D B-- e++ u** h f-->+ r*
>==============================================================================
>Disclaimer: Opinions and statements are mine, and do not necessarily
> reflect the opinions of Oracle Corporation.

--
David Williams

Joel Garry

unread,
Jul 11, 1996, 3:00:00 AM7/11/96
to

In article <$x2BSJAQ...@smooth1.demon.co.uk> David Williams <d...@smooth1.demon.co.uk> writes:
>In article <4rpsh3$s...@inet-nntp-gw-1.us.oracle.com>, Paul Zola
><pz...@us.oracle.com> writes
>>
>>} >joe...@rossinc.com (Joel Garry) writes:
>>} >
>>} >>There was a thread on this a while ago, with some folks claiming minimal
>>} >>increase, while others claimed up to 300%.
>>
>>Let me paraphrase from Cary Millsap's paper "The OFA Standard Oracle7
>>for Open Systems", (part number A19308) which everyone in this thread
>>should read before posting anything on this subject.
>>
>>(1) If disk I/O is not the bottleneck, then going to raw devices will
>> have *no* *performance* *impact* *at* *all*.
>
> Correct

Agree, just curious how often in the real world it isn't the bottleneck.
I have seen people make ridiculous attempts to use underpowered boxes,
but... well, I guess you can't a priori tell what is wrong.

>
>>(2) If disk I/O is the bottleneck, then going to raw devices may
>> sometimes gain up to a 10% performance improvement relative to
>> the same database using filesystem files.
>
> Agreed
>
>>(3) Under very common circumstances, going to raw devices can actually
>> *decrease* database performance.
>
> ??? Explain - Not going with the UNIX buffer cache and also copying
> (DMAing) directly into user space rather than via kernel space i.e.
> one memory write rather than a memory write and a memory copy is
> SLOWER??

Perhaps we are now getting into the area of placement of the files on
disk and so forth.

>
>>(4) Anyone contemplating going to raw devices should benchmark their
>> application on both raw and filesystem devices to see if there is
>> a significant performance increase in using raw devices.
>
> Not sure it's neccessary.

A stress test may show the OS vendor's more complicated handling of
file I/O may stand up better (or fall down more gracefully) than Oracles.
Since I've been burned on several platforms with DBWR pulling its dress
over its head, I don't quite trust Oracle as much as the OS. Maybe that's
unfair. But if you care about 10% performance, you may have some
capacity planning issues to address.

>
>>(5) Anyone who does not have the time, expertise, or resources to
>> perform the raw-vs-filesystem benchmark *should* *not* *consider*
>> using raw devices.

I shoulda said that. But if it's a "free" 10%, why not? (For those
who are confused now, I had said the possible administrative problems
make a small gain not worth it, this is mild sarcasm).

>>
>>Finally: a word about those 300% speedups that people report. Often,
>>when you look at the changes that they've made to go to raw devices
>>you'll find that they have done one of:
>> (1) Export/Import the database, and thereby remove fragmentation;
>> (2) Move control or redo log files onto separate devices or
>> controllers;
>> (3) Move database files onto separate devices or controllers.

I had those thoughts about the person who claimed that, as well as
that he might have changed hardware.

>>
>>If you adjust for all of the performance improvements that they've
>>gotten from all the other optimizations, then you'll see that *just*
>>going raw hasn't really bought them that much. Defragmenting, in
>>particular, can buy you a *lot* of speed.
>>
>>Of course, they could always have made the same performance
>>improvements without going raw, and gotten most, if not all, of the
>>performance gains that they're attributing to raw devices.
>>
>
> Agreed - fragmentation does decrease performance dramtically but how
> can filesystems be faster - loading inodes into memory and
> following inodes means more seeking since inodes as not that close to
> the data even with cylinder groups in use. You would expect at least
> one seek from inode table to the data which is not required when
> using raw devices.
>

Assuming Oracle doesn't do the same things... and if it doesn't,
how do we know things won't get corrupted in a crash? Assuming
everything Oracle does is as good as could be done is a mistake.
At least with fsck you might figure out what happened...

> The inode table is deliberately storing control i.e. disk layout
> information away from (in a separate cylinder) to data.
>
> How can it be faster???
>> -p
>>
>>==============================================================================
>>Paul Zola Technical Specialist World-Wide Technical Support
>>------------------------------------------------------------------------------
>>GCS H--- s:++ g++ au+ !a w+ v++ C+++ UAV++$ UUOC+++$ UHS++++$ P+>++ E-- N++ n+
>> W+(++)$ M+ V- po- Y+ !5 !j R- G? !tv b++(+++) !D B-- e++ u** h f-->+ r*
>>==============================================================================
>>Disclaimer: Opinions and statements are mine, and do not necessarily
>> reflect the opinions of Oracle Corporation.
>
>--
>David Williams

Frank Bommarito

unread,
Jul 12, 1996, 3:00:00 AM7/12/96
to

All true - with one exception.

Going to RAW can allow a significant amount of existing memory to
be reallocated - - maybe to database block buffers -- Given that
this is where I have seen the most significant performance
improvements - Same book mentions upto 3000% - Maybe the 300% was
not raw - but - was memory.

Note - RAW devices are usually - faster on random reads - and -
slower on sequential reads. -- because the OS is not buffering
data.

Frank

Christoph Torlinsky

unread,
Jul 15, 1996, 3:00:00 AM7/15/96
to

If you are going to use unix file systems with oracle, what would
you set the UNIX file systems buffers to ? 10% 15%. Just curious.
It sounds interesting that...the unix file system can actually
be just as quick as ...raw devices.

Also, what are some tricks to avoid high seek times..and what should
monitored to pinpoint such a problem?

Thanks a bunch.

ps: this one of the most interesting threads on this group in a long
time...

Paul Zola

unread,
Jul 15, 1996, 3:00:00 AM7/15/96
to

} In article <4rpsh3$s...@inet-nntp-gw-1.us.oracle.com>, Paul Zola
} <pz...@us.oracle.com> writes
} >

[snip]


} >(3) Under very common circumstances, going to raw devices can actually
} > *decrease* database performance.
}
} ??? Explain - Not going with the UNIX buffer cache and also copying
} (DMAing) directly into user space rather than via kernel space i.e.
} one memory write rather than a memory write and a memory copy is
} SLOWER??

Short answer: yep.

Flippant answer: Trying to reduce expected disk read times by
reducing the number of memory copies is like trying to optimize a
program by moving code from the initialization modules into the
main loop.

Medium-length serious answer: the UNIX kernel is very good at
optimizing file I/O. In particular, the kernel has the ability
to optimize I/O by scheduling read and write requests in a way
that minimizes disk seeks and rotational latency, which are the
big bottlenecks in any I/O system.

Before answering this in detail, I want to address another issue you
raised, in part because my response to it will help explain how the
buffer cache can win against raw devices.

You write:

} Agreed - fragmentation does decrease performance dramtically but how
} can filesystems be faster - loading inodes into memory and
} following inodes means more seeking since inodes as not that close to
} the data even with cylinder groups in use. You would expect at least
} one seek from inode table to the data which is not required when
} using raw devices.

[snip]


} How can it be faster???

I'm afraid that this passage betrays a certain lack of understanding of
the UNIX kernel.

The inodes for open files are cached in memory, for as long as the file
is open. Since the Oracle background processes open the database files
when the database is mounted, and keep them open until the database is
un-mounted, all kernel accesses to the inode are through the in-core
copy (which, of course, requires no seek time).

A somewhat more relevant objection (which you did not raise) is that
Oracle database files are big enough to require indirect blocks, and
that accessing the data in the indirect blocks would require an extra
disk access.

In practice, this extra disk access occurs very rarely. Since the
indirect block is a filesystem block, the normal LRU caching
algorithims apply to it as well. This means that the indirect blocks
for any reasonably frequently-accessed database file will already be in
the cache, and will not require an extra I/O. On a typical BSD
filesystem, using 8k blocks & 4-byte disk addresses, any access to a
disk within a 16-meg region will cause the indirect block to move to
the head of the LRU queue.

(Why a 16 meg region? Here's the math: A single indirect block, 8k
long, can contain 2048 4-byte disk addresses. Each one of these disk
addresses points to an 8k disk block. This means that a single
indirect block contains disk addresses for 8 * 1024 * 2048 bytes, or 16
meg.)

Finally, you seem to be under the impression that the UNIX kernel
schedules disk I/O sequentially: first it reads the indirect block
from disk, then it reads the pointed-to block from disk, and then it
goes on to service another I/O request. This is not the case.
Instead, the UNIX kernel is multi-threaded internally: disk I/O
processing occurs asynchronously with respect to user processing, in a
separate thread of control. (Note to any kernel hackers reading this:
yes, I do know that the traditional UNIX implementation isn't really
threaded; I do believe this to be the cleanest conceptual explanation.)

The disk I/O "thread" within UNIX schedules I/O using an "up/down
elevator algorithm". The disk scheduler works something like this:
when the kernel determines that a request has been made to read or
write a disk block, it places a request for that I/O operation into a
queue of requests for that particular disk device. Requests are kept
in order, sorted by *disk block address*, and not by time of request.
The disk driver then services the requests based on the location of the
block on disk, and not on the order in which the requests were made.

The "up/down elevator algorithm" maintains a conceptual "hand", much
like the hand on a wristwatch. This hand continually traverses the
list of I/O requests: first moving from low disk addresses to high
disk addresses, and then from high addresses down to low -- hence,
"up/down elevator". As the "hand" moves up and down the list of
requests, it services the requests in an order which minimizes disk
seek times.

It's perhaps easier to conceptualize this based on a disk with a single
platter: the "hand" moving up and down the list of disk block addresses
exactly corresponds to the disk head seeking in and out on the drive.
(On real-life multi-head drives, the sorting algorithm needs to be
tuned so that all disk blocks on the same cylinder sort together.
This, incidentally, is why mkfs and tunefs care about the number of
sectors per track and the number of tracks per cylinder.)

Here's an example based on your (incorrect) example above. Say that
we're opening a file for the first time. The kernel gets a request to
read the inode into core, and the inode is at absolute disk block
number 21345. This request goes into the request queue, and the
requesting process is put to sleep. Some time later, the "hand" passes
this disk block number, going from higher to lower disk addresses. The
process that requested to read inode number 21345 is woken up and trys
to read() the first block of that file. The kernel, acting on behalf
of that process, determines that the first block of the file is located
on disk block 21643, puts the request into the the request queue, and
puts the requesting process to sleep again. Since the "hand" is moving
towards lower addresses, the read request for block 21643 won't be
serviced until the "hand" has serviced all the requests in the queue
for blocks between 0 and 21643 twice -- once while traversing the
request queue going down, and once while going up.

Why use this algorithm? It's possible to prove that in the presence of
multiple processes making random I/O requests, some variant of the
elevator algorithm will result in the shortest average seek time, and
thus the greatest average throughput.

}
} How can it be faster???
}

So after having gone through all that, here's the long answer to why
using raw devices can be slower than the filesystem.

The important fact about disk optimization is this: disk seek times and
rotational delays are measured in milliseconds, while RAM access times
are measured in nanoseconds. This means that to maximize disk I/O the
three most important optimizations are (in order):
(1) Avoid I/O altogether (through caching or other mechanism).
(2) Reduce seek times.
(3) Reduce rotational delays.
Indeed, if we perform 500 memory-to-memory copies, and thereby avoid
1 seek, we're still ahead of the game.

Let's consider an Oracle database with the datafiles installed on raw
devices. Let us say (to avoid the effects of datafile placement) that
there's only one tablespace on this disk, and it's only used for
indexes. Let us also say that this disk has 1024 cylinders, a maximum
seek time of 100ms, and (just to give the raw device every advantage)
that it has a track buffer so that there is no rotational latency
delay. Finally, let's say that there are multiple users performing
queries and updates at the same time, and these work out so there are 2
processes reading, and one process (DBWRITER) writing to this datafile
at the same time.

Let us further consider the disk access pattern of these three
processes. (In case it isn't clear -- this is Oracle having to read a
disk block that isn't in the SGA in from the file system.) Let's say
that over a time interval, they access disk blocks in the following
pattern:
read cylinder 1; read cylinder 1024 ; write cylinder 2; read
cylinder 1023; read cylinder 2; read cylinder 1022 ; write
cylinder 3; read cylinder 1021; write cylinder 4.

How much time did this take to complete? Well, since raw device
accesses are scheduled in the order in which they occur, this disk had
to seek almost from one end to the other 8 times, for a total of 800
ms.

Now, let's consider the same access pattern on a filesystem file, using
the buffer cache. Assuming that these accesses came pretty quickly,
they could all be handled in 3 scans of the request queue (2 going up,
2 going down) for a total of 300 ms. (Note that since UNIX has a
write-back cache, the final write didn't necessarily force a write to
disk: yet another optimization made possible by using filesystem
files.)

Result? Filesytem wins by more than 100% over raw device.

How did this happen? It's a direct result of the access pattern of the
database. The UNIX filesystem is optimized for random access, while
raw devices work better for sequential access.

Note that if there wasn't a track buffer on the hard disk that we could
have made the result even more lopsided for the filesystem, by having
the access pattern force rotational delays when accessing the raw
device.

This is why I say:

} >(4) Anyone contemplating going to raw devices should benchmark their
} > application on both raw and filesystem devices to see if there is
} > a significant performance increase in using raw devices.
}
} Not sure it's neccessary.

But we've just seen that it is. Let me emphasize: whether raw or
filesystem devices will be faster depends DIRECTLY on the disk access
patterns of *your* *specific* *application*. The exact same schema
could be faster on one or the other, depending on what your users are
doing with it.

This is not to say that using raw devices will always be slower. But
they are far from the "magic bullet" that lots of DBAs seem to think
they are.

And I'll say it again:

} >(5) Anyone who does not have the time, expertise, or resources to
} > perform the raw-vs-filesystem benchmark *should* *not* *consider*
} > using raw devices.


-paul


References:
`Operating Systems Design and Implementation'
Andrew S. Tannenbaum, Prentice-Hall, ISBN 0-13-637406-9
`The Design and Implementation of the 4.3BSD Unix Operating System',
Samuel Leffler, Kirk McKusick, Michael Karels, John Quarterman,
1989, Addison-Wesley, ISBN 0-201-06196-1
`The Design of the Unix Operating System', Maurice Bach, 1986,
Prentice Hall, ISBN 0-13-201757-1

Paul Zola

unread,
Jul 15, 1996, 3:00:00 AM7/15/96
to

} Jo Manna wrote:
} > Thanks for the reply. This is quite interesting. I have previously
} > used Sybase and I remember that using Unix File Systems instead of
} > Raw Devices was not recommended. Just off the top of me head the
} > reasons for this was something like...
} >
} > ... the Sybase Server being in 'charge' of the actual I/O and if using a
} > Unix File System could not guarantee that a 'commit was a commit',
} > due to the OS buffering..... and so on.
} >
} > Obviously we are talking about Oracle here and not Sybase, but I am

} > just wondering how Oracle gets around this if at all, or have I missed
} > something?

You know, this is (at least) the second time that I've heard someone
say that Sybase told them that UNIX did not allow for transactional
behavior unless they used raw devices. I'm beginning to think that
someone at Sybase is actually saying this.

This is, of course, not true. While it is true that the default
behavior of the UNIX filesystem (write-back buffered cache) does not
allow for transactional consistancy, all modern versions of UNIX
provide an ability to modify the behavior of the buffer cache.

BSD-derived systems provide the fsync() system call, which flushes all
the dirty buffers associated with a file descriptor. After the fsync()
call completes, the operating system guarantees that all the buffered
data associated with the file descriptor has been written to the disk.

SystemV-derived systems provide the O_SYNC flag. This can be used in
2 ways: it can be used as part of the third flag to open(), or it can
be used as part of the third argument of fcntl(), when used with the
F_SETFL argument. When the O_SYNC flag is set on a file descriptor,
the operating system guarantees that when a write() system call
returns, the data from the write has been written to the disk.

ORACLE uses the fsync() or O_SYNC capabilities of UNIX to guarantee
that the redo log files are up-to-date. If the OS crashes, ORACLE will
use the (accurate) data in the redo logs to roll forward the (possibly
inaccurate) data in the data files.

Providing that the OS correctly implements the fsync() or O_SYNC
capabilities, there is no chance of data loss when using ORACLE with
filesystem files.

I have no direct experience with Sybase, so I can't say for sure
whether or not Sybase runs the risk of database corruption when using
filesystem files. If true, there's no inherent limitation in UNIX
that makes this so.

-paul

Joel Garry

unread,
Jul 15, 1996, 3:00:00 AM7/15/96
to

In article <4scrou$2...@inet-nntp-gw-1.us.oracle.com> pz...@us.oracle.com (Paul Zola ) writes:
>

>Result? Filesytem wins by more than 100% over raw device.
>
>How did this happen? It's a direct result of the access pattern of the
>database. The UNIX filesystem is optimized for random access, while
>raw devices work better for sequential access.
>

An unintended side effect is, poorly optimized apps that force too
many full-table scans will do better with raw devs...

Jonathan Lewis

unread,
Jul 17, 1996, 3:00:00 AM7/17/96
to

Joel Garry wrote:
>
> An unintended side effect is, poorly optimized apps that force too
> many full-table scans will do better with raw devs...
>

I would have said that there is an opposing argument to that.

If you do tablescans in Oracle, the data blocks are not kept in the
SGA, so a repeated tablescan does the same set of disk reads agsin.

With a file system, the disk reads may be satisfied from the
file system buffer. With raw devices they will be physical
reads. Result: a bad app collapses when moved to raw disk.

---
Jonathan Lewis
ora_...@jlcomp.demon.co.uk

Joel Garry

unread,
Jul 18, 1996, 3:00:00 AM7/18/96
to

In article <31ED5A...@jlcomp.demon.co.uk> Jonathan Lewis <ora_...@jlcomp.demon.co.uk> writes:
>Joel Garry wrote:
>>
>> An unintended side effect is, poorly optimized apps that force too
>> many full-table scans will do better with raw devs...
>>
>
>I would have said that there is an opposing argument to that.
>
>If you do tablescans in Oracle, the data blocks are not kept in the
>SGA, so a repeated tablescan does the same set of disk reads agsin.

Oh, man, I was just assuming Oracle was smart enough to keep
a small table in the SGA for repeated tablescans. Once again,
burned by assuming Oracle did something right. Oh, I see, full
table scans put the buffer at the LRU end of the LRU list, because
it assumes if you are using a full table scan, you won't need the
buffer again - so Oracle is only smart enough when it's under
SMALL_TABLE_THRESHOLD. Not likely for most of the data, although
possibly a good argument to not put all lookup codes in one table.

I've definitely changed my mind from thinking raw devs are almost a
tossup to, don't use them unless you have to or can demonstrate their
advantage.

Thanks,

jg

>
>With a file system, the disk reads may be satisfied from the
>file system buffer. With raw devices they will be physical
>reads. Result: a bad app collapses when moved to raw disk.
>
>---
>Jonathan Lewis
>ora_...@jlcomp.demon.co.uk

--

Glenn Daddy G Fawcett

unread,
Jul 18, 1996, 3:00:00 AM7/18/96
to

In Oracle 7.3 you can alter a specific table's properties to place buffers
obtained by a tablescan at the MRU end of the LRU chain.

ALTER TABLE abc123 CACHE;


In article <1996Jul18.1...@rossinc.com>,

Joel Garry <joe...@rossinc.com> wrote:
>
>buffer again - so Oracle is only smart enough when it's under
>SMALL_TABLE_THRESHOLD. Not likely for most of the data, although
>possibly a good argument to not put all lookup codes in one table.
>

--
Glenn Fawcett, Technical Marketing Engineer voice: (503)578-3712
Sequent Computer Systems FAX: (503)578-5453
15450 SW Koll Pkwy (MS CHA1-371) UUCP: ..uunet!sequent!glennf
Beaverton, Oregon 97006-6063 internet: gle...@sequent.com

Steve Dodsworth

unread,
Jul 18, 1996, 3:00:00 AM7/18/96
to

lots of people said the following :

>>>
>>> An unintended side effect is, poorly optimized apps that force too
>>> many full-table scans will do better with raw devs...
>>>
>>
>>I would have said that there is an opposing argument to that.
>>
>>If you do tablescans in Oracle, the data blocks are not kept in the
>>SGA, so a repeated tablescan does the same set of disk reads agsin.
>
>Oh, man, I was just assuming Oracle was smart enough to keep
>a small table in the SGA for repeated tablescans. Once again,
>burned by assuming Oracle did something right. Oh, I see, full
>table scans put the buffer at the LRU end of the LRU list, because
>it assumes if you are using a full table scan, you won't need the

>buffer again - so Oracle is only smart enough when it's under
>SMALL_TABLE_THRESHOLD. Not likely for most of the data, although
>possibly a good argument to not put all lookup codes in one table.
>

>I've definitely changed my mind from thinking raw devs are almost a
>tossup to, don't use them unless you have to or can demonstrate their
>advantage.
>
>Thanks,
>
>jg
>
>>
>>With a file system, the disk reads may be satisfied from the
>>file system buffer. With raw devices they will be physical
>>reads. Result: a bad app collapses when moved to raw disk.
>>---
>>Jonathan Lewis

>--
>Joel Garry joe...@rossinc.com Compuserve 70661,1534
>These are my opinions, not necessarily those of Ross Systems, Inc. <> <>

Full-table-scan tables can be kept in the sga via the cache keyword at table
create time, god bless them

Bye,
Steve
____________________________________________
| any similarity 'tween my opinions and that |
| of my employers are purely hypothetical |
| and should give no cause for alarm |
--------------------------------------------

Steve Holdoway

unread,
Jul 19, 1996, 3:00:00 AM7/19/96
to

gle...@crg8.sequent.com (Glenn "Daddy G" Fawcett) wrote:

>In Oracle 7.3 you can alter a specific table's properties to place buffers
>obtained by a tablescan at the MRU end of the LRU chain.

> ALTER TABLE abc123 CACHE;

What happens if you've not got enough space in the SGA to hold the
complete table? What I'm praying that you'll say is that the most
commonly used parts of the table are kept in memory. Or can you force
this to happen if it's not the default case??

TIA

Steve.

>In article <1996Jul18.1...@rossinc.com>,

>Joel Garry <joe...@rossinc.com> wrote:
>>
>>buffer again - so Oracle is only smart enough when it's under
>>SMALL_TABLE_THRESHOLD. Not likely for most of the data, although
>>possibly a good argument to not put all lookup codes in one table.
>>

Doug Smith

unread,
Jul 19, 1996, 3:00:00 AM7/19/96
to

In article <4snl9t$g...@atlas.aethos.co.uk> st...@aethos.demon.co.uk (Steve Holdoway) writes:
>From: st...@aethos.demon.co.uk (Steve Holdoway)
>Subject: Re: Raw Devices: Increased Performance?
>Date: Fri, 19 Jul 1996 09:48:36 GMT

>gle...@crg8.sequent.com (Glenn "Daddy G" Fawcett) wrote:

>>In Oracle 7.3 you can alter a specific table's properties to place buffers
>>obtained by a tablescan at the MRU end of the LRU chain.

>> ALTER TABLE abc123 CACHE;

>What happens if you've not got enough space in the SGA to hold the
>complete table? What I'm praying that you'll say is that the most
>commonly used parts of the table are kept in memory. Or can you force
>this to happen if it's not the default case??

Oracle keeps a portion of the tables you access in memory(SGA) buffers.
The CACHE parameter forces Oracle to put the entire table or whatever will fit
in the SGA, which will bump other data from other tables out.

Jef Kennedy

unread,
Jul 19, 1996, 3:00:00 AM7/19/96
to

Paul,

FYI... A potential OPS customer (on Unix) has apparently seen this
message and is having second thoughts about buying Oracle Parallel Server
because of its requirement of raw disks. We are in the position of having
to quiet their fears to finish the sale.

In the case of Oracle on Digital Unix, using raw disk is not as bad as you
would make it out to be if the customer uses Digital LSM product, which
provides extensive management capabilities.

Although you've included a disclaimer, the outside world will see the
"oracle.com" in your email address and take your word as gospel. Please
be more sensitive in how you present your opinions.

Thanks.

Jef Kennedy
415/506-6144


In article <4rpsh3$s...@inet-nntp-gw-1.us.oracle.com>, pz...@us.oracle.com
(Paul Zola ) wrote:

> } >joe...@rossinc.com (Joel Garry) writes:
> } >
> } >>There was a thread on this a while ago, with some folks claiming minimal
> } >>increase, while others claimed up to 300%.
>
> Let me paraphrase from Cary Millsap's paper "The OFA Standard Oracle7
> for Open Systems", (part number A19308) which everyone in this thread
> should read before posting anything on this subject.
>
> (1) If disk I/O is not the bottleneck, then going to raw devices will
> have *no* *performance* *impact* *at* *all*.

> (2) If disk I/O is the bottleneck, then going to raw devices may
> sometimes gain up to a 10% performance improvement relative to
> the same database using filesystem files.

> (3) Under very common circumstances, going to raw devices can actually
> *decrease* database performance.

> (4) Anyone contemplating going to raw devices should benchmark their
> application on both raw and filesystem devices to see if there is
> a significant performance increase in using raw devices.

> (5) Anyone who does not have the time, expertise, or resources to
> perform the raw-vs-filesystem benchmark *should* *not* *consider*
> using raw devices.
>

> Finally: a word about those 300% speedups that people report. Often,
> when you look at the changes that they've made to go to raw devices
> you'll find that they have done one of:
> (1) Export/Import the database, and thereby remove fragmentation;
> (2) Move control or redo log files onto separate devices or
> controllers;
> (3) Move database files onto separate devices or controllers.
>

> If you adjust for all of the performance improvements that they've
> gotten from all the other optimizations, then you'll see that *just*
> going raw hasn't really bought them that much. Defragmenting, in
> particular, can buy you a *lot* of speed.
>
> Of course, they could always have made the same performance
> improvements without going raw, and gotten most, if not all, of the
> performance gains that they're attributing to raw devices.
>

> -p

Steve Dodsworth

unread,
Jul 19, 1996, 3:00:00 AM7/19/96
to

In <4snl9t$g...@atlas.aethos.co.uk>, st...@aethos.demon.co.uk (Steve Holdoway) writes:
>gle...@crg8.sequent.com (Glenn "Daddy G" Fawcett) wrote:
>
>>In Oracle 7.3 you can alter a specific table's properties to place buffers
>>obtained by a tablescan at the MRU end of the LRU chain.
>
>> ALTER TABLE abc123 CACHE;
>
>What happens if you've not got enough space in the SGA to hold the
>complete table? What I'm praying that you'll say is that the most
>commonly used parts of the table are kept in memory. Or can you force
>this to happen if it's not the default case??

Only if the blocks are some of the most commonly accessed in the database

>
>TIA
>
>Steve.
>
>>In article <1996Jul18.1...@rossinc.com>,
>
>>Joel Garry <joe...@rossinc.com> wrote:
>>>
>>>buffer again - so Oracle is only smart enough when it's under
>>>SMALL_TABLE_THRESHOLD. Not likely for most of the data, although
>>>possibly a good argument to not put all lookup codes in one table.
>>>
>>--
>>Glenn Fawcett, Technical Marketing Engineer voice: (503)578-3712
>>Sequent Computer Systems FAX: (503)578-5453
>>15450 SW Koll Pkwy (MS CHA1-371) UUCP: ..uunet!sequent!glennf
>>Beaverton, Oregon 97006-6063 internet: gle...@sequent.com
>
>

Glenn Daddy G Fawcett

unread,
Jul 19, 1996, 3:00:00 AM7/19/96
to

Once a block is put on the MRU end of the chain, it will follow the
normal algorithms and rows that are accesed most often will tend to
stay in memory. The bigger issue is your first statment:

"What happens if you've not got enough space the SGA to hold the
complete table?"

If you haven't got space to hold the complete table in Memory and you
are performing full table scans, then there is not much reason to
alter the table to CACHE. This is done, then your entire SGA will be
abolished of usable every time you perform a scan. Additionally, the
portion of the table that would remain in the SGA would not be used on
the next full table scan and you would effectivly read the entire table
from disk anyway.

In article <4snl9t$g...@atlas.aethos.co.uk>,


Steve Holdoway <st...@aethos.demon.co.uk> wrote:
>gle...@crg8.sequent.com (Glenn "Daddy G" Fawcett) wrote:
>
>>In Oracle 7.3 you can alter a specific table's properties to place buffers
>>obtained by a tablescan at the MRU end of the LRU chain.
>
>> ALTER TABLE abc123 CACHE;
>
>What happens if you've not got enough space in the SGA to hold the
>complete table? What I'm praying that you'll say is that the most
>commonly used parts of the table are kept in memory. Or can you force
>this to happen if it's not the default case??

Willy Klotz

unread,
Jul 19, 1996, 3:00:00 AM7/19/96
to

Jonathan Lewis <ora_...@jlcomp.demon.co.uk> wrote:

>Joel Garry wrote:
>>
>> An unintended side effect is, poorly optimized apps that force too
>> many full-table scans will do better with raw devs...
>>

>I would have said that there is an opposing argument to that.

>If you do tablescans in Oracle, the data blocks are not kept in the
>SGA, so a repeated tablescan does the same set of disk reads agsin.

Simply not true. You can prove it easily yourself when doing a join
between two tables (which of course must result in full table scans)
and watching your system.

CPU utilization will be at 100%, without any physical I/O activity.

>With a file system, the disk reads may be satisfied from the
>file system buffer. With raw devices they will be physical
>reads. Result: a bad app collapses when moved to raw disk.

If you simply switch to raw devices, this might be true.

Consequently, when you switch to raw, you should allocate your (no
unused) UNIX-cache to the Oracle SGA (using db_buffers). This way,
Oracle will find the data in its own cache....

>---
>Jonathan Lewis
>ora_...@jlcomp.demon.co.uk

Willy Klotz

======================================================================
Willys Mail FidoNet 2:2474/117 2:2474/118
Mailbox: analog 06297 910104
ISDN 06297 910105
Internet: 06297...@t-online.de
-> No Request from 06.00 to 08.00 <-
======================================================================

M1069B00

unread,
Jul 22, 1996, 3:00:00 AM7/22/96
to

What about SVM???

Having read this thread with interest (we use raw devices with ptx/SVM),
the basic argument is that filesystems are better than raw devices if using
OFA because :-

* inodes are not a real overhead because they are in memory and commonly used
indirect blocks are cached as filesystem blocks.

* UNIX kernel uses elevator algorithm for data access on filesystems as opposed
to the come first served algorithm inherent in raw device I/O.

* In OFA you are spreading the datafiles across the disk bank and so thinning
the load on any single controller (on average).

This is comparing UNIX disk / filesystem algorithms versus raw devices as-is.
(i.e. using basic character I/O control)
But in the case of Sequent systems using ptx/SVM is it not a different issue?

Raw device (/dev/rvol/,,,) read and writes go through the ptx/SVM layer
which I thought introduced the same kinds of I/O optimizers as the filesystem
(e.g. the elevator algorithm).
It implements striping, and also balances the I/O load between mirrored copies
(i.e. goes to the least busy side of the mirror for read requests).

Is it not the case that raw volumes and raw volumes under SVM are very different
in their performance capabilities?
What about a new comparison based on the SVM enhanced raw volume?

The post that spoke of a customer considering Oracle Parallel Server would
clearly be referring to a system with SVM, but the arguments being made are
not accounting for SVM, just dumb raw devices using basic stream I/O control.

Clarification from a guru requested please ...

Regards, Steve Woolston. (woolsts@norwich_union.co.uk)

David Williams

unread,
Jul 22, 1996, 3:00:00 AM7/22/96
to

In article <83804169...@gate.norwich-union.com>, M1069B00@?.?
writes

>
>What about SVM???
>
>Having read this thread with interest (we use raw devices with ptx/SVM),
>the basic argument is that filesystems are better than raw devices if using
>OFA because :-
>
>* inodes are not a real overhead because they are in memory and commonly used
> indirect blocks are cached as filesystem blocks.
>

Yes - sorry my brain rather overloaded with ideas against filesystems
on my last post and I got a bit carried away.

>* UNIX kernel uses elevator algorithm for data access on filesystems as opposed
> to the come first served algorithm inherent in raw device I/O.
>

Generally disk drive controllers themselves which have the elevator
seek algorithm built-in to them, also the database engine has this
built-in as well. Why waste time doing it an extra time ?


>* In OFA you are spreading the datafiles across the disk bank and so thinning
> the load on any single controller (on average).
>

Also fragmentation/ creation of tables within 'dbspaces' collections
of chunks of raw disk) allows spreading of databases/tables across
disks. This gives a clearer indication of disk layout to the
database engine (since raw device are continguous across the disk).

>This is comparing UNIX disk / filesystem algorithms versus raw devices as-is.
>(i.e. using basic character I/O control)
>But in the case of Sequent systems using ptx/SVM is it not a different issue?
>

Running an Informix Online v7 database against a raw device provides
all of the benefits without a Device manager Layer.

Also OnLine has a better knowledge of where the table is stored when
doing read-ahead (if the table is not contiguous across the disk
then an operating system doing sequential read ahead may well read
areas of the disk which are then not used). Using a raw disk
overcomes this problem.

How can the UNIX disk/filesystem algorithms be better when all they
see is a large file rather then the layout of tables/indices?

Also the database engine can schedule high priority disk I/O e.g.
transaction log I/O ahead of say a search on a table.


>Raw device (/dev/rvol/,,,) read and writes go through the ptx/SVM layer
>which I thought introduced the same kinds of I/O optimizers as the filesystem
>(e.g. the elevator algorithm).
>It implements striping, and also balances the I/O load between mirrored copies
>(i.e. goes to the least busy side of the mirror for read requests).
>

OnLIne also can implement striping (via table fragmentation) and
balances I/O load between mirrored copies using Informix mirroring.

>Is it not the case that raw volumes and raw volumes under SVM are very different
>in their performance capabilities?
>What about a new comparison based on the SVM enhanced raw volume?
>
>The post that spoke of a customer considering Oracle Parallel Server would
>clearly be referring to a system with SVM, but the arguments being made are
>not accounting for SVM, just dumb raw devices using basic stream I/O control.
>

The database engine using raw devices should be faster then using
filesystems given that the filesystem code becomes just extra 'less
intelligent' layer sitting between the database engine and the disk
when using filesystems.

>Clarification from a guru requested please ...
>
>Regards, Steve Woolston. (woolsts@norwich_union.co.uk)

Sorry it's been so long since my last post but I've had flu and have
just recovered. Also sorry if this sounds like an advert for Informix
but it's database engine I know the best. What do you mean Oracle
doesn't do all of the above? ;-> Answers on an e-mail please.

--
David Williams

Jef Kennedy

unread,
Jul 24, 1996, 3:00:00 AM7/24/96
to

In article <EiAXoAAcT$8xE...@smooth1.demon.co.uk>, David Williams
<d...@smooth1.demon.co.uk> wrote:

> Generally disk drive controllers themselves which have the elevator
> seek algorithm built-in to them, also the database engine has this
> built-in as well. Why waste time doing it an extra time ?

Good point. If I remember right, the Oracle kernel has an elevator
algorithm built in, and it was useful when disk controllers weren't doing
the job. Nowadays, it doesn't make a lot of sense for the kernel to sort
the blocks going to individual drives since the disk controller can do a
better job of it. The controller knows where the head really is and sees
all the traffic going to the drive, not just dbwr's writes. I believe
most Oracle ports now have that Oracle elevator algorithm turned off.

FWIW.

-Jef

Eran Shtiegman

unread,
Jul 26, 1996, 3:00:00 AM7/26/96
to

To Whom it may concern:

We have been evaluating Oracle Objects for OLE to use instead of
ODBC to use in our Client/Server application. This is a realtime
system we are developing in C++/MFC. I have been trying to get
information about Oracle objects for OLE for some time now with
out success. There does not seem to be many people using it with
C++. If you could answer some questions for me I would greatly
appreciate it:

1. Supposedly Oracle in Israel has told us version 2 is out, but
but on the server I can't find anything indicating this fact.
Is it out and if not when will it be out?/

2. Are there any differences between 2.0 and 2.0 beta that we
downloaded from www.oracle.com

3. In the class library there is no information about date formats.
From browsing the code I see that there are some documented
bugs concerning dates. I believe that dates are stored in a
variant type internally, why can't I get the date in a variant data
type, or OLEDateTime.
There is no documentation stating in what format I receive dates
or how to enter them (with GetFieldValue, and SetfieldValue)
What about time? How do I get times without doing a ToChar
in my select statement? In other words will 2.0 support some sort
of date format, and if not what is the best way to overcome this
annoying situation?


Thanking you,

Eran Shtiegman

Willy Klotz

unread,
Jul 26, 1996, 3:00:00 AM7/26/96
to

thank you for the very good explanation of how the fileystem works.

pz...@us.oracle.com (Paul Zola ) wrote:

>} In article <4rpsh3$s...@inet-nntp-gw-1.us.oracle.com>, Paul Zola
>} <pz...@us.oracle.com> writes
>} >
>[snip]
>} >(3) Under very common circumstances, going to raw devices can actually
>} > *decrease* database performance.
>}
>} ??? Explain - Not going with the UNIX buffer cache and also copying
>} (DMAing) directly into user space rather than via kernel space i.e.
>} one memory write rather than a memory write and a memory copy is
>} SLOWER??

>Flippant answer: Trying to reduce expected disk read times by


> reducing the number of memory copies is like trying to optimize a
> program by moving code from the initialization modules into the
> main loop.

This sentence made the point. If one is going to raw devices, without
reallocating memory from the (now unused) file system cache to the
database, performance will simply go much worse.

>How much time did this take to complete? Well, since raw device
>accesses are scheduled in the order in which they occur, this disk had
>to seek almost from one end to the other 8 times, for a total of 800
>ms.

You talked about the UNIX kernal, which is optimizing I/O. Is there no
optimization for raw devices, only for filesystems ?

Oracle also schedules multiple read/writes on one raw device - why
should these be served sequentially ?

>Now, let's consider the same access pattern on a filesystem file, using
>the buffer cache. Assuming that these accesses came pretty quickly,
>they could all be handled in 3 scans of the request queue (2 going up,
>2 going down) for a total of 300 ms. (Note that since UNIX has a
>write-back cache, the final write didn't necessarily force a write to
>disk: yet another optimization made possible by using filesystem
>files.)

>Result? Filesytem wins by more than 100% over raw device.

This is only true watching accesses for one single disk drive. If a
system constantly has this type of I/O, then one should consider to
spread I/Os across several disks.

>But we've just seen that it is. Let me emphasize: whether raw or
>filesystem devices will be faster depends DIRECTLY on the disk access
>patterns of *your* *specific* *application*. The exact same schema
>could be faster on one or the other, depending on what your users are
>doing with it.

As you say, it depends on the application. Your statements are
correct, especially if only the database is running on the machine,
serving clients.

If you also have apllication code running on the machine, then the
situation is totally different. You also have to consider the load
(and type of load) the programs and users put on the system - if they
use the filesystem heavily, the cache "competes" between user-files
and database-files.

>This is not to say that using raw devices will always be slower. But
>they are far from the "magic bullet" that lots of DBAs seem to think
>they are.

>And I'll say it again:

>} >(5) Anyone who does not have the time, expertise, or resources to
>} > perform the raw-vs-filesystem benchmark *should* *not* *consider*
>} > using raw devices.

>==============================================================================


>Paul Zola Technical Specialist World-Wide Technical Support
>------------------------------------------------------------------------------
>GCS H--- s:++ g++ au+ !a w+ v++ C+++ UAV++$ UUOC+++$ UHS++++$ P+>++ E-- N++ n+
> W+(++)$ M+ V- po- Y+ !5 !j R- G? !tv b++(+++) !D B-- e++ u** h f-->+ r*
>==============================================================================
>Disclaimer: Opinions and statements are mine, and do not necessarily
> reflect the opinions of Oracle Corporation.

Willy Klotz

Willy Klotz

======================================================================
Willys Mail FidoNet 2:2474/117 2:2474/118

Mailbox: analog 06297 95035
ISDN 06297 910105
Internet: wil...@t-online.de

Scott Householder

unread,
Jul 29, 1996, 3:00:00 AM7/29/96
to

Jef Kennedy (jken...@oracle.com) wrote:
: In article <EiAXoAAcT$8xE...@smooth1.demon.co.uk>, David Williams
: <d...@smooth1.demon.co.uk> wrote:

[ Following based on this thread already being cross-posted: comp.sys.sequent ]

The thing I've not heard anywhere in this entire discussion is mention of
"Direct I/O", which has been available under Dynix/ptx since v. 1.x (even
though it was not in the on-line man pages til ptx 4.x). Direct I/O, or
DIO, allows one to use a file on a filesystem like you would a raw device;
that is, DIO supports "IO directly from a processes [sic] address space
to a file, bypassing any buffering by the DYNIX/ptx kernel" (quoting from
the DIO(2SEQ) man page on ptx 4.2.0).

We were told (by consultants from Oracle Corp., as I remember) as early as
1993 that if the DB engine utilized DIO to it's data files, that there
was no more than 5% performance gain in switching to to raw data files.
This allows you to get the best of both worlds: performance gain via
bypassing the *NIX buffer cache, but the ease of admin. of data files on
mounted filesystems.

Has this changed at all in the last 3 years?

M1069B00

unread,
Aug 1, 1996, 3:00:00 AM8/1/96
to

In <4tacun$j...@news00.btx.dtag.de>, wil...@t-online.de (Willy Klotz) writes:
>You talked about the UNIX kernal, which is optimizing I/O. Is there no
>optimization for raw devices, only for filesystems ?
> <snip> ,,,,

>>Result? Filesytem wins by more than 100% over raw device.
>
>This is only true watching accesses for one single disk drive. If a
>system constantly has this type of I/O, then one should consider to
>spread I/Os across several disks.

There is a different between a "raw disk partition" and a "striped
SVM managed raw volume".
I tried to make this point a while ago but no one took it on board.
I'm not as deep a guru as some of the contributors but I follow the
arguments about why f/s is better than raw, but no one seems to
have brought SVM and raw volume striping into the picture.
It is my belief that the SVM layer does optimise raw device I/O, and
with true datafile-level striping, performance is certainly improved
over the "dumb" raw-volume scenario.


Reply all
Reply to author
Forward
0 new messages