Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Raw vs. Cooked file system

555 views

Skip to first unread message

Parris Geiser

unread,

Oct 24, 1995, 3:00:00 AM10/24/95

In a database newsgroup there is a debate about whether a raw file
is better than a cooked file (normal UNIX file) from a performance
point of view. I thought the answer was obvious that raw files
are better (for large files, e.g., 1g) since UNIX regular file
access would need several levels of indirection to get to the data.
But the debate rages on. Does anyone in this group understand the
difference between the two well enough to enlighten me?

Thanks,
parris geiser

Alan Peery

unread,

Oct 24, 1995, 3:00:00 AM10/24/95

You sound like you already understand the issues. What happens after
that is a matter of implementation...

Alan

Chris Howard

unread,

Oct 25, 1995, 3:00:00 AM10/25/95

In article <46jarq$k...@news.iastate.edu>, par...@walleye.esp.bellcore.com
says...

I'll take a stab at it, knowing that I'll probably miss something,
and hoping that others will be prompted to straighten me out should
that happen.

From the application program's point of view, both raw and cooked
files are a stream of bytes. The performance issues involve what happens
in the operating system when you perform reads/writes on these files.
A cooked file is accessed through the mechanics of the filesystem, including
v-nodes, i-nodes, super-blocks and whatever. For large files this may
involve multiple indirection, forcing some number of logical disk reads
before you actually get to the data part. There also may be journaling
or other performance issues depending on the way the filesystem is built.
A raw file system uses a different mechanism to access the data blocks.
On most systems a raw file will be accessed in a more direct fashion,
avoiding the overhead of i-nodes and indirection issues, or at least
moving those issues into the realm of the application. Since the filesystem
is built to handle a relatively small average file size (maybe a few
kbytes) the application can probably do a better job of managing the
data in a large file.

On some operating systems, (AIX 3.x) I understand that raw file
accesses are performed through a system that maps the file to virtual
memory in such a way that defeats the desire to avoid indirection
overhead.

Another factor is data block caching. The OS does data caching for
filesystem operations. In some cases you don't go all the way
to the disk because what you need is in the cache in memory.
With raw files, the OS is out of the picture and the application
handles data caching. If the machine is being used mainly as
a database platform, the application will be doing most of the
I/O and disk accesses can be cached/optimized by the database code.
If the machine is being used for multiple tasks, it can be more
efficient to let the OS handle all disk I/O through the filesystem
mechanisms because disk accesses can be cached/optimized on a more
system-wide scale. For a large multi-disk database, this would
probably be a lesser factor in the overall performance picture.

In my experience, with Oracle and Sybase, on systems other than AIX,
raw file access is the way to go for best performance. (Of course,
it depends on the size of your data files. And always with the
caveat that the search is reasonably efficient. A bad query
is a bad query no matter what your disk performance looks like.)

--
Chris Howard
Systems Support Specialist
Automated Systems Division
Iowa State University Library

Richard L. Hamilton

unread,

Oct 25, 1995, 3:00:00 AM10/25/95

In article <46jarq$k...@news.iastate.edu>,

Parris Geiser <par...@walleye.esp.bellcore.com> wrote:
>In a database newsgroup there is a debate about whether a raw file
>is better than a cooked file (normal UNIX file) from a performance
>point of view. I thought the answer was obvious that raw files
>are better (for large files, e.g., 1g) since UNIX regular file
>access would need several levels of indirection to get to the data.
>But the debate rages on. Does anyone in this group understand the
>difference between the two well enough to enlighten me?
>

A character special file equating to a raw disk partition is better
when cranking through large amounts of data that is only accessed
by a single program (whether that's BSD-style dump program, dd to
do image copies, or a database server, if it supports that), because
the data is not buffered at all. Indeed, where possible, the user-supplied
buffer (2nd arg to read() or write()) is locked down and a DMA transfer
directly into there is done, after which that memory is unlocked unless
the user had an explicit lock-down of that memory (via plock() or whatever).

In that case, either one is doing a sequential access, thereby not needing
any buffering (dd, for example) or the program should be doing smart
application-optimized buffering of its own to avoid repeatedly reading the
same disk sector as much as possible.

Accesses by a single program are also better done via a raw partition
because one isn't competing for the system disk buffer pool then,
reducing its availability for programs that could benefit from it.
And programs that try to minimize seeks (like a database server might)
also benefit, *provided* they are the only program accessing any of
the partitions on that disk. Sybase at least used to prefer that
raw partitions used by their databases were also on the only disk
on a particular controller, so as to reduce (for example) SCSI bus
contention. But in the real world that sometimes gets a little expensive.

But accesses to a raw partition are constrained to be in sector-size
multiples, and sector aligned in the file (and on some systems there
may be memory alignment constraints as well).

For more typical usage (say /etc/passwd; on systems w/o NIS or some similar
service, that is used heavily by even "ls -l") multiple programs may be reading
the same file, and therefore the system disk buffer pool helps avoid
subsequent disk reads of the same data (which is what a buffer or cache
is generally supposed to do).

So I'd say if it's a large amount of data only being read by one program,
and that program can live with the size and alignment constraints on
I/O to special files, *and* you can reasonably use a raw partition (not
fun if you have to back up and reload to free one up), then do so. But
that's likely to be the exception rather than the rule.

There is actually a way to use regular files almost as efficiently as
raw partitions (on systems that support it): mmap(). That provides for
sharing in-memory *and* direct DMA I/O that bypasses normal buffering
mechanisms. Since where supported, mmap() should also work on raw partitions
(and can be helpful in managing buffering, since you could have multiple
mappings of the same file at once), it might be advantageous whether one
was using a raw partition or a regular file. But mmap() has the additional
constraint that I/O has to be aligned in both memory and file to page
sized (usually at least 1, 2, or 8K rather than 512 byte sector size)
chunks. mmap() uses virtual memory paging mechanisms, which tend to
be fairly efficient, although real-time processes should be aware that
the I/O probably takes place when accessing the memory rather than when
establishing the mapping (demand paging). And mmap() no more addresses
the issue of disk or controller bus contention in and of itself than
using a raw partition on a disk with other partitions does, i.e. it
doesn't. Also, in general one would be wise not to mix regular
(read()/write()) and mmap()'d I/O to the same file; efficiency aside,
not all systems will make any effort to keep the buffer cache synchronized
with mmap()'d pages.

If one is only porting to systems that decently and uniformly support
mmap(), all other things being equal I'd probably regard that as first
choice. Unfortunately, older (esp. many SysV <= SVR3 and other non-BSD)
systems don't support mmap() at all, and not all support some of the more
exotic mmap() flags or features of related functions (like madvise()).
So for maximum portability, investigate whether all desired features of
mmap() and related functions are available on all potential target
platforms *before* committing to that design.

One other advantage of a raw partition over even mmap() of a regular
file comes to mind: no indirect blocks, and everything is contiguous,
so in principle in the case of a partition on a dedicated disk and controller,
all I/O operations are equally without overhead and should complete within a
calculable interval. So if the constraints of I/O on raw devices are
acceptable and realistic, and performance is at a premium, use a raw
partition regardless (although if available mmap() there should enjoy those
advantages too).

Any of these approaches should benefit some from RAID, although
non-sequential I/O might get a somewhat greater benefit, depending somewhat
on how the disks were interleaved with each other.

The only end-of-file on a partition (block or character) is at
end-of-medium (when one just read the last sector of the partition). So a
database program would need at least a 32-bit length header on whatever it
kept in a partition, that strictly it wouldn't need (although as a safety
check it wouldn't hurt) when writing to a regular file.

Since all it takes is the 32 bit length header (64-bits on systems with
extended maximum possible filesizes, although even some of them won't
support file sizes greater than 32 bits *except* on a device (i.e. the
filesystem can't handle that for regular files)), if the alignment constraints
are acceptable, the program should work on either a raw partition or
a regular file, letting the user/administrator choose. If portability must
be maximized, yet the various other constraints of mmap() are acceptable,
and performance with even regular files is at an absolute premium, then
write your code to use both read()/write() and mmap()/munmap(), and #ifdef
it to compile either way. Expensive, but if you're that worried about
portability and performance you'll undoubtedly have a fair number of #ifdefs
anyway.

Interbase doesn't seem to worry much about using raw partitions, but at
least Sybase prefers them. I haven't seen a recent version of Ingres,
so I can't say anything nice (or relevant) about them, and I haven't
glanced for more than 15 seconds at Oracle, so I have no idea what
approach(s) they use. I think Interbase does use mmap() where available.
So at most of the issues I've mentioned (and possibly more) are probably
common enough considerations in DBMS design. Certainly at least some of
the vendors have implementations that are #ifdef'd to provide performance
features optimized for various platforms.

Strictly, the subject is a misnomer, since a file system used as such
by the system (i.e. mounted) *must* reside on a block special device.
Even remotely mounted NFS filesystems are often regarded as being on
a sort of fictional block device, for implementation reasons.

If I missed something, don't hesitate to jump in. I hope at least
I didn't miss anything too obvious, but you never know. :-) I think
I got most of the issues covered, but aside from the general statements
above, I couldn't say much about how to apply those issues to specific
situations. Profiling and performance measurement are always good to do,
even if they are a pain.
--
I compute, therefore I am.
My opinions are strictly by own, and should not be construed to represent
anyone else.

Eric E. Bardes

unread,

Oct 26, 1995, 3:00:00 AM10/26/95

Parris Geiser (par...@walleye.esp.bellcore.com) wrote:
> In a database newsgroup there is a debate about whether a raw file is
> better than a cooked file (normal UNIX file) from a performance point
> of view.

The best answer, although useless, is "it depends."

The debate is fueled by advances in filesystem technology. Modern
versions of OS's can issue low-priority requests to the disks before the
processes requests for the data by predicting what will be requested by
the process. At the core of these predictions are knowledge of the
cooked filesystem.

Let's go though a few permutations of possibilites.

1. The DBMS is rather primative and stores information for each table
in seperate files, then yes, cooked files are better because the OS
will provide good cache for your data. However, I suspect from the
context of the question, that this case doesn't apply to the debate
you are refering to.

2. The DBMS likes to manage all database activity in one large file (or
in a few large files) and use a large chunk of shared memory for
caching data.

In this case, the use of raw files is better because you avoid the
problem of double caching. The problem is that, the first cache,
provided by the DBMS, will only hit the second cache, provided by
the OS, when the first cache is exhausted. So the odds of finding
the data the DBMS wants in the second cache is very small.

3. Some operating systems "pre-cook" their raw devices. AIX, hp-ux,
and others have a Logical Volume Manager (if not by that name.)
These often have a certian amout cache attached to them and may
grow, by adding more disks.

I hope this helps.

--
Take Care, Eric ,,,
http://www.carsinfo.com/~eric/ (o o)
------------------------------oOOo-(_)-oOOo--------------------------------
'I bought a box of animal crackers. It said, "Do not eat if seal is
broken." Sure enough...' -- Bryant Kiely

Neal P. Murphy

unread,

Oct 27, 1995, 3:00:00 AM10/27/95

par...@walleye.esp.bellcore.com (Parris Geiser) writes:

>In a database newsgroup there is a debate about whether a raw file
>is better than a cooked file (normal UNIX file) from a performance

>point of view. I thought the answer was obvious that raw files
>are better (for large files, e.g., 1g) since UNIX regular file
>access would need several levels of indirection to get to the data.
>But the debate rages on. Does anyone in this group understand the
>difference between the two well enough to enlighten me?

I've investigated this in the past. The debate probably rages over using
a raw (character mode) disk device (slice/partition) or the block mode disk
device (slice/partition).

A raw device is not buffered. When you read from the device, you are
actually reading from the device, not a buffer cache. Similarly, when
you write to the device, you are writing to the device, not a cache.
An example of a raw device on SunOS 4.1.x is:
0 crw-r----- 1 root 17, 0 May 7 15:40 /dev/rsd0a
^
|
character (raw) mode

A block disk device is usually buffered by the OS. Thus your reads and
writes are made to and from cached blocks; the OS takes care of filling
the read cache and flushing the write cache for you. An example of a
block device on SunOS is:
0 brw-r----- 1 root 7, 0 Jun 20 1994 /dev/sd0a
^
|
block (cooked) mode

Note that in neither case is a filesystem involved. Using a filesystem to
hold the DB is almost guaranteed to slow down the disk I/O (due to the
extra layer of code executed to store and read the data).

The debate rages because each and every system is different. The hardware
differs and the OS differs. I've seen some OSs where I/O on character
devices is double that of block devices, and I've seen other systems where
the block I/O is far quicker than the raw I/O.

Personally, I let the debate rage and ignore it. When I am going to set up
a DB system, I find out what block transfer size the DB system uses (e.g.,
Informix 4.1 used 2KB blocks). I then run performance checks on the target
hardware to determine which mode, character or block, provides the higher
throughput using the specified block size, and use it.

Of course, this has addressed performance only. If reliability is also a
concern, then one might not want to use a block device, since actual writes
to disk are delayed, thus if the system should crash, data could be lost.

I hope I've been able to shed some light on the subject.

Fest3er

0 new messages