Linux ext2fs vs. ufs vs. presto [was Re: Fast File System?]

Case Larsen

unread,

Jun 3, 1994, 3:49:42 PM6/3/94

to

Thanks very much for taking the time to run those tests. It is also
interesting to see what happens when less data is written out, e.g. do
1000 or 4000 files instead of 8000. This shows any advantage gained by
delaying I/O. E.g. for 1000 files of 4k (4MB total), you may not see
any writes during the initial creation, whereas with 8000 4k files (32MB),
you may see quite a few writes. The 8000 file numbers would be
a measure of sustained performance (e.g. NetNews hub, fileserving),
while 1000 or 500 file numbers could be a measure of interactive
performance, e.g. typing at the keyboard, untar a bunch of files (< a few
MB worth of files), do a compile (create < a few MB worth of files).

For AdvFS, numbers for 8k and above seem pretty good:
8k files can be written at about 960k/sec
1.3MB/sec for 16k
up to 1.75MB/sec for 64k

I'm interested in seeing numbers for smaller set, i.e. 500 and 1000
files. Linux for example, can create 8kb files at 127 files/sec if it
only has to do 500 of them vs. 25 files/sec on the SparcServer
1000. The addition of NVRAM on the SparcServer boosts that to 266
files/sec.

I tested two machines with this 'benchmark':

Machine #1:
OS: Linux
CPU: 486DX/66 (66MHz) vesa local bus
Filesystem: Linux ext2fs
Main memory: 16MB
Controller: Adaptec 1542
Disk: ST11200N, 3500KB/sec b/w
Stats: 10MB/sec main memory bandwidth, 1.7MB/sec (sequential writes)
to disk through controller.

Machine #2:
OS: Solaris 2.3
CPU: SuperSparc SuperCache 50MHz (SparcServer 1000)
Filesystem: UFS
Main memory: 128MB
Controller: Onboard SCSI2-fast
Disk: ST31200N, 4100KB/sec
Stats: 40MB/sec main memory bandwidth, 4.1MB/sec (sequential writes)
to disk through controller
Configuration #1: no write accelerator
Configuration #2: 4MB NVSIMM write accelerator (aka. Presto)

All in all, the tests say that Linux ext2fs can outperform stock UFS
on slower hardware by a significant margin (factor of 2 to 8 for 1k to 8k
files) for non trivial numbers (500-4000 files).

-Case Larsen
ctla...@lbl.gov
----long list of numbers to follow---

For example, on Linux (486DX/66 with a ST11200N disk):
0 byte/files:100 files:602.104959 files/second
0 byte/files:500 files:342.995282 files/second
0 byte/files:1500 files:141.095425 files/second
0 byte/files:2000 files:113.439649 files/second
0 byte/files:4000 files:57.212904 files/second
0 byte/files:8000 files:28.754965 files/second
1024 byte/files:100 files:722.830605 files/second
1024 byte/files:500 files:357.538704 files/second
1024 byte/files:1500 files:142.382291 files/second
1024 byte/files:2000 files:106.508574 files/second
1024 byte/files:4000 files:56.045037 files/second
1024 byte/files:8000 files:28.125228 files/second
2048 byte/files:100 files:465.224471 files/second
2048 byte/files:500 files:294.299767 files/second
2048 byte/files:1500 files:117.473520 files/second
2048 byte/files:2000 files:95.386166 files/second
2048 byte/files:4000 files:51.520444 files/second
2048 byte/files:8000 files:26.732482 files/second
4096 byte/files:100 files:420.636423 files/second
4096 byte/files:500 files:274.271220 files/second
4096 byte/files:1500 files:93.512741 files/second
4096 byte/files:2000 files:81.971355 files/second
4096 byte/files:4000 files:46.101395 files/second
4096 byte/files:8000 files:25.154955 files/second
8192 byte/files:100 files:369.665268 files/second
8192 byte/files:500 files:126.773498 files/second
8192 byte/files:1500 files:71.998705 files/second
8192 byte/files:2000 files:60.053802 files/second
8192 byte/files:4000 files:39.126569 files/second
8192 byte/files:8000 files:23.504405 files/second
16384 byte/files:100 files:224.498134 files/second
16384 byte/files:500 files:57.477098 files/second
16384 byte/files:1500 files:44.398989 files/second
16384 byte/files:2000 files:38.046475 files/second
16384 byte/files:4000 files:27.451643 files/second
65536 byte/files:100 files:19.297163 files/second
65536 byte/files:500 files:15.253890 files/second
65536 byte/files:1500 files:14.030679 files/second
131072 byte/files:100 files:8.389856 files/second
131072 byte/files:500 files:7.600187 files/second

On SparcServer 1000, 128MB memory, no write accelerator, normal UFS,
ST31200N disk,

0 byte/files:100 files:44.312276 files/second
0 byte/files:500 files:43.757529 files/second
0 byte/files:1500 files:40.641805 files/second
0 byte/files:2000 files:39.001449 files/second
0 byte/files:4000 files:32.233209 files/second
0 byte/files:8000 files:25.087129 files/second
1024 byte/files:100 files:44.269825 files/second
1024 byte/files:500 files:43.073898 files/second
1024 byte/files:1500 files:31.713670 files/second
1024 byte/files:2000 files:28.782177 files/second
1024 byte/files:4000 files:23.514462 files/second
1024 byte/files:8000 files:19.517980 files/second
2048 byte/files:100 files:40.950209 files/second
2048 byte/files:500 files:42.776448 files/second
2048 byte/files:1500 files:32.522234 files/second
2048 byte/files:2000 files:28.408258 files/second
2048 byte/files:4000 files:22.568856 files/second
2048 byte/files:8000 files:19.228756 files/second
4096 byte/files:100 files:44.239762 files/second
4096 byte/files:500 files:40.796338 files/second
4096 byte/files:1500 files:29.064245 files/second
4096 byte/files:2000 files:26.350862 files/second
4096 byte/files:4000 files:22.407641 files/second
4096 byte/files:8000 files:19.101444 files/second
8192 byte/files:100 files:44.513807 files/second
8192 byte/files:500 files:35.829555 files/second
8192 byte/files:1500 files:24.617763 files/second
8192 byte/files:2000 files:24.926471 files/second
8192 byte/files:4000 files:20.685662 files/second
8192 byte/files:8000 files:18.204164 files/second
16384 byte/files:100 files:44.724059 files/second
16384 byte/files:500 files:30.803938 files/second
16384 byte/files:1500 files:23.431537 files/second
16384 byte/files:2000 files:22.221013 files/second
16384 byte/files:4000 files:20.179176 files/second
65536 byte/files:100 files:18.743218 files/second
65536 byte/files:500 files:15.353219 files/second
65536 byte/files:1500 files:11.277214 files/second
131072 byte/files:100 files:8.095716 files/second
131072 byte/files:500 files:6.423027 files/second

And on SparcServer 1000, with 4MB NVSIMM write accelerator, I get:
0 byte/files:100 files:454.312103 files/second
0 byte/files:500 files:380.245197 files/second
0 byte/files:1500 files:248.237390 files/second
0 byte/files:2000 files:208.917017 files/second
0 byte/files:4000 files:128.662900 files/second
0 byte/files:8000 files:75.904740 files/second
1024 byte/files:100 files:401.761322 files/second
1024 byte/files:500 files:163.154973 files/second
1024 byte/files:1500 files:152.140973 files/second
1024 byte/files:2000 files:130.117759 files/second
1024 byte/files:4000 files:80.909055 files/second
1024 byte/files:8000 files:50.787231 files/second
2048 byte/files:100 files:403.679946 files/second
2048 byte/files:500 files:328.801787 files/second
2048 byte/files:1500 files:136.374050 files/second
2048 byte/files:2000 files:115.508387 files/second
2048 byte/files:4000 files:77.496927 files/second
2048 byte/files:8000 files:52.562421 files/second
4096 byte/files:100 files:391.308261 files/second
4096 byte/files:500 files:316.301549 files/second
4096 byte/files:1500 files:122.783713 files/second
4096 byte/files:2000 files:109.107908 files/second
4096 byte/files:4000 files:78.231191 files/second
4096 byte/files:8000 files:51.687208 files/second
8192 byte/files:100 files:371.351012 files/second
8192 byte/files:500 files:265.959710 files/second
8192 byte/files:1500 files:99.809603 files/second
8192 byte/files:2000 files:100.119588 files/second
8192 byte/files:4000 files:62.578623 files/second
8192 byte/files:8000 files:45.223094 files/second
16384 byte/files:100 files:220.967617 files/second
16384 byte/files:500 files:219.798190 files/second
16384 byte/files:1500 files:98.374015 files/second
16384 byte/files:2000 files:68.778383 files/second
16384 byte/files:4000 files:49.978416 files/second
65536 byte/files:100 files:99.310191 files/second
65536 byte/files:500 files:62.352552 files/second
65536 byte/files:1500 files:39.066414 files/second
131072 byte/files:100 files:20.424741 files/second
131072 byte/files:500 files:18.012809 files/second

Martin Cracauer

unread,

Jun 6, 1994, 3:34:03 AM6/6/94

to

[this is from comp.benchmarks]

and...@knobel.knirsch.de (Andreas Klemm) writes:

>Case Larsen (cla...@intruder.lbl.gov) wrote:

>: All in all, the tests say that Linux ext2fs can outperform stock UFS

>: on slower hardware by a significant margin (factor of 2 to 8 for 1k to 8k
>: files) for non trivial numbers (500-4000 files).

>And how save is Linux ext2fs ? Did you reset the machines under
>heavy work to see whats happening during and after filesystem check ?

SunOS-4.1.3 has a policy to have asynchronous writes for data, but
inodes and other superinformation is always written immedeatly. So, a
crash could only affect files that a written at that moment. I think
Solairs 2 does this, too. Does anybody know?

There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
inodes async, too. That speeds up writing a large number of little
files by a factor of 2 to 3. Of course, a crash could really hurt now
that superinformation could be damaged.

I would relly like to know what the standard behaviour of Linux is. If
someone know, please tell us.
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin....@wavehh.hanse.de, Fax. +41 40 5228536, German language accepted
No guarantee for anything. Anyway, this posting is probably produced by one
of my cats stepping on the keys. No, I don't have an infinite number of cats.

Totally Lost

unread,

Jun 6, 1994, 12:18:21 PM6/6/94

to

In article <1994Jun6.0...@wavehh.hanse.de>,

Martin Cracauer <crac...@wavehh.hanse.de> wrote:
>SunOS-4.1.3 has a policy to have asynchronous writes for data, but
>inodes and other superinformation is always written immedeatly. So, a
>crash could only affect files that a written at that moment. I think
>Solairs 2 does this, too. Does anybody know?
>
>There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
>inodes async, too. That speeds up writing a large number of little
>files by a factor of 2 to 3. Of course, a crash could really hurt now
>that superinformation could be damaged.

Both of these policies are stupidly wrong from a security stand point,
which is a significant case requiring orderly filesystem updates.
Actually both are wrong form ANY stand point, except to stupid sysadmins
and shit head systems programmers that don't give a damn about users data.

Take the case where a sensitive file {passwd, payroll data, next weeks
final exam} is is copied/compressed/crypted and the original deleted.
These same blocks are allocated to a new file, the idnodes and directory
information are written before the new data for the same blocks, the machine
crashes at just the right time {hostle user is watching the jobs run with hand
on power swtich} leaving the sensitive data in the hostile users file and
fsck will be stupidly happy on the crash recovery reboot.

I've raised this point a dozen times, even at the usenix meeting where the
Berkeley guys first presented their concept of sync written meta data ...
and they didn't have a clue since they were one-tracked on making the
system clean from fsck point of view ... not from the users or data base
managers point of view. If the users/production data is corrupt, the
filesystem is corrupt!! PERIOD!!

I'd much rather have a totally unsafe filesystem that trashed on crash
and required a known good backup recovery over one that is guarenteed
to do the wrong thing and have dumb sysadmins think that just because fsck
ran fine all users/production data was fine too.

There are only two acceptable strategies:

1) require that all data be written prior to the referencing
meta data. File data blocks first, followed by Nth level
indirect blocks, ...., 1st level indirect blocks, inode.

If a file is open for writing, and written other than at
EOF, the disk inode is updated with a a dirty flag, which
is cleared when the file is closed or specifically "sync'ed".

All filesystem file data and meta data will be clean at any
interruption of service, with the detectable exception that
a file/database may be internally inconsistant.

2) any other write policy requires a filesystem to be reload
from backup media if service is interrupted while mounted.

Item one can be extended to require updates to existing filedata be done with
new data blocks, and new meta data, which only is reflected to the
disk inode when the file is closed or a commit operation performed.
When done properly with locking operations this makes a system database
safe as well.

This policy is neither difficult to implement, nor a significant performance
hit ... and in fact when done with better algorithms than UFS can be
3-10X faster.

John

Burkhard Neidecker-Lutz

unread,

Jun 7, 1994, 4:08:59 AM6/7/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
> <lots of flames about asynchronous write semantics of UNIX file systems
> deleted>

Get a clue. Whoever writes sensitive data through the normal file
system with normal semantics deserves to get corrupted data. If you
want to commit data into a UNIX file system, you can do that perfectly
well by either using fsync(2) or by opening the file with O_SYNC
in the first place.

The asynchronous write batching of the normal UNIX file system is
a very reasonable default behaviour and you can get as much consistency
as you like by using the appropriate controls.

Doing all writes synchronously has surprisingly bad performance (you
can try that for yourself on systems that support mounting whole
filesystems synchronously, such as Ultrix or DEC OSF/1.

Burkhard Neidecker-Lutz

Distributed Multimedia Group, CEC Karlsruhe
Advanced Technology Group, Digital Equipment Corporation
nei...@nestvx.enet.dec.com

Totally Lost

unread,

Jun 7, 1994, 11:50:29 AM6/7/94

to

In article <2t19ur...@usenet.pa.dec.com>,

Burkhard Neidecker-Lutz <nei...@nestvx.enet.dec.com> wrote:
>In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
>> <lots of flames about asynchronous write semantics of UNIX file systems
>> deleted>
>
>Get a clue. Whoever writes sensitive data through the normal file
>system with normal semantics deserves to get corrupted data. If you
>want to commit data into a UNIX file system, you can do that perfectly
>well by either using fsync(2) or by opening the file with O_SYNC
>in the first place.

I've had the clue since 1977 ... stop knee jerking (or whatever else).
My primary point which you failed to address was facility level reliablity
problems caused by normal operational proceedures recovering from
UNIX crashes .... due to poor filesystem designs, which include BSD/SUN/DEC.
After being primarily responsible for production UNIX systems from 1975
todate ... I have a long list of war stories for such (aka experiences).

Read first, then write .... one point of the posting was that ordered
writes correct 95% of the reason programmers are forced to use O_SYNC
to protect database updates. O_SYNC was a stupid solution to the problem
in the first place. Ordered writes with commit operations solves the
entire problem without the huge performance penalty.

>The asynchronous write batching of the normal UNIX file system is
>a very reasonable default behaviour and you can get as much consistency
>as you like by using the appropriate controls.

Asyncronous write batching WITH ordered writes is fine. With the
"normal UNIX file system" it is NOT "very reasonable default behaviour"
for any production environment for the reason presented in the first
posting - undetectable file corruption at any abort of service.

To say "you can get as much consistency as you like by using the
appropriate controls" is a gross miss-representation since 95% of
the shipped systems do not have the option of mounting filesystems
synchronously and most of the remaining 5% that can, can not due
to the penalty. You clearly understand the penalty from your comment below:

>Doing all writes synchronously has surprisingly bad performance (you
>can try that for yourself on systems that support mounting whole
>filesystems synchronously, such as Ultrix or DEC OSF/1.

Why is it that you can not understand the gains possible with ordered
writes? I guess it's just because you don't take the time to think ...
or can not ...

Message has been deleted

Monte P McGuire

unread,

Jun 7, 1994, 7:12:53 PM6/7/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
>My primary point which you failed to address was facility level reliablity
>problems caused by normal operational proceedures recovering from
>UNIX crashes .... due to poor filesystem designs, which include BSD/SUN/DEC.

Since when are crashes part of normal operational procedures?? The
last time I rebooted was to install a UPS...

A happy SunOS 4.1.3 user...

Monte McGuire
mcg...@world.std.com

Burkhard Neidecker-Lutz

unread,

Jun 8, 1994, 3:09:34 AM6/8/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
>
>Read first, then write .... one point of the posting was that ordered
>writes correct 95% of the reason programmers are forced to use O_SYNC
>to protect database updates.

My point was that most file system operations on UNIX system don't
require that much consistency and that for these the normal file
system semantics are ok as a default.

> O_SYNC was a stupid solution to the problem
>in the first place. Ordered writes with commit operations solves the
>entire problem without the huge performance penalty.

I was advocating more for fsync(2) than O_SYNC. But you are correct that
this is different from ordered writes and commits (which is in a sense
what's used inside Advfs on DEC OSF/1, though it's a more complex
transaction mechansim with write-ahead logging).

>Why is it that you can not understand the gains possible with ordered
>writes? I guess it's just because you don't take the time to think ...
>or can not ...

Err, I do, and Advfs uses an even more clever mechanism without the need
of changing applications or the file system semantics (thank you, Digital
does understand production systems). The VMS operating system has all sorts
of facilities like that in the file system, but more often than not the
performance impact of having these semantics as the *default* rather than
available only where needed is very high.

If you really want transactional safety and semantics, use a database or
transaction monitor. If you're rather into *implementing* one of those,
DEC OSF/1 has all the facilities to do asynchronous I/O and inserting
appropriate synchronization as part of it's AIO facility (both to
raw devices as well as to file systems).

Totally Lost

unread,

Jun 8, 1994, 8:18:27 AM6/8/94

to

In article <2t3qre...@usenet.pa.dec.com>,

Burkhard Neidecker-Lutz <nei...@nestvx.enet.dec.com> wrote:
>In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
>>
>>Read first, then write .... one point of the posting was that ordered
>>writes correct 95% of the reason programmers are forced to use O_SYNC
>>to protect database updates.
>
>My point was that most file system operations on UNIX system don't
>require that much consistency and that for these the normal file
>system semantics are ok as a default.

And my point is that any production (either interactive software development
users, student instructional servers, or more traditional end-user applications)
can not accept the security impacts or data corruption associated with
the current filesystem designs. Are you tring to say that having files
undetectably corrupted with uninitialized deleted file contents is an
acceptable result of a "normal" operational failure?

In the last 19 years 3 cases occured where sensitive information/mail ended
up in another users file for the systems I have mothered ... first time was
at SRI 1977 under V6 unix on a PDP-11. A much larger number of readable ascii
files have ended up with a block or more of binary trash in them unexpectedly
(and no corresponding disk errors could be found on the hard copy log).

I do not consider this an acceptable side-effect of any operational problem.
For a toy home system that only has a single user (or family) sure
a little corruption can be tolarated, but is it even necessary - NO!

>> O_SYNC was a stupid solution to the problem
>>in the first place. Ordered writes with commit operations solves the
>>entire problem without the huge performance penalty.
>
>I was advocating more for fsync(2) than O_SYNC. But you are correct that
>this is different from ordered writes and commits (which is in a sense
>what's used inside Advfs on DEC OSF/1, though it's a more complex
>transaction mechansim with write-ahead logging).

Most sigificant commercial applications need reliable updates that span
multiple files/databases for multiple processes. fsync provides a crude
commit operator useful for a single writer situations. With multiple
writers fsync also forces to disk the partially completed operations of
other writers which can leave the indexes or data of a database corrupt
on a crash and make the transaction difficult or impossible to backout
or complete based upon the transaction log. Especially when trash ends
up at the end of the file (or in the middle if the file is sparce/holely)
due to the meta hitting the disk prior to the data.

Many managers and applications programmers are completely unaware or
ready to deal with this type of corruption when they choose a UNIX system
to replace a more traditional mini or mainframe.

>>Why is it that you can not understand the gains possible with ordered
>>writes? I guess it's just because you don't take the time to think ...
>>or can not ...
>
>Err, I do, and Advfs uses an even more clever mechanism without the need
>of changing applications or the file system semantics (thank you, Digital
>does understand production systems). The VMS operating system has all sorts
>of facilities like that in the file system, but more often than not the
>performance impact of having these semantics as the *default* rather than
>available only where needed is very high.

AND if it still allows meta data to be written prior to file data then it
is operationally unsafe and a major security risk. Supporting per file
ordered writes as Advfs is fine, but why did they stop there?

>If you really want transactional safety and semantics, use a database or
>transaction monitor. If you're rather into *implementing* one of those,
>DEC OSF/1 has all the facilities to do asynchronous I/O and inserting
>appropriate synchronization as part of it's AIO facility (both to
>raw devices as well as to file systems).

What I (and customers should) really demand is operational saftey and security.
Transactional security and semantics are only a plus, and not addressed in my
original post. And while I believe they should be part of all UNIX
offerings, I do not require it.

What is the point you are trying to make? That the security and file
corruption aspects of current filesystem design during OS/Hardware
failure is acceptable?

My only point is that this is not acceptable or necessary.

John

Bill Flowers

unread,

Jun 8, 1994, 12:18:37 PM6/8/94

to

In article <idletimeC...@netcom.com>,

Totally Lost <idle...@netcom.com> wrote:
>In article <1994Jun6.0...@wavehh.hanse.de>,
>Martin Cracauer <crac...@wavehh.hanse.de> wrote:
>>SunOS-4.1.3 has a policy to have asynchronous writes for data, but
>>inodes and other superinformation is always written immedeatly. So, a
>>crash could only affect files that a written at that moment. I think
>>Solairs 2 does this, too. Does anybody know?
>>
>>There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
>>inodes async, too. That speeds up writing a large number of little
>>files by a factor of 2 to 3. Of course, a crash could really hurt now
>>that superinformation could be damaged.
>
>Both of these policies are stupidly wrong from a security stand point,
>which is a significant case requiring orderly filesystem updates.
>Actually both are wrong form ANY stand point, except to stupid sysadmins
>and shit head systems programmers that don't give a damn about users data.

The first policy does open a security hole, but U**x systems are not
normally known for security. It does however keep the filesystem
logically safe (but what good is a "clean" filesystem with bad data in
a file?).

The only advantage of the second policy is to run fast. It is neither
logically safe, nor does it make any guarantees about the data
integrity of users files. Some people want this; speed at any cost.
However, it has been my experience (authoring the QNX4 file system)
that the overwhelming majority of users want the very best performance
they can get without compromising the file system integrity or the
contents of the files.

>[examples and personal preferences deleted]

>
>There are only two acceptable strategies:
>
> 1) require that all data be written prior to the referencing
> meta data. File data blocks first, followed by Nth level
> indirect blocks, ...., 1st level indirect blocks, inode.

Ordered updates. Done properly they can all be asynchronous (ordered
asynch). However drivers cannot be allowed to apply any strategy;
strategy must be applied at a higher level which "knows" about
transactions and the file system structure. If multiple transactions
are occurring simultaneously, useful benefits can be obtained by
applying global strategies across the multiple transactions (e.g.
elevator seeking) while still maintaining the correct order of updates
within each transaction. This is part of what is going into the QNX4
file system now.

> If a file is open for writing, and written other than at
> EOF, the disk inode is updated with a a dirty flag, which
> is cleared when the file is closed or specifically "sync'ed".
>
> All filesystem file data and meta data will be clean at any
> interruption of service, with the detectable exception that
> a file/database may be internally inconsistant.
>
> 2) any other write policy requires a filesystem to be reload
> from backup media if service is interrupted while mounted.
>
>Item one can be extended to require updates to existing filedata be done with
>new data blocks, and new meta data, which only is reflected to the
>disk inode when the file is closed or a commit operation performed.
>When done properly with locking operations this makes a system database
>safe as well.

Depending on what you are attempting to accomplish, and how much
control you have over all the pieces (i.e. can you rewrite fsck? or
completely redesign the metadata structure?) shadowing may or may not
be of benefit.

>This policy is neither difficult to implement, nor a significant performance
>hit ... and in fact when done with better algorithms than UFS can be
>3-10X faster.

Yes, there are ways to implement safe, fast file systems. It is
unfortunate that there aren't more of them available.
--
---
W.A. (Bill) Flowers wafl...@qnx.com QNX Software Systems, Ltd.
phone: (613) 591-0931 (voice) 175 Terrence Matthews
(613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8

Message has been deleted

Andreas Klemm

unread,

Jun 8, 1994, 3:37:06 PM6/8/94

to

Martin Cracauer (crac...@wavehh.hanse.de) wrote:
: [this is from comp.benchmarks]

: and...@knobel.knirsch.de (Andreas Klemm) writes:
: >Case Larsen (cla...@intruder.lbl.gov) wrote:
: >: All in all, the tests say that Linux ext2fs can outperform stock UFS
: >: on slower hardware by a significant margin (factor of 2 to 8 for 1k to 8k
: >: files) for non trivial numbers (500-4000 files).
: >And how save is Linux ext2fs ? Did you reset the machines under
: >heavy work to see whats happening during and after filesystem check ?

: SunOS-4.1.3 has a policy to have asynchronous writes for data, but
: inodes and other superinformation is always written immedeatly. So, a
: crash could only affect files that a written at that moment. I think
: Solairs 2 does this, too. Does anybody know?

: There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
: inodes async, too. That speeds up writing a large number of little
: files by a factor of 2 to 3. Of course, a crash could really hurt now
: that superinformation could be damaged.

: I would relly like to know what the standard behaviour of Linux is. If
: someone know, please tell us.

Linux is very fast because it doesn't write files and inodes asynchronously
as far as I know. But if the cache is full the data have to be written to
disk immediately. This is the point, where everything other on the system
seems to stop for that amout of time :) (My own experience).

I think since ext2 fs release 4 there have been many enhancements, so
you can force asynchronously writing as a mount option ... But then the
fs is dog slow (so the author of the ext2 fs).

So if you want a fast anf reliable fs I think ufs is the better choice.
Any other points of view ?

Best regards

Andreas ///

--
Andreas Klemm /\/\____ Wiechers & Partner Datentechnik GmbH
and...@knobel.knirsch.de ___/\/\/ and...@wupmon.wup.de (Unix Support)

Message has been deleted

David Holland

unread,

Jun 8, 1994, 9:45:11 AM6/8/94

to

nei...@nestvx.enet.dec.com's message of 7 Jun 1994 08:08:59 GMT said:

> > <lots of flames about asynchronous write semantics of UNIX file systems
> > deleted>
>
> Get a clue. Whoever writes sensitive data through the normal file
> system with normal semantics deserves to get corrupted data. If you
> want to commit data into a UNIX file system, you can do that perfectly
> well by either using fsync(2) or by opening the file with O_SYNC
> in the first place.

You missed the point.

If metadata is written ahead of the actual data, a crash at the wrong
time can conceivably cause file X to contain the data from file Y.

If file Y is payroll records and file X is Joe User's email, this is
not very good; you can solve the problem by writing file X
synchronously, but most of the time Joe User isn't going to do that -
especially not if he's Joe Hacker trying to exploit the problem.

This is how it happens, as the original poster presented it:

file Y is deleted --->
<--- file X is written, using
blocks formerly from file Y
file Y's inode is written --->
<--- file X's inode is written
<CRASH>
<--- file X's data was never written

Now, after recovery and reboot, file X contains some blocks that used
to be in file Y... which still contain the data from file Y.

Security breach.

I don't know if this is actually possible with current filesystems;
I'd hope not, but...

--
- David A. Holland | "The right to be heard does not automatically
dhol...@husc.harvard.edu | include the right to be taken seriously."

Burkhard Neidecker-Lutz

unread,

Jun 9, 1994, 5:00:10 AM6/9/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
>Are you tring to say that having files
>undetectably corrupted with uninitialized deleted file contents is an
>acceptable result of a "normal" operational failure?

No modern BSD file system will show up with deleted file contents in
a file after any kind of disk crash,

>In the last 19 years 3 cases occured where sensitive information/mail ended
>up in another users file for the systems I have mothered ... first time was
>at SRI 1977 under V6 unix on a PDP-11.

V6 Unix. Oh boy. What were the other to cases running ?

>AND if it still allows meta data to be written prior to file data then it
>is operationally unsafe and a major security risk. Supporting per file
>ordered writes as Advfs is fine, but why did they stop there?

Advfs doesn't suffer from the meta data problems you mention. A complete
transactional mechanism is used inside that works regardless of the number
of writers, files or physical disk devices involved (after all, some things
have improved since UNIX V6 came out...).

>What is the point you are trying to make? That the security and file
>corruption aspects of current filesystem design during OS/Hardware
>failure is acceptable?

There shouldn't be any security breaches (and to the best of my knowledge
any recent UFS implementation doesn't have any). To solve file corruption
problems of the kind you mention (multiple writers) requires a lot more
than just writing some metadata in any particular order (Berkely UFS
is very careful to order the writes it does to disk so that that can't
happen).

Also, you don't seem to be familiar with really *current* UNIX file systems
like Advfs or JFS which suffer from none of your problems.

Burkhard Neidecker-Lutz

unread,

Jun 9, 1994, 7:26:05 AM6/9/94

to

In article <DHOLLAND.9...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:
>
>This is how it happens, as the original poster presented it:

NOT

>
> file Y is deleted --->

so blocks are being released, causing metadata to be updated *on disk*
before any other file gets a chance of reusing that block.

> <--- file X is written, using
> blocks formerly from file Y

Which are zero-filled at that time.

> file Y's inode is written --->

Can't happen now (after all, Y was deleted and hence it's inode
is *gone*). Even if Y was truncated by ftruncate(2), that would have
been noted on disk *before* the blocks could be gotten at from
X.

> <--- file X's inode is written
> <CRASH>
> <--- file X's data was never written
>
>Now, after recovery and reboot, file X contains some blocks that used
>to be in file Y... which still contain the data from file Y.
>
>Security breach.

Maybe on UNIX V6 15 years ago.

>I don't know if this is actually possible with current filesystems;
>I'd hope not, but...

Can't speak for SUN :-), but can't happen on any modern UNIX I know.

Linus Torvalds

unread,

Jun 9, 1994, 11:17:19 AM6/9/94

to

In article <1994Jun6.0...@wavehh.hanse.de>,
Martin Cracauer <crac...@wavehh.hanse.de> wrote:
>

>SunOS-4.1.3 has a policy to have asynchronous writes for data, but
>inodes and other superinformation is always written immedeatly. So, a
>crash could only affect files that a written at that moment. I think
>Solairs 2 does this, too. Does anybody know?
>
>There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
>inodes async, too. That speeds up writing a large number of little
>files by a factor of 2 to 3. Of course, a crash could really hurt now
>that superinformation could be damaged.
>
>I would relly like to know what the standard behaviour of Linux is. If
>someone know, please tell us.

The standard behaviour for linux is to write everything asynchronously,
as others have already pointed out. I'd just like to put in my two cents
for why this is done:

- synchronous writes are slow. You'll lose *lots* of performance. I
tend to think that you can trust the hardware, and just ignore the
minor problems you can get with asynchronous writes - the benefits
far outweigh the problems IMNSHO.

- doing synchronous writes on meta-data is broken: you'd really need to
do synchrnonous writes on data too to be safe. BSD does metadata
synhronously to give you a sense of security and ignores the actual
file data -- they too did a trade-off in efficiency and security.
They just did a better job at trying to fool people into thinking
it's a good idea..

Remember: fsck can clean up the filesystem metadata if you crashed (and
metadata is the only thing the FFS tries to do synchronously), so why
pay the overhead of doing the same thing at runtime? Yes, you can get
such corruption that fsck gives up in horror but if you had that bad a
crash, you'd probably have been screwed anyway even with synchronous
write. How many of you have gotten that kind of filesystem corruption
under linux without it being a device driver of hardware problem (which
a filesystem couldn't fix anyway)?

Final question: what's the use of "safe filesystems" if the hardware
itself isn't safe? Who do you think you're kidding? If the harddisk
crashes on you (or even gets just one bad sector) you won't be safe even
if you write *everything* synchronously. So why take the performance
hit? (yes, I know about RAID etc, and I don't care. I don't have that
kind of hardware, and I don't think most people here wanting a "safe"
filesystem do either).

Linus

PS. As if you didn't notice, you pushed a button. I think sync writes
are stupid, and find some of the FFS proponents attitudes irritating.
Logbased filesystems at least make sense, even though I don't
particularly want to run them myself.

Frank Lofaro

unread,

Jun 9, 1994, 12:51:18 PM6/9/94

to

What about Linux ext2fs?
Can it happen on it?

Message has been deleted

Eric Youngdale

unread,

Jun 10, 1994, 8:44:43 AM6/10/94

to

In article <Cr5HE...@ucdavis.edu>,
Kevin Brown <ke...@frobozz.sccsi.com> wrote:
>The fact that I was running the cluster patches is important.
>Before, the entire filesystem would by sync()ed every 30 seconds
>or so (I actually had it set for something like 5 minutes to get
>better performance under the conditions my system runs). Between
>these periods, the filesystem on disk was reasonably likely to be
>in a consistent state. But the cluster patches changed things such
>that the system is *continuously* writing to the filesystem, if
>only because there are periodic processes which access (for read)
>files on the filesystem, causing the filesystem to update the access
>times of all accessed files. Indeed, I can see the hard drive
>being accessed every 5 seconds, the default update period for
>bdflush(). This significantly increases the probability that
>filesystem damage will occur if a crash happens.

No, it *decreases* the probability. That was the whole point of
that patch, to reduce the degree of corruption if the system goes down
for some reason.

>The SCSI device driver has been rock-solid reliable in writing data
>correctly to my fixed disks, and when it *does* write data to my
>MO drive it does so correctly. It's just that it likes to lock up
>if I beat on my MO drive too much, especially if there's other
>activity (such as serial port activity) going on.

This says a lot to me. If you pound on the drive too much
and tickle some hardware bug of some kind that causes a drive to go insane,
I do not see how you can expect much of anything in the way of filesystem
integrity.

-Eric

--
"The woods are lovely, dark and deep. But I have promises to keep,
And lines to code before I sleep, And lines to code before I sleep."

Casper H.S. Dik

unread,

Jun 10, 1994, 9:23:46 AM6/10/94

to

crac...@wavehh.hanse.de (Martin Cracauer) writes:

>SunOS-4.1.3 has a policy to have asynchronous writes for data, but
>inodes and other superinformation is always written immedeatly. So, a
>crash could only affect files that a written at that moment. I think
>Solairs 2 does this, too. Does anybody know?

This is the normal mode of operation for the Berkeley filesystem.
It makes sures the metadata is almost always consisten on a crash.
The inodes and directory entries are created synchronously.
Each file creation requires two synchronous writes.

>There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
>inodes async, too. That speeds up writing a large number of little
>files by a factor of 2 to 3. Of course, a crash could really hurt now
>that superinformation could be damaged.

It wasn't a patch, it's a ioctl included in SunOS 4.1.2 and later
(including Solaris 2.x). It especially useful for full restores,
that run 3-10x faster with the async option on.
It's also useful for rm -rf's (10x faster) and unpacking of large
numbers of files. We use this option a lot.

>I would relly like to know what the standard behaviour of Linux is. If
>someone know, please tell us.

It's asynchronous, which isn't nice when you have a crash.
OTAH, we don't experience many crashes/powerfailures.

Casper

Stefan Esser

unread,

Jun 10, 1994, 12:29:11 PM6/10/94

to

In article <2t6u8d...@usenet.pa.dec.com>, nei...@nestvx.enet.dec.com (Burkhard Neidecker-Lutz) writes:
|> NOT
|> >
|> > file Y is deleted --->
|>
|> so blocks are being released, causing metadata to be updated *on disk*
|> before any other file gets a chance of reusing that block.
|>
|> > <--- file X is written, using
|> > blocks formerly from file Y
|>
|> Which are zero-filled at that time.

They are zero filled on disk ????
In case of a crash it doesn't make much of a difference, whether
they had been zero filled in RAM ...

Really ?

Is there an implied fsync on the file being closed, before writing
updated inode information to disk ?

(I remember that closing a file under Ultrix 4 could take quite some
time, so maybe this is really done in Ultrix ? It made our main server
often unusable for minutes, since one of the most important programs
wrote some 20 files of 64MB each on exit. We reduced the size of the
buffer cache, to shorten the time to flush the file to disk, since
other disk operations on that drive were blocked, from the moment fclose
was called until it returned ...)

How about indirect inode blocks on large files (say 2GB), you can't keep
all of them in RAM at all times (2GB/8KB * 4Byte = 2Mbyte), can you ?
(We DO write files of that size on our system regularly, so its not
only of academic interest to me).

If you update the meta date of large files before the corresponding
data blocks are guaranteed to be written to disk, the above scenario
doesn't seem impossible to me, even under modern UNIXes.

You'd have to keep buffer cache blocks and the corresponding meta data
linked in some way, to be sure you always write data before meta data.

The main problem with asynch. inode updates was, if a directory had just
been created, the inode already written to disk, but the data block still
contained ordinary file data (or, worse a previously deleted directory),
then fsck often did silly things. Worst of all was the possibility of
an indirect block number being written into an inode (on disk), when this
block (on disk) still contained ordinary file data.

Always writing inode blocks synchronously when creating or removing a
directory or allocating a new indirect block, makes fsck work much more
reliably.

But it doesn't guarantee that data blocks from another data file don't
end up in your data file, since that doesn't confuse fsck, but it may
confuse the previous owner of that data :).

And that's what the initiator of this thread said ...

--
Stefan Esser Internet: <s...@MI.Uni-Koeln.DE>
Mathematisches Institut Tel: +49 221 4706010
Universitaet zu Koeln FAX: +49 221 4705160
Weyertal 80
50931 Koeln

Totally Lost

unread,

Jun 10, 1994, 5:42:26 PM6/10/94

to

After a wild strike three Burkhard Neidecker-Lutz is OUT!

To recap this exciting drama ...

In article <1994Jun6.0...@wavehh.hanse.de> Martin Cracauer writes:

Martin: SunOS-4.1.3 has a policy to have asynchronous writes for data, but
Martin: inodes and other superinformation is always written immedeatly. So, a
Martin: crash could only affect files that a written at that moment. I think
Martin: Solairs 2 does this, too. Does anybody know?
Martin:
Martin: There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing
Martin: inodes async, too. That speeds up writing a large number of little
Martin: files by a factor of 2 to 3. Of course, a crash could really hurt now
Martin: that superinformation could be damaged.

Attempting to liven up the boring game and settle a few past scores,
John steps to the plate with <idletimeC...@netcom.com>:

John: Both of these policies are stupidly wrong from a security stand point,
John: which is a significant case requiring orderly filesystem updates.
John: Actually both are wrong form ANY stand point, except to stupid sysadmins
John: and shit head systems programmers that don't give a damn about users data.
...
John: I've raised this point a dozen times, even at the usenix meeting where the
John: Berkeley guys first presented their concept of sync written meta data ...
John: and they didn't have a clue since they were one-tracked on making the
John: system clean from fsck point of view ... not from the users or data base
John: managers point of view. If the users/production data is corrupt, the
John: filesystem is corrupt!! PERIOD!!
...
John: I'd much rather have a totally unsafe filesystem that trashed on crash
John: and required a known good backup recovery over one that is guarenteed
John: to do the wrong thing and have dumb sysadmins think that just because fsck
John: ran fine all users/production data was fine too.

With <2t19ur...@usenet.pa.dec.com> Burkhard takes the bait,
with a wild first strike:

Burkhard: Get a clue. Whoever writes sensitive data through the normal file
Burkhard: system with normal semantics deserves to get corrupted data. If you
Burkhard: want to commit data into a UNIX file system, you can do that perfectly
Burkhard: well by either using fsync(2) or by opening the file with O_SYNC
Burkhard: in the first place.
Burkhard:
Burkhard: The asynchronous write batching of the normal UNIX file system is
Burkhard: a very reasonable default behaviour and you can get as much consistency
Burkhard: as you like by using the appropriate controls.
Burkhard:
Burkhard: Doing all writes synchronously has surprisingly bad performance (you
Burkhard: can try that for yourself on systems that support mounting whole
Burkhard: filesystems synchronously, such as Ultrix or DEC OSF/1.

[The excitement builds, he tells the UNIX croud they deserve corrupted data
setting the stage to introduce some vapor ware he has stock in ... ]

John steps backup to the plate with <idletimeC...@netcom.com>:

John: My primary point which you failed to address was facility level reliablity
John: problems caused by normal operational proceedures recovering from
John: UNIX crashes ... due to poor filesystem designs, which include BSD/SUN/DEC.
John: After being primarily responsible for production UNIX systems from 1975
John: todate ... I have a long list of war stories for such (aka experiences).
John:
John: Read first, then write .... one point of the posting was that ordered
John: writes correct 95% of the reason programmers are forced to use O_SYNC
John: to protect database updates. O_SYNC was a stupid solution to the problem
John: in the first place. Ordered writes with commit operations solves the
John: entire problem without the huge performance penalty.
John:
John: To say "you can get as much consistency as you like by using the
John: appropriate controls" is a gross miss-representation since 95% of
John: the shipped systems do not have the option of mounting filesystems
John: synchronously and most of the remaining 5% that can, can not due
John: to the penalty.
John:
John: Why is it that you can not understand the gains possible with ordered
John: writes? I guess it's just because you don't take the time to think ...
John: or can not ...

With <2t3qre...@usenet.pa.dec.com> Burkhard takes the bait again
with a second wild strike:

Burkhard: My point was that most file system operations on UNIX system don't
Burkhard: require that much consistency and that for these the normal file
Burkhard: system semantics are ok as a default.
Burkhard:
Burkhard: Err, I do, and Advfs uses an even more clever mechanism without the
Burkhard: need of changing applications or the file system semantics (thank you,
Burkhard: Digital does understand production systems). The VMS operating system
Burkhard: has all sorts of facilities like that in the file system, but more
Burkhard: often than not the performance impact of having these semantics as
Burkhard: the *default* rather than available only where needed is very high.

[So he backs off that UNIX users deserve corrupted data, to they don't really
need correct data, and tring to save the day makes the standard side pitch that
for anyboding wanting correct data VMS will solve all UNIX users ills ...
with "thank you, Digital does understand production systems". I guess this
means Ultrix (DEC's unix offering) is not a production system ;-) ]

BTW: for those who have not experienced VMS, try porting a large unix
application to it for a few years, I didn't like it (most of 1988 & 1989)
and the best VMS could do was 1/4 the performance of the same application on
UNIX with only 75% the usability. Coupled with the DEC memory and processor
hungry office automation product the result was a total dog, but the State
of Calif got what they ordered - with DEC helping specify it. The DOD/DOE/NSF
contract administrators came to the same conclution 10 years earlier when
they awarded the contracts for Berkeley to make BSD 4.x, over the DEC 10 and
Vax camp that tried to put a UNIX emulator on top of VMS at SRI.

John steps backup to the plate with <idletimeC...@netcom.com>:

John: And my point is that any production (either interactive software
John: development users, student instructional servers, or more traditional
John: end-user applications) can not accept the security impacts or data
John: corruption associated with the current filesytedesigns. Are you tring
John: to say that having files undetectably corrupted with uninitialized
John: deleted file contents is an acceptable result of a "normal" operational
John: failure?
...
John: What I (and customers should) really demand is operational saftey and
John: security. Transactional security and semantics are only a plus, and not
John: addressed in my original post. And while I believe they should be part
John: of all UNIX offerings, I do not require it.
John:
John: What is the point you are trying to make? That the security and file
John: corruption aspects of current filesystem design during OS/Hardware
John: failure is acceptable?
John:
John: My only point is that this is not acceptable or necessary.

With <2t6lmq...@usenet.pa.dec.com> Burkhard takes the bait again
with a third wild strike:

Burkhard: No modern BSD file system will show up with deleted file contents in
Burkhard: a file after any kind of disk crash,
...
Burkhard: There shouldn't be any security breaches (and to the best of my
Burkhard: knowledge any recent UFS implementation doesn't have any). To
Burkhard: solve file corruption problems of the kind you mention (multiple
Burkhard: writers) requires a lot more than just writing some metadata in
Burkhard: any particular order (Berkely UFS is very careful to order the
Burkhard: writes it does to disk so that that can't happen).
Burkhard:
Burkhard: Also, you don't seem to be familiar with really *current* UNIX file
Burkhard: systems like Advfs or JFS which suffer from none of your problems.

[The wimp goes from saying that UNIX users deserve it, to they don't really
need correct data, to those that have modern BSD filesystems have correct
data after all - NOT! And then goes on to push the DEC Ultrix solution again,
with a claim of correctness that can not be quickly verified since the sources
are not public - if anyone does happen to have source access could they verify
the claim?]

Mr Neidecker-Lutz really puts his ingorance of the issue right to the point
here. I have followed the BSD filesystem since it's beginnings and have never
seen a version that was data safe over a crash. Maybe he can point us to
the internet server that has the data safe UFS implementations?

The primary place I have seen the corruption over the years is in log files and
peoples mail box (while I was only made aware of 3 cases where a user got
sensitive data in his mail box or other file, I have lost track of a fairly
large number of files that had binary data inserted into them).

From DEC's own archive on gatekeeper.dec.com via ftp you may examine various
"modern" versions of USF:

/.0/BSD/NetBSD/NetBSD-current/src/sys/ufs
/.0/BSD/FreeBSD/FreeBSD-current/src/sys/ufs

Pay particular attention to the flow path of:

ufs_vnops.c:ufs_write(vp, uio, ioflag, cred)
do {
lbn = lblkno(fs, uio->uio_offset);
on = blkoff(fs, uio->uio_offset);
n = MIN((unsigned)(fs->fs_bsize - on), uio->uio_resid);
if (n < fs->fs_bsize)
flags |= B_CLRBUF;
else
flags &= ~B_CLRBUF;
if (error = balloc(ip, lbn, (int)(on + n), &bp, flags))
break;

as it updates the block numbers in the indirect blocks via the path:

ufs_bmap.c:balloc(ip, bn, size, bpp, flags)

...

/*
* Fetch through the indirect blocks, allocating as necessary.
*/
for (; ; j++) {
error = bread(ip->i_devvp, fsbtodb(fs, nb),
(int)fs->fs_bsize, NOCRED, &bp);
if (error) {
brelse(bp);
return (error);
}
bap = bp->b_un.b_daddr;
sh /= NINDIR(fs);
i = (bn / sh) % NINDIR(fs);
nb = bap[i];
if (j == NIADDR)
break;
if (nb != 0) {
brelse(bp);
continue;
}
if (pref == 0)
pref = blkpref(ip, lbn, 0, (daddr_t *)0);
if (error = alloc(ip, lbn, pref, (int)fs->fs_bsize, &newb)) {
brelse(bp);
return (error);
}
nb = newb;
nbp = getblk(ip->i_devvp, fsbtodb(fs, nb), fs->fs_bsize);
clrbuf(nbp);
/*
* Write synchronously so that indirect blocks
* never point at garbage.
*/
if (error = bwrite(nbp)) {
blkfree(ip, nb, fs->fs_bsize);
brelse(bp);
return (error);
}

The sync bwrite makes sure the indirect block is atleast zero if a crash
occured shortly after this point ... reading on ...

bap[i] = nb;
/*
* If required, write synchronously, otherwise use
* delayed write. If this is the first instance of
* the delayed write, reassociate the buffer with the
* file so it will be written if the file is sync'ed.
*/
if (flags & B_SYNC) {
bwrite(bp);
} else if (bp->b_flags & B_DELWRI) {
bdwrite(bp);
} else {
bdwrite(bp);
reassignbuf(bp, vp);
}

note that flags & B_SYNC above will normally be false and the indirect block
pointing to our new data block will be written at the mercy of either the
disk sort routine or a combination of buffer cache flush followed by disk sort.

Back in ufs_write we return from balloc with the newly allocated bp:

if (error = balloc(ip, lbn, (int)(on + n), &bp, flags))
break;
and proceed with

bn = bp->b_blkno;
if (uio->uio_offset + n > ip->i_size) {
ip->i_size = uio->uio_offset + n;
vnode_pager_setsize(vp, ip->i_size);
}
size = blksize(fs, ip, lbn);
n = MIN(n, size - bp->b_resid);
error = uiomove(bp->b_un.b_addr + on, n, uio);
if (ioflag & IO_SYNC)
(void) bwrite(bp);
else if (n + on == fs->fs_bsize) {
bawrite(bp);
} else
bdwrite(bp);

copying the data into the newly allocated buffer and since ioflag & IO_SYNC
is generally false asyncronously write it (possibly delayed in the cache)
allowing the disk sort to write it in some random order from this inodes
point of view.

Should a crash occur after the above indirect block is written to disk, and
before this data block is written, the file WILL undectectably contain
the previous data on the disk. FSCK will be happy, the system admin will
have no means to identify this file as corrupt, and the user can get someone
elses deleted data - security breach first class.

This chance order is the best case, and as the original author pointed out:

Martin: SunOS-4.1.3 has a policy to have asynchronous writes for data, but
Martin: inodes and other superinformation is always written immedeatly. So, a
Martin: crash could only affect files that a written at that moment. I think
Martin: Solairs 2 does this, too. Does anybody know?

which insures the wrong order.

In addition significant performance delays are implictly created by
extra writing of indirect blocks that would not occur if all the
data blocks rapidly allocated to an indirect block were written prior
to allowing the indirect block to be written, and the same policy
repeated up the indirect chain.

Nearly 100% of all UNIX systems in the field have this problem.

Installing a UPS on your existing system will remove some small percentage
of unexpected crashes. On most busy server systems known OS bugs resulting
in a panic are by far the common problem, followed by random hardware
failures. At each of these types of failures you have no way of knowing
what data receintly created is corrupt. Fsck can only tell you the
filesystem meta data is correct ... it tells you nothing about the
contents of those same files.

Next batter step up to the plate please ...

John

Totally Lost

unread,

Jun 10, 1994, 7:14:28 PM6/10/94

to

In article <2t6u8d...@usenet.pa.dec.com>,

Burkhard Neidecker-Lutz <nei...@nestvx.enet.dec.com> wrote:
>>Now, after recovery and reboot, file X contains some blocks that used
>>to be in file Y... which still contain the data from file Y.
>>
>>Security breach.
>
>Maybe on UNIX V6 15 years ago.
>
>>I don't know if this is actually possible with current filesystems;
>>I'd hope not, but...
>
>Can't speak for SUN :-), but can't happen on any modern UNIX I know.
>
> Burkhard Neidecker-Lutz

I think the three strikes and Burkhard is out posting made it clear how
the UFS filesystem can leave trash data in a users file at a crash.

On machines that ran unix V6, most were lucky to have enough memory to
have 5-10 512 byte buffers ... systems with more than 25 buffers were
fairly rare. That is barely enough for the filesystem to run without
deadlock ... there was no code room or buffer space to implement a more
robust policy ... Ken did well. I'm amazed that I ran 3-4 users
on my first unix child ... a PDP11/34 with 96KB memory, 2.5mb disk,
two dec tapes, 9trk tape, two plotters, two digitizing tablets,
two terminals, two high performance LUndy graphics subsystems,
and a dream in CalPoly San Luis Obispo.

Up until this point room for kernel code was limited by the 16bit address
space and the fact most unix machines only had 128K to 256K of DRAM
to run 16-64 users. Anything added to kernel reduced the amount of
USER dram and increased the amount of swapping.

By 1980 when the BSD team started their work on the VAX machines with
512Kbytes and larger were starting to be common, and the VAX relieved
the 17 bit address space limit from the kernel (dual 16 bit spaces
on 11/70, 45, and 44's)..

The network code which ran in user space as prcesses (done by
the arpa community at Urbana, UCLA and BBN) was re-written at
UCB to drop into the VAX kernel. and BSD was born soon after
with the native port of V7/V32 to the vax. The unix kernal then
started it's rapid transition to huge.

All systems with filesystems based upon V6/V7/SVR3/BSD have the problem.
This is nearly every system ever shipped.

Fixing the problem means major changes to the filesystem/BIO/driver
relationships, key data structures, and key kernel internal programming
interfaces including driver and FFS/VNODE interfaces.

Twice I have come close to implementing this. First at Fortune Systems
where Don (1st VP of Engr) and I wrote it into the technical part of
the business plan/ product specification in Feb/Mar 1981. In May (??)
Don and I were replaced when we told Homer Dunn (Founder) that unless the
Software team and development computer funding that were prommised
for March be in place in May (??), we would slip first customer ship from
early Dec 81 week by week until available. Steve Puthuff and Rick Kiesig
(which replaced Don and I) abandoned this requirement while slipping the
schedule from Dec 81 to Sept 82 week by week. The software staff budget
for 5-7 seasoned programmers turned into 25 plus kids in or just
out of school - None including Rick has the experience to do what they
blindly took on. Late in the spring of 82 Homer wanted to fire all of them
for slipping HIS schedule ... seeking my advice to replace them I
told him fat chance if he wanted to deliver the product, and that he
just spent our companies reputation to build/train what would become
one of the better unix teams in the Valley. Rick choose to meet a
minimal filesystem harding requirement by using part of the BSD code.

The second time was receintly at SCO where I attempted several times
to work out a contract to do some significant re-architecting of
the UNIX kernel and filesystem for performance and scaling reasons.
After a couple false starts, ownership of the kernel technologies
was won by the London team and hopes vanished for doing anything
interesting in Santa Cruz. So what twice could have been a major
UNIX event, is still a dream and code fragments in my lab.

The the various LFS style filesystems have the promise of reliability
but the implementation tradeoffs are performance cripping for most
desk top systems smaller than the SPrite sized machines the work
was done for. Locality of data is severly compromised. At last
winters usenix wip session I gave a wake call talk on part of these issues.

Every production machine I see is killed by UNIX filesystem I/O
running at 10-20% of what it should be ... by filesystems designers
that insist on using a horse and buggy as the prototype for a space
ship. The receint software bloat caused by X/Motif applications
continues the pressure on the I/O subsystem, combined with
increadibly faster processor technology the pressure to
replace or rearchitect UNIX will continue into the 90's.

As with my comments about Raw I/O in comp.os.rsearch the critical
problem is people attempting to continue to use outdated decisions
without re-evaluation of the assumptionas and tradeoffs involved.
The current UNIX filesystem architecture is critically flawed
on all major fronts - performance, reliability and security - and
lacks key features of the main frame market it replaces.
OS work today is done mostly by follow the herd, critical thinking
is a lost art.

Either Novell and the key players need to get the clue, or UNIX
will be replaced in the passing of time (the 90's).

John

Totally Lost

unread,

Jun 10, 1994, 8:16:41 PM6/10/94

to

In article <Cr3Fz...@ucdavis.edu>,
Kevin Brown <ke...@frobozz.sccsi.com> wrote:
>The suggestions that you have will help, but they're not quite good
>enough, particularly on a complex system. Ultimately, there is no
>substitute for a good system administrator.

99% of the unix facilities in the world run without a knowlegable
system administrators ... in nearly every city I know of there
is a unix system hosting a Point of Sale system in a major store.
But what really scares me are the number of critical mission
implementations that use unix ... including processing prescriptions
at many of the key hospitals and drug stores.

Unless unix can be reduced in complexity and increased in reliablity
the DOS/NT market will kill it ... Programmer workstations are
very nice ... the other 99.99% of the market is hosting production
applications on servers and desktops.

>Your point is well taken, but if the hostile user is at the power switch,
>then I'd suggest you have significantly worse security problems then the
>one you present above.

This clearly is not the main threat, but only a few percent of the work
stations are on UPS's ... and while the machine door may be locked nearly
any hacker can figure out how to cut part or all the building power from
the utility closet or distribution system outside the building.

Every major OS has atleast one well known kernel panic bug that will
work just as good.

>
>>I've raised this point a dozen times, even at the usenix meeting where the
>>Berkeley guys first presented their concept of sync written meta data ...
>>and they didn't have a clue since they were one-tracked on making the
>>system clean from fsck point of view ... not from the users or data base
>>managers point of view. If the users/production data is corrupt, the
>>filesystem is corrupt!! PERIOD!!
>

>True. However, in that particular case the damage is likely much more
>localized. The system administrator may have to restore some files from
>backup, but much less than restoring the entire filesystem.

The files we are talking about are receintly created, and have no backup.
Furthermore in a production environment they are most likely cataloged
or have a strict relationshiop with other data in a production database.
Few production databases have the equivlent of FSCK to clean both the
data base meta data AND the production data it contains. Most often
records are not checksumed and bad/corrupt records can not be determined.
To proceed after a crash requires reloading from a known checkpoint
and processing the journal/transaction log prior to the crash, or
a payer and blind faith the error doesn't put you out of business.

If the error is in your financials you might just have your bank balance
high by 10-20% of monthly cash flow ... and be out of cash and out
of business when your payroll bounces at the end of the year and quarterly
tax deposits are due. WC Grant stores and warehouses closed in the early
70's when they ran out of cash during expansion and switchover from the
1440's to the 360 finacial systems they were unable to balnace the
books for several months due to some very stupid decisions about how
to manage the cutover as I heard the story many years ago..

>You ultimately can't compensate for stupid system administrators. While
>more orderly filesystem updates will help, the *real* solution is to get
>a good system administrator.

The best systems admin is helpless when blinded to problems by the
tools he must trust.

>> 1) require that all data be written prior to the referencing
>> meta data. File data blocks first, followed by Nth level
>> indirect blocks, ...., 1st level indirect blocks, inode.
>

>This makes sense as long as you don't care too much about file data
>integrity. You mention a possible solution to this problem (write
>new data blocks), but that has its own set of problems (e.g., low-
>space situations).

Get a clue ... this entire thread is about CARING about the data!
geez. Stop and re-read it, you are in too big a hurry to look dumb.

>> If a file is open for writing, and written other than at
>> EOF, the disk inode is updated with a a dirty flag, which
>> is cleared when the file is closed or specifically "sync'ed".
>

>Suppose a file is marked as "dirty", the system crashes, and the
>operator has to bring the system back up. When it comes up, the
>file is still "dirty", but what is the operator to do about it?
>Restore it from backup? This is not at all a proper solution if
>the "dirty" file is part of a multi-file database. Restoring the
>file from backup while not restoring the other (presumably clean)
>files will yield an inconsitent database.

How to recover from a dirty file can vary widely, but that is not the
issue ... how to recover when you don't know which files might be
dirty is even a tougher problem that MUST be solved. The file might
not even be dirty, just have it's dirty bit set, but atleast the
sysadmin, production managers, users and programer's have a hint
where to look ... and an incentive to write tools to deal with the
problem rather than ignore it ... or not even know.

In my mind anny application that doesn't keep a journal/transaction
log is in trouble to start with. If the journal is built with before/after
records it can be used to verify in reverse the database until
a recient checkpoint.

>Situations like this simply underscore the fact that there is no
>substitute for a system administrator that knows what he's doing
>(and knows what his users are doing. The file above may be part
>of a user's *personal* database, rather than a systemwide one).

Are pushing some sysadmin union to control the world? machines should
be designed to seldom have unreconcilable problems. As I said before
till unix machines require the same level of maint/support as a dos
or mac system ... we might as well just give up and go home rather
than prolong the dos/nt/novelware/mac takeover.

>> All filesystem file data and meta data will be clean at any
>> interruption of service, with the detectable exception that
>> a file/database may be internally inconsistant.
>

>This cannot be guaranteed, particularly if the interruption occurs
>in the middle of a write operation to the middle of a file.
>Requiring that new blocks be written will help enormously here,
>though.

the only guaranteed outcome in life is our death and that somebody
else will be doing our job afterward ... if it was worth doing.

To use algorithms that are fundamentally wrong is only useful for
job security of the person inisting they shall be wrong. Too everyone
else it creates one more fire and one more disaster waiting to threaten
their job/company/life.

>
>> 2) any other write policy requires a filesystem to be reload
>> from backup media if service is interrupted while mounted.
>

>This may well be required anyway, even *with* the ideas you suggest.

only for a poorly constructed complex application. Normal desktop
services and most server environments can be made to be clean.

>
>>Item one can be extended to require updates to existing filedata be done with
>>new data blocks, and new meta data, which only is reflected to the
>>disk inode when the file is closed or a commit operation performed.
>

>Yup. But this can be a really significant problem in low-space
>situations, particularly if the files being updated are large.

actually the only size the matters is if the trasaction exceeds free space
on the disk ... if you are that close to running out of space in an
environment that uses huge transactions then the programmer needs the
standard "were are out of memory, if you wish to continue we will be unable
to undo this action without relaoding the disk" message.

>
>>When done properly with locking operations this makes a system database
>>safe as well.
>

>Certainly safer than with existing filesystem policies.

you seem not expect much .... Managers and designers with this attitude
need to be replaced with ones that have strict standards AND the ability
to compromise with reason with reality gets in the way.

In this case, IT IS POSSIBLE TO PROVIDE SAFE SERVICES, which include
the tools to make incomplete operations consistant after a crash..
I expect a small number of very complex multi-host networked application
might have reality intruded ... but for standard single host multiple
process environments I see no problem for providing the tools for
applications programmers to build 100% operationally correct systems
which includes handling abnormal events like disk failures.

>
>>This policy is neither difficult to implement, nor a significant performance
>>hit ... and in fact when done with better algorithms than UFS can be
>>3-10X faster.
>

>Do you have any references in the literature to performance analyses
>of such schemes? I don't imagine the performance hit would be that
>significant, because the only effect I can see off the top of my
>head is the queueing of data in the various passes of the elevator
>algorithm...
>
>You'd have to get rid of the idea of a simple buffer cache, though
>(would a cache with priority levels be sufficient?).
>

geez ... back to follow the herd thinking, Do you know of a system that
has implemented this? If one major vendor had, they would be beating
their chests in the trade press about everyones elses problem. The first
kid on the block here has a major selling angle.

And no a priority scheme dosn't address the deata relationships properly
for reasonable performance. Simply assigning 5 to inodes ... and 1 to
data blocks will cause the cache to fill with non-data blocks and prevent
timily writing. As one other thinking reader deduced it takes threading
requests and a disk sort routine that cooperates with the filesystem.

John

Message has been deleted

Totally Lost

unread,

Jun 10, 1994, 9:53:04 PM6/10/94

to

Ahhh the vular keyboard emailer finally decided to spare the audience
insults directed at me will completely avoiding any techinical
discussion.

Batter number two up, and his first swing is:

In article <2t6870$b...@news.cfi.waseda.ac.jp>,

Alexander During <639...@cfi.waseda.ac.jp> wrote:
>>>I've raised this point a dozen times, even at the usenix meeting where the
>>>Berkeley guys first presented their concept of sync written meta data ...
>>>and they didn't have a clue since they were one-tracked on making the
>>>system clean from fsck point of view ... not from the users or data base
>>>managers point of view. If the users/production data is corrupt, the
>>>filesystem is corrupt!! PERIOD!!
>

>I think it is about time to stop that silly thread. Apart from the rather
>disgusting style of 'idletime's postings, the discussion is rather far-
>fetched. The point is: There are circumstances under which a filesystem
>can be harmed by system crashes. The likelyhood of such an event depends
>on a) the typical crash frequency, which is determined by the skill of
>the sysadmin and the OS, as well by the hardware, and b) the stability
>of the filesystem under such crashes, which depends on the kind of crash
>and the FS itself.

Reliablity, operational integrity, and security are never silly.
(carefully avoiding the production word the writer has a hard time about)

The point really is that nearly every unix system shipped today
will needlessly exposue the user community to corruption and security
problems ... these probelms can be greatly removed, if not mitigated
by, a correct filesystem design. This is not a silly discussion,
but rather on that is 12 years overdue from my view point.

Since the guy wants to talk statisics we can do that to ...

If the probability that metadata gets written before the data blocks
is one (sync writes) and the probability of corruption during a crash is
.03 to .1 (my guess) then accross a comunity of 4 million machines with
one crash a year, then approximately 120K to 400K potential security
or corruption events happen each year.

If your reduced the probability to approximately zero then none should
occur of this failure type. That is kinda a noticable change guy.

Of course other events occur like head crashes which destroy data on disk
that can not be controlled. But thats a hardware problem :) ... or is it?
(I have a nice tirade on how UFS causes premature disk failure too...)

>First and foremost, no FS is save against a head scrubbing over the HD,
>so there is no perfect way out. It may be useful to argue which risks
>are more worthy to protect against, but this thread is seemingly not
>directed at that kind of analysis.

And we are all going to die some day, so just give up now ???
Since you raised the point ... lets hear the analysis please,
most of us off the back of the cuff already reached what we think
is the right conclusion.

What point are you tring to make? this statement as it's stands is
simply smoke to blur the issue since you failed to provide enough data
to even validate your point, if there was one.

>
>The main problem involved is a tradeoff between time and security. A
>filesystem that writes synchronously (like DOS) forces the hardware
>to read a sector, change one byte and write that sector back each time
>a byte is written, which is slow. It is not perfectly safe, however,
>which is something everybody who has worked under DOS will readily
>agree with. So asynchronous writes have the undisputable advantage of
>being faster, while aggravate the severity of crashes, that still may
>well harm even completely synchronous writes.

I have several times stated that this is not necessarily a trade-off
at all ... and that the correct algorithm will suppress needless
metat date writes the current algorithms require to achieve fsck level
reliablity. I will also add up to 30% additional overhead due to
other side-effects of this policy which when translated to time
due to mechanical delays for seek and rotation reduce certain classes
of applications thruput by as much as 95% (read 1/20th).

And asyncronous ordered writes are the best of both worlds. Faster than
both fully syncronous or just syncronous metadata.

>From one of the enlightening postings of Mr. idletime, we could gain the
>fact that he has experienced 3 losses of data in 19 years. This is not
>something that I would call a risk. Please consider the likelyhood of
>getting run over by a car in 19 years which you spend by crossing a
>street 8 hours a day. I bet you'd get run over more often (nowadays
>probably only once, however), and still you do it. There are just
>chances that one has to take.

And you may have looked at the words, but certainly didn't think about
what was said. I know of 3 security breaches caused by this, and a
number of corruptions. I suspect atleast the same number (or larger)
of breaches occured where the receipient didn't care to disclose the
new found wealth. Since I am only one person, if you multiply that times
the number of other sysadmins who are not looking for such or reporting
such, we have a huge number per year again. And for what reason?
carelessness as far as I am concerned.

If you lay in the middle of the freeway or train tracks for 5 minutes
a day you have have some probability of getting run over ... if that
is on a freeway at rush hour the probablity is near one. If that
is midnight on a siding that hasn't been used in ten years it is near zero.
I'll take my chances with the siding. Choosing a high probablity
event is stupid.

>In my opinion there is nothing much to discuss. idletime has raised that
>point a dozen times infront of experts and they probably threw him out
>and so he tries to infest this newsgroup. Just let him have his fun, but
>don't waste the time of the people who read this for information by
>posting too many replies; it's getting annoying.

I believe the world is round, which is a pretty safe bet today.
Not that long in the past such mutterings could cost your life.
Even today some continue to declare the world flat, we call them
fools.

We each pick who we trust as experts, with experience this
usually becomes a task of reasoned care.

I think this is officially strike one, do you care to accept the
bait and attempt to reclaim your honor defending your position?

>
>Just my 2 Yen.

and at the current exchange rate what fraction of 2 cents is that ...
hmm let's see ...

batter up ....
John

David Holland

unread,

Jun 10, 1994, 6:25:22 AM6/10/94

to

nei...@nestvx.enet.dec.com's message of 9 Jun 1994 11:26:05 GMT said:

> > <--- file X is written, using
> > blocks formerly from file Y
>
> Which are zero-filled at that time.

Are they? Is this *guaranteed*? Is there anything that insures that
these zeros are *written out* before the blocks can be reused? Suppose
the crash occurs after the old file's cleared inode has been written,
but before the zeroed blocks have been written out? Then a second
crash, at the wrong time, could have this same effect. Although I
suppose in this case fsck could take care of the problem.

Not having source handy, I don't know the answer to these questions.

Nonetheless, the behavior you describe is still what I'd consider
broken: after the crash, the new file is found to exist and have the
correct length - but contain nothing. That is, the system fails
silently.

Btw...

> > file Y's inode is written --->
>
> Can't happen now (after all, Y was deleted and hence it's inode
> is *gone*).

...the inode has to be cleared or otherwise marked unused. Otherwise,
at fsck time, file Y will rise from the grave...

Orc

unread,

Jun 11, 1994, 2:42:44 AM6/11/94

to

In article <idletimeC...@netcom.com>,
Totally Lost <idle...@netcom.com> wrote:

>After a wild strike three Burkhard Neidecker-Lutz is OUT!
>

>To recap this exciting drama ... [verbage elided]

Please take it to email.

Totally Lost

unread,

Jun 11, 1994, 10:43:14 AM6/11/94

to

In article <Cr7IE...@ucdavis.edu>,
Kevin Brown <ke...@frobozz.sccsi.com> wrote:
>In article <idletimeC...@netcom.com>,
>have the expertise and the ability.
>
>So why don't *you* write an efficient, secure filesystem for Linux
>(or one of the free versions of BSD, if that suits you better,
>though they may be a bit more entrenched in tradition than the
>Linux developers)? You have the source code, so you can make any
>changes you need to in the interfaces, drivers, etc. to make it
>happen. You might even be able to work with the people who wrote
>the device drivers.
>

I am self employeed, and run a small consulting company with
a house and kids to support (IE current fixed expense >5K/mo).
My short term goal is to save enough to return to grad school soon.

This is not the only critical flaw of UFS or other V6/V7/SvRx derived
filesystems. To do the job right is a ground up redesign of a
filesystem, BIO, Kernel Interfaces, drivers, and supporting utilities.
It would take myself and two/three junior helpers probably a year to
complete to production release status.

I am not ready to starve for that long and sacrifice my house and
kids (and education) to put a work this size into the public domain.
I'd put the core work into a raid controller and attempt to make
a product out of it first. Any other path requires a system
vendor willing to make major changes to the base OS and porting
all drivers and other filesystems (at SCO this is a BIG DEAL)
AND making life tough for a bunch of third parties that will have
to do the same. When I was at SCO they had just introduced SCO/UNIX
and had forced much of the XENIX customer base to make the big
change with them. Every five years is about as often as a
major vendor can afford this size wrinkle to customers.

After SCO I discussed this with both NCR and Compaq
to see if major functionality improvements would be
an incentive to customize the SCO product. SCO is
a STANDARD in their market ... and not to be fiddled
with. Given the NIH factor and no existing relationship
I didn't see any practical way to get DEC, Sun, or USL
to fund the project. I don't think there are any other
UNIX vendors left besides these that could afford the
pay for it.

The only other long shot is to do the work in FreeBSD
or the lite when release and make a product out of it. I
would have to think long an hard about that, and have been.

>Am I right that a priority-based buffer cache would be sufficient
>to get the characteristics you need for the reliability and security
>requirements?

as I discussed elsewhere a simple priority approach doesn't
quite work ...

John

Burkhard Neidecker-Lutz

unread,

Jun 11, 1994, 1:57:30 PM6/11/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:

>After a wild strike three Burkhard Neidecker-Lutz is OUT!
>

Like the T-1000, he magically reforms and comes back to finish off
idletime... :-).

>[The excitement builds, he tells the UNIX croud they deserve corrupted data
>setting the stage to introduce some vapor ware he has stock in ... ]
>

If by vaporware you mean Advfs, better duck and cover. If you mean
by vaporware "UFS is safe", rejoice.

>[So he backs off that UNIX users deserve corrupted data, to they don't really
>need correct data, and tring to save the day makes the standard side pitch that
>for anyboding wanting correct data VMS will solve all UNIX users ills ...

I brought in VMS to show that doing all I/O uncached and
can introduce ugly performance problems. I didn't suggest
VMS would cure any UNIX ills. I never said UNIX users deserve
corrrupt data.

>with "thank you, Digital does understand production systems". I guess this
>means Ultrix (DEC's unix offering) is not a production system ;-) ]

Ultrix indeed is not a production system (too close to BSD).

Both DEC OSF/1 and VMS are.

> <warms up story of VMS not being UNIX around 1989>

a) if you don't like VMS, don't use it.
b) todays VMS is a lot more like UNIX (it even has XPG/3
branding, aside from allmost all of POSIX).
c) I don't like VMS, too :-)

>John: What I (and customers should) really demand is operational saftey and
>John: security.

Which I never disagreed to.

> <lots of detailed discussion of code pathes that lead to data showing
> up deleted>

I looked up the relevant source modules of DEC OSF/1 and while the window
is made smaller, it seems to be there as well for UFS in the non-MLS
version (i.e. there can be reordering at the driver level). One point for you.

Advfs does not suffer from the same problem because it keeps a ordered
log of what metadata operations have happened. Regardless of the ordering
of writes, the replay of the log after a crash makes sure that the
right thing happens (including the missed block zeroing).

So while you are right for BSD UFS, a modern UNIX file system such as
Advfs does it right. And that's not vaporware, that's included without
additional fee with every DEC OSF/1 shipped since V1.3 (i.e. since mid-1993).

Burkhard Neidecker-Lutz

unread,

Jun 11, 1994, 2:06:11 PM6/11/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:

>I think the three strikes and Burkhard is out posting made it clear how
>the UFS filesystem can leave trash data in a users file at a crash.

Yes...

>All systems with filesystems based upon V6/V7/SVR3/BSD have the problem.
>This is nearly every system ever shipped.

So count out AIX and DEC OSF/1.

>Fixing the problem means major changes to the filesystem/BIO/driver
>relationships, key data structures, and key kernel internal programming
>interfaces including driver and FFS/VNODE interfaces.

Yes.

>Twice I have come close to implementing this.

> So what twice could have been a major
>UNIX event, is still a dream and code fragments in my lab.

A reality in DEC OSF/1.

>The the various LFS style filesystems have the promise of reliability
>but the implementation tradeoffs are performance cripping for most
>desk top systems smaller than the SPrite sized machines the work
>was done for. Locality of data is severly compromised. At last
>winters usenix wip session I gave a wake call talk on part of these issues.

You don't need to do a log-structured file system to get the
reliability associated with keeping a log. Advfs (and as far as
I know, JFS in AIX) are not log-structured.

>Every production machine I see is killed by UNIX filesystem I/O
>running at 10-20% of what it should be ... by filesystems designers
>that insist on using a horse and buggy as the prototype for a space
>ship. The receint software bloat caused by X/Motif applications
>continues the pressure on the I/O subsystem, combined with
>increadibly faster processor technology the pressure to
>replace or rearchitect UNIX will continue into the 90's.
>
>As with my comments about Raw I/O in comp.os.rsearch the critical
>problem is people attempting to continue to use outdated decisions
>without re-evaluation of the assumptionas and tradeoffs involved.
>The current UNIX filesystem architecture is critically flawed
>on all major fronts - performance, reliability and security - and
>lacks key features of the main frame market it replaces.
>OS work today is done mostly by follow the herd, critical thinking
>is a lost art.

Maybe in the UNIX Sys V crowd, not here.

Totally Lost

unread,

Jun 11, 1994, 9:37:13 PM6/11/94

to

Ok ... I think this thread is dead, I set the followup to comp.unix.wizards
for the lack of a better idea (a group I don't normally read, but will
subscribe to again for a short while) ... comp.benchmarks is not a good
place (which I read) neither are the linux or sun groups (which I don't).

I appologize for the last couple badly hacked up postings, I was using
vi over cu thru rlogin and was not aware of how bad the char drops were
when doing X cut and paste in the vi window. (my window was fine,
the end result sure wasn't ...) This post is being created the same
way ... I will try to be more careful.

>>After a wild strike three Burkhard Neidecker-Lutz is OUT!
>
>Like the T-1000, he magically reforms and comes back to finish off
>idletime... :-).

Good job Burkhard, without a sceptic this thread would have passed
right over peoples heads. Maybe a few stopped to think about it
along the way.

>If by vaporware you mean Advfs, better duck and cover. If you mean
>by vaporware "UFS is safe", rejoice.

vaporware is a safe modern UFS.

> I never said UNIX users deserve corrrupt data.

Your right :) but what did you then mean by:

"Whoever writes sensitive data through the normal file system

with normal semantics deserves to get corrupted data."

given 99.99% of unix users only have a V6/V7/SVRx/BSD filesystem
which most people would equate to "normal filesystems" then
the paraphrase is factually correct, although slightly
imprecise since a very small set of users may have a filesystem
with correct operation in this case.

>Advfs does not suffer from the same problem because it keeps a ordered
>log of what metadata operations have happened. Regardless of the ordering
>of writes, the replay of the log after a crash makes sure that the
>right thing happens (including the missed block zeroing).

I suggest that if the user data is missing from the "ordered log" as
well as disk completion events then it is impossible to reconstruct
what the correct data is ... and in my mind while zeroing questionable
data may correct the security violation it does not leave the
data correct. Unless the recovery tool tells the admin and user
that it cleared a section of the file, or that a file was open
and written but not closed, then the user still can not trust
any data near the time of the crash.

>So while you are right for BSD UFS, a modern UNIX file system such as
>Advfs does it right. And that's not vaporware, that's included without
>additional fee with every DEC OSF/1 shipped since V1.3 (i.e. since mid-1993).

Please specifically address the above points before I can accept
your assesment of "right". I rather suspect that Advfs is not,
just by certain other track histories.

Can some other reader please address the same questions for the
AIX JFS?

We then can discuss "modern UNIX file system such as" at some
other point .... ;)

Make sure all you volume buyers out there call up your UNIX vendor
and ask this question ... Is my system data secure after a crash?
and be a bit questioning about the salesmans answer :)

> You don't need to do a log-structured file system to get the
> reliability associated with keeping a log. Advfs (and as far as
> I know, JFS in AIX) are not log-structured.

What I forgot or should have said, is that I think you can get
LFS or better performance on small systems without implementing
a log at all, and still have reliability. And in fact I believe
designs based on the log structured approach are fundamentally
flawed due to the absense of locality, as is UFS in it's cylinder
group balancing.

> >The current UNIX filesystem architecture is critically flawed
> >on all major fronts - performance, reliability and security - and
> >lacks key features of the main frame market it replaces.
> >OS work today is done mostly by follow the herd, critical thinking
> >is a lost art.
>
> Maybe in the UNIX Sys V crowd, not here.
>

> Burkhard Neidecker-Lutz
>
> Distributed Multimedia Group, CEC Karlsruhe
> Advanced Technology Group, Digital Equipment Corporation
> nei...@nestvx.enet.dec.com

as heralded by the great state of the midwest ... Show Me!

and remember ... It's not what you think you know that counts,
It's what you don't know about what you think you know that does!

and you can quote me!

have fun ...
John

Matthew Dillon

unread,

Jun 11, 1994, 4:08:22 PM6/11/94

to

In article <33f...@qnx.com> wafl...@qnx.com (Bill Flowers) writes:
:In article <idletimeC...@netcom.com>,

:Totally Lost <idle...@netcom.com> wrote:
:>In article <1994Jun6.0...@wavehh.hanse.de>,
:>Martin Cracauer <crac...@wavehh.hanse.de> wrote:
:>>SunOS-4.1.3 has a policy to have asynchronous writes for data, but

:>>...
:>
:>Both of these policies are stupidly wrong from a security stand point,
:>...
:
:The first policy does open a security hole, but U**x systems are not
:...

This whole thread is silly. You people are talking about security
concerns related to a tiny window that may occur during a crash.

I submit that the argument is centered around the wrong thing. The
problem with most UNIX's today is really their monolithic nature.
Too much junk is being shoved into a common supervisor address space
the result of which is that a single bug in a single subsystem can
corrupt or crash the entire core.

Before you throw all your resources at this filesystem 'bug', you
should realize that the problem is much greater then the one little
side effect that rarely occurs in UFS based filesystem designs. I,
personally, would rather throw my resources into solving the basic
problem -- the kernel design.

For the record, I believe that something like Mach is a good first
attempt, but personally I think the designers concentrated on all the
wrong things. I'm amazed at how much unecessary locking is done in
Mach... people don't seem to grasp the fact that the critical path in
FIFOs and QUEUEs can *easily* be implemented without locking. For
example, a reentrant queuing routine for an interrupt subsystem can
generally be implemented like this:

addq.l #4,QueuePointer ; allocate position somewhere in Q
move.l QueuePointer,A0 ; top down address pointer
loop tst.l -(A0) ; find unused position
bne loop ; (critical path: no looping occurs)
move.l D0,(A0) ; write data to queue

The key to divorcing low level device drivers from a monolothic kernel
is task switching overhead, but I have yet to see anybody seriously
address the issue. I see no reason why you wouldn't be able to get the
same or better efficiency out of a filesystem by having a process (i.e.
protected address space) for each mounted partition and a process
for the low level device driver. The result of such a scheme is the
ability to eliminate most of the semaphores and other locking mechanisms
that would otherwise be required in a monolithic design. Most importantly,
the critical 'bug' path goes from the 600K monolithic kernel to the 60K
microkernel core, the 20K filesystem process, and the 5K device driver
process... even if your system 'crashes' to an user-unusable state, the
chances of filesystem corruption are drastically reduced.

Considering that the complexity of kernels goes up every year, this is
the only real solution. The fact that a microkernel-based environment
can better take advantage of a multi-processor situation is just icing
on the cake.

-Matt

--

Matthew Dillon dil...@apollo.west.oic.com
1005 Apollo Way ham: KC6LVW (no mail drop)
Incline Village, NV. 89451 Obvious Implementations Corporation
USA Sandel-Avery Engineering
[always include a portion of the original email in any response!]

Totally Lost

unread,

Jun 12, 1994, 9:35:42 AM6/12/94

to

In article <2td5jm$1...@apollo.west.oic.com>,

Matthew Dillon <dil...@apollo.west.oic.com> wrote:
>
> This whole thread is silly. You people are talking about security
> concerns related to a tiny window that may occur during a crash.

First, it is not a tiny window on most UFS implementations. Secondly,
I did not make a point of that it almost always exists in one file or
another due to both sync metadata updates and the fact that since the
metadata is allocated before the data it references it will mostly sort
ahead of the data in the disk queue. The fact that most data is non-human
readable and most people when they find trash don't bother to figure out
what it is obscures the problem. If your point of view is a workstation
with a single user, you are right ... since the I/O system is idle
98% of the day. If you are talking any server or multiuser system that
is really used to update files/databases then the window is open nearly
all the time.

As for the rest of your beef, please take it where it belongs and don't
attach it to other problems for arguement sake. I addressed a valid
major bug that needs to be fixed ... we can take the other design issues
to comp.os.research (a moderated forum) so they can be discussed without
needless useless flaming.

> ... I, personally, would rather throw my resources into solving the
> basic problem -- the kernel design. [then goes on about using mach
> and micro-kernels]

Jump in then think (??) ... In bringing out this other greater problem
you put the cart before the horse ... so to speak, and the resulting
control problems are near impossible ...

"Performance by Design" is my motto ... the basic problem is System Design,
which includes Applications, Kernel API, the Operating System, and the
host hardware platform. Software bloat, creaping featurism, poor partitioning
of the overall design (application, OS and hardware components) and more
contribute to the problem. Your pointing to a little hole in the dam and
saying quick put a rock here (use a micro kernel) ignores the structural
problems behind the hole.

> process... even if your system 'crashes' to an user-unusable state, the
> chances of filesystem corruption are drastically reduced.

and this is the shithead point of view I opened this talk with ... accepting
that filesystem corruption is a valid tradeoff is wrong and only leads to
less optimal designs, both from a performance and security view point.

Personally I don't want anyone with this view any where near a kernel.

John

John F. Haugh II

unread,

Jun 12, 1994, 6:53:07 PM6/12/94

to

In article <2t6u8d...@usenet.pa.dec.com> nei...@nestvx.enet.dec.com (Burkhard Neidecker-Lutz) writes:

>In article <DHOLLAND.9...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:
>>Now, after recovery and reboot, file X contains some blocks that used
>>to be in file Y... which still contain the data from file Y.
>>
>>Security breach.
>
>Maybe on UNIX V6 15 years ago.

Two nits -- V6 was more than 15 years ago, and the blocks were zero
filled on allocation the same way as they are now. The function in
traditional UNIX systems which allocates blocks is alloc(). Anyone
with a copy of Lions' book can verify my statement.
--
John F. Haugh II [ NRA-ILA ] [ Kill Barney ] !'s: ...!cs.utexas.edu!rpp386!jfh
Ma Bell: (512) 251-2151 [GOP][DoF #17][PADI][ENTJ] @'s: j...@rpp386.cactus.org
There are three documents that run my life: The King James Bible, the United
States Constitution, and the UNIX System V Release 4 Programmer's Reference.

John F. Haugh II

unread,

Jun 12, 1994, 6:59:07 PM6/12/94

to

In article <DHOLLAND.94...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:
>Are they? Is this *guaranteed*? Is there anything that insures that
>these zeros are *written out* before the blocks can be reused? Suppose
>the crash occurs after the old file's cleared inode has been written,
>but before the zeroed blocks have been written out? Then a second
>crash, at the wrong time, could have this same effect. Although I
>suppose in this case fsck could take care of the problem.

The 6th Edition alloc() function wouldn't return the block number
until the block had been zero'd. That is how strong the guarantee is.

-candee-+Kovach K.R.

unread,

Jun 13, 1994, 8:08:56 AM6/13/94

to

>The current UNIX filesystem architecture is critically flawed
>on all major fronts - performance, reliability and security - and
>lacks key features of the main frame market it replaces.
>OS work today is done mostly by follow the herd, critical thinking
>is a lost art.
>
>Either Novell and the key players need to get the clue, or UNIX
>will be replaced in the passing of time (the 90's).

Can we get a posting by one or more of the authors of prior postings
of references to work that solves some or all of the problems of
performance, reliability and security?

I would like to get a clue, but have been at least partially blinded
by the fact that UNIX file systems and UFS are generally what is taught
and held up as examples of how to do file systems. So in particular
I would like to see how other OS's solve the problem.

Kurt Kovach "My opinions are my own."
Novell, Summit

Andrew Bray

unread,

Jun 12, 1994, 12:42:04 PM6/12/94

to

In article <2td5jm$1...@apollo.west.oic.com> dil...@apollo.west.oic.com (Matthew Dillon) writes:
> addq.l #4,QueuePointer ; allocate position somewhere in Q
> move.l QueuePointer,A0 ; top down address pointer
>loop tst.l -(A0) ; find unused position
> bne loop ; (critical path: no looping occurs)
> move.l D0,(A0) ; write data to queue

I hope you aren't advocating this as useful code.

This has a glaring race condition which can cause significant data loss
in the presence of multiple processors or interrupts.

I hope you now post a code fragment that can achieve your aim without
introducing locking in some form.

If you hadn't quoted this code fragment I would have actually given
your argument some weight.

Andy

Nigel Gamble

unread,

Jun 11, 1994, 12:26:07 PM6/11/94

to

In <Cr5HE...@ucdavis.edu> ke...@frobozz.sccsi.com (Kevin Brown) writes:
>It's a question of probabilities, and of minimizing damage. If
>the power suddenly fails, that will surely cause as definitive a
>crash as would bad hardware. But a power failure is much more
>likely to occur. When it does, you would *prefer* that the damage
>to the filesystem be minimized, as long as you don't take a
>significant performance hit to get it.

If you are concerned about power failure, the way to survive with
no corruption and no performance hit is to use a UPS. Why would you
want to put any performance hit in the filesystem when there is a
better way to address the power fail problem?

Cheers,
Nigel (happily attached to his Best Fortress UPS)
--
Nigel Gamble ni...@gate.net
Boca Raton, FL, USA.

David Holland

unread,

Jun 13, 1994, 10:45:52 AM6/13/94

to

> The 6th Edition alloc() function wouldn't return the block number
> until the block had been zero'd. That is how strong the guarantee is.

That would do it.

...as long as it's done synchronously and the cache doesn't trap it.

Btw, your news header is missing the domain name.

Ken Pizzini

unread,

Jun 13, 1994, 4:02:34 PM6/13/94

to

In article <1994Jun12.225907.12160@rpp386>,

John F. Haugh II <j...@rpp386.cactus.org> wrote:
>In article <DHOLLAND.94...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:
>>Are they? Is this *guaranteed*? Is there anything that insures that
>>these zeros are *written out* before the blocks can be reused? Suppose
>>the crash occurs after the old file's cleared inode has been written,
>>but before the zeroed blocks have been written out? Then a second
>>crash, at the wrong time, could have this same effect. Although I
>>suppose in this case fsck could take care of the problem.
>
>The 6th Edition alloc() function wouldn't return the block number
>until the block had been zero'd. That is how strong the guarantee is.

To *disk*?

--Ken Pizzini

Martin Cracauer

unread,

Jun 13, 1994, 5:18:30 AM6/13/94

to

torv...@cc.Helsinki.FI (Linus Torvalds) writes:

>In article <1994Jun6.0...@wavehh.hanse.de>,
>Martin Cracauer <crac...@wavehh.hanse.de> wrote:
>>
>>SunOS-4.1.3 has a policy to have asynchronous writes for data, but

>>inodes and other superinformation is always written immedeatly. So, a

>>crash could only affect files that a written at that moment. I think

>>Solairs 2 does this, too. Does anybody know?
>>

>>There was a patch for SunOS 4.1.3 to make the BSD-Filesystem writing

>>inodes async, too. That speeds up writing a large number of little

>>files by a factor of 2 to 3. Of course, a crash could really hurt now

>>that superinformation could be damaged.
>>

>>I would relly like to know what the standard behaviour of Linux is. If
>>someone know, please tell us.

>The standard behaviour for linux is to write everything asynchronously,
>as others have already pointed out. I'd just like to put in my two cents
>for why this is done:

> - synchronous writes are slow. You'll lose *lots* of performance. I
> tend to think that you can trust the hardware, and just ignore the
> minor problems you can get with asynchronous writes - the benefits
> far outweigh the problems IMNSHO.

I agree and I drive my Sun's with everything asynchronous, too.

> - doing synchronous writes on meta-data is broken: you'd really need to
> do synchrnonous writes on data too to be safe. BSD does metadata
> synhronously to give you a sense of security and ignores the actual
> file data -- they too did a trade-off in efficiency and security.
> They just did a better job at trying to fool people into thinking
> it's a good idea..

I think the difference is that with sync indoes only files that are
actually written when the crash happens can be affected. With
asynchronous metadata other files can be lost, too.

Of course, this only count for crashes that don't happen when the
kernel actually syncs the buffer cache, so I agree that it's not worth
the overheap to write anything async.
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin....@wavehh.hanse.de, Fax. +49 40 5228536, German language accepted
No guarantee for anything. Anyway, this posting is probably produced by one
of my cats stepping on the keys. No, I don't have an infinite number of cats.

Dan Swartzendruber

unread,

Jun 14, 1994, 7:33:00 AM6/14/94

to

No, why does that matter? When you alloc a block from the free list,
before using it for any file, you zero it.

--

#include <std_disclaimer.h>

Dan S.

David Holland

unread,

Jun 14, 1994, 8:12:55 AM6/14/94

to

dsw...@pugsley.osf.org's message of 14 Jun 1994 11:33:00 GMT said:

> >>The 6th Edition alloc() function wouldn't return the block number
> >>until the block had been zero'd. That is how strong the guarantee is.
> >
> >To *disk*?
>
> No, why does that matter? When you alloc a block from the free list,
> before using it for any file, you zero it.

Haven't you been following the thread? If you don't flush those zeros
*to the disk* before using the block, a badly-timed system crash can
cause the UNZEROED blcosk to appear in the new file - containing who
knows what kind of private data.

John Hascall

unread,

Jun 14, 1994, 2:32:41 PM6/14/94

to

David Holland <dhol...@husc7.harvard.edu> wrote:
}dsw...@pugsley.osf.org's message of 14 Jun 1994 11:33:00 GMT said:
} > >>The 6th Edition alloc() function wouldn't return the block number
} > >>until the block had been zero'd. That is how strong the guarantee is.
} > >To *disk*?
} > No, why does that matter? When you alloc a block from the free list,
} > before using it for any file, you zero it.

}Haven't you been following the thread? If you don't flush those zeros
}*to the disk* before using the block, a badly-timed system crash can
}cause the UNZEROED blcosk to appear in the new file - containing who
}knows what kind of private data.

I suppose one could have a system where a disk had 2 free
lists, 1 zerod, 1 not. Blocks could only come from the
zerod list, and the system could zero blocks in idle time
(moving them from one list to the other).

John
--
John Hascall ``An ill-chosen word is the fool's messenger.''
Systems Software Engineer
Project Vincent
Iowa State University Computation Center + Ames, IA 50011 + 515/294-9551

Dan Swartzendruber

unread,

Jun 14, 1994, 5:18:38 PM6/14/94

to

In article <DHOLLAND.94...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:
>

>dsw...@pugsley.osf.org's message of 14 Jun 1994 11:33:00 GMT said:
>
>Haven't you been following the thread? If you don't flush those zeros
>*to the disk* before using the block, a badly-timed system crash can
>cause the UNZEROED blcosk to appear in the new file - containing who
>knows what kind of private data.

True. I was thinking of another scenario and confused the two.

Totally Lost

unread,

Jun 14, 1994, 10:45:17 PM6/14/94

to

In article <2tkt49$5...@news.iastate.edu>,

John Hascall <jo...@iastate.edu> wrote:
> I suppose one could have a system where a disk had 2 free
> lists, 1 zerod, 1 not. Blocks could only come from the
> zerod list, and the system could zero blocks in idle time
> (moving them from one list to the other).
>

a file corrupted with zeros is still corrupted ... not a security
risk unless the file is critical to use of a system and causes
another form of fault that exposes you.

The point is that file corruption, in particular undetectable
file corrupt is not necessary.

John

Adam Sweeney

unread,

Jun 15, 1994, 11:29:49 AM6/15/94

to

>a file corrupted with zeros is still corrupted ... not a security
>risk unless the file is critical to use of a system and causes
>another form of fault that exposes you.
>
>The point is that file corruption, in particular undetectable
>file corrupt is not necessary.
>
>John

John,

So far your proposal seems to have been to order all writes
to file data before any of the meta-data writes which will make
that data permanent. While that will probably work for newly
written files, it doesn't do much for the consistency of
multi-block updates to an existing file unless you go to a
no-overwrite scheme and commit all of meta-data atomically.

It seems to me that what you want is transactional file update,
where both the meta-data and the data are protected by the
transaction. Is that accurate? I think this would be very
useful functionality for the examples you've given, i.e. mail
files, log files, password files. I wouldn't go so far as
to say that all file updates require such strict behavior.

For example, we had a customer here at SGI who was very upset
that EFS would eat their 300 MB log file when they crashed.
Well, it at their file because EFS was using a scheme very
similar to what you've been proposing. The inode would not
go to disk until all of the data in the file had. Since they
never stopped writing the file long enough for this to happen
the inode never went to disk. We've changed the behavior of
EFS now so that the parts of the file which have made it to
disk become permanent, but this is just an example of how
your all or nothing argument does not satisfy everyone.

In my opinion, what would be really nice would be multfile
transactional file update. Of course, this requires a
concurrency control mechanism and all that, but it would
be handy. What I see at work, however, is that customers
are not asking for this. They want huge files with huge
data rates to and from the files. They want huge file systems
that don't take days to bring back online. Nobody asks
for perfect data consistency semantics.

Oh well, maybe someday.

Adam Sweeney
MTS
Media Systems Division
Silicon Graphics, Inc.

Totally Lost

unread,

Jun 15, 1994, 8:07:07 PM6/15/94

to

In article <2tn6pd$5...@fido.asd.sgi.com>,

Adam Sweeney <a...@spareme.engr.sgi.com> wrote:
>>a file corrupted with zeros is still corrupted ... not a security
>>risk unless the file is critical to use of a system and causes
>>another form of fault that exposes you.
>>
>>The point is that file corruption, in particular undetectable
>>file corrupt is not necessary.
>

>So far your proposal seems to have been to order all writes
>to file data before any of the meta-data writes which will make
>that data permanent. While that will probably work for newly
>written files, it doesn't do much for the consistency of
>multi-block updates to an existing file unless you go to a
>no-overwrite scheme and commit all of meta-data atomically.

Actually there is a subtle difference between what I ask for/require
and how you interpret it. I require that all data be written prior
to meta data that points to it ... period. This doesn't imply
that all data be written before any meta data for the file.
And certainly allows the filesystem to checkpoint along the
way with minimal performance lost, and a huge gain over current
sync write of all meta data.

>It seems to me that what you want is transactional file update,
>where both the meta-data and the data are protected by the
>transaction. Is that accurate? I think this would be very
>useful functionality for the examples you've given, i.e. mail
>files, log files, password files. I wouldn't go so far as
>to say that all file updates require such strict behavior.

I clearly stated that while I think that multifile transactional
updates are a very good thing for any standard unix platform,
I was not mandating that this proposal be tied to it.
My point is that async ordered writes as proposed are faster than
async data and sync meta data as currently practed by nearly all
filesystem designs. This applies to all type of files.

I will go so far as to say that if a file contains any data other
than written at a crash point it will greatly increase the
recovery applications work for that file type. I will go on
to say it should be required of any filesystem to indicate
which files may be corrupt after a crash as a minimally
required function to aid the sysadm and production staff
to get the system consistant again short of rolling back
to a known checkpoint..

>
>For example, we had a customer here at SGI who was very upset
>that EFS would eat their 300 MB log file when they crashed.
>Well, it at their file because EFS was using a scheme very
>similar to what you've been proposing. The inode would not
>go to disk until all of the data in the file had. Since they
>never stopped writing the file long enough for this to happen
>the inode never went to disk. We've changed the behavior of
>EFS now so that the parts of the file which have made it to
>disk become permanent, but this is just an example of how
>your all or nothing argument does not satisfy everyone.

This case is not necesary and represents a short sighted view
of the requirements. see above ...

I never made an all or nothing arguement ... please quote what
you think forms this requirement.

>
>In my opinion, what would be really nice would be multfile
>transactional file update. Of course, this requires a
>concurrency control mechanism and all that, but it would
>be handy. What I see at work, however, is that customers
>are not asking for this. They want huge files with huge
>data rates to and from the files. They want huge file systems
>that don't take days to bring back online. Nobody asks
>for perfect data consistency semantics.

Ditto on transactional file updates as said here and elsewhere.
However your market segment has biases in file activity
that are not mirrored in 99% of the remaining market. As does
one of my clients in the medical imaging business that uses
your equipment. That aside 99% of that same clients inhouse
work on SGI development stations follows normal file usage
distributions where the ordered write issue is important,
as does a number of other production usages.

Nobody asks ... probably true ... I think most assume (falsely)
the reliability exists and seldom see a failing case that would worry
them. Being enlightened you might say, I see many unix
applications that scare the shit out of me knowing
what can (and has in several cases) happened to critical
data that peoples lives depend on.

All the pharmacies systems running on SCO XENIX/UNIX and other PC UNIX's
are time bombs from my point of view ... going to kill somebody
or leave them with unnecessary damage.

>Oh well, maybe someday.

I'm hoping soon, like one release life cycle away at most.
In talking with several DEC guys over this, with Advfs they hit
damn close and will probably be to market with a safe
filesystem before the rest of you guys.

>Adam Sweeney
>MTS
>Media Systems Division
>Silicon Graphics, Inc.

Since comp.os.research doesn't seem to want to handle this thread
-- my last two postings with comp.os.research included seemed to
get /dev/null'ed -- I'll go ahead in follow it here for the lack of
a better place.

One of my side arguments is that a good filesystem design would
make RAID boxes a white elephant ... instead filesystem people
at places like SGI seem to want to institutionalize them. Any OS
on a fast enough peice of iron should be able to run circles around
any raid box since the OS with CORRECT cache mangement policy
should leave an uncachable stream visible to the raid box.
That would make any cache memory in the raid box over and above
flow buffering a total waste.

Instead I feel stupid as hell having to recommend a raid box
for my clients SGI systems because you guys don't have it
together yet.

For a look into my view of how a filesystem should look, try
the following which I prepared as a first deliverable to SCO
on contracts that never materialized. While I held off giving
them the document to slow down the Not-Inventeed-Here We-can-do-better
game - each of the facts below were presented to one or more of
the SCO team at Santa Cruz. Kip, Brian, Dean, Jeff, Michael Taht
and a few others should remember the lectures as will several of my
former staff that worked on the Altos port.

The core parts of this design happened 1981 thru 1988, then refined
at SCO during 1989 to 1992. I dusted it off in Jan this year while
tring to reopen this with the SCO London team, but they were not
serious enough to proceed past a friendly handshake. Changes of
this degree takes some serious kernel hacking. More than just an
idle weekend's play.

----------------------------- Cut here -------------------------------

Exploring DFS (DMS Design File System)

and

DFS Functional Specification

John L. Bass
DMS Design

Aug 22, 1990
(Revised Jul 10, 1991)
(Revised Jan 14, 1994)

Overview

Current file system designs are cripled by poor locality, excessive cpu
requirements, out dated architecture, excessive rotational latency losses,
excessive seek time, and poor reliability. Various architecture and
implementation problem areas that current file systems and disk subsystems
are crippled by are brought together in this paper along with innovative
new approaches to the problems.

These concepts and goals include:

Replacing poor data management strategies for volumes, logical
volumes, partitions, mirroring, stripping, and RAID parity with a
unified multiple spindle file system that has built-in smart
replacements for these policies on a file by file basis.

Replacing block-at-a-time with file-at-a-time strategies which
include semi-continugous extent based allocations of memory and disk
spaces to reduce allocation overheads. Strict ordering of meta data
and file data writing to provide crash security and reliability.

Replacing early binding (allocating disk addresses when buffers are
allocated) with late binding (allocating disk addresses just before
writing the data) to improve locality and disk usage.

Replacing existing file system, BIO, and driver interfaces to
implement these new policies, rearrange policy/functionality, and
provide new layered standardized device interfaces with simplified
control paths between the file system and drivers.

Replacing existing FIFO block and inode caches with heuristic
statistical caches to greatly improve cache performance and memory
utilization. Keep usage profiles gathered by file and directory.

Reducing typical disk I/O's per small to medium sized file accessed,
to an average of two or less, including directory operations for a
busy system. A combination of caching, clustering, and revised
on disk data structures will be used to achieve this goal.

Replacing current disk allocation and sorting policy to reduce the
total seek and rotational latency time for drives by 80% or more.

Policies for Disk, Logical Volume and Partition Management

Under DFS multiple disks, each with a single active partition, will
contain independent filesystems, treated as unified parallel file system.
Per file replication and parity group selection replaces partition
mirroring and RAID paritiy requirements as needed for reliability.
Per file stripping allows key large files to span disks, load balance,
and achieve higher thruput with concurrent multiple disk accesses. An
integrated backup and data migration manager transparently handles block
addressable tapes and removeable disks to extend the filesystem space,
especially where an automated changer (jukebox) is available. Sequential
tapes can be used for traditional serial archives that also can be
transparently mounted as filesystems, abeit very slow. Automatic
compression is also a per file/directory selectable attribute.

Current high level data management policy is to sub-divide disks into
multiple partitions. These partitions may then be combined with mirroring,
stripping or simple concatenation to form logical partitions used for
file systems and swap or paging space. The reliability of stripped
partitions can be improved by using mirroring or parity, available in
several different forms (aka RAID) to offset the reliability problems
associated with spanning file systems across multiple drives. The per
file/directory selectable attributes in DFS combined with automatic
heuristics used in DFS data migration obsolete these older high level
data management policies.

Use of multiple active partitions introduces severe performance losses
caused by excessive seeking and reduced locality of data over a single
partition design. Paritions also fragment and distribute disk free space
causing increased freespace management problems, including inability to
handle large files even though enough distributed free space is available.

|-------------------------- single disk ---------------------------|
|----- Partition 1 -----| |---- Partition N -----|
inodes files freespace inodes files freespace

|---------------- wasted seeks --------------|
|------------wasted seeks ------------------|

With this typical disk layout long expensive and unnecessary seeks
are generated by using partitions which can be avoided with a single
partition spanning the drive. There is also no incentive to develop
file system strategies that enhance locality since the volume/partition
managment interfere with locality optimization.

Traditional Enhanced UNIX filesystems like Berkeley's UFS and LFS
specifically distroy locality. UFS does this with load balancing cylinder
groups. LFS does it by the very nature of using the entire disk as a
rolling log. Except for certain educational servers with small quota's
this policy is both wrong and worse than a traditional UNIX filesystem
(especially where bit-mapped free lists are used).

Under DFS active heuristics and data migration policies combine to
highly localize active disk regions, nearly eliminating unnecessary
wasted rotational latency and seeks over unused data. A typical DFS
partition is organized as:

|-------------------------- single disk ---------------------------|
Very Active Files Stale Files Archived Files

where Very Active Files are files receintly created or accessed. These
files are typically automatically replicated for reliability and
enhanced retrieval performance. For files that have a continuously
high access rate they may be replicated multiple times, both on the
same spindle and across spindles to enhance retrieval times. Files
that have a history of repeated one day to one week access intervals
will be held in the stale area. Files that have a history of repeated
longer term accesses will be held in the archive area. Files in the Stale
and Archive areas are never replicated (mirrored), instead they are
clustered into groups with multiple parity vectors. The Archived clusters
differ from Stale clusters in that they are also compressed.

Both Fixed and Removable DFS volumes represent a complete independent
parallel filesystem rooted at /. Key paths are cached and chained
internally on mount. An unrestricted name search will always select
the latest version of a file. A naming syntax exists to select a
particular version of a file when multiple versions exist (IE a backup
volume is mounted or versioning is enabled). In addition a volume may
be mounted with an implied prefix to it's root, or only a portion of the
volume may be mounted by selecting a specific directory path to be
rooted at the (possibly prefixed) root.

If multiple copies of a selected file version exist, all are presented
to the I/O queue upon access and the first device ready to service the
request will mark it BUSY which causes other devices to ignore the request
but keep it in their queue until the request is posted DONE. If a device
servicing a request encounters an error it removes BUSY and allows an
alternate device to attempt servicing it. Writes are handled very
similarly with a couple extensions to support writing the data to multiple
disks/tapes with a single queued request. A priority scheme is used
to schedule both I/O requests and cache usage which follows a processes
dynamic priority and niceness. The filesystem charges each process for
it's I/O usage and decay's it over time as with CPU usage. These
strategies provides superior process management, cache management,
load balancing, and error recovery strategies over traditional designs.

With a jukebox present, files will transparently migrate between
fixed disks and archive media upon reference or going stale. Without
a jukebox an operator interface is provided to handle volume change
requests. Using this feature a fast crash recovery is possible allowing
even giga/terabyte sized filesystems to come back online while recovery
restores are in progress with specially constructed logical image backups
which have files sorted by reference history. Multiple archive devices
can be used in parallel to speed generation and recovery from logical
image backups.

At boot time, all configured fixed disks, removable disks, and ready
tape drives which have an automount flag set in the the super block will
be mounted. At any idle period, each file system volume will go to sleep
by flushing caches to the media. Should the system crash during a sleep
period the disk will be known as consistant at startup, with the
exception of flagged open written files.

Policies for Directory, File, Free Space management.

While DFS still uses metadata objects like a super block, inodes,
indirect blocks, and directories - how they are used and maintained
in memory and on disk varies greatly from traditional practice. First
extents are used to describe the blocks used to store file and directory
data, a major departure from block lists. Secondly there is not a separate
inode area, most inodes reside immediately before the data on disk,
or with the primary directory entry for this file. Most small files
data is encapsulated in the primary directories cluster. All small
files and directory clusters are internally check summed to aid in
detection of corruption. Every cluster written includes a timestamp/seqno
in the meta data to determine which replicated content versions are
current. Secondary file references (links) contain a subset of the inode
information to speed access. Primary and secondary references are
cross chained to support update of inode information. This slows some
heavy meta data operations and greatly speed most others. All this
is traded off with improved cache and I/O performance.

These policies allow small file operations to average a sustained,
less than one I/O per file accessed, in busy directories. For compiler
header files (/usr/include/*), mail folders, and many net news directories
this is a huge performance boost.

DFS contains a number of per file attributes that are not normally
found on a typical POSIX system. These attributes are transparently
handled by defaulted inheritence from either the directory or creating
applications template.

Extensive access profile statistics are kept on active, clusters,
files and directories which are used to create additional replication
to enhance retrieval times. In addition, these metrics are used to
predict effectiveness of caching, and for cache life. These stats are used
and updated over time to determine archiving requirements and placement.
In addition highly referenced clusters maintain an in memory history
of successor clusters accessed, and is used to drive a prefetch
where certain series of small-med file clusters are accessed.

The logical disk allocation size is not fixed, and ranges between
128 bytes upto a system max on a per file basis. The physical cluster
sizes also vary. The upper end for both of these is tunable per system.

Internally the old system global buffer chains are gone. Buffer headers
are now chained off either directory cluster caches, the in memory inode
structures, or the volume free list. Buffer headers are more complex, do
not have statically allocated buffer memory, and are not used to queue
I/O requests. The I/O request structure includes a pointer to the
traditional buffer header, the per devices addresses for this file to
support replication, and a doubly linked list that ties it all together.
Allocations for both disk and buffer cache space are done with a
combination of best fit and N-way lazy buddy to reduce fragmentation.
Disk addresses preserved where convient, or are bound at I/O queuing
time to facilitate request clustering operations. This late binding
is important for the creation of new files, especially large ones.
To avoid running out of space, a reservation system makes sure that
space is available for all unbound data.

A demon process runs in the background to handle defragmentation
operations, data migration, archiving and volume change services.

Changes to the kernel API and device driver level interfaces

Open, read, write, lseek, Fcntl, stat and close interfaces will be
maintained for POSIX compatability and application portability. These
will not be the prefered interfaces for most files however.

The prefered way to read small to medium sized files will be to
use open with a new flag O_DIRECT_MAPPED which will cause open to
return a pointer to the file data mapped copy on write into the
processes virtual address space. A read request for the entire file
will be queued at open time. Calling close with the same address
will free the virtual address space.

The prefered way to create small to medium sized files will be
using open with O_DIRECT_MAPPED|O_CREAT|O_TRUNC and passing it both
the file mode and data buffer address. The call will return zero if
successful, and -1 with errno if not.

Where possible page flipping between the filesystem and application
will be used, so writing modulo system page size with page aligned
buffers is a significant win.

For the most part the driver interface is similar, except the
strategy routine is prohibted from sorting the request list and rather
than doing a wakeup on the request header a callback thru the I/O
request structure is done. The driver is suggested to be fairly dumb,
and only implement a few extended device specific operations to
provide geometry, error recovery, and bad block mapping services.

Changes to optimize seek, head switch, and rotational latency times.

Most existing systems attempt to optimize seek time by doing some
form of queue sorting. In most cases this queue sorting is done wrong
allowing low priority processes to hog the disk subsystem or otherwise
provide unfair service distribution. Particular attention will be
paid to provide uniform prioritized service with some "fair share"
attributes.

Traditional scheduling algoritms treat the disk as a sequential
array of tracks and cylinders with almost no rotational/seek based
optimization of the queue processing. For instance:

|------------R4--------- Cyl 0 Trk 0 -------------------------R9-|
|--------R3------------- Cyl 0 Trk 1 ----------------------------|
|----R2----------------- Cyl 0 Trk 2 ----------------------------|
|R1--------------------- Cyl 0 Trk 3 ----------------------------|

|---------------------R6 Cyl 1 Trk 0 ----------------------------|
|-----------------R5---- Cyl 1 Trk 1 ----------------------------|
|----------------------- Cyl 1 Trk 2 ------------------R8 -------|
|----------------------- Cyl 1 Trk 3 -------------R7-------------|

Would be normally serviced as 4, 9, 3, 2, 1, 6, 5, 8, and 7 using a
total of 8 revolutions of the disk. Given acceptable head switch and
short seek times these might be completed in one revolution in sequential
order. Knowing short seek performance guidlines, head switch times,
track skew factors for each zone plus the driver and controller command
setup times and completion posting latencies can yeild significant
performance gains. Selecting the queue order primarily on priority
and enhancing it with "free requests" can maintain both high thruput
and outstanding priority driven response times. This is particulary
important since many new drives are reaching short seeks times nearly
equivalent to head switch times.

In a default mode, the filesystem and driver will operate with
some default parameters based upon the device size and assume basic
timing numbers from a quickie startup test. A generic placement and
queue scheduling routine will be used for this filesystem.

For devices that advertise their geometery across one or more
zones along with bad block mapping information that data will be used
to optimize data placement and request scheduling. Customized
placement and scheduling routines will be available for qualified
vendors.

We will aid disk and system manufacturers with a white paper that
clearly outlines expected operation guidelines to achieve maximum
disk subsystem performance. This will include both a requested standard
as well as guidelines on how to enhance or modify the standard without
invalidating the placement and scheduling modules that exist. We
will also clearify the performance impacts of using software based
scatter/gather controllers which produce data underruns and rotational
latency losses at page boundries.

SCSI write behind buffering and caching will be supported by both
the driver and filesystem. Non-SCSI devices will be able to use the
basic model in their drivers.

Request concatination/chaining will be used where possible to
improve read/write performance.

Careful design consideration will be applied to these cases in order
to achieve "average transfer rate" over "raw transfer rate" ratios as
near one as possible.

Dirk-Jan Koopman

unread,

Jun 16, 1994, 6:14:10 AM6/16/94

to

Adam Sweeney (a...@spareme.engr.sgi.com) wrote:
: >a file corrupted with zeros is still corrupted ... not a security

: >risk unless the file is critical to use of a system and causes
: >another form of fault that exposes you.
: >
: >The point is that file corruption, in particular undetectable
: >file corrupt is not necessary.
: >
: >John

: John,

: So far your proposal seems to have been to order all writes
: to file data before any of the meta-data writes which will make
: that data permanent. While that will probably work for newly
: written files, it doesn't do much for the consistency of
: multi-block updates to an existing file unless you go to a
: no-overwrite scheme and commit all of meta-data atomically.

see the thread 'meta-data - the old fashioned way'

WLC

unread,

Jun 16, 1994, 8:20:25 PM6/16/94

to

Totally Lost (idle...@netcom.com) wrote:
: In article <2tn6pd$5...@fido.asd.sgi.com>,

: Nobody asks ... probably true ... I think most assume (falsely)

: the reliability exists and seldom see a failing case that would worry
: them. Being enlightened you might say, I see many unix
: applications that scare the shit out of me knowing
: what can (and has in several cases) happened to critical
: data that peoples lives depend on.

: All the pharmacies systems running on SCO XENIX/UNIX and other PC UNIX's
: are time bombs from my point of view ... going to kill somebody
: or leave them with unnecessary damage.

This is a very harsh statement. You can only do so much with the operating
system proper. Nothing is going to stop a defective disk controller board
or other insane hardware from possibly subtly damaging your data.

The chances of a user typing in the wrong patient ID or a technician
entering the wrong result are MUCH greater than the type of failure
you present; even so chances are the patient is not going get hurt.
There are other checks that go on within the pharmacy.

It sure would be nice to see your ideas implemented though!

Bill Corcoran
bi...@classix.com

Totally Lost

unread,

Jun 17, 1994, 12:37:06 AM6/17/94

to

[This was originally cross posted to comp.os.research, the moderator
removed the other groups when he accepted the thread ... Please
follow up to comp.os.research so we can move to a better forum for
continuing this. Enjoy "The RAID 7 Disk Subsystem - band-aid and
white elephant" thread ...JB]

In article <2thi8o...@slate.summit.novell.com>,

I'm attempting to move those topics in this thread to
comp.os.research and combine with the thread on RAID as
a bandaid due to filesystem architecture problems. This
should be even more interesting than the opening shots
in this thread, and will include topics I have presented
since the late 70's to mid 80's and have been largely ignored
or discounted.

I have a significant beef with the educational systems treatment
of the "Computer Science" dicipline and how students are taught.
This is the outcome of not only how my formal education progressed
during the early 70's but how current graduates must be mentored
to re-examine their teachings which they accept as the gospel
according to Thomposon, Richie, and Joy - and the thousands of
converted missionaries.

This series was fun, and I think necessary to introduce the cold
reality that things are far from perfect. I don't think a two line
bug report would have served the purpose or opened up the degree of
thinking and discussion necessary to have it sink in.

Some of the Major topics to follow in this comp.os.research series are:

* How we got here today (partly covered in this thread)
and why that was acceptable then and not now.

* the evils of partition filesystem management, specifically
the performance and resource partitioning impacts (old data
presented to thousands in the past conferences, but lost)

* the evils of block at a time filesystem architecture
as formalized and standardized by the kernel filesystem
interfaces (old data again from both presentations
and postings in other groups).

* Locality problems and how they strangle disk performance.
A well structured arguement for radical architecture
changes. The problem goes past just filesystem design
as defined by UNIX implementation, and extends into
core kernel services, drivers, disk subsystems.
(old data again ... widely known, but not acted upon).

* Ignorance of physics and physical properties which
criple UNIX filesystem performance and reliability.
Sub topics include seek management, rotational latency
management, data scrubbing, integrated clustered
ecc with automatic recovery from multiple data block
failures. Data placement techniques for reliability.
(Old data presented in other conferences, training
seminars, and to clients).

* Statistical caching, data replication, data migration
and fully transparent backup and recovery techniques.
Why mirroring, stripping and RAID in the current
designs are just flat wrong and stupid. Alternative
architecture presented which obsoletes these techniques
and greatly moves us forward. (old data not widely
presented elsewhere, primarily to my internal staff
and clients).

Why comp.os.research ... because the data above is enough to
fill a book or a three day seminar, and will invite too many
cranks to divert the dicussion if held in a non-moderated
forum. ... I have a limited amount of time to fight the issue.
I normally make my living consulting on these topics, not
training people for free.

My detractors at SCO called the above the "Big Bang Theory"
since I repeatedly said the base problem with UNIX filesystem
performance was the initial design ... and that no significant
improvement could be made without a near complete redesign
which abandoned all the current interfaces and knowledge.
This is a scarey commitment to make when your bussiness is
to follow the technical lead of other providers. At first
I took "Big Bang" as an insult (as it was intended I guess)
then realized it was kinda cute and really reflected the
scope of the problem. Sometime things really are just wrong
from the begining ... when viewed with hindsight and changing
requirements. I think the design and tradeoffs were brillant
for 1974 ... and simply do not apply today.

Should the moderator of comp.os.research choose not to
host this debate ... we might need to find a third party
to moderate an alt.filesystem.research group.

I will probably force all mail to this account to bounce
as well, given the shit I received over this thread. If you
have an interest in the thread and would like a private
channel let me know soon.

The key policy of any design should be coordinated by
the responsible technology architects and the Systems
Architect (or Architecture Team) which is responsible
for coordinated features and overall consistancy.
This group must have, not only detailed knowldge of the
of the product, but the external experience and critical
thinking skills necessary to evaluate proposals with
conformity to a long range plan in mind. This is not
a place for wannabies or one membership one vote.
Nor is it a function to be filled by seniority or popularity.

UNIX has suffered greatly since this control function has
not been applied to the massive disjointed growth
in the product forced by competetive interests. The
relationship between the various key vendors, standards
efforts, customers, universities, wannabees, and
USL/Novel is a crazy product circus one could never
dream up with a clear mind. Every vendor and every
wannabie implements new features that are irreconcilable
with the future.

John

Dirk-Jan Koopman

unread,

Jun 17, 1994, 12:10:32 PM6/17/94

to

Totally Lost (idle...@netcom.com) wrote:
: [This was originally cross posted to comp.os.research, the moderator

: removed the other groups when he accepted the thread ... Please
: follow up to comp.os.research so we can move to a better forum for
: continuing this. Enjoy "The RAID 7 Disk Subsystem - band-aid and
: white elephant" thread ...JB]

[stuff deleted]

: I have a significant beef with the educational systems treatment

: of the "Computer Science" dicipline and how students are taught.
: This is the outcome of not only how my formal education progressed
: during the early 70's but how current graduates must be mentored
: to re-examine their teachings which they accept as the gospel
: according to Thomposon, Richie, and Joy - and the thousands of
: converted missionaries.

Here, here.

: This series was fun, and I think necessary to introduce the cold

: reality that things are far from perfect. I don't think a two line
: bug report would have served the purpose or opened up the degree of
: thinking and discussion necessary to have it sink in.

: Some of the Major topics to follow in this comp.os.research series are:

: * How we got here today (partly covered in this thread)
: and why that was acceptable then and not now.

: * the evils of partition filesystem management, specifically
: the performance and resource partitioning impacts (old data
: presented to thousands in the past conferences, but lost)

: * the evils of block at a time filesystem architecture
: as formalized and standardized by the kernel filesystem
: interfaces (old data again from both presentations
: and postings in other groups).

: * Locality problems and how they strangle disk performance.
: A well structured arguement for radical architecture
: changes. The problem goes past just filesystem design
: as defined by UNIX implementation, and extends into
: core kernel services, drivers, disk subsystems.
: (old data again ... widely known, but not acted upon).

Interfaces such as SCSI which reinforce current thinking.

: * Ignorance of physics and physical properties which

: criple UNIX filesystem performance and reliability.
: Sub topics include seek management, rotational latency
: management, data scrubbing, integrated clustered
: ecc with automatic recovery from multiple data block
: failures. Data placement techniques for reliability.
: (Old data presented in other conferences, training
: seminars, and to clients).

SCSI again.

: * Statistical caching, data replication, data migration

: and fully transparent backup and recovery techniques.
: Why mirroring, stripping and RAID in the current
: designs are just flat wrong and stupid. Alternative
: architecture presented which obsoletes these techniques
: and greatly moves us forward. (old data not widely
: presented elsewhere, primarily to my internal staff
: and clients).

one of the few beefs I have with unix is all these techniques were known
about and/or in use before unix made out of Bell Labs - because of unix
they have all disappeared.

[stuff deleted]

: My detractors at SCO called the above the "Big Bang Theory"

: since I repeatedly said the base problem with UNIX filesystem
: performance was the initial design ... and that no significant
: improvement could be made without a near complete redesign
: which abandoned all the current interfaces and knowledge.

I would have said radical redesign, but one should be able to
do a lot without having to throw quite everything away, especially
a lot of programs would cease to run - and regardless of what you think
of the performance of the file system, an approach which significantly
altered the programming interface would probably not be acceptable.

However there are ways round that ofcourse, just because you don't have
an inode anymore doesn't mean that you can't LOOK as though you still
have...

[stuff deleted]

: requirements. I think the design and tradeoffs were brillant

: for 1974 ... and simply do not apply today.

I would have said 'suitable for the purpose', there were better designs
available even before then, eg George 3 (yes, I know it's British).

[stuff deleted]

Dirk

John F. Haugh II

unread,

Jun 17, 1994, 10:46:10 PM6/17/94

to

In article <DHOLLAND.94...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:

>Haven't you been following the thread? If you don't flush those zeros
>*to the disk* before using the block, a badly-timed system crash can
>cause the UNZEROED blcosk to appear in the new file - containing who
>knows what kind of private data.

Correct, but the block number can't be in the inode of the new file
without the buffered block being zeroed out. If it isn't in the
inode, the new file can't have access to the data. The only remaining
discussion is whether or not the inode is flushed before the data
block, and that is where the issue of "badly-timed" comes into play.

The larger, more modern systems are much more likely to have this kind
of problem (bigger buffer caches ...) than are the systems which the
original poster blamed for introducing this problem.

John F. Haugh II

unread,

Jun 17, 1994, 11:46:42 PM6/17/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:

>I will go so far as to say that if a file contains any data other
>than written at a crash point it will greatly increase the
>recovery applications work for that file type. I will go on
>to say it should be required of any filesystem to indicate
>which files may be corrupt after a crash as a minimally
>required function to aid the sysadm and production staff
>to get the system consistant again short of rolling back
>to a known checkpoint..

I think that part of the reason for the rejection of your ideas may be that
they are projecting requirements that don't really seem to exist. I
administer a number of systems and I don't make any efforts, beyond the
automated FSCK, to verify that all active files at the time of the crash
are valid. Even in an EDP environment I've seldom seen this behavior.
The operations staff typically has a firm understanding as to which files
are safe and which aren't. As an aside, the validity of data involves
much more than the consistency of the data from a write() call
perspective. The filesystem will never assure more than that level of
validity.

Furthermore, I think your assertion that an async ordered write filesystem
is going to be faster a priori than the competitive mixed designs that
might be imagined seems a tad bold. One problem which comes to mind is
that you have now created an ordering in which all metadata blocks sort
after all data blocks. And this ordering can create thrashing as the
drive steps back out to where the data is stored (wherever that may be)
and back to where the metadata is stored (pick a cylinder, any cylinder).
My personal inclination is towards journalled filesystems, though the
state of that art seems absurd in that the I/O to the journal location
frequently appears to create its own problems.

At this point the suggestion that you prototype a filesystem based on
these concepts seems in order. While it is true that the fully functional
design might take 3 or 4 programmer-years to implement, prototypical
implmentations should well within reach for one or two warm bodies on a
part time basis.

No doubt some of the ideas will yield promising results. But without some
evidence, your grand re-invention seems littered with potholes.

Totally Lost

unread,

Jun 18, 1994, 7:05:23 PM6/18/94

to

I really want this thread to continue in comp.os.research ... I responded
to this one here since it was a weak pot shot at the meta data issue.

In article <1994Jun18.034642.28129@rpp386>,

John F. Haugh II <j...@rpp386.cactus.org> wrote:

>In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:
>>I will go so far as to say that if a file contains any data other
>>than written at a crash point it will greatly increase the
>>recovery applications work for that file type. I will go on
>>to say it should be required of any filesystem to indicate
>>which files may be corrupt after a crash as a minimally
>>required function to aid the sysadm and production staff
>>to get the system consistant again short of rolling back
>>to a known checkpoint..
>
>I think that part of the reason for the rejection of your ideas may be that
>they are projecting requirements that don't really seem to exist.

I have already noted that most system admins like yourself really don't care
that users files or production data may be corrupt after a crash, that
really is somebody else problem isn't it? ... or is it?

For people that really don't give a shit about the quality of the services
they provide it's easy to dismiss data corruption posibilites at every crash.

I ask ... is this really a professional way to address data corruption?

Heck no ...

>I administer a number of systems and I don't make any efforts, beyond the
>automated FSCK, to verify that all active files at the time of the crash
>are valid. Even in an EDP environment I've seldom seen this behavior.

FSCK does not verify all files active at the time of a crash as valid.
It only verifies that filesystem metadata is not corrupted, which
would in turn cause more even more corruption if not checked.

The scary term here is "seldom seen" from somebody that isn't even
looking. Sysadmins see it all the time, but not being trained or required
to locate the cause just dismiss it as one of those unanswered questions
in life. I think this is bull crap.

Of course, most sysadmins just love to point to the application programmers
troubles tring to recover data and suggest they should have written the program
to handle it better ... not my job ... :)

>The operations staff typically has a firm understanding as to which files
>are safe and which aren't.

I have yet to see this true for 95% of the production UNIX systems in the
field ... that have no technical person watching over the system. Even the
operations staff at unix providers I have consulted for do not attempt to
issolate and examine files written within a sync period of a crash
and check them for corruption. This statement is largely false for the
vast majority of sites in the world. Show me otherwise ...

>As an aside, the validity of data involves
>much more than the consistency of the data from a write() call
>perspective. The filesystem will never assure more than that level of
>validity.

I disagree strongly!

Validity needs to be addressed at three levels.

1) the data which appears on disk is the data WAS written.

2) the data written actually ends up on the disk.

3) the data which appears on the disk is consistant with other data.

The point of this thread was to assure item 1. That minimal level is not
currently achieved by UNIX filesystems based upon V6/V7/SVRx/BSD.
Because of both data corruption and security related issues, atleast this
level of validity must be required of all UNIX vendors.

A side point of the thread is that 2 is also required to address
recovery issues after a crash ... you must have some way to identify
those files that may not be complete. I believe that this should also
be required of all UNIX vendors.

Lastly I also believe the 3 which reqires implementation of multifile
multiprocess commit style updates should exist. I said several times
I think this belongs in every UNIX filesystem, but I stop short of
requiring it of every vendor since it clearly is niche market, but
a significant one in sales volume.

If I missed your point please clearify.

>Furthermore, I think your assertion that an async ordered write filesystem
>is going to be faster a priori than the competitive mixed designs that
>might be imagined seems a tad bold.

GEEZ ... cann't you even do the basics?

Thats a pretty bold statement to make without presenting (or thinking about)
a simple model for it ... so let's do your work (thinking) for you ...

Current sync metadata filesystems require as much as two physical I/O's for
every block/cluster written - this yields MUCH less 50% efficency due to
rotational latency and seek losses ... 5-10% or less typical. A good I/O
system will combine the block/cluster data writes making the sync metadata I/O
the vast majority of the I/O stream - few do this though. So in a 1K filesystem
best case we have 256 sync metadata writes of 1K and a 256K data cluster
resulting in probably 260+ revs of the disk (about 4.3 secs on a typical
3600RPM disk with 32K per track)

I require the data be written before the meta data. This can be implemented
several ways:

1) Sync data, Sync Metadata - worse than Async data, Sync Metadata
under 5% efficendy. No sane person would do this if another
workable choice exists. 512 revs, 8.7 secs.

2) Async queue data within a lowest level indirect block,
block on completion of all queue blocks; Sync Metadata
Requires one sync write per block of indirect metadata,
resulting in BSIZE=1k filesystem clustered writes with
better than 75% efficency for most drives. Result vary
depending on a number of variables but are always better than the
previous two solutions. 5 revs for .08 secs.

Other schemes exist, including the threaded one with fully async ordered
writes which are better than the above three.

The point is that the Berkeley Sync write solution is stupidly inefficent.

Nobody implements 2 that I am aware of which meets the minimal requirement
for data integrity at a reasonable cost - because it's difficult to implement
inside exist designs. The threaded design is simpler and can be simply
extended to handle all the ordering issues besides just this case.

>One problem which comes to mind is
>that you have now created an ordering in which all metadata blocks sort
>after all data blocks.

No - you do ... see the reply to Alan at SGI and the above.

>And this ordering can create thrashing as the
>drive steps back out to where the data is stored (wherever that may be)
>and back to where the metadata is stored (pick a cylinder, any cylinder).

This will happen at some frequency regardless ... with the Berkeley
Sync Metadata solution it is very frequent ... with 2 above or threaded
ordered async writes it seldom occurs.

You clearly do not know how things currently work, given these dumb statements
why don't you stop wasting our time and do your homework before posting
in the future?

>My personal inclination is towards journalled filesystems, though the
>state of that art seems absurd in that the I/O to the journal location
>frequently appears to create its own problems.

They have their own problems ... especially with relatively poor locality.
Given your knowlege of existing filesystem internals, it's not clear
you made that decision on any real informed basis.

>At this point the suggestion that you prototype a filesystem based on
>these concepts seems in order. While it is true that the fully functional
>design might take 3 or 4 programmer-years to implement, prototypical
>implmentations should well within reach for one or two warm bodies on a
>part time basis.

That *I* prototype it? How generous you are with other peoples time. Who do
you expect to pay my mortage and feed my kids for the year while I do this
by myself ... or to pay the salaries of a couple junior's to help for a
shorter period ... heck I don't even think I would train juniors to help with
the prototype project unless I thought I had funding to complete it.

Now if a couple Univ CSc seniors in San Francisco or San Luis Obispo
or possibly Colorado wanted to help me on speculation we could
turn it into a product I might think about it. In fact doing an external
storage system from the VNODE/FSS interface was one idea on how to
turn this into a product without a systems vendors help.

Most ideas do not need an implementation to determine worth ... just
a simple pencil and paper model is generally enough as long as the
number of variables can be contained to two or three. See above.

I quoted SCO a very cheap bid to do the remaining research and a prototype
because I *really* wanted to do the project ... the london team thought they
could do something better (at surely a much higher cost). It's almost
impossible to outbid an internal development group on a fun project. Something
nobody wants to do ... piece a cake. Quality of ideas is never an issue ...
nor is experience or anything else that really matters - the internal team
almost always wins - even when you have it done already.

>No doubt some of the ideas will yield promising results. But without some
>evidence, your grand re-invention seems littered with potholes.

Pretty bold statement to make without a better rebuttal to my ideas. Just what
are those potholes you see? spots on your glasses? :)

Try again?

>John F. Haugh II [ NRA-ILA ] [ Kill Barney ] !'s: ...!cs.utexas.edu!rpp386!jfh

John Bass, Sr. Engineer, DMS Design idle...@netcom.com
UNIX Consultant Development, Porting, Performance by Design

Totally Lost

unread,

Jun 18, 1994, 7:21:04 PM6/18/94

to

In article <1994Jun18.024610.27790@rpp386>, a confused

John F. Haugh II <j...@rpp386.cactus.org> wrote:

>In article <DHOLLAND.94...@husc7.harvard.edu> dhol...@husc7.harvard.edu (David Holland) writes:
>>Haven't you been following the thread? If you don't flush those zeros
>>*to the disk* before using the block, a badly-timed system crash can
>>cause the UNZEROED blcosk to appear in the new file - containing who
>>knows what kind of private data.
>
>Correct, but the block number can't be in the inode of the new file
>without the buffered block being zeroed out. If it isn't in the
>inode, the new file can't have access to the data.

What happens in memory doesn't matter in this discussion, because the
whole point is what is the state of the disk at any point up to a crash.

With metadata sync writes the meta data always points to trash until the
async write data block is flushed. This is a long window.

>The only remaining
>discussion is whether or not the inode is flushed before the data
>block, and that is where the issue of "badly-timed" comes into play.

With any meta data sync write filesystem the window is open nearly the
entire time the file is being written - this is not a matter of badly-timed.

>The larger, more modern systems are much more likely to have this kind
>of problem (bigger buffer caches ...) than are the systems which the
>original poster blamed for introducing this problem.

Any system with over 5 buffers will exhibt the problem, that is any UNIX
system ever shipped. True it gets a lot worse over 5 buffers, but that
has been the case for nearly 20 years.

You clearly don't understand either UNIX filesystem internals or the
technical details behind this thread. Before your next post a short
study course in UNIX internals and filesystems and rereading this
thread highly is recomended.

Dan Swartzendruber

unread,

Jun 18, 1994, 7:35:44 PM6/18/94

to

In article <idletimeC...@netcom.com> idle...@netcom.com (Totally Lost) writes:

[deleted]

If this is all so clear-cut and obvious, how about coming up
with an actual design, and putting it out for review? If you
really know what you're talking about, it shouldn't be that
difficult, no? And if you actually come up with an implementation
which is clearly superior FT-wise, and performs well, you could
end up saving all these poor users who are getting file corruption,
instead of keeping news spools filled.

John Miller -- sysadmin

unread,

Jun 19, 1994, 6:10:21 PM6/19/94

to

Dan Swartzendruber (dsw...@pugsley.osf.org) wrote:

: [deleted]

Yeah, what he said!

In the Valley of the Blind, the one-eyed man is . . . nuts.

This highly esoteric discussion has been amusing, but if Bro.
idletime's vision is so much further advanced than the rest
of us mortals, he would do us all a favor by disappearing into
the cellar and spending some time in practical, rather than
theoretical, development.
--
John Miller, N4VU Linux! Fayetteville
j...@n4vu.Atl.GA.US DoD #1942 (Atlanta)
{emory,gatech}!n4hgf!n4vu AMA #671301 GA, US

Callum Gibson

unread,

Jun 21, 1994, 12:21:21 AM6/21/94

to

Nigel Gamble (ni...@gate.net) wrote:
> If you are concerned about power failure, the way to survive with
> no corruption and no performance hit is to use a UPS. Why would you
> want to put any performance hit in the filesystem when there is a
> better way to address the power fail problem?

Then you could do as our sys admin did and trip over the power cord between
the ups and the computer. (sorry Glenn). :-)

regards,
Callum
--
Callum Gibson cal...@bain.oz.au
Fixed Income Division, DB Bain & Co. 61 2 258 1620

Roger B.A. Klorese

unread,

Jun 22, 1994, 1:57:34 PM6/22/94

to

In article <idletimeC...@netcom.com>,

Totally Lost <idle...@netcom.com> wrote:
>With metadata sync writes the meta data always points to trash until the
>async write data block is flushed. This is a long window.

...unless you use a file system like VxFS, which supports block-clearing
before metadata update.

>With any meta data sync write filesystem the window is open nearly the
>entire time the file is being written - this is not a matter of badly-timed.

...Unless clears are done synchronously as well.
--
ROGER B.A. KLORESE rog...@QueerNet.ORG
2215-R Market Street #576 San Francisco, CA 94114 +1 415 ALL-ARFF
"There is only one real blasphemy: the refusal of joy." -- Paul Rudnick

Alan Cox

unread,

Jun 23, 1994, 8:59:41 AM6/23/94

to

In article <2u9u2e$8...@unpc.queernet.org> rog...@unpc.queernet.org (Roger B.A. Klorese) writes:
>In article <idletimeC...@netcom.com>,
>Totally Lost <idle...@netcom.com> wrote:
>>With metadata sync writes the meta data always points to trash until the
>>async write data block is flushed. This is a long window.
>
>...unless you use a file system like VxFS, which supports block-clearing
>before metadata update.
>
>>With any meta data sync write filesystem the window is open nearly the
>>entire time the file is being written - this is not a matter of badly-timed.
>
>...Unless clears are done synchronously as well.

I beg to differ. If each block has a timestamp you know whether it is pre or
post inode update. This avoids the need to clear synchronously.

Alan

Totally Lost

unread,

Jun 23, 1994, 3:05:46 PM6/23/94

to

In article <1994Jun23....@uk.ac.swan.pyr>,

If clearing is done sync it takes 3 IO's instead of one ... cleared data block,
sync metadata, Data Block for each allocation. This is much greater than
a series of ordered async data blocks followed by the referencing metadata
... slightly over one average in the sequential case VS three.

The Sync clear and indirect block updates will be a huge performance loss.
While the clear addresses the security issue at a huge cost, it still leaves
the file at risk of being corrupted with with zero data at a crash rather than
a short EOF.

Timestamp algorthms have similar problems by having to maintain two additional
sync writes instead of just a clear ... or the log updates have to be
strictly ordered in relation to each other and other file data ... if you
do strict ordering, then why not just strictly order metadata/filedata?

John

Vince Fleming

unread,

Jun 28, 1994, 5:01:11 PM6/28/94

to

Callum Gibson (cal...@frost.bain.oz.au) wrote:

: Nigel Gamble (ni...@gate.net) wrote:
: > If you are concerned about power failure, the way to survive with
: > no corruption and no performance hit is to use a UPS. Why would you
: > want to put any performance hit in the filesystem when there is a
: > better way to address the power fail problem?

: Then you could do as our sys admin did and trip over the power cord between
: the ups and the computer. (sorry Glenn). :-)

Then you could kick youself and buy a system with an *internal* UPS. ;-}

[on a serious note, the larger AT&T/NCR have this as an option for that
very reason]

Message has been deleted

Jeff Jonas

unread,

Jul 11, 1994, 2:32:13 PM7/11/94

to

In article <2uq32n$p...@news1.digex.net> vi...@vtci.com (Vince Fleming) writes:
>Callum Gibson (cal...@frost.bain.oz.au) wrote:
>: Nigel Gamble (ni...@gate.net) wrote:
>: > If you are concerned about power failure, the way to survive with
>: > no corruption and no performance hit is to use a UPS. Why would you
>: > want to put any performance hit in the filesystem when there is a
>: > better way to address the power fail problem?

Since I'm joining this thread late, let me keep brief in order to
not repeat what may have come before: (1)

a) I believe that the "ordered writes" feature of SVR3 ment that
writes to the superblock or inode list was sync, all others async.

b) there are MANY failure modes other than power fail
for which a robust, reliable system must contend.
They're less frequent, but data cables can come out or another
peripheral may hog/corrupt the data bus (SCSI, etc).
With removable media, someone could pull out the disk while it's
active (unless the door physically locks when in use).
Floppy disks are a real sore point since the door cannot lock
on *most* models (2) so you have a tradeoff: do you perform
all writes sync to protect against removal, or perform some buffering
and achieve better performance?

>: Then you could do as our sys admin did and trip over the power cord between
>: the ups and the computer. (sorry Glenn). :-)
>
>Then you could kick youself and buy a system with an *internal* UPS. ;-}
>
>[on a serious note, the larger AT&T/NCR have this as an option for that
>very reason]

I recall seeing a DC power supply on display at an electronics
show with the following demo:
an electromagnet was holding a large hammer over a vase.
Pull the plug on the power supply and the hammer didn't drop
since the DC power was uninterruptable!
One would think that it's more efficient to convert the battery power
directly to regulated 5/12VDC than to make invert it to 120 VAC only to get
regulated to 5/12VDC.

(1) the old articles ahve already expired at my site, so I
cannot go back and read the full thread

(2) yes, I know that the Sun Sparcstations and Macs "eat" the disk
and won't "spit" it out while in use, but most PC floppy drives
do not offer that protection.
--
Jeffrey Jonas
je...@panix.com

Toni Mueller

unread,

Jul 15, 1994, 12:25:05 PM7/15/94

to

Vince Fleming (vi...@vtci.com) wrote:

How many seconds do you run on such an internal power supply ?

I guess the sensible minimum running time would be 1.5 times as long as it
normally takes to shut the machine down.

--
--------------------------------
Toni M"uller Internet: mue...@uran.informatik.uni-bonn.de
phone: +49-2261-79351

Message has been deleted