Some very thought-provoking ideas about OS architecture.

Eric S. Raymond

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to linux-...@vger.rutgers.edu

(Please copy any replies to me explicitly, as I'm not presently subscribed
to the linux-kernel list -- it's not practical when I'm spending so
much time on the road.)

Gents and ladies, I believe I have may have seen what comes after
Unix. Not a half-step like Plan 9, but an advance in OS architecture
as fundamental at Multics or Unix was in its day.

As an old Unix hand myself, I don't make this claim lightly; I've
been wrestling with it for a couple of weeks now. Nor am I suggesting
we ought to drop what we're doing and hare off in a new direction.
What I am suggesting is that Linus and the other kernel architects
should be taking a hard look at this stuff and thinking about it. It
may take a while for all the implications to sink in. They're huge.

What comes after Unix will, I now believe, probably resemble at least
in concept an experimental operating system called EROS. Full details
are available at <http://www.eros-os.org/>, but for the impatient I'll
review the high points here.

EROS is built around two fundamental and intertwined ideas. One is
that all data and code persistence is handled directly by the OS.
There is no file system. Yes, I said *no file system*. Instead,
everything is structures built in virtual memory and checkpointed out
to disk every so often (every five minutes in EROS). Want something?
Chase a pointer to it; EROS memory management does the rest.

The second fundamental idea is that of a pure capability architecture
with provably correct security. This is something like ACLs, except
that an OS with ACLs on a file system has a hole in it; programs can
communicate (in ways intended or unintended) through the file system
that everybody shares access to.

Capabilities plus checkpointing is a combination that turns out to
have huge synergies. Obviously programming is a lot simpler -- no
more hours and hours spent writing persistence/pickling/marshalling
code. The OS kernel is a lot simpler too; I can't find the figure to
be sure, but I believe EROS's is supposed to clock in at about 50K of code.

Here's another: All disk I/O is huge sequential BLTs done as part of
checkpoint operations. You can actually use close to 100% of your
controller's bandwidth, as opposed to the 30%-50% typical for
explicit-I/O operating systems that are doing seeks a lot of the time.
This means the maximum I/O throughput the OS can handle effectively
more than doubles. With simpler code. You could even afford the time
to verify each checkpoint write...

Here's a third: Had a crash or power-out? On reboot, the system
simply picks up pointers to the last checkpointed state. Your OS, and
all your applications, are back in thirty seconds. No fscks, ever
again!

And I haven't even talked about the advantages of capabilities over
userids yet. I would, but I just realized I'm running out of time --
gotta get ready to fly to Seattle tomorrow to upset some stomachs
at Microsoft.

www.eros-os.org. Eric sez check it out. Mind-blowing stuff once
you've had a few days to digest it.
--
<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The Bible is not my book, and Christianity is not my religion. I could never
give assent to the long, complicated statements of Christian dogma.
-- Abraham Lincoln

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Rik van Riel

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to Eric S. Raymond

On Sun, 20 Jun 1999, Eric S. Raymond wrote:

> What comes after Unix will, I now believe, probably resemble at
> least in concept an experimental operating system called EROS.
> Full details are available at <http://www.eros-os.org/>, but for
> the impatient I'll review the high points here.

Unfortunately, EROS is still based on the PC hardware as
we've got it today and not modeled after a JINI-like
appliances model (the network is the computer).

With the death of the monolithic computer (if it happens)
will come the death of Unix, Windows _and_ EROS.

At the moment I can see only one Open Source system that
could become ready for a world like that. Alliance OS
(http://www.allos.org/).

cheers,

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV: http://www.reseau.nl/ |
| Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie: http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

Bill Huey

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to Rik van Riel

> At the moment I can see only one Open Source system that
> could become ready for a world like that. Alliance OS
> (http://www.allos.org/).
>
> cheers,
>
> Rik -- Open Source: you deserve to be in control of your data.

"eros"

I was thinking about the problems of process migration over network
and code relocation/relinking of running processes across machines.

I haven't read and though about very carefully just yet. I could
be speaking out of my ass.

bill

Eric S. Raymond

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to al...@lxorguk.ukuu.org.uk

(Apologies for losing the thread ID. Alan's mail to me bounced.)

Alan Cox writes:
> > EROS is built around two fundamental and intertwined ideas. One is
> > that all data and code persistence is handled directly by the OS.
> > There is no file system. Yes, I said *no file system*. Instead,
> > everything is structures built in virtual memory and checkpointed out
> > to disk every so often (every five minutes in EROS). Want something?
> > Chase a pointer to it; EROS memory management does the rest.
>

> This is actually an old idea. The problem that has never been solved well is
> recovery from errors. You lose 1% of your object store. How do you tidy up.
> 20% of your object store dies in a disk crash, how do you run an fscobject
> tool. You can do it, but you end up back with file system complexity and
> all the other fs stuff.

Accepting your analysis, it still seems to me there's a difference,
though. In an EROS-like world, you would only pay the complexity cost of doing
fscobject-like things in the postmortem analyzer that's trying to
stitch together the remaining pieces. You wouldn't have to pay that same
cost in the kernel for each and every access to persistent stuff; no
namespace management to worry about.

So, yes. An EROS-like architecture has the same error-recovery
problem that fsck addresses. But it appears to me, at least as far as
I've taken the logic, that that problem would be better contained than
in a Unix-like system.

> Another peril is that external interfaces don't always like replay of events.

A much more serious objection, I agree.

> You still end up with a lot of your objects having checkpoint/restart aware
> methods.

Yes, I grant that's true. (The way I'd put it is that you still need
something like commit/rollback in database-land.) But this is a solvable
problem. Butler Lampson showed years ago how to do provably correct
serialization of access to shared critical regions with timestamps
even in the absence of reliable locks. So as long as your
hypothetical user can't futz with the system clock...

> Moving just some objects between systems is fun too. You then get into
> cluster checkpointing, which is a field that requires you wear a pointy hat,
> have a beard and work for SGI or Digital.

Not something I have opinions about -- or am qualified to. :-)

> Their numbers are for a microkernelish core. They are still very good, but
> that includes basically no drivers, no network stack, no graphics and apparentlyno real checkpoint/restart in the face of corruption. I may be wrong on the
> last item.

You're probably right; I'm told all EROS actually does at this point
is run its own debugging and benchmarking tools. Still, the fact that
the test kernel can be that small is IMO an argument that the design
is sound.

> That nature of I/O is no different. If you always do large sequential
> block writes tell me how it will outperform a conventional OS if only
> a small number of changes in a small number of objects occur.

No seeks to read inodes, because the map from EROS's virtual-space
blocks to disk blocks is almost trivial (essentially the disks get
treated like a honkin' big series of swap volumes). So the disk
access pattern would be quite different, I think.

> Object stores are great models for some applications, thats why libraries
> for doing persistent object stores in application space exist (eg texas)
>
> Another way to look at this
>
> File System Object Store
>
> Index Inode Number Object ID
> Update Look in directory Look in an object
> Find item Find item location
> Write(maybe COW) Write(maybe COW)
> Page In Look in directory Look in an object
> Find item Find item location
> Write(maybe COW) Write(maybe COW)
> Granularity User controlled Enforced by OS
>
>
> So if I promise to call my inodes object ids, call the directory structure
> "objects" and I have a checkpointing scheme - what is the great new concept.

That, under most circumstances, you don't have to manage persistence
yourself (or to put it more concretely, no explicit disk I/O in most
applications). That's clearly a huge win, even if you end up having to do
more conventional-looking things in applications that require
commit/rollback.

And it's not clear to me that you do end up there; with one single
added atomic-flush primitive, I think you could use Lampson's
timestamp trickery to do reliable journalling without having to go all
the way to fs-like namespace management.

> o I don't think the object model is the good stuff

Even if you're right...

> o The security model is very very interesting indeed.

...this is still very true.

> o They are making it hard to help them however.

This is indeed true. However, I may have some leverage on a win-win
solution. But that's a topic for another day.

What I'm thinking is this: remember RT-Linux? Suppose the kernel were
a process running over an EROS-like layer...
---

<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

To make inexpensive guns impossible to get is to say that you're
putting a money test on getting a gun. It's racism in its worst form.
-- Roy Innis, president of the Congress of Racial Equality (CORE), 1988

Eric S. Raymond

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to al...@lxorguk.ukuu.org.uk

Alan:
> That depends on if you want to consistency check your object store after a
> crash. Unless you journal the object store - which btw is hard. If you have
> two thousand inter-related objects you need to dump the set of them
> consistently and in a snapshotted state.

I've been thinking about this since your last post. Seems to me the
primitive one needs is the ability to say "This object and all its
dependents need to be written atomically". Not too hard to imagine
how to do that given you already have enough of a VM system to do
copy-on-write. OK, you end up having to allocate in two different
spaces, one with atomicity constraints and one without. But it's
solvable. (See below on why this doesn't mean you end up journaling
everything).

> Why is having persistence managed by a library that is playing guessing games
> of your intent a good idea ? It has to know about object relationships,
> potentially it has to blindly snapshot the entire system. It has to do a lot
> of work to know in detail what has changed.

For the *exact same* reasons that automatic memory management with
garbage collection is preferable to slinging your own buffers. Perl
and Python and Tcl are on the rise because, outside the kernel, accepting
all that complexity and the potential for buffer overruns just doesn't
make any damn sense with clocks and memory as cheap as they are now.

Remember, the name of the game in OS design is really to optimize for
least complexity overhead for the *application programmer* and *user*.
If this means accepting a marginally more complex and less efficient
OS substructure (like the difference between a journaled object store
and a file system with explicit I/O) then that's fine. But in fact I
think Shapiro makes strong arguments that an object store, done
properly, is *more* efficient.

> So all you have to do is export every object that this object refers to. Like
> the windowing environment, whoops oh dear.

Now you know it's not that bad in practice. Not all object references are
pointers. Some are capabilities and cookies that are persistent without
prearrangement. That's especially likely to be true of OS services, and
especially if you design your API with that in mind.

> Suppose Eros was just a set of persistent object libraries that ran on
> top of numerous other platforms too, could be downloaded off the net and
> pretty well within the limits of the "programmer lazy, do more work than
> worked needed" paradigm.
>
> ftp://ftp.cs.utexas.edu/pub/garbage/texas/README
>
> And that is demonstrably the right way up. If you put a "lazy programmer"
> system at the bottom of an environment you prevent the smart programmer doing
> smart things. If your bottom layer is fundamentally ignorant of programmer
> provided clues you cripple the smart.

If that's true, why is Perl a success?

That's not intended to be a snarky question. Your argument here is
essentially the argument for malloc(3) as opposed to unlimited-extent
types and garbage collection. And the answer is the same: there comes
a point where the value of the optimization you can do with hints no
longer pays for the complexity overhead of having to do the storage
management yourself.

The EROS papers implicitly argue that we've reached that point not
just in memory management but with respect to the entire persistence
problem. I'm inclined to agree with them.

At the very least, it's something that I think we'd all be better off
doing a little forward thinking about. As I said at the beginning of
the thread, I'm not after changing the whole architecture of Linux
right away; that would be silly and futile. But this exchange will
have achieved my purposes if it only plants a few conceptual seeds.

--
<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The right of self-defense is the first law of nature: in most
governments it has been the study of rulers to confine this right
within the narrowest limits possible. Wherever standing armies
are kept up, and when the right of the people to keep and bear
arms is, under any color or pretext whatsoever, prohibited,
liberty, if not already annihilated, is on the brink of
destruction."
-- Henry St. George Tucker (in Blackstone's Commentaries)

Linus Torvalds

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to linux-...@vger.rutgers.edu

In article <Pine.LNX.4.03.99062...@mirkwood.nl.linux.org>,

Rik van Riel <ri...@nl.linux.org> wrote:
>
>Unfortunately, EROS is still based on the PC hardware as
>we've got it today and not modeled after a JINI-like
>appliances model (the network is the computer).
>
>With the death of the monolithic computer (if it happens)
>will come the death of Unix, Windows _and_ EROS.

That's a classic thing said by "OS Research People".

And it's complete crap and idiocy, and I'm finally going to stand up and
ask people to THINK instead of repeating the old and stinking dogma.

It's _much_ better to have stand-alone appliances that can work well in
a networked environment than to have a networked appliance.

I don't understand people who think that "distribution" implies
"collective". A distributed system should _not_ be composed of mindless
worker ants that only work together with other mindless worker ants.

A distributed system should be composed of individual stand-alone
systems that can work together. They should be real systems in their
own right, and have the power to make their own decisions. Too many
distributed OS projects are thinking "bees in a hive" - while what you
should aim for is "humans in society".

I'll take humans over bees any day. Real OS's, with real operating
systems. Monolithic, because they CAN stand alone, and in fact do most
of their stuff without needing hand-holding every single minute.
General-purpose instead of being able to do just one thing.

>At the moment I can see only one Open Source system that
>could become ready for a world like that. Alliance OS
>(http://www.allos.org/).

I will tell you anything based on message passing is stupid. It's very
simple:

- if you end up doing remote communication, the largest overhead is in
the communication, not in how you initiate it. This is only going to
be more true with mobile computing, not less.

Ergo: optimizing for message passing is stupid. You should _always_
optimize for the local case, because it's the only case where the
calling protocol really matters - once you go remote you have time to
massage the arguments any which way you like.

- Most operations are going to be local. Any operating system that
starts out from the notion that most operations are going to be
remote is going to die off as computers get more and more powerful.

Things may start off distributed, but in the end network bandwidth is
always going to be more expensive than CPU power.

- Truly mobile computing implies that a noticeable portion of the time
you do _not_ want to be in contact with any other computers. Your
computer had better be a very capable one even on its own. Anybody
who thinks anything else is just unbelievably misguided.

This implies that your computer had better have a local filesystem,
and had better be designed to work as well without any connectivity
as it does _with_ connectivity. It can't communicate, but that
shouldn't mean that it can't work.

So right now people are pointing at PDA's, and saying that they should
be running a "light" OS, all based on message passing, because obviously
all the real work would be done on a server. It makes sense, no?

NO. It does NOT make sense. People used to say the same thing about
workstations: workstations used to be expensive and not quite powerful
enough, and people wanted to have more than one. Where are those people
today? Face it, the hardware just got so much better that suddenly REAL
operating systems didn't have any of the alledged downsides, and while
you obviously want the ability to communicate, you should not think that
that is what you optimize for.

The same is going to happen in the PDA space. Right now we have PalmOS.
It's already doing internet connectivity, how much do you want to bet
that in the not too distant future they'll want to offer more and more?
There is no technical reason why a Palm in a few years won't have a few
hundred megs of RAM and a CPU that is quite equipped to handle a real
OS. (If they had selected the strongarm instead of a cut-down 68k it
would already).

In short: message passing as the fundamental operation of the OS is just
an excercise in computer science masturbation. It may feel good, but
you don't actually get anything DONE. Nobody has ever shown that it
made sense in the real world. It's basically just much simpler and
saner to have a function call interface, and for operations that are
non-local it gets transparently _promoted_ to a message. There's no
reason why it should be considered to be a message when it starts out.

Linus

Linus Torvalds

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to linux-...@vger.rutgers.edu

In article <1999062012...@snark.thyrsus.com>,

Eric S. Raymond <e...@snark.thyrsus.com> wrote:
>
>You're probably right; I'm told all EROS actually does at this point
>is run its own debugging and benchmarking tools. Still, the fact that
>the test kernel can be that small is IMO an argument that the design
>is sound.

Why?

What is the correlation between "small" and "good"?

There's seldom any very strong correlation. Often the correlation is
negative.

Linux started out as 10k lines of code. Was that good? It's not 1.5M
lines of code. Is that bad?

Assuming something does the same thing as another, and is more efficient
at doing it (smaller, faster, whatever), that's good. But microkernels
are based on the notion that small is good even if it is NOT capable to
do the same things: a fundamentally flawed argument.

So mind explaining why you're using that argument?

Linus

yoda...@chelm.cs.nmt.edu

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to Eric S. Raymond

On Sun, Jun 20, 1999 at 10:04:54AM -0400, Eric S. Raymond wrote:
> Alan:

> I've been thinking about this since your last post. Seems to me the
> primitive one needs is the ability to say "This object and all its
> dependents need to be written atomically". Not too hard to imagine

But then you have a journaling object store _and_ some method for the
OS to determine dependency chains in arbitrary objects. Even on a single
computer, this is hard. Object D is a "directory" object exporting
methods for accessing objects listed in the directory. One of those objects,
object R is a relational database table that references object R' a second
table, not listed within D, and also has some dynamic content provided by
object C also not listed in D. We now "move" the listing of R to D' and
ask the OS to "write D and D' and all their dependents atomically!
Simple as pie -- especially if D' is over the network.

State is hard. Much of academic CS over the last 20 years is an effort
to hide state and hope that what you don't see won't bite you. (EROS
is has some good ideas, but I verymuch doubt Shapiro would argue that it
makes state complexity disappear).

> > Why is having persistence managed by a library that is playing guessing games
> > of your intent a good idea ? It has to know about object relationships,
> > potentially it has to blindly snapshot the entire system. It has to do a lot
> > of work to know in detail what has changed.
>
> For the *exact same* reasons that automatic memory management with
> garbage collection is preferable to slinging your own buffers. Perl
> and Python and Tcl are on the rise because, outside the kernel, accepting
> all that complexity and the potential for buffer overruns just doesn't
> make any damn sense with clocks and memory as cheap as they are now.

Yes. In different domains, different paradigms make great sense.

> > So all you have to do is export every object that this object refers to. Like
> > the windowing environment, whoops oh dear.
>
> Now you know it's not that bad in practice. Not all object references are
> pointers. Some are capabilities and cookies that are persistent without
> prearrangement. That's especially likely to be true of OS services, and
> especially if you design your API with that in mind.

But the OS, in your scheme, must accomodate the most complex cases.

> > system at the bottom of an environment you prevent the smart programmer doing
> > smart things. If your bottom layer is fundamentally ignorant of programmer
> > provided clues you cripple the smart.
>
> If that's true, why is Perl a success?

Because it uses a simple bottom layer in a clever way.

Rik van Riel

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to Linus Torvalds

On 20 Jun 1999, Linus Torvalds wrote:

> >Unfortunately, EROS is still based on the PC hardware as
> >we've got it today and not modeled after a JINI-like
> >appliances model (the network is the computer).
>

> That's a classic thing said by "OS Research People".

Which I'm not :)

> I don't understand people who think that "distribution" implies
> "collective".
>

> A distributed system should be composed of individual stand-alone
> systems that can work together.

OK. I can agree with this. Still, Bill Joy can't be alltogether
wrong, can he? This is somewhat of a dillemma -- having to choose
between the paradigms of Bill Joy and Linus Torvalds... :)

> >(http://www.allos.org/).
>
> I will tell you anything based on message passing is stupid. It's
> very simple:

[communications overhead is either in the actual communication
channel or unneeded]

The Alliance OS uses something like a shared library so that
what looks like message passing from higher levels only is
message passing when it needs to be -- otherwise the higher
overhead is avoided.

> - Most operations are going to be local.

Optimizing for the local case doesn't mean that remote operations
can't be made transparent. Because of latency problems, you are
probably right though...

> - Truly mobile computing implies that a noticeable portion of the time
> you do _not_ want to be in contact with any other computers. Your
> computer had better be a very capable one even on its own.

It depends. If a computer is used as a way of getting at information,
then you will want it to be connected. Mobile phones simply aren't
very useful on the north pole, however well they might function on
their own. Computing is more and more about communication and not
about number-crunching or playing games -- which, I agree, can be done
very well without network access.

Even for things like a calendar you will want access to outside
information. If you plan an appointment with someone else, you
need to be able to communicate with eachother to agree on a date/time
both of you are able to meet...

> In short: message passing as the fundamental operation of the OS
> is just an excercise in computer science masturbation. It may
> feel good, but you don't actually get anything DONE.

It can help achieve things we can't do with Linux:

Upgrade (parts of) the OS while running.
Since message passing objects are self-contained, you can
replace them more easily than possible with 'classic' OSes.
User process migration and other nice scalability and/or
reliability tricks are also more easily done.

Transparent networking.
While userland can do this in a library, this feature can be
very useful to achieve high availability because it makes
clustering at the filesystem level extremely easy (to name a
thing).

Sandboxing parts of the OS.
Finally it's possible to test new kernel parts without risking
the rest of your system. Debugging a new networking library
(kernel-level, that is) without endangering the rest of the
system to stray pointers. The sandboxing protection can always
be removed later when the new kernel addition has been found
to work properly.

While I agree with your point that computers will be both powerful
enough to do anything and never powerful enough not to need the
maximum level of optimization (this seemed to be what you were
saying and, however paradoxical it may seem, will probably be true
forever), I think you should take a more positive attitude towards
new system concepts.

Even if you don't like them, some very nice and useful spinoffs
could come from the research in those areas. We need diversity
in order to select the best alternative for the job at hand.

I won't try to convince anyone to turn Linux into such a system.
Not only doesn't it make sense, I wouldn't even want Linux if it
became like that -- other systems can take on other roles that
are to be played in the huge playing field that's out there...

Rik -- Open Source: you deserve to be in control of your data.

+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV: http://www.reseau.nl/ |
| Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie: http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

Michael B. Trausch

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----

On 20 Jun 1999, Linus Torvalds wrote:
>

> Linux started out as 10k lines of code. Was that good? It's not 1.5M
> lines of code. Is that bad?
>

I think that the fact that you're getting the OS into 1.5 million lines of
code is great; look at Windows. Microsoft stated at one point that they
had up to 11 million lines of code in Windows 95-A. Ouch... try that one
on for size. I'd hate to see how many gigabytes their source tree takes
up.

The thing about Linux is that it works, dammit. I have been using it for
two years now. I've had 2 crashes in that entire time: One fatal crash
with 2.0.36 when I used a buggy SCSI card, and one when I was using
2.2.something that happened with something to do with the networking code
(PPP or some other Internet code). But the thing is that that's *2*
crashes in *2* years. That's pretty damn good... considering I used to
get 2 crashes a *DAY* in Windows.

What matters is not the fact that Linux's source is large. A compiled,
compressed kernel is nearly always under 600k. In fact, I even built a
completely static kernel once and it was under 600k compressed. The fact
of the matter is, though, that it works, and despite it's size, it's damn
fast. My 486, running 2.2.9, is a lot faster than my parent's Pentium II
200MHz running Windows 98. I can do work with it. And, X is even faster
in some cases on this computer than their PII. Granted -- I don't use X
much. I won't until I get a good, kick ass computer that I'm comfortable
with using X in. But I'll still use my favorite programs - pine, aumix,
lynx. What can I say? They're efficient :).

- ----------------------------------------------------------------------------
Michael B. Trausch
President of Linux Operations, ADK Computers http://adk.hypermart.net/
- ----------------------------------------------------------------------------
PGP Public Key is available at hkp://keys.pgp.com/mtra...@wcnet.org, or at:
http://wcnet.org/~mtrausch/pubkey.txt, or finger://mtra...@wcnet.org
- ----------------------------------------------------------------------------
ADK Computers, Walbridge Office E-Mail: mtra...@wcnet.org
- ----------------------------------------------------------------------------

Customer: I'm running Windows '98 Tech: Yes. Customer: My computer isn't
working now. Tech: Yes, you said that.

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 5.0i for non-commercial use
Charset: noconv

iQCbAwUBN21fdnNd0YT7jYpVAQGRRwQqA2sZSTxKq9/RnzvVJ5EFaezAV6Ks9Nkt
0AR/TZFKxf3Vf9WOiakkblVfcS7ST1YhbFiIBVjek8QPF+aIG63bt4SCOYjrpEsC
Nl85ReiUt3TdKGjRpXq1IgiMdWgiK8Mb9nk63UoFe8robA+clR7TrWu3LuhP7PGt
1h4gxyq0KiRsHzO7jBQ=
=9uui
-----END PGP SIGNATURE-----

Larry McVoy

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to Eric S. Raymond

: Your argument here is
: essentially the argument for malloc(3) as opposed to unlimited-extent
: types and garbage collection.

And it is a good argument. Garbage collection is a fine thing for those
who want to use it, but as a fundamental part of the system? Please dont.
It's just a bad, bad idea.

yoda...@chelm.cs.nmt.edu

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Rik van Riel

On Sun, Jun 20, 1999 at 09:06:59PM +0200, Rik van Riel wrote:
> > - Most operations are going to be local.
>
> Optimizing for the local case doesn't mean that remote operations
> can't be made transparent. Because of latency problems, you are
> probably right though...

Esp true for mobile systems.

> > - Truly mobile computing implies that a noticeable portion of the time
> > you do _not_ want to be in contact with any other computers. Your
> > computer had better be a very capable one even on its own.
>
> It depends. If a computer is used as a way of getting at information,
> then you will want it to be connected. Mobile phones simply aren't
> very useful on the north pole, however well they might function on
> their own. Computing is more and more about communication and not
> about number-crunching or playing games -- which, I agree, can be done
> very well without network access.
>

But you better be able to do some smart and compute/store intensive
things to hide connect problems. If you are sitting in a plane, working
with a database reached over a series of erratic links, its wonderful
to be able to: 1) store a useful chunk of data locally so work does
not stop every time a packet is dropped, (2) do some serious compression
and encryption, (3) run a very smart routing daemon that looks for
alternative paths and perhaps even predicts where you will be so it can
setup connection in advance ...
So for serious work, you will want to compensate for interconnect problems
locally -- you want a "heavy" client. Fortunately, you will have Linux
on a DIMM with multiple processors ready to keep you going.

> > In short: message passing as the fundamental operation of the OS
> > is just an excercise in computer science masturbation. It may
> > feel good, but you don't actually get anything DONE.
>
> It can help achieve things we can't do with Linux:
>
> Upgrade (parts of) the OS while running.
> Since message passing objects are self-contained, you can
> replace them more easily than possible with 'classic' OSes.
> User process migration and other nice scalability and/or
> reliability tricks are also more easily done.

Can I suggest that you take a look at a couple of years of SOSP from
the early 1980s? All these things were in vogue, all ran into
intractable problems. Doesn't mean that changes in technology or smart
implementations won't ever make these ideas useful, but ...

As for the importance of message passing: just think of a message header
as a destination address and a collection of arguments and ...
subroutine calls are examples of message passing . Everyone knows that
dividing complex systems into parts is good. The mechanics of how you
connect those parts is a lot less difficult than figuring out what the
parts are.

Bernd Paysan

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

Linus Torvalds wrote:
> In short: message passing as the fundamental operation of the OS is just
> an excercise in computer science masturbation. It may feel good, but

> you don't actually get anything DONE. Nobody has ever shown that it
> made sense in the real world. It's basically just much simpler and
> saner to have a function call interface, and for operations that are
> non-local it gets transparently _promoted_ to a message. There's no
> reason why it should be considered to be a message when it starts out.

This is partly right, and partly wrong. Actually, some light-weight
message passing protocols can be cheaper than function calls. A function
call to a kernel function requires two state transitions (user->kernel
and back), and two context switches; since the kernel wants to massage
different data than the user process, it also has an effect on the
cache.

Message passing can make this cheaper, if (and only if) your messages
are handled asynchronously. Put a bunch of messages into your shared
memory message buffer (simple *ptr++ = id; *ptr++ = arg; ...), and when
you are done, do the one state transition and context switch, and let
the kernel handle all the requests at once (also simple: next message is
goto *handlers[*ptr++]; Message code does arg = *ptr++;...). The
communication overhead in a good designed active message system can be
below the state transition overhead in a classical OS.

Counterside: works only if your OS is designed to deliver asynchronous
results, and if your app is programmed to use that. In other words:
forget about blocking read/write calls. Works much better for services
like X Window than for Unix-like OS services. The point is that you must
restructure your app to be message-handling, too. I.e. a web server's
frame would look like

switch(get_next_message(queue)) {
case incoming_connection: accept(connection); break;
case opened_connection: read(request); break;
case request_read: parse(request); send_answer(request); break;
case request_sent: close(connection); free(request); break;
...
}

Note that all the OS-related calls are unblocking and are served (as a
bunch) when the app has no more messages to handle - or the outgoing
message buffer is full. Wraping messages around an OS that is designed
to have services as function calls is a complete waste of time; so in
this respect, Linus is right. Masturbation is only effective if you have
a sperm bank.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Mikulas Patocka

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Eric S. Raymond

> Here's another: All disk I/O is huge sequential BLTs done as part of
> checkpoint operations. You can actually use close to 100% of your
> controller's bandwidth, as opposed to the 30%-50% typical for
> explicit-I/O operating systems that are doing seeks a lot of the time.
> This means the maximum I/O throughput the OS can handle effectively
> more than doubles. With simpler code. You could even afford the time
> to verify each checkpoint write...

I believe that you get 100% write throughput, but in many systems read
requests are much more frequent than writes. I'd be insterested how good
is read performance. Wouldn't your data get too fragmented? If you
hadn't ported any applications to Eros yet, I wonder if you get good read
performance in real enviroment.

> Here's a third: Had a crash or power-out? On reboot, the system
> simply picks up pointers to the last checkpointed state. Your OS, and
> all your applications, are back in thirty seconds. No fscks, ever
> again!

I don't think it is really so important. If system crashes, something is
bad anyway, and you should cure the causation (fix bugs), not the
consequence (do quick recovery).
Power fault is another case, but I think it really doesn't matter if
system is down for 3 hours because of power fault or 3 hours and 10
minutes because of power fault and fsck.

Mikulas Patocka

Message has been deleted

Steve Underwood

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Bernd Paysan

Bernd Paysan wrote:

> Linus Torvalds wrote:
> > In short: message passing as the fundamental operation of the OS is just
> > an excercise in computer science masturbation. It may feel good, but
> > you don't actually get anything DONE. Nobody has ever shown that it
> > made sense in the real world. It's basically just much simpler and
> > saner to have a function call interface, and for operations that are
> > non-local it gets transparently _promoted_ to a message. There's no
> > reason why it should be considered to be a message when it starts out.
>
> This is partly right, and partly wrong. Actually, some light-weight
> message passing protocols can be cheaper than function calls. A function
> call to a kernel function requires two state transitions (user->kernel
> and back), and two context switches; since the kernel wants to massage
> different data than the user process, it also has an effect on the
> cache.

> Message passing can make this cheaper, if (and only if) your messages
> are handled asynchronously. Put a bunch of messages into your shared
> memory message buffer (simple *ptr++ = id; *ptr++ = arg; ...), and when
> you are done, do the one state transition and context switch, and let
> the kernel handle all the requests at once (also simple: next message is
> goto *handlers[*ptr++]; Message code does arg = *ptr++;...). The
> communication overhead in a good designed active message system can be
> below the state transition overhead in a classical OS.

Batching messages between processes can be good, but what percentage of the
time is this practical? In most robust applications you tend to need the
result of one step before you are sure what the next step is, due to errors,
etc. There are obvious exceptions. I could initiate a bunch of writes to
different places, and look at all the results in one go.

Message passing can work well if its embedded in the processor. The old
British GEC4000 minis did this. The ONLY way processes could communicate in
those was by message passing, and the microcoded message scheme was as fast
as a call. A message to or from an intelligent I/O card was just like a
message between CPU processes, too - no need for interrupts.

The real performance issue isn't one of how communication takes place, but of
the total quantity of communication needed. Its just like any human activity.
You have one guy on a project, and add a second. What happens? Well, if each
can work largely in isolation, with limited need for interaction, you might
get nearly twice the output. If they need to communicate a lot you might get
less output than the first guy could produce on his own. The same is true in
the OS world, or with client/server. You need to minimise the chatting
(whether its by message passing, system call, or meetings), and maximise the
real work. If you ask someone to do something, make sure a concise request
produces a substantial reward.

Steve

Bernd Paysan

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Steve Underwood

> Batching messages between processes can be good, but what percentage of
the
> time is this practical? In most robust applications you tend to need the
> result of one step before you are sure what the next step is, due to
errors,
> etc. There are obvious exceptions. I could initiate a bunch of writes to
> different places, and look at all the results in one go.

The other obvious exception is when the kernel wants to tell you
something. I.e. instead of using poll or select, you just get the activities on fds
reported via messages. The synchronous IO in Unix is fine for processes that
do one thing at one time, but ugly for serving multiple clients, like web
servers or transaction monitors do.

> The real performance issue isn't one of how communication takes place,
but of
> the total quantity of communication needed. Its just like any human
activity.
> You have one guy on a project, and add a second. What happens? Well, if
each
> can work largely in isolation, with limited need for interaction, you
might
> get nearly twice the output. If they need to communicate a lot you might
get
> less output than the first guy could produce on his own. The same is
true in
> the OS world, or with client/server. You need to minimise the chatting
> (whether its by message passing, system call, or meetings), and maximise
the
> real work. If you ask someone to do something, make sure a concise
request
> produces a substantial reward.

Proceeding a bunch of messages at once can give that reward. You certainly
should make sure that each message itself isn't too fine-grained. If you
want to reduce communication to the max, the best results can be obtained by
sending programs around. Like a web server can tell the OS: whenever an
incomming connection on port 80 is established, read the data until two
consecutive newlines, and give me that data. And the answer would be a small "OSlet",
too: take this filename and content-type, print a formatted header
including file size and last modification date and push the whole file over that TCP
connection. Somewhat like Display Postscript, where the application can
define new rendering functions.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Sent through Global Message Exchange - http://www.gmx.net

Pavel Machek

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Mikulas Patocka

Hi!

> I don't think it is really so important. If system crashes, something is
> bad anyway, and you should cure the causation (fix bugs), not the
> consequence (do quick recovery).
> Power fault is another case, but I think it really doesn't matter if
> system is down for 3 hours because of power fault or 3 hours and 10
> minutes because of power fault and fsck.

Or 3 hours of power fault and 3 hours of fsck... Which is exactly what
happens on ~100Gig disks.

Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

Mikulas Patocka

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Pavel Machek

> Hi!
>
> > I don't think it is really so important. If system crashes, something is
> > bad anyway, and you should cure the causation (fix bugs), not the
> > consequence (do quick recovery).
> > Power fault is another case, but I think it really doesn't matter if
> > system is down for 3 hours because of power fault or 3 hours and 10
> > minutes because of power fault and fsck.
>
> Or 3 hours of power fault and 3 hours of fsck... Which is exactly what
> happens on ~100Gig disks.

If you have enough money to buy 100G disk, you have certainly enough money
to buy UPS :-)

Mikulas Patocka

lei...@convergence.de

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

In local.linux-kernel, you wrote:
> Sandboxing parts of the OS.
> Finally it's possible to test new kernel parts without risking
> the rest of your system. Debugging a new networking library
> (kernel-level, that is) without endangering the rest of the
> system to stray pointers. The sandboxing protection can always
> be removed later when the new kernel addition has been found
> to work properly.

The fastest code is the one that is not executed at all.
Thus, to get the benefits from a sandbox or message-passing means that
you also have to do extensive error-checking even to helper routines
(should you choose to communicate with them via message passing, which
is a good thing if you want to experiment with new, optimized versions
without wreaking havoc.

So, you would spend up to 90% of your time checking for errors in
routines you would simply trust. Currently, gcc does error checking for
the arguments and argument types when calling a subroutine in the
kernel. This checking is done at compile time. With message passing,
this has to be done at run time. While there are efficient algorithms
available to do this, it will eat your CPU cycles relentlessly.

In the end you have a conceptually sound system that even scales to 128
CPUs, but you actually need 64 CPUs to be as fast as Linux is on 1 CPU.

Felix

Pavel Machek

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Mikulas Patocka

Hi!

> > > I don't think it is really so important. If system crashes, something is
> > > bad anyway, and you should cure the causation (fix bugs), not the
> > > consequence (do quick recovery).
> > > Power fault is another case, but I think it really doesn't matter if
> > > system is down for 3 hours because of power fault or 3 hours and 10
> > > minutes because of power fault and fsck.
> >
> > Or 3 hours of power fault and 3 hours of fsck... Which is exactly what
> > happens on ~100Gig disks.
>
> If you have enough money to buy 100G disk, you have certainly enough money
> to buy UPS :-)

Yes, and when your kernel goes crazy that you certainly can not afford
to wait 3 hours.

People tend to do stupid things. How many times did you crash your
machine? Imagine you would have to wait _3 hours_ after each crash. No
no.

Certainly machines with 100G disks have UPSes. But shit happens: I've
seen fuse failing _after_ UPS.

And: 100G is not that much any more. When you buy new computer today
(not lowend), you get 10G disk. Take ten of them and you have 100G. It
is not much!

You see? If you only take 10 computers you have 100G of disk space!
100G is not much space any more.

Pavel
PS: If you want to reply, please do so in private mail. (Or, even
better, talk).

--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

-

Greg Lindahl

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Mikulas Patocka

> If you have enough money to buy 100G disk, you have certainly enough money
> to buy UPS :-)

I have several terabytes of disk and a UPS. I still want fast fsck
times, for good reasons. Please don't make assumptions about how other
people use their systems.

-- g

Malcolm Beattie

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

Pavel Machek writes:
> > I don't think it is really so important. If system crashes, something is
> > bad anyway, and you should cure the causation (fix bugs), not the
> > consequence (do quick recovery).
> > Power fault is another case, but I think it really doesn't matter if
> > system is down for 3 hours because of power fault or 3 hours and 10
> > minutes because of power fault and fsck.
>
> Or 3 hours of power fault and 3 hours of fsck... Which is exactly what
> happens on ~100Gig disks.

Please don't guess/exaggerate: I've already posted fsck benchmark
times to this mailing list a month ago. In particular, fsck for a
43 GB ext2 filesystem (4K blocks) with 30 GB in use took 13 minutes.

Disk usage was 25 directories each with a copy of the Linux 2.2.1
source tree (63MB, 4000 files) and each with 200 subdirectories
holding 5 files of 1MB.

That was with a single SCSI bus connected to a (software) RAID5 array
of 6 x 9GB (10K RPM) disks. Triple that to a three-way SCSI adapter
connected to one 43 GB array each and fscking the whole damn lot will
run in parallel and take about the same time: 13 minutes. Not 3 hours.
The person you replied to who said an additional 10 minutes for fsck
was far closer then your guess.

If you want to treat 100 GB of disk like tape with slow access then
fsck will take longer but any decently configured system with a lot of
disk should also have a decent I/O subsystem.

--Malcolm

--
Malcolm Beattie <mbea...@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services

Remi Turk

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to lkml

Hello,
I missed a few mails, so don't shoot me if this has already been said
(or is completely stupid), but I can imagine the following situation:
- An error happens (i.e. a mm bug in EROS)
- EROS doesn't discover it and writes VM to harddisk.
- A few minutes later the system crashes do to the mem-corruption.
- After booting, the mem-corruption has already happened, so it will
crash again.

--
Advantages of Windows NT over Linux:
* It's easier to explain the crash was not your fault.
* You've to remember only one solution all to your problems: Reboot.
* You can use your mouse to type an email-message.

Chris Adams

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

Once upon a time, Malcolm Beattie <mbea...@sable.ox.ac.uk> said:
>Please don't guess/exaggerate: I've already posted fsck benchmark
>times to this mailing list a month ago. In particular, fsck for a
>43 GB ext2 filesystem (4K blocks) with 30 GB in use took 13 minutes.
>
>Disk usage was 25 directories each with a copy of the Linux 2.2.1
>source tree (63MB, 4000 files) and each with 200 subdirectories
>holding 5 files of 1MB.

That is a nice benchmark, but like all benchmarks, it is artificial and
not necessarily representative of the real world.

My news spool is a 30G RAID 0 striped across 7 ultra wide SCSI Seagate
Barracudas on a DPT SmartRAID IV hardware RAID controller (which has
performance problems with RAID 5 according to some, but not with RAID
0). It is one big ext2 filesystem with 4K blocks. Last time I had to
fsck it, it had about 22G of data in about 3,000,000 files (I had no
idea how many directories, probably around 55,000). It is on a Pentium
Pro 200 with 320MB RAM. The fsck took about 50 minutes and there were
no errors. There have been some times when the system crashed and we
had to give up on fsck and just mke2fs the news spool.

There are other ways of storing news instead of one article per file,
and they work fine for news feeding, but they still have some problems
with news reading, so I don't use them. I'm not claiming that news is
representative of all large filesystems, but I am claiming that your
benchmark is not representative either.
--
Chris Adams <cad...@ro.com> - System Administrator
Renaissance Internet Services - IBS Interactive, Inc.
Home: http://ro.com/~cadams - Public key: http://ro.com/~cadams/pubkey.txt
I don't speak for anybody but myself - that's enough trouble.

Olivier Galibert

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

On Mon, Jun 21, 1999 at 11:45:27AM +0100, Alan Cox wrote:
> No. A message pass is a function call is a message pass.

I think the only real difference between a function call and a message
pass is that you can send a message asynchronously from the kernel to
a user application, while you need a different mechanism if you're
using system calls[1].

OG.

[1] Blocking syscalls waiting for the event, read() on fds, signals,
whatever.

sh...@us.ibm.com

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Mikulas Patocka

Reminder: please reply to cap-...@eros-os.org as well as
linux-...@vger.rutgers.edu

>I believe that you get 100% write throughput, but in many systems read
>requests are much more frequent than writes. I'd be insterested how good
>is read performance. Wouldn't your data get too fragmented? If you
>hadn't ported any applications to Eros yet, I wonder if you get good read
>performance in real enviroment.

That's a good question. The answer is that we don't have enough EROS
applications to know for sure, but that the extent-based allocation strategy for
space appears to get you all of the locality you need for file systems. For
applications that really care, there are mechanisms architected (but not yet
implemented) to guarantee contiguous allocation. The disk arm behavior of EROS
differs significantly from that of Linux, so it's not clear that the locality
issues play out in exactly the same way.

>> Here's a third: Had a crash or power-out? On reboot, the system
>> simply picks up pointers to the last checkpointed state.
>

>I don't think it is really so important. If system crashes, something is
>bad anyway, and you should cure the causation (fix bugs), not the
>consequence (do quick recovery).

This is a good idea, but you don't want to do it while your customers are on
hold waiting for the computer to reboot.
Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

sh...@us.ibm.com

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Remi Turk

Remi writes:

>I missed a few mails, so don't shoot me if this has already been said
>(or is completely stupid), but I can imagine the following situation:
> - An error happens (i.e. a mm bug in EROS)
> - EROS doesn't discover it and writes VM to harddisk.
> - A few minutes later the system crashes do to the mem-corruption.
> - After booting, the mem-corruption has already happened, so it will
> crash again.

Remi is certainly correct that bad state can be checkpointed. The scenario he
suggests can't happen, however.

Only a very limited amount of kernel state (the running thread list) is included
in the checkpoint, thus, a *kernel* memory error is not reinstated after a
restart.

Second, prior to a checkpoint a consistency pass is made across the system to
verify that all of the likely pointers make sense and that all modified objects
have actually been marked writable.

Third, the kernel implements a very limited number of data structures. This
facilitates the consistency check.

Finally, the kernel is not impacted by errors in user memory state. At worst
(and with low likelihood), the error will cause corruption of other user state.
In practice, we have not observed this to occur.

In practice, outside of early development kernels or periods when we have been
messing with the checkpoint logic, we have never observed an unrecoverable state
to go to the disk.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

Remi Turk <co...@a2zis.com> on 06/21/99 10:33:17 AM

To: lkml <linux-...@vger.rutgers.edu>
cc: (bcc: Jonathan S Shapiro/Watson/IBM)
Subject: Re: Some very thought-provoking ideas about OS architecture.

Hello,
I missed a few mails, so don't shoot me if this has already been said
(or is completely stupid), but I can imagine the following situation:
- An error happens (i.e. a mm bug in EROS)
- EROS doesn't discover it and writes VM to harddisk.
- A few minutes later the system crashes do to the mem-corruption.
- After booting, the mem-corruption has already happened, so it will
crash again.

--
Advantages of Windows NT over Linux:
* It's easier to explain the crash was not your fault.
* You've to remember only one solution all to your problems: Reboot.
* You can use your mouse to type an email-message.

-

sh...@us.ibm.com

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Steve Underwood

> Linus Torvalds wrote:
> > In short: message passing as the fundamental operation of the OS is just
> > an excercise in computer science masturbation. It may feel good, but
> > you don't actually get anything DONE. Nobody has ever shown that it
> > made sense in the real world. It's basically just much simpler and
> > saner to have a function call interface, and for operations that are
> > non-local it gets transparently _promoted_ to a message. There's no
> > reason why it should be considered to be a message when it starts out.

With due libations to the Gods here, Linus is mistaken on all counts.

Moving a message from hither to yon *does* accomplish something: it moves a unit
of work from one protection/encapsulation domain to another. This may not be
necessary in your application, but it is vitally important in some. The claim
that nobody has ever shown benefit is also inaccurate. A considerable amount of
open literature on fault tolerant software exists to support the value of
message passing in certain applications. Consider in particular all of the
research reports out of Tandem. Also, note that all of the operating systems
whose software MTBF exceeds 1 yr make heavy use of protection domains.

More important, from my perspective, is that the comment about procedure calls
confuses the API for the semantics. Let's do an example. Consider the UNIX
read call read(fd, buf, sz) [I may have gotten the arg order wrong. It
doesn't matter]. Assume for a moment that we are implementing a single machine
system.

From an implementation perspective, there is absolutely NO performance
difference between the implementation of

read(fd,buf,sz)
and
fd->CALL(OP_READ,buf,sz)

The order of demultiplexing changes -- the read() call does the operation first
and the descriptor second, while the CALL does the descriptor type first and the
operation second, but precisely the same information is passed across the
user/supervisor boundary, and several implementations exist to show that they
are equivalently efficient.

Given this, there are compelling arguments for the second API:

1. By changing the order of demultiplexing, it offers the option of remoting at
a later time.
2. It allows objects to implement non-identical system call interfaces. This is
easily abused, but sometimes extremely valuable.
3. It offers the option of depriving the program of the ability to perform I/O
calls by ensuring that it has no objects that support I/O.

So: even if you think that message passing is not the way you wish to implement
things, object based APIs offer greater flexibility of implementation, and this
is generally a good thing.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

-

Jeffrey B. Siegal

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

> Batching messages between processes can be good, but what percentage of the
> time is this practical? In most robust applications you tend to need the
> result of one step before you are sure what the next step is, due to errors,
> etc. There are obvious exceptions. I could initiate a bunch of writes to
> different places, and look at all the results in one go.

To work this requires the interface be designed to avoid the need for
round trips (returned results). For example, X11 protocol is designed
to need relatively few results. Rather than returning, say, a window
ID, the client passes a window ID to the server which uses that ID to
identify the window in subsequent requests. So rather than:

CreateWindow() ->

<- WindowID

OperateOnWindow(WID) ->

You have

CreateWindow(WID) ->
OperateOnWindow(WID) ->

In this case, CreateWindow and OperateOnWindow can clearly be batched
into a single message. Failures are reported back asynchronously (the
equivalent in a system call context would be a signal); on the rare
occasion that the client really needs to know whether an operation
succeeded, it can do so by forcing a round trip. Generally, it doesn't
matter and this overhead is avoided.

Without redesigning the Unix-derived system call interface to use
similar techniques, I don't see how this would yield any significant
benefit.

Linus Torvalds

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to linux-...@vger.rutgers.edu

In article <85256797.0...@D51MTA03.pok.ibm.com>,

<sh...@us.ibm.com> wrote:
>> Linus Torvalds wrote:
>> > In short: message passing as the fundamental operation of the OS is just
>> > an excercise in computer science masturbation. It may feel good, but
>> > you don't actually get anything DONE. Nobody has ever shown that it
>> > made sense in the real world. It's basically just much simpler and
>> > saner to have a function call interface, and for operations that are
>> > non-local it gets transparently _promoted_ to a message. There's no
>> > reason why it should be considered to be a message when it starts out.
>
>With due libations to the Gods here, Linus is mistaken on all counts.

It's happened before, it will happen again. However, you had better come
up with a better argument before I believe it happened this time.

>Moving a message from hither to yon *does* accomplish something: it moves a unit
>of work from one protection/encapsulation domain to another.

Ehh.. In real operating systems, we call that event a "system call".
No message necessary or implied, unless you want to call the notion of
switching privilege domains "messages" (and some people do: they call
them messages just to prove that messages are as fast as system calls.
In logic, that's equivalent to proving that liver tastes as good as ice
cream by calling ice cream liver, and is in real life called "lying").

The system call may be turned into a message later if that turns out to
be a good idea, but it's nothing inherent. AND IT SHOULD NOT BE.

>So: even if you think that message passing is not the way you wish to implement
>things, object based APIs offer greater flexibility of implementation, and this
>is generally a good thing.

Object-based API's are a completely different issue (I removed your
argument, because I think it is completely irrelevant to "messages").

I don't think object-based approaches are bad. A lot of libraries
("stdio" in C) are based on that notion, and it's often the right way to
encapsulate information in user space.

HOWEVER: that is not an OS boundary, and should not be considered to be
one. The _definition_ of a OS boundary is the boundary of protection
domains: the OS takes over where the library no longer has the
appropriate privileges to access the object any more. Because if the
library could do the operation, it should - instead of bothering the OS
with it.

So in effect, at the OS boundary the object has to be pretty much
completely opaque, or it shouldn't be considered an OS boundary in the
first place. QED.

That's why the OS boundary HAS to be equivalent to

read(handle, buffer, size)

and NOT be equivalent to

handle->op(READ, buffer, size);

because by definition, if you can do the "handle->op" lookup, then it's
not a OS boundary any more - or at least it is a very BAD one. See?

Linus

Tom Gall

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to e...@snark.thyrsus.com, linux-...@vger.rutgers.edu

Hi Eric, (and list members)

"Eric S. Raymond" wrote:
>
> (Please copy any replies to me explicitly, as I'm not presently subscribed
> to the linux-kernel list -- it's not practical when I'm spending so
> much time on the road.)

I'm sure you're getting plenty of mail on the topic that you rose on
the kernel
list. There's a commercial operating system that works on pretty much
the same
concepts as EROS and it's been out since 1988. It's the IBM AS/400. I
thought
I'd just point out that this idea has been out for years and is actually
in
a production commercial OS that businesses are actually using.

http://as400.rochester.ibm.com/

> EROS is built around two fundamental and intertwined ideas. One is
> that all data and code persistence is handled directly by the OS.
> There is no file system. Yes, I said *no file system*. Instead,
> everything is structures built in virtual memory and checkpointed out
> to disk every so often (every five minutes in EROS). Want something?
> Chase a pointer to it; EROS memory management does the rest.

It's a "single level store" environment so like EROS it's all memory
and
the OS pages things in as you use it.

Also the Apple Newton worked much the same way, while the Newton
wasn't a
multiuser really full OS like the AS/400 is... they definately did some
innovative things at Apple. But back to the 400 cause it's what I know.

There is the concept of a file system on the AS/400 (Well actually it
has several) which is built on top of a relational database. Our main
"filesystem" is a library / file / member setup. And our IFS file system
is actually a full standard file system just like you see on Linux.

http://www.as400.ibm.com/techstudio is probably has a few more techie
details... but it's not quite to the depth or quality of information you
folks are talking about here...

If you're interested I'll see if I can find a better "here's the 400 and
what
it's all about" kinda information on the web or maybe better yet you
come
out and talk to us about Linux and we'll tell ya about the 400. 8-)

Anyway back to some "real" talk.

> Here's another: All disk I/O is huge sequential BLTs done as part of
> checkpoint operations. You can actually use close to 100% of your
> controller's bandwidth, as opposed to the 30%-50% typical for
> explicit-I/O operating systems that are doing seeks a lot of the time.
> This means the maximum I/O throughput the OS can handle effectively
> more than doubles. With simpler code. You could even afford the time
> to verify each checkpoint write...

I agree with you and I think from the 400's performance numbers I'd
agree that
the single level store environment gives you advantages as far as I/O is
concerned. The disadvantage tho is if you lose a disk in a multidisk
system. If you're not in a RAID environment it can make life
interesting.
Additionally since everything is just memory you have to be careful to
really
release memory if you don't it could be paged out to disk forever ... or
in the case of the 400 we have a reclaim storage command that
effectively acts
like a garbage collector.

Pointers are key to a single level store environment... on the 400 we
have
several different kinds of points and we are very stricts as to what you
can
and can not do with them. For one thing the old trick of casting a
number
to an address to dereference is gone... can't do any of that on the 400
as that'd be a bigtime security hole. We have

Open
Pointers that can hold any of the other pointer types.

Space
Generic pointers to data objects.

Function
System pointers to *PGM objects or procedure pointers to bound
ILEprocedures.

System
Pointers to system objects such as queues, indexes, libraries, and
*PGM objects.

Label
Pointers to fixed locations within the executable code of a
procedure or function.

Invocation
Pointers to process objects for procedure (function) calls under
ILE, or program calls under EPM or OPM.

Suspend
Pointer to the location in a procedure where control has been
suspended.

Interestingly enough (and you eluded to this in your initial note) the
step of saving and restoring data really doesn't go away. After all if
you
are going to ship that data over the wire that's exactly what you have
to do
and the application writers still seem to employ that concept... tho I
sure
a certain amount of that is on accout that developers haven't grok'd
what's possible on the 400.... However for
the OS hackers such as myself the ability just to stuff things into
persistent
memory is quite handy.

Well enough blathering on my part....

--
Hakuna Matata! |\ _,,,--,,_ ,)
Tom /,`.-'`' -, ;-;;'
|,4- ) )-,_ ) /\
import standard.disclaimer.*; |||| '---''(_/--' (_/-'
Tom Gall - Java Guy ____ "Where's the ka-boom? There was
IBM Rochester - Sapere aude \ /\ supposed to be an earth
(w) tom_...@vnet.ibm.com \__/_/ shattering ka-boom!"
(h) tg...@uswest.net ------ -- Marvin Martian

"Here's to the crazy ones. The misfits. The rebels. The troublemakers.
The round pegs in the square holes. The ones who see things differently.
The ones that change the world!"

Jim Gettys

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Bernd Paysan

> Sender: owner-lin...@vger.rutgers.edu
> From: Bernd Paysan <bernd....@gmx.de>
> Date: Mon, 21 Jun 1999 11:36:54 +0000
> To: linux-...@vger.rutgers.edu

> Subject: Re: Some very thought-provoking ideas about OS architecture.

> -----

> Linus Torvalds wrote:
> > In short: message passing as the fundamental operation of the OS is just
> > an excercise in computer science masturbation. It may feel good, but
> > you don't actually get anything DONE. Nobody has ever shown that it
> > made sense in the real world. It's basically just much simpler and
> > saner to have a function call interface, and for operations that are
> > non-local it gets transparently _promoted_ to a message. There's no
> > reason why it should be considered to be a message when it starts out.
>

> This is partly right, and partly wrong. Actually, some light-weight
> message passing protocols can be cheaper than function calls. A function
> call to a kernel function requires two state transitions (user->kernel
> and back), and two context switches; since the kernel wants to massage
> different data than the user process, it also has an effect on the
> cache.
>
> Message passing can make this cheaper, if (and only if) your messages
> are handled asynchronously. Put a bunch of messages into your shared
> memory message buffer (simple *ptr++ = id; *ptr++ = arg; ...), and when
> you are done, do the one state transition and context switch, and let
> the kernel handle all the requests at once (also simple: next message is
> goto *handlers[*ptr++]; Message code does arg = *ptr++;...). The
> communication overhead in a good designed active message system can be
> below the state transition overhead in a classical OS.
>

> Counterside: works only if your OS is designed to deliver asynchronous
> results, and if your app is programmed to use that. In other words:
> forget about blocking read/write calls. Works much better for services
> like X Window than for Unix-like OS services. The point is that you must
> restructure your app to be message-handling, too. I.e. a web server's
> frame would look like
>

Buffering and batching (or pipelining) in systems design is a well known
technique (by some) for improving system performance, particularly when
latencies are high. X has done this for a long time (approaching 15 years),
it is part of HTTP/1.1, and appears elsewhere. We didn't consider it rocket
science even when we were doing the early X work over 14 years ago. Look
at VMS QIO and AST delivery for another (ugly) approach, but in my view
one that throws away almost all of the benefits as it still is doing
the system call transitions without the benefits of the buffering to
amortize the expense.

An X request has an instruction budget of 100 instructions or so total;
the only way that this is feasible is to avoide system calls like the
plague, and to amortize such expensive operations as read/write and select
over many X requests. I used to regularly characterize X an exercise
in avoiding system calls.

I will note, however, that interface (protocol) design has a major
impact in how this technique can/will work and it is hard to retrofit.
We worked pretty hard in X Version 11 design to avoid these problems,
but history has shown we didn't work hard enough.

An example is the X request "InternAtom", which is heavily used (much
more so that we originally thought it would be) and is the basis of alot
of X's extensibility for client/client and client/window manager
communication. InternAtom gives you a short "atom" name for a string
(and is used as an extensible type system for communcation). This is a
synchronous call, and has turned into a bottleneck (we built in alot of
basic atoms. With 20-20 hindsight, we should have chosen a suitable sized
hash function, and just always sent a hash, which would have allowed them
to always be client generated.

Here's the moral: buffering/batching can work REALLY well, but is BEST done
at design time, and hard/painful/impossible to retrofit later. It can often
cause VERY great performance increments (for HTTP/1.1, for example, where
it turned out to be possible to retrofit to some extent, it can allow
for a factor of 2-10 performance improvement from our measurments). Whether
it would make any sense to try to retrofit anything approximating UNIX
system call semantics onto such a base is far from clear to me at all...

So if you want to do this when designing a system, think about it first,
not later, and think about it hard!

- Jim

--
Jim Gettys
Compaq Computer Corporation
Visting Scientist, World Wide Web Consortium, M.I.T.
http://www.w3.org/People/Gettys/
j...@w3.org, j...@pa.dec.com

Bernd Paysan

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Jim Gettys

On Mon, 21 Jun 1999, Jim Gettys wrote:

> Here's the moral: buffering/batching can work REALLY well, but is BEST done
> at design time, and hard/painful/impossible to retrofit later. It can often
> cause VERY great performance increments (for HTTP/1.1, for example, where
> it turned out to be possible to retrofit to some extent, it can allow
> for a factor of 2-10 performance improvement from our measurments). Whether
> it would make any sense to try to retrofit anything approximating UNIX
> system call semantics onto such a base is far from clear to me at all...

For IO-intensive applications that handle multiple requests from different
clients at once, queued/asynchronous IO (*without* requiring a system call
for every single operation) could work well. IMHO the basic design rule is
to never ever put a synchronous bottleneck at all into your interface -
all synchronous jobs must be done local (no XInternAtoms and that like
;-). This means that one can completely forget about the Unix API as OS
interface - this is far too local stuff.

> So if you want to do this when designing a system, think about it first,
> not later, and think about it hard!

In my experience (and that of a good friend of mine who manages a 100+
controller cluster for over 10 years), sending program snippets around
works best, performance- and bandwidth-reduction-wise. Security however
costs performance, so these things are perhaps better for controlled
environments than for a general purpose OS which has to face a lot of
nasty things.

Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Jim Gettys

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Jeffrey B. Siegal

> Sender: owner-lin...@vger.rutgers.edu
> From: "Jeffrey B. Siegal" <j...@quiotix.com>
> Date: Mon, 21 Jun 1999 11:07:41 -0700

> To: linux-...@vger.rutgers.edu
> Subject: Re: Some very thought-provoking ideas about OS architecture.
> -----

> > Batching messages between processes can be good, but what percentage of the
> > time is this practical?

How often it is practical is misleading, if you look at only existing
interfaces without redesign. I've never seen a good study on on what
could be done given equivalent, but different interfaces.

> > In most robust applications you tend to need the
> > result of one step before you are sure what the next step is, due to errors,
> > etc. There are obvious exceptions. I could initiate a bunch of writes to
> > different places, and look at all the results in one go.

Not always: and the question is robust to what? How much it is practical
often turns out to be a strong function of interface design, and
application. You have significant control over the outcome, at least
when you make interface design decisions.

If my connection to the X server fails, having checked for a write error
on every library call isn't going to change the outcome: I've got a failure
on my hands that is going to be very hard to recover from, and finding
out somewhat sooner isn't a significant help to that application. The
application has a mess on its hands in either case, and even when you
are on the same machine as the X server, you can't afford the overhead
of a Round Trip (RTT) between the application and the server.

Another point I heard John Osterhaut recently make (though I disagreed
with much of his Usenix talk) I agree with is that alot of operations
that are "errors" really aren't; avoid such pseudo-errors in your
interface design. In the X context, we made windows not existing an
"error", when it many applications it is not an error (e.g. window managers).
I now believe this was a mistake.

C library STDIO buffering was introduced exactly since you can't
afford even one system call/operation, for that interface.

>
> To work this requires the interface be designed to avoid the need for
> round trips (returned results). For example, X11 protocol is designed
> to need relatively few results. Rather than returning, say, a window
> ID, the client passes a window ID to the server which uses that ID to
> identify the window in subsequent requests. So rather than:
>
> CreateWindow() ->
>
> <- WindowID
>
> OperateOnWindow(WID) ->
>
> You have
>
> CreateWindow(WID) ->
> OperateOnWindow(WID) ->
>
> In this case, CreateWindow and OperateOnWindow can clearly be batched
> into a single message. Failures are reported back asynchronously (the
> equivalent in a system call context would be a signal); on the rare
> occasion that the client really needs to know whether an operation
> succeeded, it can do so by forcing a round trip. Generally, it doesn't
> matter and this overhead is avoided.
>
> Without redesigning the Unix-derived system call interface to use
> similar techniques, I don't see how this would yield any significant
> benefit.

Yup. The CreateWindow optimization was explicitly done during the X11
redesign of X. But we found we didn't go far enough in avoiding round
trips (e.g. my InternAtom example in my last message). On the wire, however,
they are still two messages in the X protocol stream; it is just we bundle
them into one TCP segment (typically); the X server reads as much as possible
all at once and processes them.

The best time to enable efficient batching/pipelining, etc, is in the
initial systems design. For many operations, X ran faster over a network
than locally (after all, you are getting two systems working on the same
problem; I don't know if this is as true today, as the relative costs
of CPU vs. network have been changing. I should perform some experiments
on a 100megabit network...)

One place I disagree (in a sense) with Linus in this thread is where he says:

> - Most operations are going to be local. Any operating system that
> starts out from the notion that most operations are going to be
> remote is going to die off as computers get more and more powerful.
>
> Things may start off distributed, but in the end network bandwidth is
> always going to be more expensive than CPU power.
>

It isn't that these statements are not true: they are...

But if you don't avoid round trips in interface design, many operations
you might want to be able to run remotely may perform like dogs. Avoiding
round trips affects initial systems design, as it is often difficult to
impossible to retrofit in later (I know, we retrofit it into HTTP recently,
and it is much harder to retrofit later than doing it up front). Getting
round trips back is usually much harder than almost anything else, and
we now live in a world wide network where speed of light isn't going to
change any time soon, even optimistically.

And I wish we'd designed a tighter wire protocol for X, which would go
exactly toward Linus's second point above. X is in most environments
more limited today more by available network bandwidth or RTT latency
than by CPU speed; this can even be true on the same machine today, as
memory gets further and further away from the processor.

Then again, I'd characterize our design center in the 1980's as the campus
or metropolitan scale network and 1-3 mip machines; the Internet existed,
but shall we say that the Internet wasn't seriously useable for X traffic
(or much of anything else) in 1988 during the congestion collapse then,
so I don't feel bad about it.... All 20-20 hindsight, of course.

- Jim

P.S. Pet peeve alert!

My major complaint of people who think memory is free is that those
people don't generally understand that touching less memory will make their
systems run alot faster; and this is the technological trend I don't
see changes in anytime soon... Here's a concrete set of examples:

o The X server got much faster when the data structures were shrunk by
60% 8 or so years ago (between X11R2 and X11R3 and X11R4) (trading a few
instructions for much more compact data structure representation).

o Keith Packard has recently seen similar effects when he rewrote the frame
buffer code recently (this time, it is shrinking the total # of instructions
by an order of magnitude in the frame buffer code. This is feasible,
since memory is now so much further away from the processor that it now
pays to pull instructions back into the inner loop, rather than
unrolling them into independent loops! The code is I/O bus bound,
even though it is executing many more instructions (than were available
in 1988-1990, when the original frame buffer code was developed and
optimized), with a drastic reduction of code space and complexity.
(Coming soon to X servers near you: you'll get about .5 megabytes of
memory back; it is the code I was running on my Itsy at LinuxExpo).

Jeffrey B. Siegal

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Jim Gettys

Jim Gettys wrote:
> o The X server got much faster when the data structures were shrunk by
> 60% 8 or so years ago (between X11R2 and X11R3 and X11R4) (trading a few
> instructions for much more compact data structure representation).

I'm not saying I disagree with your conclusion about touching less
memory (certainly today this is very true), but there was so much other
work going on at that time to optimize the X server at that time (some
of which I did, some of which had to do with touching memory less and
some of which didn't) that I'm not sure you can really say that the
speed improvements were the direct result of touching less memory as
opposed to simply executing less code, or touching the frame buffer
(often slower than memory) less. CPU's were pretty slow in those days
(compared to the relative speed of today's CPU's and memory) and it
sometimes made sense even to touch memory *more* using lookup tables and
such in order to save instructions.

Pavel Machek

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Linus Torvalds, linux-...@vger.rutgers.edu

Hi!

> In short: message passing as the fundamental operation of the OS is just
> an excercise in computer science masturbation. It may feel good, but
> you don't actually get anything DONE. Nobody has ever shown that it
> made sense in the real world. It's basically just much simpler and
> saner to have a function call interface, and for operations that are
> non-local it gets transparently _promoted_ to a message. There's no
> reason why it should be considered to be a message when it starts out.

Well - there is. Because function calling leads to things like
ioctl(). And ioctl() is _evil_. Yes, linux-kernel interface without
ioctl-like things would be ok with me. Even ioctl() which is _always_
given a structure which begins with its own length would be ok. But
ioctl() as it is today is evil, because you may pass horrible things
like linklist of things to do. And it is hard to marshall _that_.

Linus, do you plan some kind of clustering support into linux? If
someone gave you simple syscall-over-net forwarder for linux, would
you like it?

Pavel
PS: Well - there is such forwarder in development around here. It does
not forward ioctl()s for obvious reasons :-). Major thing for
clustering seems to be 32bit pids just now.

Bill Huey

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to Linus Torvalds

> because by definition, if you can do the "handle->op" lookup, then it's
> not a OS boundary any more - or at least it is a very BAD one. See?
>
> Linus

I'm going to wait before I comment on this one.

Some of what's going on is a misrepresentation of
current OS theory within acedemic circles, which changes
the context surrounding the message passing thread.

Syscall overhead is an issue, but things in the late 90's
are different now, so folks arguments are rather dated.

I'm going to wait for the IBM guy, to reply first.

bill

ty...@mit.edu

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to e...@thyrsus.com

Eric:

One other observation about EROS: Capabilities are also very old
concept, dating back at least 20 years. The one challenge with using
them, though, is that it completely guts your hope of being POSIX.1
compatible. For example, the open() system call must now take a new
argument, which is the capability. So does unlink(), and rename(), and
bind(), and accept().....

On the one hand, this is good --- very good --- it means that you don't
have the problem of accidentally running something with superuser privs
when you didn't expect them to, since the privilege management is called
out explicitly as part of the API. It's a bit more annoying for the
programmers (especially programmers of networked services like ftpd and
httpd), who now have to manage capabilities explicitly, but that's the
price you pay for the improved security

On the flip side, the lack of compatibility means that lose all of the
Unix utilities (the GNU suite of utilities, the X window system, etc.).
This doesn't mean that a capability system which becomes used in the
real world will never happen, but it does increase the energy barrier of
making a useful system appear much, much harder. One of the reasons why
Linux took off so quickly was that he were able to reuse userspace
tools. Moving to a pure capability-based scheme will mean losing all of
that.

Now, you could solve the problem by having a compatibility layer;
however, within that compatibility layer, you won't have any of the
benefits of capabilities, either. So now the game is trying to convince
all of the application programmers to rewrite their programs in order to
be able support the new capability-based OS. Again, not impossible, but
the activation energy can be quite high.

- Ted

Jonathan Walther

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Eric S. Raymond

Eric, there is nothing new about this. What you are describing is a
stripped down, clearly conceptualized MULTICS. Please stop spamming us.
I do however, agree that MULTICS is the future. There were so many lessons
in it that have been clearly ignored.

Jonathan Walther

On Sun, 20 Jun 1999, Eric S. Raymond wrote:

> (Please copy any replies to me explicitly, as I'm not presently subscribed
> to the linux-kernel list -- it's not practical when I'm spending so
> much time on the road.)
>

> Gents and ladies, I believe I have may have seen what comes after
> Unix. Not a half-step like Plan 9, but an advance in OS architecture
> as fundamental at Multics or Unix was in its day.
>
> As an old Unix hand myself, I don't make this claim lightly; I've
> been wrestling with it for a couple of weeks now. Nor am I suggesting
> we ought to drop what we're doing and hare off in a new direction.
> What I am suggesting is that Linus and the other kernel architects
> should be taking a hard look at this stuff and thinking about it. It
> may take a while for all the implications to sink in. They're huge.
>
> What comes after Unix will, I now believe, probably resemble at least
> in concept an experimental operating system called EROS. Full details
> are available at <http://www.eros-os.org/>, but for the impatient I'll
> review the high points here.

>
> EROS is built around two fundamental and intertwined ideas. One is
> that all data and code persistence is handled directly by the OS.
> There is no file system. Yes, I said *no file system*. Instead,
> everything is structures built in virtual memory and checkpointed out
> to disk every so often (every five minutes in EROS). Want something?
> Chase a pointer to it; EROS memory management does the rest.
>

> The second fundamental idea is that of a pure capability architecture
> with provably correct security. This is something like ACLs, except
> that an OS with ACLs on a file system has a hole in it; programs can
> communicate (in ways intended or unintended) through the file system
> that everybody shares access to.
>
> Capabilities plus checkpointing is a combination that turns out to
> have huge synergies. Obviously programming is a lot simpler -- no
> more hours and hours spent writing persistence/pickling/marshalling
> code. The OS kernel is a lot simpler too; I can't find the figure to
> be sure, but I believe EROS's is supposed to clock in at about 50K of code.

>
> Here's another: All disk I/O is huge sequential BLTs done as part of
> checkpoint operations. You can actually use close to 100% of your
> controller's bandwidth, as opposed to the 30%-50% typical for
> explicit-I/O operating systems that are doing seeks a lot of the time.
> This means the maximum I/O throughput the OS can handle effectively
> more than doubles. With simpler code. You could even afford the time
> to verify each checkpoint write...
>

> Here's a third: Had a crash or power-out? On reboot, the system

> simply picks up pointers to the last checkpointed state. Your OS, and
> all your applications, are back in thirty seconds. No fscks, ever
> again!
>
> And I haven't even talked about the advantages of capabilities over
> userids yet. I would, but I just realized I'm running out of time --
> gotta get ready to fly to Seattle tomorrow to upset some stomachs
> at Microsoft.
>
> www.eros-os.org. Eric sez check it out. Mind-blowing stuff once
> you've had a few days to digest it.
> --
> <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>
>
> The Bible is not my book, and Christianity is not my religion. I could never
> give assent to the long, complicated statements of Christian dogma.
> -- Abraham Lincoln

Matthew Wilcox

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Pavel Machek

On Sun, Jun 20, 1999 at 10:09:55PM +0200, Pavel Machek wrote:
> Well - there is. Because function calling leads to things like
> ioctl(). And ioctl() is _evil_. Yes, linux-kernel interface without
> ioctl-like things would be ok with me. Even ioctl() which is _always_
> given a structure which begins with its own length would be ok. But
> ioctl() as it is today is evil, because you may pass horrible things
> like linklist of things to do. And it is hard to marshall _that_.

Surely the sensible way of doing this is to define an ioctl2() system
call which is given a length. I imagine we would then add an ioctl2()
method to struct file_operations, and fall back to ioctl() (trimming
off the length word) for compatibility.

I wonder if we can do this in a clever enough way to renumber all the
old definitions of ioctl numbers.

(from:

#define LOOP_SET_FD 0x4C00

to:

#define VIDIOCGCAP _IOR('v',1,struct video_capability)

)

The alternative would be to drop ioctl altogether and replace it with a
different interface. plan9 uses ctl files -- you write strings to them
to perform commands. But I'm not sure people are willing to make that
kind of radical change (certainly not within the 2.3 timeframe).

--
Matthew Wilcox <wi...@bofh.ai>
"Windows and MacOS are products, contrived by engineers in the service of
specific companies. Unix, by contrast, is not so much a product as it is a
painstakingly compiled oral history of the hacker subculture." - N Stephenson

Richard Gooch

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Matthew Wilcox

Matthew Wilcox writes:
> The alternative would be to drop ioctl altogether and replace it
> with a different interface. plan9 uses ctl files -- you write
> strings to them to perform commands. But I'm not sure people are
> willing to make that kind of radical change (certainly not within
> the 2.3 timeframe).

But then there would be the added cost of parsing/comparing control
strings. A switch on an integer is much cheaper.

Regards,

Richard....

Linus Torvalds

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Matthew Wilcox

On Tue, 22 Jun 1999, Matthew Wilcox wrote:
>
> Surely the sensible way of doing this is to define an ioctl2() system
> call which is given a length. I imagine we would then add an ioctl2()
> method to struct file_operations, and fall back to ioctl() (trimming
> off the length word) for compatibility.

Actually, for ioctl, you definitely do want to have both a command and a
reply, so something like this would work:

int control(int fd, unsigned int code, void *in, int in_size, void *out, int out_size)

and yes, I agree that "ioctl()" and "fcntl()" as they currently stand are
just horribly ugly, and they are probably one of the worst features of
UNIX as a design.

There's a few other things that could be handled more cleanly with just a
single "control" interface - things like socket options etc (which as they
stand now are yet another special case).

Something like the above is actually what a lot of UNIX systems try to
encode in the ioctl number - the number often has the size and the
direction encoded in it. Linux tries to do it for some things, but it's
not enforced due to historical baggage.

And notice how it's not getting to be really pretty whatever you do: even
if ioctl() and friends had a nicer interface, they'd still be just a ugly
sideband channel to whatever the fd is connected to.

Linus

Jim Gettys

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Jeffrey B. Siegal

> Sender: owner-lin...@vger.rutgers.edu
> From: "Jeffrey B. Siegal" <j...@quiotix.com>

> Date: Mon, 21 Jun 1999 14:39:20 -0700
> To: Jim Gettys <j...@pa.dec.com>
> Cc: linux-...@vger.rutgers.edu, torv...@transmeta.com

> Subject: Re: Some very thought-provoking ideas about OS architecture.
> -----

> Jim Gettys wrote:
> > o The X server got much faster when the data structures were shrunk by
> > 60% 8 or so years ago (between X11R2 and X11R3 and X11R4) (trading a few
> > instructions for much more compact data structure representation).
>
> I'm not saying I disagree with your conclusion about touching less
> memory (certainly today this is very true), but there was so much other
> work going on at that time to optimize the X server at that time (some
> of which I did, some of which had to do with touching memory less and
> some of which didn't) that I'm not sure you can really say that the
> speed improvements were the direct result of touching less memory as
> opposed to simply executing less code, or touching the frame buffer
> (often slower than memory) less. CPU's were pretty slow in those days
> (compared to the relative speed of today's CPU's and memory) and it
> sometimes made sense even to touch memory *more* using lookup tables and
> such in order to save instructions.
>

Very true: but the fact is that the server maintained compatibility while
using less than 40% of the dynamic memory for its internal data structures
while getting much faster. Certainly it did not seem to hurt performance
then to use alot less memory. How much of this was due to more compact
representation, and how much to recoding is an interesting question, but
one we'll never be able to answer. Machines of that day were about 10MIP
class systems.

More interesting may be Keith's recent results: while executing more
instructions, but in a much more compact code base, he's seeing >10% kinds
of speedups on the frame buffer coder (excluding text, where he's
not worrying about speed; even so, dumb simple code is getting him
50K characters/second or more). Both the old code and the new code
should be I/O bus cycle bound. Part of the speedup may be that he's
able to use 64 bit constructs in the compiler, but part may be due to
better instruction cache behavior. Even more interesting may be real
applications, rather than small micro-benchmarks, as the footprint in
the I cache for the X server is very much smaller, so the context
switch overhead due to cache refill between client and server should
be much less. But right now, we only have micro-benchmark data, so
I don't know if this intuition is correct.

It is clearly a great example that one must now think much more on overall
system behavior, rather than focussing on instructions executed. Better
performance tools are needed to help redirect programmer's intution about
performance, often set on what are now antique systems. Trading memory
for fewer instructions will often be a loser on current systems.

- Jim

Jeffrey B. Siegal

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Jim Gettys

Jim Gettys wrote:
> More interesting may be Keith's recent results: while executing more
> instructions, but in a much more compact code base, he's seeing >10% kinds
> of speedups on the frame buffer coder (excluding text, where he's
> not worrying about speed; even so, dumb simple code is getting him
> 50K characters/second or more). Both the old code and the new code
> should be I/O bus cycle bound. Part of the speedup may be that he's
> able to use 64 bit constructs in the compiler, but part may be due to
> better instruction cache behavior.

Absolutely. In recent years, I have looked at the gross code expansion
done by the X server build (where some substantial code is compiled
multiple times, and in each case there is some pretty aggressive loop
unrolling) and thought that this might well be wrong given today's cache
issues. But I never did anything to test that theory. I'm glad to hear
that Keith has (as ever) been doing some good work in this area.

sh...@us.ibm.com

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to ty...@mit.edu

Ted writes:

>The one challenge with using
>them, though, is that it completely guts your hope of being POSIX.1
>compatible. For example, the open() system call must now take a new
>argument, which is the capability. So does unlink(), and rename(), and
>bind(), and accept().....

Actually, *these* system calls aren't the problem, as most of them take file
descriptors, which are capabilities.

The question comes down to: do you want to facilitate secure collaboration, or
do you want to run POSIX apps. Pick one, because you cannot do both.

>On the flip side, the lack of compatibility means that lose all of the
>Unix utilities (the GNU suite of utilities, the X window system, etc.).

It's surprising how well a compatibility box works. The truth is that most of
your day to day environment can stay in POSIX without much of a problem.
Especially when your compatibility box is about the same speed as the real POSIX
system.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

-

sh...@us.ibm.com

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Bill Huey

Here's the response to Linus and others on the messaging/syscall discussion.

Linus wrote:

>That's why the OS boundary HAS to be equivalent to
>
> read(handle, buffer, size)
>
>and NOT be equivalent to
>
> handle->op(READ, buffer, size);
>

>because by definition, if you can do the "handle->op" lookup, then it's
>not a OS boundary any more - or at least it is a very BAD one. See?

Linus and several other people seem to share the same misunderstanding about
what I wrote, so I obviously said it badly.

The confusion, I think, is over whether the descriptor dereference (I did not
*say* handle, and I did not *mean* handle) is done by the application or the OS.
Several people assumed that I meant it should be done by the application.
Indeed, if the application is able to dereference the name (in which case it
would indeed be a handle), then there is no real protection boundary. This
would indeed be a poor design.

The issue I was trying to get at is both more simple and more complicated. The
real question is: "What is an operating system?"

Never mind whether you do it all in one executable (monolithic style) or as a
microkernel. There is a core nucleus to any operating system that demultiplexes
and dispatches events. In typical designs, these events include interrupts and
system calls. I am proposing that a better answer is for the nucleus to
demultiplex and dispatch interrupts and *descriptors*, and then let the
dispatched-to code figure out what call was made.

If you are building a monolithic system, the change in demultiplex order
(operation first vs. descriptor first) has no impact on performance.

That said, there are a number of reasons to prefer it:

1. You might want to distribute your system later. In a distributed system you
want to know the target (i.e. the object) of the syscall, ship the work off, and
let it be done by that object.
2. You might want to restrict what syscalls an application can do. In practice
this is invariably about limiting the objects that the application can
manipulate. Controlling the system calls is a rather bad and indirect way to do
this. It would be better to control the object access more directly.
3. It obviates the need for all objects to export the same system call
interfaces.

Before you jump on this, consider that the all of the file I/O calls in UNIX
area actually capability invocations. So is the open() call: open() implicitly
references either the current root descriptor or the CWD descriptor, both of
which are part of the process state. If you want to have a discussion in
UNIX-centric terms, I'm arguing that *all* system calls should rely on some
descriptor in this way, and that it should be well-defined what happens when the
descriptor is void.

For example, the whole notion of controlling signal delivery on the basis of
parent/child trees is a bad idea. A better design would have explicit
per-process descriptors, and give authority to signal based on whether the
signaller held the appropriate descriptor. Such a design *can* support the
current semantics, but is not limited to the current semantics.

Ultimately, I'ld note that when you strip away the syntax transformations you
discover that above the vnode boundary UNIX is a capability system, with the
notable exception of the signalling and process control mechanisms, which are
rather crufty. With the introduction of /proc, processes have been given
descriptors as well. What we now need to do is find a way to eliminate the
legacy interfaces that render principled security in UNIX so difficult.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

Bill Huey <bi...@mag.ucsd.edu> on 06/21/99 07:07:13 PM

To: torv...@transmeta.com (Linus Torvalds)
cc: linux-...@vger.rutgers.edu (bcc: Jonathan S Shapiro/Watson/IBM)

Subject: Re: Some very thought-provoking ideas about OS architecture.

> because by definition, if you can do the "handle->op" lookup, then it's

> not a OS boundary any more - or at least it is a very BAD one. See?
>
> Linus

I'm going to wait before I comment on this one.

Some of what's going on is a misrepresentation of
current OS theory within acedemic circles, which changes
the context surrounding the message passing thread.

Syscall overhead is an issue, but things in the late 90's
are different now, so folks arguments are rather dated.

I'm going to wait for the IBM guy, to reply first.

bill

Stephen C. Tweedie

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Pavel Machek

Hi,

On Sun, 20 Jun 1999 22:09:55 +0200, Pavel Machek <pa...@Elf.ucw.cz> said:

> Linus, do you plan some kind of clustering support into linux? If
> someone gave you simple syscall-over-net forwarder for linux, would
> you like it?

Mosix already has one.

--Stephen

sh...@us.ibm.com

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Jim Gettys

Jim Gettys writes (in the context of a discussion about asynchronous messaging):

>Here's the moral: buffering/batching can work REALLY well, but is BEST done
>at design time, and hard/painful/impossible to retrofit later.

Jim is absolutely right. I remember having a long series of design sessions on
this with Phil Karlton (now dearly missed) when my group was designing the SGI
VIEW debugger product line.

The two hidden costs in buffered asynchronous messaging systems is that somebody
has to pay for the buffers and recovering consistency when one side or the other
drops the ball is more than a little tricky. At this point, I'm fairly well
convinced that asynchronous messaging is not something I want in a microkernel
nucleus.

None of this, you should understand, is meant to disagree with Jim; it's more an
observation about different priorities at different layers in the design.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

-

Rik van Riel

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Linus Torvalds

[I've thought about this long and hard and I've finally come
up with a proper response to Linus' argument]

On 20 Jun 1999, Linus Torvalds wrote:

> In short: message passing as the fundamental operation of the OS
> is just an excercise in computer science masturbation. It may
> feel good, but you don't actually get anything DONE. Nobody has
> ever shown that it made sense in the real world.

It's not about physical message passing in the actual implementation,
what's really happening can be 'hidden' by clever programming by the
people who built the OS.

The real issue here is paradigms. The classical "everything's
a file" broke down with the advent of networking, sockets and
non-blocking reads. At the moment the file paradigm is so much
out of touch with computational reality that web servers need
to fork for each client and people are crying out for asynchronous
sendfile and other weird interfaces.

A new "everything's a message" WILL fit the current use of computers
though. One simple concept that's good enough for all our
computational needs. And because it _is_ one simple concept, it can
be implemented in a simple, clean and fast way -- unlike the myriad
of different kludges Unix has to overcome the file paradigm...

Of course, I'll be using Unix for the forseeing future -- it does
all that it needs to do and it's got all the luxuries I want :)

regards,

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV: http://www.reseau.nl/ |
| Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie: http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

Linus Torvalds

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Rik van Riel

On Tue, 22 Jun 1999, Rik van Riel wrote:
>
> The real issue here is paradigms. The classical "everything's
> a file" broke down with the advent of networking, sockets and
> non-blocking reads. At the moment the file paradigm is so much
> out of touch with computational reality that web servers need
> to fork for each client and people are crying out for asynchronous
> sendfile and other weird interfaces.

Sure. But I think it's still a valid paradigm to consider "everything is a
stream of bytes". And that's _really_ what the UNIX paradigm has been from
the first: the whole notion of pipes etc is not all that different from
networking.

> A new "everything's a message" WILL fit the current use of computers
> though. One simple concept that's good enough for all our
> computational needs. And because it _is_ one simple concept, it can
> be implemented in a simple, clean and fast way -- unlike the myriad
> of different kludges Unix has to overcome the file paradigm...

I disagree.

The issue is not how you get the data from one place to the other:
"read()" is as good as way as "rcv()". Message passing is not the issue.

The real issue is _naming_, and that's not going away. The name space has
always been the difficult part. And that's where I agree that UNIX could
do better: I think we do want to move into a "web direction" where you can
just do a open("http://ssss.yyyyy.dd/~silly", O_RDONLY) and it does the
right thing.

Linus

Chip Salzenberg

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to Linus Torvalds

According to Linus Torvalds:

> int control(int fd, unsigned int code,
> void *in, int in_size, void *out, int out_size)

I like it! Sign me up for one of those, please.

> And notice how it's not getting to be really pretty whatever you do: even
> if ioctl() and friends had a nicer interface, they'd still be just a ugly
> sideband channel to whatever the fd is connected to.

It's an inevitable consequence of presenting file data without
structure, IMO.
--
Chip Salzenberg - a.k.a. - <ch...@perlsupport.com>
"When do you work?" "Whenever I'm not busy."

Bill Huey

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to sh...@us.ibm.com

> Before you jump on this, consider that the all of the file I/O calls in UNIX
> area actually capability invocations. So is the open() call: open() implicitly
> references either the current root descriptor or the CWD descriptor, both of
> which are part of the process state. If you want to have a discussion in
> UNIX-centric terms, I'm arguing that *all* system calls should rely on some
> descriptor in this way, and that it should be well-defined what happens when the
> descriptor is void.
>
> For example, the whole notion of controlling signal delivery on the basis of
> parent/child trees is a bad idea. A better design would have explicit
> per-process descriptors, and give authority to signal based on whether the
> signaller held the appropriate descriptor. Such a design *can* support the
> current semantics, but is not limited to the current semantics.

Basically, from what I've read it translates to having a unified single delivery,
per syscall capabilities system that's not scatter-brainded hack spread out in the
Linux kernel now, right ?

You basically get all this simple and elegant encapsulation in one architectural
effort, right ?

With processes and signals as the exceptions ?

> Ultimately, I'ld note that when you strip away the syntax transformations you
> discover that above the vnode boundary UNIX is a capability system, with the
> notable exception of the signalling and process control mechanisms, which are
> rather crufty. With the introduction of /proc, processes have been given
> descriptors as well. What we now need to do is find a way to eliminate the
> legacy interfaces that render principled security in UNIX so difficult.

Uh, what you're saying is that the vnode/VFS interface is identical in structure
to Eros capabilities, but it not treated in uniform manner ?

bill

> Jonathan S. Shapiro, Ph. D.
> IBM T.J. Watson Research Center

Message has been deleted

david parsons

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to linux-...@vger.rutgers.edu

In article <linux.kernel.Pine.LNX.4.03...@mirkwood.nl.linux.org>,

Rik van Riel <ri...@nl.linux.org> wrote:
>[I've thought about this long and hard and I've finally come
>up with a proper response to Linus' argument]
>
>On 20 Jun 1999, Linus Torvalds wrote:
>
>> In short: message passing as the fundamental operation of the OS
>> is just an excercise in computer science masturbation. It may
>> feel good, but you don't actually get anything DONE. Nobody has
>> ever shown that it made sense in the real world.
>
>It's not about physical message passing in the actual implementation,
>what's really happening can be 'hidden' by clever programming by the
>people who built the OS.
>

>The real issue here is paradigms. The classical "everything's
>a file" broke down with the advent of networking, sockets and
>non-blocking reads.

I think you're going to have to enumerate some of the cases
where ``everything is a file'' broke down, and then you're
going to need to enumerate some of the reasons why ``everything
is a message'' is not exactly the same as ``everything is a
file'' (aside from ``message'' being spelled differently
than ``file'', which most people have already figured out.)

____
david parsons \bi/ Skeptical.
\/

sh...@us.ibm.com

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to Kai Henningsen

[Kai writes:]
>I've thought for a while I'd like to experiment with an OS where you had
>an OS boundary like this, but you could actually do IPC where you see an
>object in another process as such an OS handle. Of course, you'd need some
>kind of information on what the arguments look like so that the OS can
>actually move them to the other process, and it should still be efficient.

Kai: go grab a copy of EROS from the website at www.eros-os.org. This is
exactly how EROS is structured. In the EROS case, the kernel moves a sequential
byte range; the kernel itself knows nothing of argument structure.

>And I see no reason why one
>couldn't build a perfectly compatible POSIX environment on top of that.

One can -- a POSIX environment in fact existed on KeyKOS, the predecessor of
EROS.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center

Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

-

Pavel Machek

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to Linus Torvalds

Hi!

> The real issue is _naming_, and that's not going away. The name space has
> always been the difficult part. And that's where I agree that UNIX could
> do better: I think we do want to move into a "web direction" where you can
> just do a open("http://ssss.yyyyy.dd/~silly", O_RDONLY) and it does the
> right thing.

You've just described what podfuk does. Actually, it is bad idea to
use url style (just about every unix app knows // == /), and it is bad
idea to use / as component separator in url, because "what should cd
http:// do?".

But yes I currently have open("/#url:http:||ssss.yyyyy.dd/~silly",
O_RDONLY) working. I also have chdir("/#ftp:host.name") working. And
this all in 300 lines of c code (+ midnight commander turned into
library). You want to see
http://atrey.karlin.mff.cuni.cz/~pavel/podfuk/podfuk.html. [I hope
name does not offend you... :-)]

Pavel
PS: Podfuk is not multithreaded, and it should be. But it works
reliably for over 2 years (I think).
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

DAVID BALAZIC

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to linux-...@vger.rutgers.edu

somebody wote :

> difference between the implementation of
>
> read(fd,buf,sz)
> and
> fd->CALL(OP_READ,buf,sz)

I might be off track here , but is the second line anyhow related to
object oriented programming ?

Then the difference is mostly in the syntax , because on lower level the
second line get translated to :

read_nr_32( fd , buf,sz)

The correct method/function is choosen and called with the additional
"implicit" parameter , a pointer to the object.
The correct function can be found at compile time or run time,
when using virtual methods.

Now what I understood under sending messages in OS is :

message_type *m;
m = <the contents of the message>
send_message(m [ , destination ] ) ;

So the execution will not neccesarily be transfered to another piece of code.
Just the message would be transfered to some mailbox.

This is ( more or less ) how AmigaOS works. For example a task sends a message to
the filesystem "do this" , the message is sent and the task continues its bussiness.
The filesystem , which is another task , reads the message , does something and
sends a reply.

Well , just some thoughts of mine ...

--
David Balazic , student
E-mail : 1st...@writeme.com | living in sLOVEnija
home page: http://surf.to/stein
Computer: Amiga 1200 + Quantum LPS-340AT
--

Mark H. Wood

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to

On Mon, 21 Jun 1999, Mikulas Patocka wrote:
> > Here's another: All disk I/O is huge sequential BLTs done as part of
> > checkpoint operations. You can actually use close to 100% of your
> > controller's bandwidth, as opposed to the 30%-50% typical for
> > explicit-I/O operating systems that are doing seeks a lot of the time.
> > This means the maximum I/O throughput the OS can handle effectively
> > more than doubles. With simpler code. You could even afford the time
> > to verify each checkpoint write...
>

> I believe that you get 100% write throughput, but in many systems read
> requests are much more frequent than writes. I'd be insterested how good
> is read performance. Wouldn't your data get too fragmented? If you
> hadn't ported any applications to Eros yet, I wonder if you get good read
> performance in real enviroment.

I think one source of argument here is that (for examples) OLTP, news
spooling, and software development are almost, but not quite, totally
unlike one another in the ways they use storage, so different folks have
different priorities. It's fairly easy to see how EROS is a win for OLTP,
but this may not be so clear for other activities.

--
Mark H. Wood, Lead System Programmer mw...@IUPUI.Edu
Don't Panic!

sh...@us.ibm.com

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to Mark H. Wood

>I think one source of argument here is that (for examples) OLTP, news
>spooling, and software development are almost, but not quite, totally
>unlike one another in the ways they use storage, so different folks have
>different priorities. It's fairly easy to see how EROS is a win for OLTP,
>but this may not be so clear for other activities.

Quite right about differences in usage leading to differences in priorities.

As I explained in an earlier round somewhere, the EROS storage allocator works
very hard to allocate things in an extent-oriented fashion. In practice this
works quite well.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: sh...@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

-

Rik van Riel

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to david parsons

On 22 Jun 1999, david parsons wrote:

> >The real issue here is paradigms. The classical "everything's
> >a file" broke down with the advent of networking, sockets and
> >non-blocking reads.
>
> I think you're going to have to enumerate some of the cases
> where ``everything is a file'' broke down, and then you're
> going to need to enumerate some of the reasons why ``everything
> is a message'' is not exactly the same as ``everything is a
> file'' (aside from ``message'' being spelled differently
> than ``file'', which most people have already figured out.)

A file is a stream of bytes (usually of a known size).

A message can be part of a stream of bytes, a status
report, or something different.

A file is actively read() from, a message can be delivered
very much like a signal. This makes asynchronous I/O more
easy than the file paradigm where async I/O is aided by signals
and other additional cruft.

If the OS only has to tuck a status word onto the message
(or in case of large reads, send a message with a pointer
to a page which has just been mapped into the process'
address space) we again have a consistent interface for
all things in the system.

No longer a difference between files, pipes, network
connections, connectionless network communication and
signals.

One interface to rule them all
One interface to find them
One interface to bind them all
And out of the kernel unwind them :)

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV: http://www.reseau.nl/ |
| Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie: http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

Paul Barton-Davis

unread,

Jun 24, 1999, 3:00:00 AM6/24/99

to linux-...@vger.rutgers.edu

>A file is a stream of bytes (usually of a known size).
>
>A message can be part of a stream of bytes, a status
>report, or something different.

Rik - messages are just streams of bytes to.

The big difference, which you almost get to, is that messages are
intended to be *potentially* deliverable asynchronously, whereas a
stream of bytes has no delivery semantics at all. In addition, a
message is intended to have some self-contained semantics, whereas a
stream of bytes has no semantics without a context in which to
interpret them.

In this sense, a stream of bytes is more fundamental than a
message. This observation points out that what we really need to unify
the now-kludgey collection of Unix I/O mechanisms is a common method
of doing async I/O delivery of a stream of bytes. Call it a message,
call it async file I/O, it doesn't matter.

Sometimes, it would be nice to say to the kernel:

"send me 50 bytes from this file every 10ms"
"whenever you've got 10K of this data stream ready to read,
send it to me"

which can be supported same general model that supports:

"tell me when 10ms has gone by"
"tell me when my heap size exceeds 100MB"
"tell me when another process has something to tell me"
"tell me when i tried to write to an illegal address"

As others have noted, it implies a very different programing model
that open/read/write/close, but its hardly an unfamiliar one in this
era of event-driven main loops.

--p

Sam Roberts

unread,

Jun 24, 1999, 3:00:00 AM6/24/99

to

Previously, Linus Torvalds wrote in list.linux.kernel:
>
> [snip]

> In short: message passing as the fundamental operation of the OS is just
> an excercise in computer science masturbation. It may feel good, but
> you don't actually get anything DONE. Nobody has ever shown that it

> made sense in the real world. It's basically just much simpler and
> saner to have a function call interface, and for operations that are
> non-local it gets transparently _promoted_ to a message. There's no
> reason why it should be considered to be a message when it starts out.
>
> Linus
>

and
> [snip]

> The issue is not how you get the data from one place to the other:
> "read()" is as good as way as "rcv()". Message passing is not the issue.
>

> The real issue is _naming_, and that's not going away. The name space has
> always been the difficult part. And that's where I agree that UNIX could
> do better: I think we do want to move into a "web direction" where you can
> just do a open("http://ssss.yyyyy.dd/~silly", O_RDONLY) and it does the
> right thing.

QNX4 is not an excercise of any kind, it is hugely popular (outside of
the server (dominated by Unix) and desktop(...) markets) and used by
people whos interest is in getting there work done well, now, not in
O/S theory. BeOS and Plan9 are less commercially successful, but do
their jobs well. The days when micro-kernels were university toys
are past, as are the days when Unix was university toy.

You and most of the posters are assuming that message passing means
*asynchronous* message passing. What about synchronous message passing?

QNX4 is based on synchronous send/receive/reply messaging. The synchronicity
means that message buffer management doesn't need to be done anymore,
the source process is blocked until the message is delivered, processed
and replied to. The source location of the message *is* the buffer, safely
(because that process can't run). Asynchronous messaging is easy to build
on top of this, but not all message passing has to pay its performace price.

That these messages map simple to Unix style system calls is good, it makes
it easy to support the POSIX API, and port software and programmers.

That these messages are naturally routed over a network is great, but
you can do this in-kernel by translating system calls to messages (ioctls
are always a problem, of course).

That they're not as unidirectional as pipes and make it easy to implement
systems as groups of co-operating processes is great, but that can be done
by implementing a message passing library (either in-kernel, like I have for
Linux, or as a library).

The *real* advantage of message passing is that you no longer have to bind
*independent* subsystems into the kernel, the immediate target
of all system calls in a system-call based OS. Why does serial device
I/O have to thread through the same code and locks as file I/O, as networking?

The *real* issue is "naming" and implementing of services such as an
http file system (which you mention), such as a CODA fs, drivers, etc.
outside of the kernel.

IMO, micro-kernel message-passing systems make doing this *easy* and *natural*.
That doesn't mean you can't do it in a a macro-kernel, but when the OS supports
what you want to do, your life becomes easier.

Is it slower? I've only started working with Linux (as opposed to running
file/mail servers) recently, so I haven't benchmarked it. I can say that QNX
is very fast, and has put the work into optimization of message passing and
context switching thats necessary for a system that makes them the basis
of everything.

Linux is moving this way, but could do so more agressively. Devfs is a start.

Notice how there are two classes of appication in Linux? Theres user-space
apps, and kernel-space apps (knfsd, sound drivers, etc.). Because if you
want something to be fast, or if you want to implement the open/write/read/...
API you *must* be a kernel-app, thus more and more kernel code gets
written (and nobody can stop it, Linux is open-source, and programmers do
what gets the job done).

People keep arguing about whether or not you can do it in user-space. This is
the wrong debate. The question is "can I export my services so is available are
available using an open/read/write/..?". If the answer is yes, then you
should, because thats one of the unifying principles in the Unix
API/architecture, and it works really well. And if you should, then you need
to put your app/servicer/call-it-what-you-will in kernel space.

There is not now, and never will be, a one true O/S architecture. Like good languages, each architecture facilitates certain types of design models, and
make possible others. However, that there are more shipped macro-kernels
right now has more to do with marketing, and company inertia, than technical
merits.

Sam

--
Sam Roberts (s...@cogent.ca), Cogent Real-Time Systems (www.cogent.ca)
"News is very popular among its readers." - RFC 977 (NNTP)

Raul Miller

unread,

Jun 24, 1999, 3:00:00 AM6/24/99

to DAVID BALAZIC

DAVID BALAZIC <david....@uni-mb.si> wrote:
> Now what I understood under sending messages in OS is :
>
> message_type *m;
> m = <the contents of the message>
> send_message(m [ , destination ] ) ;
>

> So the execution will not neccesarily be transfered to another piece .
> of code Just the message would be transfered to some mailbox .

>
> This is ( more or less ) how AmigaOS works.

Er.. and AmigaOS had trouble moving to a virtual memory model, let
alone to a distributed network model.

Which just goes to show you: shuffling around arguments is a tool,
not a solution in and of itself.

--
Raul

Ralf Baechle

unread,

Jun 25, 1999, 3:00:00 AM6/25/99

to DAVID BALAZIC, linux-...@vger.rutgers.edu

On Thu, Jun 24, 1999 at 03:36:44PM -0400, Raul Miller wrote:

> Er.. and AmigaOS had trouble moving to a virtual memory model, let
> alone to a distributed network model.

But AmigaDOS was never designed to live in a virtual memory world. Just
two examples, it doesn't have a facility to prevent certain memory from
being swapped and memory pools are maintained with a granularity smaller
than eight bytes, a number that is compiled into many apps.

Ralf

Mailing List Account

unread,

Jun 30, 1999, 3:00:00 AM6/30/99

to linux-...@vger.rutgers.edu

I am having problems getting crond to run correctly under a virtual server
(done with the virtual server HOW-TO as a guide) under 2.2.10ac4. It is
working fine under 2.0.36.

(Don't know if this was dropped as off-topic, or if it just never got
distributed -- it's been nearly 36 hours since it was sent)

The crond in the main server is running fine, but the virtuals are failing
to STAT a file correctly. Following is a diff of the strace output from a
virtual (<<) and the main server (>>). It appears the value of st_mode is
getting set differently in the two environments. Why would that be
happening, and how can I correct it?

Note: crontrace is the virtual, maintrace is the main. Maintrace is the
working one, crontrace is the non-working one.

[root: /virtual/weather] # diff crontrace maintrace
5,6c5,6
< fstat(4, {st_mode=S_ISVTX|0327, st_size=0, ...}) = 0
< mmap(0, 8161, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40014000
---
> fstat(4, {st_mode=031032, st_size=0, ...}) = 0
> mmap(0, 10786, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40014000
11,14c11,14
< mmap(0, 974392, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x40016000
< mprotect(0x400fc000, 32312, PROT_NONE) = 0
< mmap(0x400fc000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4,
0xe5000) = 0x400fc000
< mmap(0x40101000, 11832, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40101000
---
> mmap(0, 974392, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x40017000
> mprotect(0x400fd000, 32312, PROT_NONE) = 0
> mmap(0x400fd000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4,
0xe5000) = 0x400fd000
> mmap(0x40102000, 11832, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40102000
16c16
< munmap(0x40014000, 8161) = 0
---
> munmap(0x40014000, 10786) = 0
18c18
< getpid() = 18663
---
> getpid() = 18672
26c26
< fstat(4, {st_mode=S_ISVTX|0325, st_size=0, ...}) = 0
---
> fstat(4, {st_mode=0, st_size=0, ...}) = 0
32,33c32,33
< getpid() = 18663
< write(4, "18663\n", 6) = 6
---
> getpid() = 18672
> write(4, "18672\n", 6) = 6
39,40c39,40
< stat("cron", {st_mode=S_ISVTX|0264, st_size=0, ...}) = 0
< fork() = 18664
---
> stat("cron", {st_mode=031366, st_size=0, ...}) = 0
> fork() = 18673

Best,
Scott Temaat, CEO
SOHOWeb Technologies, LLC (http://sohoweb.net)

Effective Web Presence Doesn't Cost -- IT PAYS
Toll-Free: 888-421-SOHO

Mailing List Account

unread,

Jul 3, 1999, 3:00:00 AM7/3/99

to linux-...@vger.rutgers.edu

I am receiving the following, every 10 minutes:

Subject: Cron <root@pooh> /sbin/rmmod -as

rmmod: Function not implemented

This is a stock RedHat 6.0 install, with kernel 2.2.7ac4, compiled as a
monolithic kernel. How do I get the module cleanup process (I assume this
is what's going on, since rmmod -as is "remove all unused modules, output
to syslog") to stop running, since there are no modules in the kernel?