Plan 9 most certainly has NOT been "designed from the ground up as a
commercial operating system." The fact that it was done in .com is
irrelevant. Plan 9 was designed from the ground up as a *research*
operating system--this does not mean that Plan 9 tries to push the
frontiers of operating system concepts (although it does in some
ways), but rather that Plan 9 was designed to be a platform on which
research, not necessarily OS research, is conducted. They wrote it
because they wanted a system they would enjoy using.
Considering that AT&T is supposedly in the process of selling Unix System
Labs, I'd doubt whether they want to get into the business of selling
operating systems again any time soon.
The amount of work it would take to turn Plan 9 into a system you
could sell to J. Random User would be prodigous, probably at least
as much work as writing an operating system from scratch, since:
* it does not conform to standards (ANSI C, Posix, etc)
* it doesn't use X11 (although they had it running at Murray Hill
a few years ago at least--however it will always be a second class
citizen to 8.5)
* it does not have an object-oriented user interface toolkit
with all sorts of Bells and Whistles and the "3d look"
* it does not have lots of glitzy, cute, marketable little features
* etc.
In general, all the complaints about Unix being a user-unfriendly
programmer-friendly operating system would probably also apply to
Plan 9. It has taken assorted companies a monumental amount of work
to turn Unix into something they could sell to J. Random User (in
the process completely destroying Unix...) and in order to sell Plan 9
they would have to do the same tremendous amount of work all over again.
I hope they don't. I like having a system that I actually enjoy using.
>Certainly AT&T's decision to sell Unix to Novell pretty much guarantees
>that they are not going to be getting into the OS business with Plan 9.
They are investing quite a bit in distribution of Plan-9 to universities,
a method that worked well in establishing another o.s. from Bell Labs.
Unless the marketing and development gurus who made USL such a success
take over Plan-9, I suspect that it will rapidly spread.
The NT
>beta software development kit is available on CDROM for PC's and MIPS
>machines for $69 from Microsoft. The CD-ROM contains all system
But no code. Now that Henry Massalin is working for MS, is there work underway
to put on-the-fly code generation into NT?
% Last time I spoke to Henry (about two months ago) he was at MicroUnity, not
% MS. --DL
--
Keep in mind that many of the things that one would
|> traditionally need source to do for Berkeley Unix (install device drivers,
|> new file systems, new network protocols, new OS emulations, etc.) can
|> be done without source in NT through installable drivers and file systems
|> using the device driver development kit. One advantage of this
|> approach for experimental OS development research is that new experimental
|> services can be built as installable file systems or drivers or emulations
|> and then distributed freely since they do not impinge at all on the base
|> OS. Such software could also be distributed as shareware or even sold
|> outright for commercial purposes should it move from research to practical
|> use by businesses. This has traditionally been difficult for research
|> organizations making changes to Berkeley Unix and some of its clones due
|> to USL licensing and legal practices.
Actually, IBM's AIX 3.x which has been available for quite a few years now,
also provides excellent support for dynamically installable device-drivers,
virtual-file systems and kernel extensions. This allows it to be a really good
basis for OS research. In fact we have been doing most of our research in
clustered workstation OSs by extending the AIX kernel. It is one of the few
commercially available (currently selling) that provides this kind of
support.
Arindam Banerji
(a...@cse.nd.edu)
I assume you have source, right? I know for a fact (i.e. someone from IBM
told me) that at least one of the documented ways to extend AIX (back-end
pagers) is broken. I already knew this because someone who has source could tell
me, so i didn't waste too much time on it. Of course I would have known
it sooner if IBM would release AIX source to research
outfits for less than $60K.
Granted you can extend AIX with drivers, etc. and I have done so here;
but much of the documentation is broken. When all is said
and done, you still need kernel source
for these things. I don't know if NT is any better. My guess is not,
pending hearing glowing reviews from people who have done it (note:
people from microsoft don't count in this case).
Also, for what it is worth, I can build dynamically loadable
drivers, virtual file systems, and other kernel extensions under SunOS.
I have found this to be ***MUCH*** easier under SunOS than AIX; Sun
has a better design. AIX is just plain painful, especially with that
cumbersome and baroque Object Data Manager database.
ron
--
rmin...@super.org
(301)-805-7451 or 7312
The people I `know' (i.e., via the net) who have moved to (commercial)
use of NT seem extremely happy with it.
Many of these people are also VMS hackers (and so might be expected to
sympathise with Cutler's ideas of the world...). Of course, VMS has
had loadable drivers for ages. In principle, one could write them
without access to the source; in practice, in the early days, source-
listing microfiches were all but essential.
No-one, apparently, would claim that VMS is a wonderful OS research
vehicle for having loadable drivers (indeed most of its kernel)...
|> [...]
--
Robin (Keep Radio 3 != Classic FM) Fairbairns r...@cl.cam.ac.uk
U of Cambridge Computer Lab, Pembroke St, Cambridge CB2 3QG, UK
With all this discussion of mounting special purpose file systems under
various operating systems, I thought I'd post some source showing how this
is done in QNX. Some code has been deleted to reduce the size of this
posting, but most of the structure is here. This code fragment is extracted
from a sample /proc resource manager.
This source code would be compiled like any user program and when run with
sufficient priviledge, would appear in the system as /proc. It can be
started and stopped like any user process, and source-level debugged. When
another process does I/O to the /proc namespace, the system shared libs
will convert the various POSIX I/O calls (open, close, read, write,
readdir, etc) into messages which are sent (via the microkernel) to this
/proc process. All resource managers in the OS are implemented in this
manner (the '/' file system, /dev, etc), meaning that "guest" file systems
are as "native" as any other file system present. This approach preserves
all the POSIX semantics and passes the NIST POSIX test suite, even in the
network distributed case.
#include <various header files>
union { // A union of various messages that might arrive in the buffer
msg_t type;
msg_t status;
struct _io_open open;
struct _io_open_reply open_reply;
struct _io_write write;
struct _io_write_reply write_reply;
struct _io_read read;
struct _io_read_reply read_reply;
// etc...
} msg;
char buff[512]; // The message buffer, could be any size
struct stat rootdir;
dev_t devno;
struct _mxfer_entry mx_entry[2];
main(int argc, char *argv[]) {
int nbytes, num, i, pindx, major, handle;
msg_t type;
pid_t pid;
if(qnx_osinfo(0, &osdata) == -1) exit( 1 );
if(qnx_psinfo(PROC_PID, PROC_PID, &psdata, 0, NULL) == -1) exit( 1 );
if((major = qnx_device_attach()) == -1) exit( 1 );
devno = (dev_t)((getnid() << 16L) + (major << 10L));
// Create a default root stat structure for this new file system
rootdir.st_ino = 0;
rootdir.st_dev = devno;
rootdir.st_size = osdata.microkernel_size;
rootdir.st_rdev = devno;
rootdir.st_ftime = psdata.un.proc.file_time;
rootdir.st_mtime = psdata.un.proc.start_time;
rootdir.st_atime = psdata.un.proc.start_time;
rootdir.st_ctime = psdata.un.proc.start_time;
rootdir.st_mode = S_IFDIR|S_IRUSR|S_IRGRP|S_IROTH|S_IXUSR|S_IXGRP|S_IXOTH;
rootdir.st_uid = psdata.ruid;
rootdir.st_gid = psdata.rgid;
rootdir.st_nlink = psdata.un.proc.links;
rootdir.st_status = _FILE_USED;
// Attach the resource manager into the pathname space
if ( qnx_prefix_attach( "/proc", NULL, 0) == -1 ) exit( 1 );
// Main event loop
for(;;) {
pid = Receive( 0, &msg, sizeof(msg) ); // Receive the message
msg.status = ENOSYS; // Prepare some defaults
nbytes = sizeof( msg.status );
switch( msg.type ) { // Main switch loop
case _IO_FSTAT:
msg.fstat_reply.status = EOK;
// Provide root directory info for stat request
memcpy( &msg.fstat_reply.stat, &rootdir, sizeof(struct stat) );
nbytes = sizeof( msg.fstat_reply );
break;
case _IO_OPEN:
if ( ( handle = open_resource(msg.open.path, &pindx) ) != -1) {
qnx_fd_attach(pid, msg.open.fd, 0, 0, 0, 0, handle);
msg.open_reply.status = EOK;
} else
msg.open_reply.status = ENOENT;
nbytes = sizeof( msg.open_reply );
break;
case _IO_CLOSE:
close_resource( __get_fd(pid, msg.close.fd, fd_ctrl) );
msg.close_reply.status = EOK;
nbytes = sizeof( msg.close_reply );
break;
case _IO_WRITE:
msg.write_reply.status = EROFS; // A read only file system
nbytes = sizeof(msg.write_reply);
break;
case _IO_READ:
if( (handle = __get_fd(pid, msg.read.fd, fd_ctrl)) != -1 ) {
num = read_resource(handle, msg.read.nbytes);
if( num > sizeof(buff) ) num = sizeof( buff );
msg.read_reply.nbytes = num;
msg.read_reply.status = EOK;
// The mx structure allows multipart messages without the
// redundant copying that would otherwise be needed to build
// contiguous messages for reply.
_setmx( &mx_entry[0], &msg,
sizeof(msg.read_reply) - sizeof(msg.read_reply.data) );
_setmx( &mx_entry[1], buff, num );
Replymx( pid, 2, &mx_entry );
nbytes = 0;
} else {
msg.read_reply.status = EBADF;
nbytes = sizeof( msg.read_reply );
}
break;
case _IO_READDIR:
if( (handle = __get_fd(pid, msg.read.fd, fd_ctrl)) != -1 ) {
num = readdir_resource(handle, msg.readdir.ndirs);
if(num > _DIRBUF) num = _DIRBUF;
msg.readdir_reply.ndirs = i;
msg.readdir_reply.status = EOK;
_setmx(&mx_entry[0], &msg,
sizeof(msg.readdir_reply) - sizeof(msg.readdir_reply.data));
_setmx(&mx_entry[1], procdir, sizeof(struct dirent) * i);
Replymx(pid, 2, &mx_entry);
nbytes = 0;
} else {
msg.read_reply.status = EBADF;
nbytes = sizeof( msg.readdir_reply );
}
break;
}
if ( nbytes ) Reply(pid, &msg, nbytes); // Reply to client message
}
}
--
Dan Hildebrand email: da...@qnx.com
QNX Software Systems, Ltd. QUICS: danh (613) 591-0934 (data)
(613) 591-0931 x204 (voice) mail: 175 Terrence Matthews
(613) 591-3579 (fax) Kanata, Ontario, Canada K2M 1W8
-Arindam Banerji
(a...@cse.nd.edu)
how about lightweight syncronmisation of process groups to handle
seperate multicast multimedia ? i doubt it...
most of the user specifiable scheduling,s too heavy weight and doesnt
include efficient deadline handling of audio and video devices in a
minimal clean way
and multicast stuff is too new to be properly integrated into the IPC
model...how do i name/address groups/manage them etc etc
and clean ways of provideing a distributed synch service (e.g. of video/audio
and file update a la ABCAST but relaxed to permit deviation in the
video and audio by a time rather than strict sequence...)
etc etc etc
i think research needs more open source operating systems STILL!
by the tone of some of the compacent messages about
mach/amoueba/plan9 anyone would think these problems have been
solved!
cheers
jon
Actually, this is something I have wondered about. We are thinking
about porting our new system, Horus, to NT but so far have had problems
figured out just what this would involve. With Mach, Chorus and QNX
we have a much clearer picture, but I picked up that book Rick mentioned
and ended up with absolutely no idea what real communications software
in NT is supposed to look like.
Isn't there any sort of NT programmers manual, with a section 1 on commands
and a section 2 on system calls and a section 3 on libraries, etc? Can
it be copied via FTP?
Anyhow, offhand, I don't know of any reason why a system like Horus couldn't
run efficiently over NT, but I do know that it can be a challenge to run
efficiently over systems like Mach and Chorus, and UNIX.
Multimedia video is another story. Here, the problem will require
several layers of technology. We need to start by figuring out how to
deal with ATM or whatever the video technology is from software -- not
trivial. Next, there is a question of whether an architecture like NT
(or UNIX, or QNX, etc) lets you do that in a way shared between more than
one program. Then there is the issue of whether you can connect this
to process groups without losing the whole show. My group does plan
to work on this problem; in fact, it is one of our main interests right
now, but so far, my sense is that none of the operating systems under
discussion actually has any particular technical merit that would make
it better for solving this problem then the others. Quite the contrary,
the real question is whether the OS has done something brain-dead that
makes it totally impossible to do anything reasonable at all. The answer
isn't clear to me on this, but I would say that all of the systems in my
list come "dangerously close" to being useless, mostly because they are
all so focused on RPC that they leave little flexibility for non-RPC
protocol stacks and architectures. Basically, modern operatings systems
still seem to view communication as something they own, and so all this
microkernel stuff you hear about still leaves you dealing with a macro-
kernel if you are in the protocol business, and doing anything other than
RPC.
So, long before we worry about abcast and process groups and so forth,
there seems to be a lot of basic technology that needs to be developed
and demonstrated. On the positive side, we're researchers, so I guess
this is an opportunity!
--
Kenneth P. Birman E-mail: k...@isis.com
Dept of Computer Science, Cornell Univ. TEL: 607-255-9199
Isis Distributed Systems Inc. TEL: 607-272-6327
Glad to hear i'm not lonely here :-) The question "is RPC (or
message-passing) is sufficient to base multimedia system on" is
somehow deeper than it appears from the beginning.
However, there IS a way to combine advantages of both microkernel
architecture and non-message-based interactions. The attached
article (it wasn't published before) desctibes my humble research
efforts here.
Caveat: since all my employers had absolutely no interest in the
research operating systems the project never reached the state
of a working system (although there was a lot of proof-of-concept
prototypes). The work is simply too large to be done by an
individual by weekends (i have to earn my bread somehow, too)
and the previous projects i participated in (the family of Soviet
Unix clones and the now-largest ex-USSR data network, RELCOM)
consumed weekends, too.
--vadim
The text should be formatted using troff -me.
------cut here-----
.fo ''- % -''
.\" .ad l
.(c
.sz 16
.pg
\fBIn\ \ Search\ \ Of\ \ The\ \ Holy\ \ Grail\fP
.br
.sp 2
.)c
.(c
.sz 10
Vadim Antonov
.br
.)c
.(c
(\fI...@uunet.uu.net\fP)
.sp 5
.)c
.(f
.sz 8
.in 0
.ti 0
Copyright 1992 by Vadim Antonov.
.br
All rights reserved.
.sp 1
The research and development on the project Grail was \fBnot\fP sponsored by and/or
conducted using resources of or during the hours paid for the author's
past and present employers; the results of the project Grail including this
text are the sole property of the author.
.br
Permissions are granted for reproduction and copying this text
without omissions and/or modifications.
.sz 10
.)f 0
.sh 1 "One particularly failed project"
.pp
It all began about ten years ago when the first \%Unix leaked
under the Iron Curtain and the Soviet hackers community discovered
that new idea of an unfettered OS (at the time everybody belonged
to a "clan" of IBM/370 or PDP-11 or whatever else zealots \- essentially
the isolated groups who didn't understand each other, had their own
gurus etc.
The idea of \%Unix was quite suspicious at the beginning \- there were
major doubts of possiblity of an efficient system which does not
depend heavily on details of underlying hardware.
However, the idea was appealling to system programmers in large
physics research institutes who already had major headache with
zoo of various incompatible hardware and software.
Incidentally those people were the same ones who contributed
heavily to the BESM-6 software (BESM-6 is a machine which was the
best in the world in its class and deserves a separate story \-
for this story it's enough to note that it has really extensive
entirely original software comparable in complexity only with
\%OS/370;
numerous math libraries and basically was the workhorse for
academical and military computations for two decades) \- and
therefore they weren't really apt to copy the Western
software blindly.
.pp
Fisrt of all, v6 didn't work on the Soviet hardware because of the lack
of any reasonable recovery from hardware errors; after that problem
was solved there was a problem with the lack of basics like a
WYSIWIG editor or 8-bit transparency.
And there was the eternal problem of efficiency.
I do not want to go into details here; but the result was that
Soviet Unix (aka \%\fIDEMOS\fP) was quite a lot different from its Western
origins.
.pp
Quite soon the shortcomings of \%Unix were discovered and the team
started to look for solutions \- initially the trend was to
add new functionality to the existing \%Unix framework \-
from graphics driver in the kernel to the
extensive IPC and things like execute-on-open files.
However, all forces of the DEMOS team were consumed by the growing
amount of work necessary to support the family of operating systems
and adapt new Western software; and (probably fortunately) the
development of a super-Unix never got over the stage of technical
project.
Anyway, it was discussed quite extensively and the realization
that the team is simply physically unable to create such a monster
was rather swift (and was somehow fueled by the growing irritaiton
caused by immense monolithic stuff like EMACS and BSD 4.x kernels and
zillions of small programs for plugging holes in \%Unix for particular
cases).
It forced the "radical" part of the team to try to do deeper analisys of
reasons of Unix's success; it resulted in some kind of a "theory of
evolution of operating systems" (it wasn't recorded in publications \-
at the time the publication of anything in the Soviet Union required a lot
of efforts and the "theory" became a matter of verbal tradition; I used to explain
some of that stuff to students on my lectures).
.pp
The project \fID3\fP (later renamed to \fIGrail\fP for a number of
reasons) was a direct successor of Soviet Unix development efforts, but
due to the number of reasons it remained my "own" (it couldn't
have a chance to become an "officail" project of the team which was
struggling for financial survival in the echonomical chaos of the
post-communist country).
However, I'd like to thank my DEMOS (now Relcom) colleagues for
support and understanding.
.sh 1 "Why dinosaurs die"
.pp
Basically, all operating systems are simple at the beginning.
The further development of a system finally leads to its death under
the load of its own complexity; eventually a system becomes a collection
of components so vast that reimplementing a function is easier
than finding (and for that matter, learning) an existing component
capable of performing the desired function.
Moreover, the system becomes "rigid" \- i.e. introducing new functionality
becomes more and more battered by concerns about possible breaking
some unknown part of the system (which leads to shamanism in programming
\- "do it this way because we did it this way for a long time and it worked"
and "don't do it because it may broke something somewhere dunno what, but
still...").
.pp
Simply speaking, complexity of a useable system is limited by ability of a single
person to understand it completely.
All extra functionality becomes a useless wight; and even worse \- it tends
to create unexpected problems consuming more and more efforts to
overcome.
.pp
The old school of application programming tells that if there is a problem
a program should be written to solve exactly that problem.
This approach completely misses the following facts of life:
.np
\fIProblems tend to evolve.\fP
Any user will want to expand the scope of the things she wants the
computer to do; and the application program will grow accordingly.
Eventually the application program will include its own screen editor,
database, backup system, you name what else.
The overgrown behemoth program will be proudly named "an integrated
system" and will ever fail mysteriously from bugs nobody's able to
track down.
.np
\fIProblems tend to merge.\fP
If there are two applications the user runs she'll eventually need to exchange
data between the applications.
There are more than good chances that initial design didn't include any provisions
for that case; that the programs have different formats of data files or in the
worst case never let the required data to be stored in external files.
A programmer facing this problem has three choices \- the first is to create
a kind of "gateway" program for providing an interface between existing
application programs (which is not always possible and practically always
requires some manual actions from user),
include functions of one package into another (hm, try to imagine adding
something to already bloated behemoth and (shrudder) debugging it)
or to rewrite everything from scratch to create another "integrated package".
Any way increases the number of components by one and pushes the entire
system closer to its imminent collapse.
.np
\fIThere will be problems nobody thought of before.\fP
As an example from the real life I'll mention multimedia \- all
existing operating systems simply do not provide adequate i/o model
for multimedia services.
.pp
\%Unix addresses the first two problems by introducing two fundamental things \-
tool-oriented environment and the universal text file format.
What is not usually told about \%Unix model of development
is that original \%Unix
components were designed as "complete" (in mathematical sense) functional
units but not as means for solving particular tasks which made them
rather immune to the "totally new" problem phenomenon.
Unfortunately later development of \%Unix was too infested with the old
"package" approach to writing the programs.
The fundamental drawback of the toolbox approach is the practical impossibility
of removing obsolete tools \- once some function dropped in it will likely
remain there forever.
The simple-minded attempt to remove something can easily produce, well,
interesting results \- how many people knew that v7 \fBtar\fP needed \fBawk\fP
for some of its functions?
.pp
However, while the "package" programming tends to create systems with exponential
growth of complexity (given linear growth of functionality), the \%Unix toolbox
model provides for the linear growth of complexity allowing for much longer
lifespan (Unix already has outlived many of its "competitors" like
\%VMS or \%RSX-11).
The end is imminent anyway (and the fact that its original authors abandoned
their creature long time ago seems to support that statement).
The Unix successor, \%Plan\ 9, is a complete reimplementation but
it still utilizes the same approach and there is no reason why it won't
find its dead end in exactly the same way as Unix did.
.pp
Another fatal disease is diversity.
I understand that it's a highly controversal statement but now I'm talking
about longievity of operating systems.
The uncontrolled evolution of a system tends to produce zillions of slightly
incompatible verions;
once released, this djinnie can't be put back into bottle; there is a strong
feeling that all \%Unix standartization attempts are doomed a priori.
(As a side track \- I've made up the empiric rule that the worse programming
language is the more dialects it has (think abouit BASIC, Pascal, C++); it
can be esaily explained because implementors tend to be "creative" if the
language specifications have holes or extremely unsuitable for practical
use "features").
.pp
\&From the user's point of view the diversity becomes really annoying \-
the lack of such in, say, \%\fIMS-DOS\fP is often a decisive factor for many
customers.
There is an internal controversy \- the non-diverse system is a dead one;
any living system changes and therefore produces different variants.
However, this problem is a particular case of the "functionality \->
complexity" phenomenon.
.pp
The only known way to defeat complexity is to rethink everything from
the beginning; throw old stuff away and start from scratch.
It is sometimes affordabe in acadeia
but in the real world it hardly works \-
more and more resources get invested into carrying the old baggage and
cases when it's easier to live with old technology than to switch
to newer become more and more typical.
It means only that operating systems are becoming stuff like nuts
and bolts \- have you seen any significant change in bolts in the
last years?
Combined with the trend to increasing complexity it leaves us only one
choice:
.pp
\fIThe ultimate goal of the operating system design is to create a
system which will not die of its own complexity.\fP
.sh 1 "How to solve the unsolvable problem"
.pp
That problem is only seems to be unsolvable \- the
more precise definition of the problem is
\&"how to provide a way to do major reimplementations of components
(or even creating completely new environments) without loosing
ability to run old programs".
It basically leads to the idea of a system which can contain
several generations of environments coexisting (and communicating)
in a single machine.
That's what all the object-oriented operating systems are about;
though there are older (and rather successful) things like
\%VM\ 370.
.pp
However, the conclusion above leaves one question unanswered \-
what the kernel should do (and what is the kernel anyway).
The selection of the set of kernel functions is a central
issue determining the shape of the entire system and it is
highly critical because in a properly designed object-oriented
operating system it is the only part whose interfaces cannot
be changed later.
.pp
Basically it means that the kernel should provide minimal
functionality sufficient to support \fIany possible\fP user environment
on a given class of machines.
It is easy to choose a sufficiently powreful set of system calls,
the problems are its minimality, efficiency and convinience.
(The minimality is the most important since everything added to
a kernel gets engraved in stone without any chances to retract
or change it later without massive efforts to track all usages
in all users programs).
.pp
The generally accepted model of an object-oriented system assumes that
there is a number of active entities (call them processes, actors,
whatever else) doing actual work inside them and exchanging information
outside.
(The original \fID3\fP terminology utilizes the deliberately obscure
term \%\fIP-object\fP (from \fIprocess\fP) to avoid misinterpretation.)
Some of the active objects should deal with the outside world in ways
which are not in the system's framework (device drivers or
user interfaces).
.pp
The first question is what kind of information those active objects
exchange and how that exchange is organized.
Here's the difference begins.
.pp
All existing object-oriented operating systems (Chorus, Mach, Amoeba, QNX)
employ message-passing as the basic communication method.
It works reasonably well for networking and stream i/o-s but
becomes a bottleneck when bulk data transfers are required.
The message-passing is not general enough to include such things
as virtual memory and copy-on-the-write semantics; it basically
forces developers to create a second interprocess communication
mechanism \- namely some kind of shared (virtual) memory (and
mutual exclusion semaphores as well) which surely does not contribute
to the simplicity of the kernel interface (and leaves opened the question
about non-traditional hardware architectures like data-flow machines or
transputers \- do we need to add special primitives to deal with them too?).
.pp
The trouble is in embedding the particular kind of interaction in the
basic set of kernel interfaces \- this road leads to eventual including
more and more primitives for handling other
\fIspecial\fP cases of interaction.
What we need is a single framework general enough to cover all
posible kinds of interaction within a specific class of machines.
.pp
Now, let's look at the process of exchange of information closer:
obviously we have (at least) two parties \- an originator and an
acceptor.
The originator should know about (potential) existance of the acceptor
and should have its attributes to direct the act of interaction
(we'll discuss addressing later).
The parties should also have an agreement about
the format and the physical nature (i.e. bits in RAM or packets in the
network) of the information \- otherwise the very act of exchange is
meaningless.
They also should have a common physical medium.
.pp
Let's call the presense of combination of those preconditions the \fIaccess\fP;
i.e. to initiate an exchange the originator should have an access to the
acceptor.
The fact of initiating the exchange should cause some action by the
acceptor; it forms the casual chain implicit in any communication.
Theoretically speaking, the primitive "signaling" is sufficient for
any kind of communication (in the same sense as Turing machine is sufficient
for execution of any program); in practice we need something more.
.pp
That "more" is the ability to carry some information from originator to
acceptor (and vice versa) as the side effect of the act of initiating the
\fItransaction\fP.
.pp
\fIThe fundamental principle of Grail architecture is that the tranactions
themselves do not carry any information except for accesses.\fP
.sh 1 "The world according to Plato"
.pp
It may seem to be really paradoxal \- how can the information exchange be
conducted \fIwithout\fP any actual information exchange.
The fact is the existing operating systems don't do any transfer of
information as well.
Hardware does the work.
.pp
When an operating system copies the message from user program to an internal
buffer and back the actual information never gets transmitted through or
stored in such entity as "operating system".
What really happens is that CPU pumps bits over wires to its internal
registers from memory and back; in some sense the information remains
\fIin the memory\fP (registers are memory too).
It is enough for one active object to pass access to that information
to another object to do the very same action.
Some people call it a memory pointer.
.pp
Let's look closer at interaction of active objects and memory.
All the activity of the active objects in a traditional computer is
issuing requests to memory to store some information or to retrieve
it back given the address and the number of bytes.
If we include CPU registers into memory (exactly like PDP-11's registers
in high memory I/O page or Weitek math coprocessors' registers) we'll
find that the memory exchange is self-sufficient; i.e. the address and
length of the data are stored in the memory as well.
.pp
The active objects in this model are non-programmable automatons with
the state information represented by the (ordered) set of accesses they
posess and with some invisible to the system connectivity to the
external world.
To fit the model of a "traditional" operating system those automatons
should be the simple virtual (and limited) replicas of
the CPU logic circuitry.
Their sets of accesses are represented by contents of CPU's page tables
and tables in the kernel.
.pp
It's time to introduce some "point of contact" (ports, sockets, pins,
synapses; \%\fIS-objects\fP (from \fIsocket\fP) in \fID3\fP
terminology) which active objects advertise to indicate their
willingness to accept transactions.
S-objects do not have any state and completely passive; every
S-object \fIbelong\fP to a P-object.
For simplicity let's assume that any S-object belong to no more
than one P-object.
.pp
When a P-object wants to initiate an interaction it must have an
access to an S-object belonging to the P-object with whom the first
P-object intends to communicate with.
For brevity, the initiating and the accepting P-objects (in relation to a
particualar transaction) will be called \fICP-object\fP (from
\fIclient\fP) and \fIMP-object\fP (from \fImaster\fP \- the more
appropriate \fIserver\fP abbreviates to \fIS\fP which may be
confusing).
.pp
An S-object can participate in initiating a number of various
kinds of interactions; let's call the particluar kinds of
interacations \fIoperations\fP.
The accesses passed with the transaction from CP-object to MP-object
and back are \fIarguments\fP of the operation.
Now we can introduce the formal definition of \fIprotocol\fP
accepted by S-object: it's the set of all acceptable combinations
of operations with their arguments and protocols of the arguments.
.pp
In a traditional operating system there are numerous
ways of communication between components; all of them
can be viewed as particluar kinds of protocols stacked over
the memory protocol.
Message-passing object-oriented operating systems basically
attempt to reduce all other protocols to sending/receiving
messages; however it is not adequate to the traditional hardware and
designers of message-passing systems find themselves forced to add
shared memory and other non-message things which effectively
ruins the initial goal of simplcity and coherency of design.
In \fIGrail\fP all memory objects (groups of consequtive bits)
belong to the \fBmemory\fP which is nothing more than a
P-object from the point of view of the rest of the system.
"Virtual processor" P-objects simply issue requests to the
memory P-object(s), though the implementation can "hide"
the actual transfers of individual words for the sake of efficiency.
A "processor" P-object can simulate memory to another "processor"
P-object \- on a traditional machine it can be done intercepting
page fault interrupts.
.pp
The most interesting, however, are non-memory based protocols.
The above definitions never mentioned the nature of information
exchanged as a side effect of initiating transactions.
In the case of memory protocol it is copying of the bits stored
in memory cells or registers; it can be, say, transmitting a
videosignal or telephone conversation or (as an extreme case)
exchange of ICBMs.
The duration of such interaction is the lifetime of transaction
(after all, moving bits \fIdoes\fP take some time so the memory
protocol isn't different).
The pattern of accesses inside the system "shapes" the communication
channels (it does not matter informational or material) in the
outside world (which includes the machine the system runs on).
The whole thing strongly resembles the Plato's idea of things
as shadows of ideas; accesses are ideas of potential
communications; the transactions are ideas of actual interactions
(but \fInot\fP the physical interactions).
.pp
For example we can take a computer system controlling a
phone exchange.
We can behold the non-memory protocol of telephone conversations;
although the computer system knows nothing about the
protocol (except that it does exist) and generally cannot speak the
"processor" P-objects inside the computer can posess and
transfer accesses to a telephone set; one of the operations
(requiring an access to another telephone set) is \fBcall\fP; and
transactions are telephone conversations.
This way transferring a call can be a "shadow" of sending some e-mail!
In this example all telephone S-objects belong to a
single P-object (the PBX driver) and are totally opaque to
the rest of a system.
.pp
Generally speaking, there is a number of clearly distinguishable
classes of machines; the classification is based on the kind of
information exchange between components (and for that matter the
kind of time they exist in):
.np
\fIStatic machines\fP are artifacts which do not change; their
components aren't engaged in information exchange.
Time does not exist for such machines.
The examples include mathematics, schematic diagrams, plans.
.np
\fIDiscrete machines\fP contains finite number of components
exchanging finite amounts of information in discrete time (i.e.
the interactions take zero time and delays are introduced
by components; however the \fIreal\fP duration of the delays
does not matter, only the order of events is visible).
This kind of machines are the "classical" computers and the
existing operating systems, algorithms and programming techniques
are designed to deal with this class of machines.
.np
\fIAnalogue machines\fP are composed from finite number of
components engaged in interactions of limited duration and
generally exchanging analogue information.
These machines live in the real physical time and all interactions
require some time.
All electronic circutis can be viewed as a particular case of this kind of
machines (all interactions between components of machine take
exactly the same time \- from turning the power on until turning the
power off (probably a bit longer)).
.np
\fINon-decomposable machines\fP are machines which cannot be
decomposed to independent entities; in a sense all parts of
such machines permanently interact with all other parts.
All (at least macroscopic) physical objects belong to this class.
.pp
I think it is obvious that those classes aren't equivalent and
that every machine of higher class can with some precision
behave like a machine from the lower class.
Most electronic computers are essentially designed as machines
of class 2 on the module level and as machines of class 3 on
gate level.
However, approaching the physical limits forces engineers to
attempt developing machines as class 4 devices which is incredibly
complex (think about all things like cross-talk, side capacity,
thermal noise and delays in leads).
.pp
\fIThe project Grail is an attempt to treat computers and computer-controlled
machines as devices of principally wider class \fP(class 3 on the
scale above) \fIthan traditional and message-passing operating systems do\fP.
.sh 1 "What's in a name"
.pp
In order to initiate an interaction P-object holding an access needs
some kind of addressing to select the desireable peer for the communication.
There are several possible addressing schemes, namely:
.ip "\fIGlobal addressing\fP"
assumes plain namespace in which every object has its unique name (address);
in some systems the unique name contains a cryptographically protected tag
allowing the use of name as an access (as in \%\fIAmoeba\fP).
Most object-oriented systems use some kind of global addressing; however
the global addressing do not scale well.
Moreover, in distributed environments the problem of generating of unique
names becomes rather non-trivial.
The practical global addresses are short enough to use them directly
in all system primitives.
.ip "\fIHierarchial addressing\fP"
is a variation of global addressing with names separated into subnames and
ordered as (and likely implemented as) tree-like structures.
This approach allows for better scalability than global addressing (in
fact, systems with the global addressing tend to finally adapt some limited form
of hierarchial addressing; an extreme example is the \%VMS file names
which can contain up to 8 fields with different syntax).
Most post-Unix operating systems have some kind of hierarchial addressing
embedded; \fIPlan\ 9\fP takes it to extreme giving every object a name in
a single universal namespace similar to the space of \%Unix file names.
The hierarchial addressing does not handle well temporary and replicating
objects and applications have to revert to generating names
like \%\fI/tmp/snd.GAAa22698\fP.
Creating customized environments in hierarchial namespaces
isn't easy and requires clumsy aliasing or mounting of directories.
The hierarchial addressing is easy to use for humans but is unsuitable
for use by programs because multi-component names have variable
(and ususally large) length; some kind of translation of hierarchial
addresses to local addresses is usually required before accessing the
named object.
.ip "\fIAttribute-based addressing\fP"
is a scheme in which objects are located not by their unique names but
selected using specified conditions.
The attribute-based addressing is rarely used in operating systems (may be
undeservedly) due to the complexity of its implementation.
.ip "\fILocal addressing\fP"
assumes that there is no universal system-wide namespace(s) and the
addressing is entirely relative to the local contexts of active objects.
The local addresses can be short and aren't supposed to used by humans;
usually it is the small integers.
Practically all operating systems employ local addressing as file
handles (channels, file descriptors).
.pp
A typical OS kernel usually contains some form of non-local addressing
built-in (Unix kernel implements \fIall\fP kinds of addressing \-
process ids are global, file names are hierarchial, signals can be
sent to groups of processes selected by attributes (process groups, user ids or
controlling terminals) and file descriptiors addressed locally).
In a properly designed (read object-oriented) operating system all system
services can be reached using uniformed addressing sheme.
A straghtforward implementation of global or hierarchial addressing
in a kernel does not worth further discussion for examples are numerous.
.pp
However, a local addressing alone is sufficient to emulate all other kinds
of addressing using user-level nameservers, i.e. programs receiving
long addresses and returning accesses to the addressed objects \fIif\fP
the system provides a way to pass the access to a client program.
"Mainstream" \%Unix has no provisions for that; this kind of
functionality first appeared in v8 in a rather ugly form of an \%\fBioctl\fP
call allowing to pass an opened file descriptor as out-of-band data in a pipe;
this principially new function came virtually unnoticed apparently because
if was not much of use in the existing \%Unix framework.
This function is fundamental for the \%\fIGrail\fP architecture.
The accesses posessed by some P-object simply addressed with positive
integer numbers very much like \%Unix opened file descriptors.
.pp
To convert some long address to a local address a P-object should simply pass
an access to memory area containing the long address with the request
to the \fIdirectory S-object\fP and receive an access (i.e. to get a unique
for our P-object local address selecting that newly acquired access).
The directory S-object belong to the nameserver P-object.
Doubtless, there should be a way for the client P-object to get the first
directory access; there are several alternatives.
The first is to hard-wire the \fIroot\fP directory in the kernel \-
this method was considered bad because it does not allow creating several
totally independent namespaces which clearly contradicts the goal of
the project.
Alternatively an access to the root directory can be inherited from
the "parent" P-object.
The \%Unix mechanism of inheritance is rather clumsy because it
does not allow reshuffling file descriptors (and won't work well in
case of an unexpected failure making additions like \%\fBFIONCLEX\fP
necessary);
\%\fIGrail\fP's approach is much simplier \- the kernel automagically
creates a \fIlink S-object\fP belonging to the newly created P-object;
the access to that object is returned to the parent.
The link S-object is used in "filling-up" the child's context (i.e.
opened files, accesses to system services, arguments, etc) from
the parent's.
Of course, this approach is less efficient than Unix's "one operation
does all" inheritance mechanism because it requires several interactions
between the child and the parent but its flexibility outweigh the
efficiency considerations.
.pp
This design allows simple and effective emulation of existing
operating systems as well as creation of arbitrary name spaces
with different semantics, syntax and resolution methods (centralized,
recursive or iterative \- take Internet DNS as an example).
One particularly appealing property is that it allows processes to have several
"current directories" and arrangement in which child process can change the
parent's current directory (with it's consent, of course).
.pp
Please note that P-objects are not addressable at all.
It means that their interfaces are limited to S-objects they create
and their internal structure is effectively invisible.
There is no way to distinguish between a single P-object and a conglomerate
of P-objects if their (externally advertised) S-objects respond similarly.
A \%\fIloader P-object\fP (responsible for creating new P-objects \-
effectively the driver of CPU) may elect to create an accompanying \fIcontrol
S-object\fP with each new P-object to provide means for non-consentual
control over that P-object (killing, tracing, reading status or changing
priority); however even the existance of the loader P-object is not
required (i.e. there may be systems witout any way to create new P-objects,
for example embedded applications).
.pp
\fIGrail does not enforce usage of any particular addressing scheme allowing
multiple name spaces with arbitrary structure to coexist within a single
environment; the local addressing combined with access passing is sufficient
to support user-level implementation of any kind of adressing\fP.
.sh 1 "Your id, please"
.pp
It is natural to associate bit masks of capabilities (i.e. allowed
operations) with local addresses of accesses to S-objects.
This approach is equivalent to keeping the complete tables of
permissions with the objects.
Further, the notion of access will be assumed to include the
corresponding mask of capabilities.
Duplicating and transferring accesses cannot add new capabilities
(capabilities can be created only toghether with creation
of the corresponding object), but there should be a primitive to
remove capabilities from a given set.
It also makes sense to have a primitive for merging accesses to
the same S-object with the same lifetime producing
united set of capabilities.
.pp
As you already noticed the transactions aren't merely the acts of
initiating and signalling about completion of a processing of request;
they last until the processing (or \fIinteraction\fP) completed.
This scheme requires four basic primitives:
.(b L
.ta .5i +\w'\fICP-object\fP\ \ \ \ 'u +\w'\fBwaitreply\fP\ 'u +\w'\-\-> 'u +\w'\fBaccept\fP\ \ \ \ 'u
\&\t\fICP-object\fP\t\fBrequest\fP\t\-\->\t\fBaccept\fP\t\fIMP-object\fP
\&\t\t\t\t\ \ v
\&\t\t\t\t\ \ v
\&\t\t\fBwaitreply\fP\t<\-\-\t\fBreply\fP
.)b
which is somehow more complex than the symmetric scheme often used in
message-passing operating systems:
.(b L
.ta .5i +\w'\fICP-object\fP\ \ \ \ 'u +\w'\fBreceive\fP\ 'u +\w'\-\-> 'u +\w'\fBreceive\fP\ \ \ \ 'u
\&\t\fIclient\fP\t\fBsend\fP\t\-\->\t\fBreceive\fP\t\fIserver\fP
\&\t\t\t\t\ \ v
\&\t\t\t\t\ \ v
\&\t\t\fBreceive\fP\t<\-\-\t\fBsend\fP
.)b
However, even leaving the philosophical aspects aside, the first scheme was
choosen because it allows \fItemporary accesses\fP.
.pp
A temporary access exists no longer than the transaction it was passed
with from CP-object to MP-object.
Temporary accesses are very much like arguments of procedures allowing
modules to hide their internal structure; it is also an important
addition to the capability table-based security of \%\fIGrail\fP.
Access carried by a transaction from CP-object to MP-object is
temporary by default (transferring a permanent access should be explicitly
requested); obviously a "returned" access (i.e. one carried from
MP-object back to CP-object) should be permanent (at least relatively to
that transaction).
While non-addressability of P-objects supports modularity by hiding internal
structure of P-objects, the temporary accesses provide much better security
for modular system effectively preventing obscure internal parts of servers
from keeping permissions longer than they actually need.
.pp
The last element of security provided by the \%\fIGrail\fP kernel is
authentication of clients with \fIkey S-objects\fP.
A key S-object is nothing more than an ordinary S-object belonging
to a trusted (by server) \fIkey server P-object\fP.
In order to obtain restricted services a client should provide server
with a copy of the key the client posess.
To verify the authenticity of the key the server passes the access to
the key server (using some trusted access to the key server acquired,
say, by inheritance) and the key server (after determining that it
indeed owns the object by comparing the client's access with the
corresponding
own access) returns the confirmation and attributes (like person's
real name, etc).
If a client has several keys it can sequentially try them or supply
all of them in a single request (if protocol allows it).
A system can have several key-servers or even several totally independent
environments with "firewalls" or even no connections between them.
.pp
\fIThe combination of temporary accesses, the lack of any means
to generate new capabilities to access already existing objects
and transfer them between two isolated subsystems and the
primitive for comparing accesses on equality is sufficent
and minimal for building any security scheme.\fP
.sh 1 "On types and mimicry"
.pp
A local address identifies the collection of accesses to the same
S-object (all with the same lifetime) for different kinds of operations.
When P-object initializes transaction it needs to specify (\fIto name\fP)
the operation it wants the S-object to perform.
.pp
The set of all possible operations on a S-object is called the \fItype\fP
of the object; the system knows nothing about the number or types of
arguments required by each operation (earlier versions of \%\fIGrail\fP
checked those as well, but it turned out to be rather clumsy and restrictive
and finally was left to be implemented in user-space servers).
Effectively, types are sets of operation names.
.pp
A naming scheme for types should conform the design goals of the system:
i.e. an introduction of new type should not affect parts of the system
which do not use that type (it rules out global naming for types);
any P-object should be able to create an S-object which can be
indistinguishable from any given S-object; there should be a way
to create subtypes and supertypes.
These requirements leave no choice but to use some kind of local
addressing; i.e. to treat types as ordinary S-objects and
identify operations by pairs (\fIaccess to the type\fP, \fIoperation
identifier within the type\fP).
.pp
Earlier \%\fIGrail\fP implementations followed this scheme literally;
operations in types were identified by positive integer numbers and
a CP-object needed both accesses to the type and to the S-object
in order to initiate a transaction.
(The type access was necessary because intger operation numbers could
easily collide in cases of multiple inheritance on creation of new types).
The current version does not require that CP-object has an access
to the type (it is rather inconvinient and complicates the code) and
the multiple inheritance problem is eased by using strings instead
of numbers as operation names.
Although it may seem that the change caused sufficient penalty in
efficiency it is not the case because the
\%\fIquasi-radix trees\fP are used to match the user-supplied strings
with the types' operation name lists.
QRTs deserve a separate publication, but the main idea is to build
"good" radix trees which test non-sequential bytes (i.e. each node
contains additional \fIoffset\fP field) in order to reach minimal
possible length of branches (this idea resembles Patricia trees*).
.(f
*\ see D.Knuth, The Art of Computer Programming, vol.\ 3.
.)f
Building QRT trees given a set of \fIN\fP keys of the length
not exceeding \fIM\fP takes relatively long time (the crude estimate
is \%O(\fIMN\fPlog\fIN\fP)), but the search is blazingly fast
on average (as fast as hashing on short strings and much faster
on long ones because QRT search does not need to check every byte)
and not slower than \%O(log\fIN\fP)+O(\fIM\fP) in the worst case
with small constants hidden in O's (the first component of the sum
is basically a dereferencing of three pointers, two indexings of byte arrays,
shift, a bitwise conjunction and a conditional jump; the second
component comes from the final string comparison on equality).
The extreme simplicity of QRT search algorithm contibutes to its
efficiency as well and since it has the logarithmic worst case time
(unlike linear for hashing) and is memory efficient (the existing
implementation utilizes 24 bytes per node and no additional memory
for leaves) it is a very good replacement for hashing, especially
for real-time applications (like the object-oriented OS kernel).
.pp
The type S-objects accept two operations \- \fBnew\fP returns the
access to the new object of this type belonging to the requesting
P-object and \fBlist\fP returns the list of operation names in this
type.
There is a special system call \%\fBgettype\fP returning the limited
access to the type of a given S-object, so any P-object can "simulate"
an object it has access to.
.pp
From the architectural point of view types are "abstractions" of protocols \-
i.e. type contains partial information about the protocol accepted by an object.
A protocol describes the logical system-level representation of an operation
(number of arguments it requires and returns, protocols of the arguments);
the physical parameters for the actual information interchage (i.e. if the
data reside in some particular RAM block or if the devices are connected
through network, etc) can be different for different accesses with the same
protocol (they will be called \fIphysical protocol(s)\fP later).
A typical user process is not aware of the physical protocol corresponding
the particular access but it is necessary to provide a way for the
system services (device drivers, etc) to distinguish
between different physical protocols.
The problem is solved by making \%\fBgettype\fP to return access to type
without the capability to perform \fBnew\fP; using that property
a conglomerate of P-objects dealing with a specific physical protocols can
tell "their" S-objects (implying physical information exchange)
from "fakes" by outside P-objects.
Usually operations on "fake" S-objects can be reduced to a set of
operations on "true" S-objects: for example file system serving
\fBread\fP request on a "fake" memory object (file) converts that
request to a sequence of "true" memory operations like DMA or RAM-to-RAM
block copy.
.pp
Since any P-object can create a type with an arbitrary set of operations
and client P-objects cannot distinguish such types from "original"
one if they do not have a sample they can trust (i.e. they aren't
\fIgiven\fP such a sample because they do not belong to the group of
P-objects sharing the control over some physical information medium)
the creation of super- and subtypes is nothing more than reading the
list(s) of operations of basic type(s); merging them in case of
multiple inheritance, removing or adding operation names to the list
and creating the resulting type S-object using the list we got.
.sh 1 "The matter of timing"
.pp
The last (but not least) aspect of the system architecture is
synchronization allowing parts of the system to act in concert.
Since \fIGrail\fP is supposed to work on machines of class 3
(analogue machines with finite-time transactions) and is not
concerned with the contents of the information exchange between
parts of the system the only events requiring attention of the
kernel are starting and termination of transactions and expiration
of accesses.
.pp
The creation of transaction by CP-object yields the reference
local address of this transaction viewed from client side, which
will be called below (probably somehow misleading) \fIthe local
address of C-transaction\fP.
When transaction is \fIactivated\fP by a CP-object an event is
generated within the MP-object responcible for serving that
transaction.
The MP-object should announce its willingness to be notified about
that event by adding the local address of the (own) S-object
the transaction is associated with to its things-to-wait list.
In order to serve that transaction the MP-object should \fIaccept\fP
it \- this system primitive yields \fIthe local address of
M-transaction\fP (note that C- and M-transactions here are in
fact the \fIsame\fP transaction viewed from client's and master's
sides accordingly).
The local address of M-transaction is used to access the transaction's
arguments and to \fIreply\fP to the CP-object when processing is done.
Replying releases the local address of the M-transaction and generates
an event within the CP-object.
The local address of C-transaction (still posessed by the CP-object)
is used in a list of events to wait and is released by the
\fBwaitreply\fP system call (which returns the completion code
of the transaction \- a signed integer number indicating the failure
(negative) or success (zero and positive) of the operation).
.pp
The scenario above depicts the "successful" act of interaction;
however sometimes there is a need to \fIcancel\fP the transaction.
If a transaction was activated but not yet accepted it is simply
removed from the queue and destroyed; if it is already completed
nothing have to be done; the case when transaction is canceled
while being processed by the MP-object is more difficult.
First of all, the MP-object should be notified about the cancelation \-
the address of the M-transaction can be added to the wait list.
The cancellation of the transaction does not mean the immediate
termination of MP-object's operations on this transaction;
it should clean up the action and indicate that the transaction
is completed.
Even if the transaction is cancelled, waiting on C-transaction will
not be finished before the transaction is completed.
.pp
A special kind of transaction is generated automatically by
the kernel when the last non-active access to an object was
destroyed.
.pp
\fIGrail\fP's two syncronization primitives take lists of local
addresses as its argument and returns the sequential number of
the list component's which is associated with the corresponding
event \- very much like BSD \%\fBselect\fP.
The difference is that one primitive (\fBselect\fP) waits if there is no
events pending, another (\fBcheck\fP) returns immediately.
Depending on the kind of the local address those primitives
will wait for:
.ip "-" \w'--'u
expiration of the access (for local addresses of alien S-objects),
.ip "-" \w'--'u
activating a transaction by a client or destruction the last
passive access (for local addresses of own S-objects),
.ip "-" \w'--'u
completion of the transaction (for local addresses of
C-transactions),
.ip "-" \w'--'u
cancellation of the transaction (for local addresses of
M-transactions).
.pp
Since the list of events to wait may grow fairly large there
is a kind of pseudo-objects called \fIgroups\fP which basically
are sets of local addresses.
A program can add local addresses to groups or remove them
providing that a local address can't belong to more than one
group.
Releasing a local address automatically removes it from the
group it belonged to.
Although groups have local addresses like any other objects
the accesses to groups cannot be passed from one P-object
to another.
Using group's local address in the wait list is equivalent to
mentioning all group's members explicitly.
.pp
The multiple-event waiting combined with groups allows to avoid
usage of signals and simplifies implementation of parallel
(multi-threaded) programs greatly.
The \fIexceptions\fP \- i.e. the error conditions generated
by the thread itself (like division by zero) cause immediate
termination of all threads in the P-object and signalling an
error in the way suitable for loader P-object.
Unix signals can be emulated by using a dormant "signal" thread
waiting with \fBselect\fP for all possible causes of signals.
.pp
\fIAn implementation of the Grail synchronization scheme does not
require any algorithms running in time larger than linear of the
number of references to an object/transaction no matter how large
the system is.\fP
.sh 1 "Transparent as nets are"
.pp
In the \fIGrail\fP architecture any P-object can create S-objects
"mimicing" any other S-objects and the internal structure of
P-objects is entirely opaque thanks to the temporary accesses and
impossiblity to address P-objects directly; probably the
most interesting application of that property is that a pair of
somehow communicating P-objects (let's call them \fInetwork
servers\fP) can be "invisibly" inserted
in the middle of any access \- if the first one gets an access
to an S-object it can tell the second to create the \fIshadow\fP
S-object with exactly the same type (the first P-object can easily
obtain and analyze its type).
After that all operations on the shadow S-object will generate
an event in the second P-object which in turn will request the
same (with the same name and arguments, that's it) operation
on the original S-object.
Since the only kind of arguments of operations allowed by
\fIGrail\fP is accesses the very same routine as described above
can be applied for passing arguments to/from the original
object's owner.
.pp
To keep the local P-objects checking for autheticity of
objects happy the network server should identify accesses
to the local S-objects passed back to the system from the network
and substitute the "real" ones instead of creating shadows.
and control their lifetime using aliases (see description of
the \%\fBalias\fP system call).
.pp
Granted, the exchange of accesses does not eliminate the need
to perform the actual information transfer.
To do it networking servers check is the object's protocol
is basic (like memory protocol) and perform the actual data
transfer.
It can be done using the method for identifying "true" S-objects
described above.
In some sense the special handling of basic protocols resembles
the concept of \fIwell-known services\fP widely used in Internet.
The difference is that unlike WKS the number of basic protocols
is very limited and all user protocols which can be reduced
to the basic protocols supported by network servers are
automatically supported.
This way the support for remote devices and network file
systems comes "for free" and does not require any special
network protocols and kernel code.
From the point of view of a user program, the system environment
(the collections of S-objects and threads of accesses appear
to merge trough "holes" formed by the network servers).
.pp
A P-object behaving like a pair of networking servers (kind
of a null net) can be used as a firewall between two isolated
parts of the same system to censor the flow of classified
information.
Another interesting configuration is a system
with several network servers acting as a "gateway" between
various networks; a network server can be totally unaware
about the existance of others.
.pp
The basic \fIGrail\fP networking functionality does not
include process migration, load balancing, fault avoidance
and data replication that can be easily implemented in user-level
"network supervisors".
The regularity of inter-object communications makes those tasks
much easier than their analogs for the traditional operating
systems.
.pp
\fIThus, the network becomes completely invisible for user programs
effectively making all Grail computers on a local net to
behave as a single homogenous computing system with common
file memory and common pool of peripherials.\fP
.sh 1 "Persistence of memory*"
.(f
*\ see the corresponding graphic design by S. Dali.
.)f
.pp
Although basic \fIGrail\fP functionality is general enough to
cover various kinds of hardware memory management (starting from
the lack of such) the case of paged memory is the most interesing
from the practical point of view.
.pp
As it was said before, the "virtual processor" P-objects issue
read and write requests to the memory they have access to;
however it is quite impractical to generate transaction for
every single read or write.
Instead, they map memory into their address spaces by asking
the memory manager to do it and supply it with the access to
issue read and write requests only on page faults.
The memory manager's objects are called
\fImemory segments\fP and follow the same memory protocol
plus an extra operation for mapping; there is an operation
to create a subsegment (which is a memory segment itself).
A memory \fIarea\fP can be created with the first segment
and will be destroyed after the last segment was destroyed.
If the memory manager asked to allocate mappable memory
and no access to read from/write to provided the memory manager
will allocate space in the default swap file.
.pp
The memory server can check that the accesses provided on creation
of the new memory areas refer to the same object and allow parties
to share the same area.
Copying large blocks within single mapped memory is handled
using copy-on-write.
The \fIGrail\fP virtual memory model supports weak consistency;
to share modifications parties should use the simply memory locks
resembling the BSD system call \%\fBflock\fP \- i.e. they
lock entire segment (but not entire area) and there are two kinds
of locks \- shared (for reading) and exclusive (for updating).
The same locking semantics is supposed to be supported by the
file system servers.
.pp
Since the file system servers in \fIGrail\fP architecture
do not need to copy data (they simply create subsegments
of user-supplied memory segments (i/o buffers) and request
disks to copy data to them the disk cache is implemented
simply as mapping the entire disks into memory using the mechanism
described above.
.pp
An interesting feature of this design is that since the actual
data transfer is performed by the original memory objects
servers (disk drivers, for example) and activity on upper levels
is limited to juggling accesses, data copying
within the same medium (or in the same computer, network, etc) will
"descend" to the lowest possible level, for example copying
a file from remote disk on the same disk will not result
in pumping data through network.
Another possible application of that principle is
copy-on-write semantics in file systems.
.sh 1 "The system calls..."
.pp
Handling of the system calls apper to be the botleneck
in the system performance because micro-kernel architectures
arguably use more system calls to perform the same functions
as more complex system calls of monolithic operating systems.
On sane CPUs system call shouldn't take much more time
than a procedure call; unfortunately on the most widespread
platform (i80x86) crossing the boudary between user and kernel
space takes awfully long time.
To avoid inherent inefficiency \fIGrail\fP uses batching of
system calls \- a user program simply stores syscall codes
and arguments in a buffer until it needs to get a reply from
the kernel (or to suspend execution waiting for some external
event).
This way, the entire procedure of creating the transaction,
filling the arguments, activating it and waiting for reply
usually requires single crossing the user/kernel boundary
although it may appear that multiple system calls were
issued.
The default values of the system call parameters are
speciallyu selected to support batching.
.pp
Another thing which should be noted \- the local addresses
of segments of C- or M-transaction are computed from
the local address of the transaction and sequential numbers
of segments using simple inline function \fBtseg\fP.
Local addresses are generally positive integers; negative
values are used as error codes.
The zero local address is commonly used as an indicator
of end of the list or as an indicator that default value
should be used.
.pp
The current implementation has the follwing system calls
(\fIsd\fP is an arbitrary local addrees,
\fImtr\fP and \fIctr\fP are local addresses of M- and C-transactions
accordingly, \fItype\fP is a local address of a type S-object,
\fIindex\fP is an arbitrary pointer value not inperpreted by
the kernel in any way, \fIres\fP and \fIn\fP are integers,
\fIsdvect\fP is a zero-terminated array of local addresses,
\fIopname\fP is a zero-terminated string, \fIoplist\fP is a
double zero-terminated list of zero-terminated strings,
\fIaddr\fP and \fIlength\fP are address and length of a
memory area):
.ip "\fIsd2\fP = \fBdup\fP(\fIsd2\fP, \fIsd1\fP)"
duplicates the access from local address \fIsd1\fP to
the local address \fIsd2\fP, if \fIsd2\fP is zero a new local
address will be allocated. It is important that \fBdup\fP
allows \fIsd1\fP and \fIsd2\fP to be addessed of M- or C-transaction's
segments.
Copying accesses to C-transaction segments will make them
temporary by default, the constant \%\fBPERMANENT\fP
should be added to \fIsd2\fP to copy permanent access.
.ip "\fBlimit\fP(\fIsd\fP, \fIoplist\fP)"
limit the capabilities associated with the access by the
list of operations;
.ip "\fBprohibit\fP(\fIsd\fP, \fIoplist\fP)"
remove listed operations from the capabilities associated
with the access;
.ip "\fIres\fP = \fBverify\fP(\fIsd\fP, \fIoplist\fP)"
returns bitmask of operations from the \fIoplist\fP which cannot
be performed on access \fIsd\fP;
.ip "\fBmerge\fP(\fIsd1\fP, \fIsd2\fP)"
adds capabilities from \fIsd2\fP to \fIsd1\fP; both accesses
should point to the same S-object and have the same lifetime;
.ip "\fItsd\fP = \fBalias\fP(\fIsd\fP)"
creates own S-object \fItsd\fP which will act as an alias
of \fIsd\fP \- i.e. all requests to \fItsd\fP will be
handled by \fIsd\fP and those objects will be equal, but the
destruction of \fItsd\fP will cause expiration of all
accesses to it and the notification about closing the
last access to \fItsd\fP will be sent to the owner of \fItsd\fP.
.ip "\fBclose\fP(\fIsd\fP)"
release the local address (side effects may include destruction
of an access, completing transaction with "canceled" error code,
etc);
.ip "\fItype\fP = \fBgettype\fP(\fIsd\fP)"
returns an access to the type object of \fIsd\fP (where
\fIsd\fP can be a local address of an object or M-transaction);
.ip "\fIres\fP = \fBeqobj\fP(\fIsd1\fP, \fIsd2\fP)"
returns true if \fIsd1\fP and \fIsd2\fP are accesses to the
same S-object;
.ip "\fIres\fP = \fBobjhash\fP(\fIsd\fP, \fIn\fP)"
returns hash value in range \%0\|.\|.\|\fIn\fP-1
for the object \fIsd\fP; hash values (with the same \fIn\fP)
for the different objects probably differ and are always the
same for different accesses to the same object.
.ip "\fIctr\fP = \fBrequest\fP(\fIsd\fP, \fIopname\fP, \fIn\fP)"
create C-transaction for the operation \fIopname\fP to the
S-object \fIsd\fP with \fIn\fP slots for arguments;
if \fIsd\fP is a C-transaction, a new transaction will be
created for the same S-object as previous and the
system will guarantee that no other transaction will
be queued in between the new one and the previous;
all the transactions in the \fItrain\fP will be activated
simultaneously;
.ip "\fIsd\fP = \fBreceive\fP(\fIsd\fP, \fIsd_tseg\fP)"
declare that the segment \%\fIsd_tseg\fP of a C-transaction
will be used for receiption of an access from
MP-object and that it will be stored in the local address
\fIsd\fP; if specified \fIsd\fP is zero, a new local
address will be allocated;
.ip "\fBactivate\fP(\fIctr\fP)"
activates the C-transaction(s);
.ip "\fBcancel\fP(\fIctr\fP)"
cancels the C-transaction(s);
.ip "\fIres\fP = \fBwaitreply\fP(\fIctr\fP)"
wait for the C-transaction (or the last transaction
in the train) to be completed and
returns the result code of the last executed
transaction in the train, if \fIctr\fP wasn't
activated \%\fBwaitreply\fP will activate it before
waiting;
.ip "\fIres\fP = \fBchecksd\fP(\fIsd\fP)"
returns the type of contents of the local address, i.e.
"free", "permanent access", "C-transaction", etc;
.ip "\fIres\fP = \fBselect\fP(\fIsdvect\fP)"
waits for an event associated with a local address from
the list; if there is an event pending returns immediately
with the sequential number of the first element of list
which has a pending event;
.ip "\fIres\fP = \fBcheck\fP(\fIsdvect\fP)"
sane as \fBselect\fP but returns immediately if there are
no pending events;
.ip "\fImtr\fP = \fBaccept\fP(\fIsd\fP, &\fIminfo\fP)"
.(b
struct \fBminfo\fP {
int \fIopnum\fP;
int \fInargs\fP;
void *\fIindex\fP;
}
.)b
accepts the request to own S-object (or one of own S-objects
belonging to a group) \fIsd\fP, fills \fIminfo\fP and returns M-transaction;
\fIopnum\fP is a sequential number of the operation in the list of
operations of the object's type, \fInargs\fP is the number of arguments
of the transaction and \fIindex\fP is a user-defined
pointer (the same as in "\fBnew\fP", see below),
.ip "\fBgetopname\fP(\fImtr\fP, \fIopname\fP)"
records operation name associated with the M-transaction to
the character array \fIopname\fP;
.ip "\fBreply\fP(\fImtr\fP, \fIres\fP)"
complete transaction with the reply code \fIres\fP (it will be returned
as a result of \%\fBwaitreply\fP), it also releases
the local address of M-transaction; negaive \fIres\fP
abandons all consequent transactions in the train;
.ip "\fIgsd\fP = \fBnewgroup\fP()"
create a new group;
.ip "\fBsetgroup\fP(\fIsd\fP, \fIgsd\fP)"
assign local address \fIsd\fP to the group \fIgsd\fP;
.ip "\fIgsd\fP = \fBgetgroup\fP(\fIsd\fP)"
get the group of the local address \fIsd\fP;
.pp
The following system calls are included for efficiency
and convinience and aren't necessary in the "strict" \fIGrail\fP
architecture.
.ip "\fIsd\fP = \fBgetcontext\fP()"
get the current context (P-object) control S-object;
.ip "\fIsd\fP = \fBmemr\fP(\fIsd\fP, \fIaddr\fP, \fIlength\fP)"
create a read-only non-expandable memory subsegment corresponding the
mapped memory of the current P-object at the given location;
if specified \fIsd\fP is zero, allocate new local address;
see also the operation "\fBmemr\fP" of the context control protocol;
.ip "\fIsd\fP = \fBmemw\fP(\fIsd\fP, \fIaddr\fP, \fIlength\fP)"
same as \fBmemr\fP but allocates write-only segment;
.ip "\fIsd\fP = \fBmemu\fP(\fIsd\fP, \fIaddr\fP, \fIlength\fP)"
same as \fBmemr\fP but allocates read and write segment;
.ip "\fIres\fP = \fBfork\fP()"
creates the new thread in the same context; returns 0
to the old thread, 1 to the new;
.ip "\fBquit\fP()"
terminates the thread of execution.
.sh 1 "Basic diplomacy (protocols)"
.pp
The several protocols are built-in:
.sh 2 "Type Protocol"
.pp
There is a \fItype-of-type\fP S-object which can be obtained as
.(b
\fItot\fP = \fBgettype\fP(\fBgettype\fP(\fIanyobject\fP)).
.)b
To create a new type apply "\fBnew\fP" to the type-of-type:
.(b
\fItot\fP."\fBnew\fP" (rcv type \fIty\fP, mem{"\fBread\fP"} \fIxoplist\fP).
.)b
\fIxoplist\fP contains the list of opnames each prepended with a byte
containing disjunction of '0' and of the following flags:
.ip "\fBOP_NORMAL\fP" \w'\fBOP_GRANTED\fP\ \ 'u
default pseudo-flag;
.ip "\fBOP_URGENT\fP" \w'\fBOP_GRANTED\fP\ \ 'u
out-of-band request;
.ip "\fBOP_GRANTED\fP" \w'\fBOP_GRANTED\fP\ \ 'u
availability of this request is granted
(i.e. operations limit and prohibit do not
affect availability of this operation).
.pp
The operations defined on types are:
.(b
\fItype\fP."\fBnew\fP" (rcv object \fIobj\fP, mem{"\fBread\fP"} \fIindex\fP)
.)b
creates a new object with of that type and assigns it
the address of a user-space structure (\fIindex\fP) which
is not interpreted by the system;
.(b
\fIlen\fP <- \fItype\fP."\fBdump\fP" (mem{"\fBwrite\fP"} \fIxoplist\fP)
.)b
writes the description of the type into the buffer \fIxoplist\fP
in the same format as was used when the type was created and
returns the actual length of the type description.
.sh 2 "Memory Protocol"
.pp
The virtual memory type object \fImem_type\fP can be obtained as
.(b
\fImem_type\fP = \fBgettype\fP(\fBmemr\fP(0, \fIany_valid_address\fP, 0)).
.)b
.pp
The memory protocol operations are:
.(b
\fImem\fP."\fBdup\fP" (rcv mem \fInewseg\fP)
.)b
\- effectively creates the new segment associated with
the same memory (i.e. it \fIdoes not\fP copy);
.(b
\fIlen\fP <- \fImem\fP."\fBread\fP" (mem{"\fBwrite\fP"} \fIdst\fP)
\fIlen\fP <- \fImem\fP."\fBwrite\fP" (mem{"\fBread\fP"} \fIsrc\fP)
.)b
copy as much data to/from \fImem\fP as possible and return
the number of bytes actually copied;
.(b
\fIres\fP <- \fImem\fP."\fBlock\fP" (mem{"\fBread\fP"} \fIop\fP)
.)b
(\fIop\fP is 1 byte value) performs locking function (one of
\%\fBL_UNLOCK\fP, \%\fBL_SHARED\fP, \%\fBL_EXCLUSIVE\fP,
\%\fBL_CHECK\fP) on the memory segment and returns the
status of the lock;
.(b
\fImem\fP."\fBsizeof\fP" (mem{"\fBwrite\fP"} \fIbounds\fP)
\fImem\fP."\fBcut\fP" (mem{"\fBread\fP"} \fIbounds\fP)
\fImem\fP."\fBexpand\fP" (mem{"\fBread\fP"} \fIbounds\fP)
struct \fBbounds\fP {
unsigned long \fIbase0\fP; /* low 32 bits */
unsigned long \fIbase1\fP; /* high 32 bits */
unsigned long \fIlength0\fP; /* low 32 bits */
unsigend long \fIlength1\fP; /* high 32 bits */
};
.)b
read and change memory segemen't boundaries. "\fBcut\fP"
cannot increase size of the memory segment.
.pp
New virtual memory objects can be created with
.(b
\fImem_type\fP."\fBnew\fP" (rcv mem \fInewobj\fP, [mem \fIoriginal\fP]).
.)b
This call creates a (zero-length) memory segment with all operations
allowed (if \fIoriginal\fP is not specified) or with the set of allowed
operations identical to the set of original's operations (a shadow memory,
see discussion before).
.sh 2 "Context Control Protocol"
.pp
The context type can be obtained with
.(b
\fIcontext_type\fP = \fBgettype\fP(\fBgetcontext\fP()).
.)b
.pp
The following operations are defined on contexts:
.(b
\fIcontext\fP."\fBmemr\fP" (rcv mem \fIm\fP, mem{"\fBread\fP"} \fIbounds\fP)
\fIcontext\fP."\fBmemw\fP" (rcv mem \fIm\fP, mem{"\fBread\fP"} \fIbounds\fP)
\fIcontext\fP."\fBmemu\fP" (rcv mem \fIm\fP, mem{"\fBread\fP"} \fIbounds\fP)
.)b
perform the same functions as \%\fBmemr\fP, \%\fBmemw\fP and
\%\fBmemu\fP system calls.
.(b
\fIcontext\fP."\fBattach-data\fP" (mem{*} \fIm\fP, mem{"\fBread\fP","\fBwrite\fP"} \fIbase\fP)
\fIcontext\fP."\fBattach-text\fP" (mem{"\fBread\fP"} \fIm\fP, mem{"\fBread\fP","\fBwrite\fP"} \fIbase\fP)
\fIcontext\fP."\fBdetach\fP" (mem{*} \fIm\fP)
.)b
Map and unmap memory text (executable) and data memory segments to
the context.
\fIBase\fP specifies the desired address the segment should be attached
at (-1 allows the system to choose any location); the actual base
address (which may differ from advised) will be written to \fIbase\fP.
.(b
\fIcontext\fP."\fBtrace\fP" (mem{"\fBread\fP"} \fIop\fP, mem{"\fIread\fP","\fIwrite\fP"} \fIdatum\fP)
.)b
is used for debugging and tracing \- this request is
hardware-dependent;
.(b
\fIcontext\fP."\fBkill\fP" (mem{"\fBread\fP"} \fIreason\fP)
.)b
terminate all threads in the context immediately;
.(b
\fIcontext\fP."\fBpriority\fP" (mem{"\fBread\fP"} \fIprio\fP)
.)b
set the priority level, subject to limitations;
.(b
\fIcontext\fP."\fBinfo\fP" (mem{"\fBwrite\fP"} \fIinfo\fP)
.)b
returns the status information about the context;
.pp
To create a new context use the operation
.(b
\fIcontext_type\fP."\fBnew\fP" (rcv context \fInewc\fP, object \fIlink\fP)
.)b
(access to the link object will be assinged the local address 1);
then allocate and attach text, data and stack segments and
allow execution of the first thread by raising the priority.
.pp
The following protocols aren't built-in but important
enough to be mentioned here:
.sh 2 "Directory Protocol"
.(b
\fIdir\fP."\fBcreate\fP" (rcv object \fInewobj\fP, [\fIparameter\fP...])
.)b
Creates an object with given parameters in the memory
(on the media, etc) corresponding to the directory;
.(b
\fIdir\fP."\fBbind\fP" (object \fIobj\fP, mem{"\fBread\fP"} \fIname\fP,
[[mem{"\fBread\fP"} \fIprotection\fP,] access_key \fIowner\fP])
.)b
Binds an object (apparently created with \%"\fBcreate\fP" though
some directories may allow arbitrary objects) to the \fIname\fP
in the directory;
.(b
\fIdir\fP."\fBopen\fP" (rcv object \fIobj\fP, mem{"\fBread\fP"} \fIname\fP,
[mem{"\fBread\fP"} \fIkind_of_access\fP, [access_key \fIkey1\fP...]])
.)b
Opens the object with the given \fIname\fP for the particulars
kids of access (providing that client provided its credentials
\fIkey1\fP...).
.(b
\fIdir\fP."\fBunbind\fP" (mem{"\fBread\fP"} \fIname\fP)
.)b
Removes \fIname\fP from the directory.
.(b
\fIlen\fP <- \fIdir\fP."\fBgetent\fP" (mem{"\fBwrite\fP"} \fIname\fP, mem{"\fBread\fP","\fBwrite\fP"} \fIpos\fP)
.)b
Returns the name of the element of directory at position \fIpos\fP
and updates the position to point to the next element and returns
the length of the name.
The zero value of \fIpos\fP points to the beginning of the
directory.
Position is a 4-byte integer.
.sh 2 "Access Keys Protocol"
.pp
.(b
\fIlen\fP <- \fIkey\fP."\fBid\fP" (mem{"\fBwrite\fP"} \fIbuf\fP)
.)b
writes the system-wide "nickname" associated with the \fIkey\fP
into \fIbuf\fP and returns its length.
.(b
\fIlen\fP <- \fIkey\fP."\fBname\fP" (mem{"\fBwrite\fP"} \fIbuf\fP)
.)b
writes the real user name associated with the \fIkey\fP into \fIbuf\fP and returns
its length.
.pp
To verify authenticity of the key use
.(b
\fIres\fP <- \fItrusted_key_server\fP."\fBverify\fP" (access_key \fIkey\fP)
.)b
which returns 0 if \fIkey\fP is not authentic.
.uh "References"
.lp
D. Knuth,
\fIThe art of computer programming\fP, vol.3.
.lp
M. Rozier, V. Abrossimov, F. Armand and others,
\fIOverview of the Chorus Distributed Operating System\fP,
tech. repoet CS/TR-88-7, 1988.
.lp
Sape J. Mullender, Guido van Rossum, Andrew S. Tanenbaum
and others,
\fIAmoeba: A Distributed Operating System for the 1990s\fP,
Computer, vol 23, May 1990.
.lp
David L. Presotto, Dennis M. Ritchie,
\fIInterprocess Communication in the Ninth Edition Unix System\fP,
Software \- Practice and Experience, vol 20, June 1990.
.lp
\fIPlan 9 \- The Reference Manual\fP.
.lp
V. Antonov,
\fIA Regular Architecture for an Operating System\fP,
ACM SIGOPS Operating Systems Review, vol 24, number 3,
Jule 1990
------cut here-----
I was just poking around the U.Washington TR archives looking
for something else and I came across this Abstract which looks
pertinent. No - this one isn't in the archive. If someone has
it, would you e-mail me a copy, please? ( Otherwise, I'll have
to wait for paper-mail! :-(
- Steve Majewski (804-982-0831) <sd...@Virginia.EDU>
- Univ. of Virginia - Department of Molecular Physiology and Biological Physics
TR # 92-03-11
``The Case for Application-Specific Communication Protocols''
Edward W. Felten
Message-passing programs are heavily dependent on the performance of
communication primitives. While communication hardware has
gotten much faster in the last few years, the communication performance
achieved by application programs has not improved so dramatically. We argue
that this `communication gap' is inherent in the design of conventional
message-passing systems, which are based on general-purpose communication
protocols implemented in the operating system. Closing the gap requires
fundamental changes in system design.
We review the arguments for user-level implementation of communication
protocols. We discuss how the architecture and operating system can be
structured to provide user-level communication without compromising system
protection.
We further argue that general-purpose communication protocols, whether
implemented at user level or in the kernel, are inherently slow.
Optimum performance requires a protocol designed to take advantage of
the behavior of a particular application program. We introduce the notion
of a `protocol compiler', a program that generates an application-specific
protocol, based on information about an application's communication
behavior. As a concrete example, we sketch the design of a protocol
compiler that can generate nearly optimal communication code for a
class of data-parallel applications.
: Isn't there any sort of NT programmers manual, with a section 1 on commands
: and a section 2 on system calls and a section 3 on libraries, etc? Can
: it be copied via FTP?
The only other published material that I am aware of comes as two books,
whose exact titles I forget, but are volumes 1 and 2 of -
Win32 Preliminary SDK Reference (??)
or something like that. These detail the API to NT as it stood about a
year ago.
Better yet, fork out the US$ 69 for the complete beta OS on CD, complete
with 7500 pages of doco in postscript. The latest release was about the
8th March.
Check out comp.os.ms-windows.win32.programmer for further info.
(I have absolutely no affiliation with Microsoft)
--
David Arnold
====================================================================
CRC for Distributed Systems Technology arn...@dstc.edu.au
University of Queensland voice +61 7 365 4367
Australia fax +61 7 365 4399
David Arnold replies
[...]
> Win32 Preliminary SDK Reference (??)
>
> Better yet, fork out the US$ 69 for the complete beta OS on CD, complete
> with 7500 pages of doco in postscript.
No, both these documents describe Microsoft's Win32 API. The Win32 API is to
NT as 4.3 BSD UNIX' API is to Mach. This documentation is not particularly
useful for OS research. (Any more so than say than the Posix standard given
that NT supports that as a protected subsystem too.)
--
Zalman Stern zal...@adobe.com (415) 962 3824
Adobe Systems, 1585 Charleston Rd., POB 7900, Mountain View, CA 94039-7900
If you need more than that, sources of NT will be licensed by MS to
Universities and Research Labs once the product is out.
Nice example. For comparison, I'll post an AmigaDOS filesystem. I'm
writing this off the top of my head, so it probably would need tweaking to
compile. AmigaDOS, like QNX, converts requests into messages sent to a user-
level process. I'll try to keep it short and leave out non-relevant parts,
including most of the error-checking.
Note: this is for a ram disk. Also, for historical reasons from
Tripos and BCPL, many DOS pointers are BPTR's (longword offsets; i.e. pointers
shifted right by 2. Ugh.)
#include <various header files>
struct DosPacket {
struct Message *dp_Link; /* EXEC message */
struct MsgPort *dp_Port; /* Reply port for the packet */
/* Must be filled in each send. */
LONG dp_Action; /* See ACTION_... below and
LONG dp_Res1; /* For file system calls this is the result
* that would have been returned by the
* function, e.g. Write returns actual
* length written */
LONG dp_Res2; /* For file system calls this is what would
* have been returned by IoErr() */
LONG dp_Arg1;
LONG dp_Arg2;
LONG dp_Arg3;
...
};
struct node *root;
struct lock *lock_list;
struct MsgPort *MyPort;
LONG res1,fileerr;
main (struct DosPacket *dp) {
struct Message *msg;
struct DeviceList *devnode,*newnode;
struct node *node;
ULONG arg1;
/* various shared library opens/inits */
...
MyPort = &(FindTask(NULL)->pr_MsgPort);
root = (struct node *) AllocVec(sizeof(struct node),MEMF_CLEAR);
root->type = ST_USERDIR;
strcpy(root->name,"Ram Disk");
lock_list = NULL;
spaceused = 1;
/* add our volume entry to the system */
newnode = MakeDosEntry("Ram Disk",DLT_VOLUME);
newnode->dl_Task = MyPort;
newnode->dl_DiskType = ID_DOS_DISK;
DateStamp(&(newnode->dl_VolumeDate));
AddDosEntry(newnode);
/* our Device node already exists - BPTR */
devnode = (struct DeviceList *) (dp->dp_Arg3 << 2);
devnode->dl_Task = MyPort;
/* init finished successfully, tell system we're ready */
ReplyPkt(dp,DOS_TRUE,0);
while (1) {
sigs = Wait(1L << MyPort->mp_SigBit);
/* handle dospackets */
while ((msg = GetMsg(MyPort)) != NULL)
{
dp = (struct DosPacket *) msg->mn_Node.ln_Name;
/* set result to failure, clear secondary result */
res1 = fileerr = 0;
/* used by most functions-usually BPTR to filehandle or lock */
arg1 = dp->dp_Arg1 << 2; /* save space, do it once */
switch (dp->dp_Action) {
case ACTION_WRITE: /* Write(fh,buffer,len) */
res1 = write((struct lock *) arg1,
(CPTR) dp->dp_Arg2, dp->dp_Arg3);
break;
case ACTION_READ: /* Read(fh,buffer,len */
res1 = read((struct lock *) arg1,
(CPTR) dp->dp_Arg2, dp->dp_Arg3);
break;
case MODE_NEWFILE: /* fh = Open(name,MODE_NEWFILE) */
flag = 0;
open:
res1 = openmakefile(
(struct FileHandle *) arg1,
(struct lock *) (dp->dp_Arg2 << 2),
BtoC(str1,dp->dp_Arg3),
flag);
break;
case ACTION_END: /* aka Close(fh) */
res1 = closefile((struct lock *) arg1);
break;
etc...
case ACTION_INHIBIT:
case ACTION_GET_BLOCK:
case ACTION_MORE_CACHE:
default:
fileerr = ERROR_ACTION_NOT_KNOWN;
} /* switch */
/* we've handled the packet, return it */
ReplyPkt(dp,res1,fileerr);
} /* while GetMsg */
} /* while 1 */
}