I've been about Plan 9... there are lots of goodies there under
/sys/doc. However, I have a couple of lingering questions that don't
seem to be answered anywhere:
Observation 1: There doesn't seem to be any provision for moving a
directory from one directory into another directory; that is, moving
it to a different directory on the same <type,device> file system.
Observation 2: There doesn't seem to be any support for hard links.
My questions:
Are these features, in fact, unsupported? Or did I overlook something?
If they're unsupported, why? Were they simply overlooked? Are there
compelling technical or theoretical reasons for not providing them?
Are there any proposals afoot to implement either of these features? If
not, are there any workarounds (besides cp&&rm and bind, respectively)?
I've checked the docs under /sys/doc, the man pages, the 9fans archives,
and the googleweb, but I can't seem to find any explanation for these
two properties. (The case for omitting symlinks, I think is obvious:
they make most file-related utilities 3x more complicated than they
would be otherwise.)
--
+---------------------------------------------------------------+
|E-Mail: smi...@zenzebra.mv.com PGP key ID: BC549F8B|
|Fingerprint: 9329 DB4A 30F5 6EDA D2BA 3489 DAB7 555A BC54 9F8B|
+---------------------------------------------------------------+
dircp and bind(1).
As it happens, feature(X) is not really uniformly available in Linux
either. It sorta works on some types of things, and does not really
work on most others. Take a simple one: select on an open file.
Doesn't really work, might be useful, so we get inotify. Well, but,
inotify doesn't work on most things, but for those things it doesn't
work on, well, maybe select will work. You end up with the tangled
thicket of overlapping, but incompatible, feature sets that are
almost, but not quite entirely, unlike what Unix was supposed to be:
that's Linux today.
This made me think about the design rules for Linux and Plan 9. From
what I can see on Linux, the rule is: if it will work on something and
make it better, do it. It may not work on most things, but as long as
it works on one, go ahead and plug it in. The Linux rules also happen
to make those few things feature(X) works on fast for some cases. Put
together enough special cases and for those uses that conform to those
cases, you've got a very fast and capable system. Which Linux is. But
there is certainly a great deal of ugliness there.
If you look at what a hard link really is, you'll realize that your
two questions are actually the same question.
On Plan 9, the rule tends to be: if feature(X) can't be implemented in
a way that works for everything, don't do it. No special cases
allowed; it's why mmap is discouraged. This has led to a system that
is quite uniform and a pleasure to use, albeit lower performing than
its competitors in many places.
If you look at what a hard link is, you'll realize why they are not in Plan 9.
Just my take, anyway.
ron
nominated for informative post the month.
- erik
i think the record is quite clear that ken, rob, presotto, et. al. were
well-aware of these things. ron has made excellent points about why
these features might be left out.
i'd like to add though, the question is not, "why not include feature x".
the question should be "why do i need feature x". in general features
just make a system bigger and more complicated. so the bias should be
to leaving them out.
- erik
> If you look at what a hard link is, you'll realize why they are not in
> Plan 9.
It's not that obvious to me. A hard link is another name for a file,
uniquely identified by <type,device,qid>. The effect of a hard link can
be simulated with bind, but requires managing a list of excetions (one
bind for each "link"). If the binding were done server-side, there
would need to be some additional protocol element (perhaps a Tbind
request) to add another name to a file. The semantics of Tbind could
meaningfully be extended to all types of files, not just disk files.
I don't understand why 9P doesn't allow transporting bind operations
from machine to machine like this. /sys/doc/9.ps talks about ongoing
research on ways to export namespaces from one machine to another.
Allowing binds to traverse 9P seems to be an easy way to do that.
The alternative, making a full copy of the file/directory, wastes disk
space (unless it's de-duplicated by Venti) and bandwidth.
It's similar for moving directories. If you have a 10 GiB directory,
doing dircp&&rm requires 20 GiB to traverse the link to the file server.
At $local_big_networking_corp, I got chewed out for copying a 650MB ISO
across a single router. If the 9P Twstat message had a destdirfid
field, a fid could be relocated by altering file system metadata alone.
If the destdirfid does not represent a directory, or represents a
location on a different <type,device> file system, then just return
Rerror("move crosses bind") and make the client do a full dircp&&rm.
I suppose it's possible to interpose a file system, call it "linkfs",
between the file server and user processes. Something like:
term% linkfs -f /path/to/persistent/bind/cache
term% mount /srv/linkfs /
term% echo foo > /some/path
term% echo /some/path /another/path > /dev/hardlink
term% cat /another/path
foo
AFAIK, "linkfs" doesn't exist. I totally made it up; I'm just throwing
out some ideas, here.
how do you specify the device? you can't without giving up
on per-process-group namespaces. i don't think there's any
way to uniquely identify a device except through a namespace,
and there's no global namespace.
> I don't understand why 9P doesn't allow transporting bind operations
> from machine to machine like this.
this is done all the time. every time you cpu, you are exporting
your whole namespace to the target machine.
> It's similar for moving directories. If you have a 10 GiB directory,
please explan why a bind is not appropriate here?
> At $local_big_networking_corp, I got chewed out for copying a 650MB ISO
> across a single router.
did the router get tired?
- erik
> the tangled
> thicket of overlapping, but incompatible, feature sets that are
> almost, but not quite, entirely unlike what Unix was supposed to be:
> that's Linux today.
One for the fortunes file.
You are focusing on "how" to implement it, not "why".
Ask yourself *why* do you need it. Is it just convenience
(what you are used to) or is there something you do that
absolutely requires hard links? Next compare the benefit
of hardlinks to their cost. It is worth it?
Hard links force certain design choices and make things more
complex. For instance, you have to store the "inode"
(metadata) of a file separate from its name(s) so if you lose
all of its names, you need a way to garbage collect the object
(or add a name) -- what fsck does. If there is just one name,
you can store the metadata along with the name -- a simpler
choice; you don't need to allocate a separate disk area for
inodes.
Next, you can only hardlink on the `same device' so now the
underlying device is no longer transparent (i.e. something you
can ignore) and you can't be sure if
ln <path1>/a <path2>/b
will work. A more complex user model.
Next, Unix typically only allows hardlinking of files and not
directories (to avoid creating accidental loops -- detecting
loops is considered more complex for some reason). So more
restrictions.
Next, the concept of hardlink is not particularly useful or
doesn't even make sense for synthetic filesystems (such as
devices or environment or basically anything that can benefit
or be more easily accessed given a collection of names and
often these names have special meanings). What would it even
mean if you allowed "ln /proc/123 /proc/324"? So even in unix
where special filesystems are allowed such operations are
banned. So more restrictions.
A more useful concept is that of a `view' on a collection of
things rather than hardlinking individual files. bind/mount
already give you that.
Coming from a Unix environment you have learned to live with
the complexity of and restrictions on hardlinks and very
likely you think of filesystem names as almost always
referring to files that store data or directories. "Unlearn"
unix rather than try to recreate it in Plan9!
Except for the lite part.
-rob
Not really true - there are one or two orders of magnitude more APIs
in Windows than in Linux, and they are all infinitely uglier.
I'm not disagreeing that Linux ain't pretty, but given a choice between
Windows and Linux, at the C level, I'd take Linux any day.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
we've got to give linux full faith and credit for the great
design they started with and how far they've come it making
it windows-like. the latest proud addition, open_by_handle().
- erik
You're joking, right?!
++L
http://lwn.net/Articles/375888/
now scheduled for 2.6.39, http://lwn.net/Articles/435215/ in
the kernel release status.
- erik
It was a rhetorical question :-)
One of these days they'll need installation checkboxes (a) to choose
optional features and (b) to choose different alternative features
that may clash with each other. One will need a university degree in
Linuxology to make educated decisions.
When will natural selection start to apply to Linux distributions?
++L
In http://cm.bell-labs.com/cm/cs/who/dmr/hist.html Ritchie discusses the structure
of the "PDP-7 Unix file system" (see the section with that name). One unusual aspect
(for the time) was the separation between the i-list array of i-nodes, each representing
a file, and directory files that mapped names to i-list indices (i-numbers). A "link"
was just an entry in a directory file, mapping a name to an i-node, and there
could be many such links to a given i-node, just as in later versions of Unix,
although some of the conventions were rather different.
Much later, when Berkeley added symbolic links (following Multics), a distinction was drawn
between those links, now called "hard links" and "symbolic" ones. That obscured a
deeper distinction: the original form either exposes details of a file system implementation
or imposes specific requirements on one (depending on your point of view), but the
symbolic link does not. I'd argue that for what applications typically do,
it's probably an over-specification, and in an environment such as Plan 9
where many different types of resources are represented by file servers,
it's a big nuisance for a small benefit. For instance, structures built on-the-fly
as views of data would need to record all the new links added.
What anyway is the benefit of links? In Unix, directories were made (by a command)
by creating a completely empty directory file, then using sys link to
add "." and ".." links into that directory in separate, explicit steps.
Files were renamed by creating and removing links. This had a certain simplicity and charm,
but also implied constraints on file system implementation.
Even in Unix, links weren't extensively used for other forms of aliasing.
In Plan 9, mkdir is a system call, and "." and ".." aren't explicitly created;
and there's a specific way of renaming a file within its directory.
It's perhaps a little more abstract, and links aren't needed to express those
operations. Later versions of Unix also added mkdir and rename system calls
but absolutely typically, retained links; "." and ".." aren't still
implemented as links, for which rename implementations give thanks, so why keep hard links?
If you really want an operation to move a file or directory from one place in
a hierarchy to another, without always having to copy it, it would be better to
express that explicitly.
Whatsoever it is, though GNU sucks, but the GNU/Linux is dominating the
markets:
http://mybroadband.co.za/news/software/19762-The-Linux-Microsoft-war-over.html
--
Balwinder S "bdheeman" Dheeman Registered Linux User: #229709
Anu'z Linux@HOME (Unix Shoppe) Machines: #168573, 170593, 259192
Chandigarh, UT, 160062, India Plan9, T2, Arch/Debian/FreeBSD/XP
Home: http://werc.homelinux.net/ Visit: http://counter.li.org/
>> It's not that obvious to me. A hard link is another name for a file,
>> uniquely identified by <type,device,qid>.
>
> how do you specify the device? you can't without giving up
> on per-process-group namespaces. i don't think there's any
> way to uniquely identify a device except through a namespace,
> and there's no global namespace.
I got the impression, from what I read, that the kernel driver chooses
the device number.
>> I don't understand why 9P doesn't allow transporting bind operations
>> from machine to machine like this.
>
> this is done all the time. every time you cpu, you are exporting
> your whole namespace to the target machine.
Then what's all that about in paragraphs 2-3 on p. 21 of /sys/doc/9.ps?
>> It's similar for moving directories. If you have a 10 GiB directory,
>
> please explan why a bind is not appropriate here?
The old directory will still be visible in its old location, even if
it's bound to a new name. I had thought about sticking all my files
under $home/files/ and binding them where I wanted them. But then, I'd
just be reinventing all the i-node stuff of *nix. I might as well just
call the directory $home/inodes/. :)
>> At $local_big_networking_corp, I got chewed out for copying a 650MB ISO
>> across a single router.
>
> did the router get tired?
Yeah, I didn't understand their reaction, either.
> I got the impression, from what I read, that the kernel driver chooses
> the device number.
what's a device number and why would we need one?
ron
> Ask yourself *why* do you need it. Is it just convenience
> (what you are used to) or is there something you do that
> absolutely requires hard links? Next compare the benefit
> of hardlinks to their cost. It is worth it?
I'm trying to create a data structure in the form of a directed acyclic
graph (DAG). A file system would be an ideal way to represent the data,
except that P9 exposes no transaction to give a node more than one name.
I could store the data in a P9 file system tree and maintain a set of
links in, say $home/lib/bindrc.d/myDAG. But, every time I
copy/relocate/distribute the tree, I would have to include the myDAG
bindings. It would be much nicer if the structure of the data embodied
in the data itself.
ATM, I'm thinking about creating a DAGfs backed by pq. That way,
standard file utilities could still be used be used to manipulate the
data. However, that solution strikes me as being suspiciously similar
to creating a new disk file system. (How many do we have, already?)
> I'm trying to create a data structure in the form of a directed acyclic
> graph (DAG). A file system would be an ideal way to represent the data,
> except that P9 exposes no transaction to give a node more than one name.
warning: i'm going to try to talk about graphs. This usually ends in tears.
A file system is a DAG I believe.
There is way to give something more than one name: mkdir
in the directory that contains it, it has the name you give it. cd
into it, and as far as you're concerned it has the name '.'.
Is this enough?
ron
A FS is not necessarily the ideal way.
> I could store the data in a P9 file system tree and maintain a set of
> links in, say $home/lib/bindrc.d/myDAG. But, every time I
> copy/relocate/distribute the tree, I would have to include the myDAG
> bindings. It would be much nicer if the structure of the data embodied
> in the data itself.
>
> ATM, I'm thinking about creating a DAGfs backed by pq. That way,
> standard file utilities could still be used be used to manipulate the
> data. However, that solution strikes me as being suspiciously similar
> to creating a new disk file system. (How many do we have, already?)
Not a disk FS, just a naming FS. You can overlay your naming
FS on top of an existing disk based FS. In effect each named
file in this naming FS maps to a "canonical name" of a disk
based file. You can implement linking via a ctl file or
something.
Is lnfs(4) a relevant example?
Seems so.
IIRC companies such as Panasas separate file names and other
metadata from file storage. One way to get a single FS
namespace that spans multiple disks or nodes for increasing
data redundancy, file size beyond the largest disk size,
throughput (and yes, complexity).
Along these lines, a bit torrent swarm would make for
an interesting experiment in distributed filestorage.....
that certainly does seem like the hard way to do things.
why should the structure of the data depend on where it's
located? certainly ken's fs doesn't change the format of
the worm if you concatinate several devices for the worm
or use just one.
- erik
This would be a long discussion :-)
best bet is to read Gibson's stuff.
ron
Indeed.
On Isilon's website we find:
* Up to 10.4 petabytes and up to 85 GBps of throughput and up
to 1.4 million IOPS in a single file system.
* Add capacity and/or performance within 60 seconds.
I don't know how Isilon does this but think of how one would
build a FS that can scale to multi perabyte files requiring
multi gigabyte/sec bandwidth and ability to add storage as
needed. Think oil companies and movie studios!
could you please clarify? i'm not following along.
- erik
> could you please clarify? i'm not following along.
I'm at the end of a long day and not able to write a good explanation
of what they are thinking. :-)
ron
no problems.
- erik
?
It all boils down to having to cope with individual units'
limits and failures.
If a file needs to be larger than the capacity of the largest
disk, you stripe data across multiple disks. To handle disk
failures you use mirroring or parity across multiple disks.
To increase performance beyond what a single controller can
do, you add multiple disk controllers. When you want higher
capacity and throughput than is possible on a single node, you
use a set of nodes, and stripe data across them. To handle a
single node failure you mirror data across multiple nodes. To
support increased lookups & metadata operations, you separate
metadata storage & nodes from file storage & nodes as lookups
+ metadata have a different access pattern from file data
access. To handle more concurrent access you add more net
bandwidth and balance it across nodes.
From an adminstrative point of view a single global namespace
is much easier to manage. One should be able to add or replace
individual units (disks, nodes, network capacity) quickly as
and when needed without taking the FS down (to reduce
administrative costs and to avoid any downtime). Then you have
to worry about backups (on and offsite). In such a complex
system, the concept of a single `volume' doesn't work well.
In any case, users don't care about what data layout is used
as long as the system can grow to fill their needs.
<![RANT[
except those are not 100% orthogonal. not in theory, and even less in
implementations. you risk ending up with big-ball-of-mud code, or abstracting
all your performance (and flexibility and metadata like S.M.A.R.T.) away.
also, (almost every) network hop and node lessens the compound reliability;
some even introduce entirely new failure modes.
]]>
so kudos to Isilon for actually having build great stuff :)
--
dexen deVries
[[[↓][→]]]
``In other news, STFU and hack.''
mahmud, in response to Erann Gat's ``How I lost my faith in Lisp''
http://news.ycombinator.com/item?id=2308816
striping, mirroring, etc really don't need to affect the file system
layout. ken fs, of course, is one example. the mirroring or striping
is one example. in the case of ken fs on aoe, ken fs doesn't even
know at any level that there are raid5s down there. same as with
a regular hba raid controller.
since inside a disk drive, there is also striping across platters and
wierd remapping games (and then there's flash), and i don't see
any justification for calling this a "different fs layout". you wouldn't
say you changed datastructures if you use 8x1gb dimms instead of
4x2gb, would you?
- erik
I am not getting through....
Check out some papers on http://www.cs.cmu.edu/~garth/
See http://www.pdl.cmu.edu/PDL-FTP/NASD/asplos98.pdf for instance.
University degree? Hopeless! What I knew about Linux kernel
configuration 5 or 6 years ago isn't much use today. You need to
maintain an ongoing and in-depth self-education to keep up with
kernel configuration. I for one would much rather invest my time into
learning the all the curious details of Plan 9, most of which are far
more interesting and certainly much more educational and mind-
broadening than keeping up with the myriad latest hacks of Linux.
> When will natural selection start to apply to Linux distributions?
Natural selection in an environment as fiercely aggressive as what
humans demand of their computers (if that makes sense) unfortunately
favours short-term achievement only.
no you're getting through. i just don't accept the zfs-inspired theory that
the filesystem must do these things. or the other theory that is
a one gigantic fs with supposedly global visibilty through redundency
layers, etc. is the only game in town. since two theories are
better than one, this often becomes the one big ball of goo grand
unification theory of storage. it could be this has a lot of advantages,
but i can't overcome the gut (sorry) reaction that big complicated things
are just too hard to get right in practice. maybe one can overcome this
with arbitrarly many programmers. but if i thought that were the
way to go, i wouldn't bother with 9fans.
it seems to me that there are layers we really have no visibility into
already. disk drives give one lbas these days. it's important to remember
that "l" stands for logical; that is, we cannot infer with certainty
where a block is stored based on its lba. thus an elevator algorithm
might do exactly the wrong thing. the difference can be astounding—
like 250ms. and if there's trouble reading a sector, this can slow things by
a second or so. further, we often deal with virtualized things like drive arrays,
drive caches, flash caches, ssds or lvm-like things that make it even
harder for a fs to out-guess the storage.
i suppose there are two ways to go with this situation, remove all the
layers between you and the storage controller and hope that you can
out-guess what's left, or you can give up, declare storage opaque and
leave that up to the storage guys. i'm taking the second position. it
allows for inovation in layers below the fs without changing the fs.
it's surprising to me that venti hasn't opened more eyes to what's
possible with block storage.
unfortunately venti, is optimized for deduplication, not performance.
except for backup, this is the opposite of what one wants.
just as a simple counter example to your 10.5pb example, consider
a large vmware install. you may have 1000s of vmfs "files" layered
on your san but file i/o does not couple them; they're all independent.
i don't see how a large fs would help at all.
- erik