vfs: is there any way to distinguish read() for different file descriptors

Anatol Pomozov

unread,

Jan 30, 2012, 7:33:33 PM1/30/12

to Kernel List

Hi,

I am working on fuse filesystem kernel extension (http://fuse4x.org). Fuse originates from linux and follows its vfs semantics.

One of the difference between Linux and XNU vfs is that some operations on Linux (e.g. open, close, read, write, ) have access to "file" structure. This structure contains information about file descriptors and it is possible to distinguish if these operations are run against different descriptors.

int fd1 = open("a",..) // fuse userspace generated FD_1

int fd2 = open("a",..) // fuse userspace generated FD_2

read(fd1,..) // fuse userspace receives FD_1

read(fd2,..) // fuse userspace receives FD_2

Fuse userspace filesystem generates a unique id for every open() and stores this id in "struct file" private data. Later the id is passed on every read/write operation.

Now I need to do something similar for XNU. In its code VNOP_READ does not have access to anything like file structure.

Is there any way to distinguish if operations (read/write/..) are run against different descriptors? Currently fuse4x does an ugly hack - it stores 3 descriptors per vnode - for read-only/write-only/rdrw descriptors. It means if the same file is open 2 times with there same mode (e.g. RDONLY) then fuse4x uses only one file descriptor. For fuse userspace it looks like the file was opened only once and all read operations are done for the same file descriptor.

I would like to make it better. Is there any way to distinguish vnop operations for different file descriptors? Should I add a map (process_id,mode)->fd to vnode? It does not solve the problem completely, but at least 2 different processes will have different fd numbers for the same file.

Shantonu Sen

unread,

Jan 31, 2012, 6:58:05 AM1/31/12

to Anatol Pomozov, Kernel List

File descriptors are maintained in a per-process lookup table. Each open of a file increments a reference count on the vnode and populates a slot in the fd table. This also ignores that some I/Os are generated from inside the kernel without a corresponding file descriptor (the most notable of which is reading the Mach-O header of an executable in the execve(2) system call).

The short answer is "no", since fd->vnode translation happens before the filesystem is invoked.

You could violate all the optimizations of the OS X VFS system by having each lookup for a path return a unique vnode. You'd have to verify that you don't need any code that uses vnode pointer equality, of which there's a substantial amount when comparing parent directory vnodes, I think. So perhaps you only return unique vnodes for non-directories (and deal with the lack of symmetry with Linux when opening directories for enumeration). I think this has a low probability of success, but perhaps it is possible.

Shantonu

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (Darwin...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/darwin-kernel/ssen%40apple.com

This email sent to ss...@apple.com

Ken Hornstein

unread,

Jan 31, 2012, 9:14:06 AM1/31/12

to Kernel List

>Is there any way to distinguish if operations (read/write/..) are run
>against different descriptors? Currently fuse4x does an ugly hack - it
>stores 3 descriptors per vnode - for read-only/write-only/rdrw descriptors.
>It means if the same file is open 2 times with there same mode (e.g.
>RDONLY) then fuse4x uses only one file descriptor. For fuse userspace it
>looks like the file was opened only once and all read operations are done
>for the same file descriptor.

I feel your pain, because I ported a Linux filesystem to MacOS X
and I had to deal with a ton of issues like this. Shantonu has
already given you the authoritative answer (no), but it occurs to
me that you have some alternate options.

You get notified of an open with a VOP_OPEN, and you get the mode of
the open at that time. So you could use that to create new file
descriptors ... although you will probably have a hard time in practice
matching up a request with a descriptor. You could at least distinguish
between different modes of open.

But something occurs to me - let's say three different processes
open the same file with the same mode, so the fuse backend shares
a descriptor among the different open requests. So what? I mean,
how exactly is that a problem? You get a file offset and size with
every request, so it's not an issue of tracking the offset in the
descriptor. Seems to me this isn't a problem in practice ... unless
I'm missing something (which is always possible).

--Ken

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (Darwin...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:

https://lists.apple.com/mailman/options/darwin-kernel/darwin-kernel-garchive-95844%40googlegroups.com

This email sent to darwin-kernel-...@googlegroups.com

Michael Smith

unread,

Jan 31, 2012, 4:20:34 PM1/31/12

to Anatol Pomozov, Kernel List

On Jan 30, 2012, at 4:33 PM, Anatol Pomozov wrote:

I would like to make it better. Is there any way to distinguish vnop operations for different file descriptors? Should I add a map (process_id,mode)->fd to vnode? It does not solve the problem completely, but at least 2 different processes will have different fd numbers for the same file.

At the most fundamental level, no, this is not possible.

vnode data operations are cache fill/writeback operations for the buffer cache; they don't correspond to process-level read/write calls (which are copies from/to the cache). (disregarding uncached file I/O)

As has been noted elsewhere, what you are fundamentally asking for is for each descriptor to 'see' one of a set of files. This is, again, a concept that doesn't work for Mac OS X - the filesystem namespace is intentionally coherent across all consumers, and a filesystem doesn't necessarily get to decide on an open-by-open basis which vnode is going to be vended (if the entire lookup path is in the name cache, the filesystem may never know that a file has been opened again, for example).

If you want to interpose in an application's view of the system, you need to do this in userland via library interpositioning *before* the application's request is submitted to the kernel. Once you're in the kernel's scope, both the filesystem namespace and data content are (again, intentionally) coherent to all clients.

= Mike

--

The lyf so short, the craft so long to lerne -- Chaucer

Anatol Pomozov

unread,

Feb 6, 2012, 2:20:45 PM2/6/12

to Ken Hornstein, Kernel List

On Tue, Jan 31, 2012 at 6:14 AM, Ken Hornstein <ke...@cmf.nrl.navy.mil> wrote:

>Is there any way to distinguish if operations (read/write/..) are run
>against different descriptors? Currently fuse4x does an ugly hack - it
>stores 3 descriptors per vnode - for read-only/write-only/rdrw descriptors.
>It means if the same file is open 2 times with there same mode (e.g.
>RDONLY) then fuse4x uses only one file descriptor. For fuse userspace it
>looks like the file was opened only once and all read operations are done
>for the same file descriptor.

I feel your pain, because I ported a Linux filesystem to MacOS X
and I had to deal with a ton of issues like this. Shantonu has
already given you the authoritative answer (no), but it occurs to
me that you have some alternate options.

You get notified of an open with a VOP_OPEN, and you get the mode of
the open at that time. So you could use that to create new file
descriptors ... although you will probably have a hard time in practice
matching up a request with a descriptor. You could at least distinguish
between different modes of open.

But something occurs to me - let's say three different processes
open the same file with the same mode, so the fuse backend shares
a descriptor among the different open requests. So what? I mean,
how exactly is that a problem?

Fuse at Linux passes the id unique to a file structure to userspace. And it is difficult to guess how a fuse filesystem is going to use this id. For example SSHFS keeps a readahead buffer for every file descriptor. If there are 2 descriptors read the same file then SSHFS will be confused - readahead buffer will be read but then dropped as other descriptor reads remote file at different offset (if I understand sshfs sources correctly).

Anyway it seems there is no easy way to get the file structure for VNOP and we have to stick with the solution from macfuse. It has been working fine for macfuse for several years....

> Or why another fuse clone?

Historically fuse4x appeared several month before osxfuse. I was migrating a tool from Linux to OSX and discovered that MacFuse breaks compatibility with linux fuse in several important areas (e.g. macfuse does not work in multi-thread applications). I asked if it should be fixed in macfuse - it requires quite a lot of changes in libfuse and kext and changes some macfuse-specific API. But I've got answer that "macfuse is not fuse" and they do not want to change it http://osdir.com/ml/macfuse/2011-05/msg00032.html So I followed way "if you want to be done - do it yourself" and started the macfuse fork. The goals of fuse4x are:

- Keep it compatible as much as possible. Fuse4x should be a reference implementation of fuse on macosx.

- Keep sources clean simple. Remove API that looks weird or does not follow spirit of Fuse.

- Make fuse4x as fast as possible.

osxfuse appeared later with idea to be fully compatible with macfuse (including binary compatibility of the distribution), but this also means they have all the problems that macfuse has.

See more infor here as well http://old.nabble.com/Re%3A-macfuse-p33048887.html

You get a file offset and size with
every request, so it's not an issue of tracking the offset in the
descriptor. Seems to me this isn't a problem in practice ... unless
I'm missing something (which is always possible).

--Ken

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (Darwin...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:

https://lists.apple.com/mailman/options/darwin-kernel/anatol.pomozov%40gmail.com

This email sent to anatol....@gmail.com

Reply all

Reply to author

Forward