GSOC 2015: Adding more filesystems

144 views
Skip to first unread message

mgac...@gmail.com

unread,
Mar 17, 2015, 9:18:30 AM3/17/15
to native-cli...@googlegroups.com
I am interested in adding more filesystems to NaCl as a part of GSOC. As given in the ideas page, I'm focusing on the online filesystems currently( Drive , dropbox) etc.

However, i'd like to how exactly to approach the problem. The currently supported filesystems aren't very similar to these.


Bradley Nelson

unread,
Mar 17, 2015, 2:46:44 PM3/17/15
to Native Client Discuss, Sam Clegg, Ben Smith
Hello mcgachhul,

You have an excellent point.
Most of the existing filesystems leverage being in-memory or using synchronous APIs.

There's actually a number of design choices needed for a filesystem targetting these cloud apis.

Key challenges include:
- Local caching will be key to performance
- Differential updates are required for maximum speed
- Can real-time features be incorporated meaningfully?
- A monolithic implementation (where caching is intermixed with i/o) will be harder to reason about, but might yield the best performance (maybe).

My own impulse would be something like this:
- Create a RawDriveMount that does synchronous UrlLoader calls similar to the http mount, but uses PUT/POST/PATCH/UPDATE/DELETE where appropriate when mutating things.
   * PATCH could initially be skipped as it's tricky
- Implement a caching mount that references two other directories:
    * a cache location, likely either a memfs or html5fs
    * a raw drive mount as above
   Requests are routed to check the cache, falling back to the raw drive mount when needed. The cache is kept up to date with last known state.
- Enhance both of the above:
   * Add an ioctl to the RawDriveMount to support patching, enhance the caching mount to use this when it has old state to generate a patch with.
   * Add expiration to the cache so it doesn't grow without bound (useful particularly if it's storing to something locally persistent). This might be tricky as our html5fs doesn't support access time. Likely a database of accesses will be needed to know what to expire.
   * Track which files are open in the cache mount and add an ioctl to the rawdrivemount to register files for realtime monitoring.
   * Write a journal to the cache, to ensure that even if the drive mount fails or is currently unavailable, that mutations can be re-tried later. (This only makes sense with an html5fs backed cache).

Alternatively, caching could be integrated with the drive mount, likely only for an in-memory one.
This might give better results in terms of efficiently generating patches.
However, one advantage of the split caching mount is that it would be flexible about what storage to use for caching without special work.
Using html5fs storage for the cache (or potentially a second layer cache if the caching mount is designed in a flexible way), means offline access to recent data would happen automatically.
Also come to think of it, without using an html5fs cache, separate processes can't share the cache.
The process I listed above is probably a reasonable strategy.
Depending on your level of experience in this domain a subset of the steps I listed above might be an acceptable project. (I.e. you might focus just on exposing the raw API / or just on the caching, though that might be tricky to meaningfully vet in isolation).

Sam, Ben, what do you guys think?

-BradN

On Tue, Mar 17, 2015 at 6:18 AM, <mgac...@gmail.com> wrote:
I am interested in adding more filesystems to NaCl as a part of GSOC. As given in the ideas page, I'm focusing on the online filesystems currently( Drive , dropbox) etc.

However, i'd like to how exactly to approach the problem. The currently supported filesystems aren't very similar to these.


--
You received this message because you are subscribed to the Google Groups "Native-Client-Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to native-client-di...@googlegroups.com.
To post to this group, send email to native-cli...@googlegroups.com.
Visit this group at http://groups.google.com/group/native-client-discuss.
For more options, visit https://groups.google.com/d/optout.

mgac...@gmail.com

unread,
Mar 18, 2015, 8:54:44 AM3/18/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
Thank you for your response. Will the following design be a good idea? It's a general one, I'll have to go through the details of course.

1)  Keep a slightly monolithic design. One class, say OnlineFS, uses a CacheFS and a ProviderFS. OnlineFS is where the actual interleaving of I/O and cache calls will take place.
2) Different ProviderFS classes for different online providers. These classes will actually implement the standard FS calls, since the implementation for each provider will be different.
3) CacheFS implements the caching functions.

Initially, I can work on getting the providers up and running, along with OnlineFS. CacheFS can be made a dummy class until the ProviderFS works properly.



mgac...@gmail.com

unread,
Mar 21, 2015, 4:56:19 AM3/21/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
I'd like to have your views regarding this, so I can know where I'm going wrong with my approach. I also modified the implementation, and have run into a few problems.

Implementation

My current idea is to have a cache layer, which mounts a cache dir using html5fs and the target remote dir ,say OnlineFS. Devs using nacl_io will mount only the cache layer and define the required remote dir in the settings option of the mount call. There will be different OnlineFS implementations implementing the basic IO function. For example, dropbox has a HTTP api( which will need URLLoader calls), while GDrive uses JS( using jsfs? )

Problems
  1. Authentication. All of these remote dirs use OAuth. My idea was that the authentication is left to the app dev, who only passes the OAuth token(using postMessage in the js scripts) to the NaCl module, after the filesystem has been mounted. The OnlineFS calls then use this token as required. Is this acceptable?
  2. I'm having trouble identifying how exactly nacl_io exposes the POSIX calls. In the examples, I've seen the use of the POSIX calls, but I can't find their implementation anywhere. My understanding was I'll have to provide implementations of these calls in my classes using the PPAPI methods, so I feel like my approach is wrong somewhere.
Thanks,
Mainak

Bradley Nelson

unread,
Mar 22, 2015, 2:02:08 AM3/22/15
to Native Client Discuss, Sam Clegg, Ben Smith
Sounds reasonable.

Bradley Nelson

unread,
Mar 22, 2015, 2:14:34 AM3/22/15
to Native Client Discuss, Sam Clegg, Ben Smith
On Sat, Mar 21, 2015 at 1:56 AM, <mgac...@gmail.com> wrote:
I'd like to have your views regarding this, so I can know where I'm going wrong with my approach. I also modified the implementation, and have run into a few problems.

Implementation

My current idea is to have a cache layer, which mounts a cache dir using html5fs and the target remote dir ,say OnlineFS. Devs using nacl_io will mount only the cache layer and define the required remote dir in the settings option of the mount call. There will be different OnlineFS implementations implementing the basic IO function. For example, dropbox has a HTTP api( which will need URLLoader calls), while GDrive uses JS( using jsfs? )

Problems
  1. Authentication. All of these remote dirs use OAuth. My idea was that the authentication is left to the app dev, who only passes the OAuth token(using postMessage in the js scripts) to the NaCl module, after the filesystem has been mounted. The OnlineFS calls then use this token as required. Is this acceptable?

Currently the mounts have a single string parameter, so you could include the Oauth token as a part of that string. Though allowing it to be set later as you suggest might be more flexible.
While a postmessage is likely how you'd get the token in, you'd need a way to get it to the filesystem itself, likely an ioctl. 
  1. I'm having trouble identifying how exactly nacl_io exposes the POSIX calls. In the examples, I've seen the use of the POSIX calls, but I can't find their implementation anywhere. My understanding was I'll have to provide implementations of these calls in my classes using the PPAPI methods, so I feel like my approach is wrong somewhere.
The nacl_io source is here:
 
posix calls are intercepted differently in each libc. Most of our libc's have been modified to provide an optional function to receive calls like open/read/write etc. The interception is in kernel_wrap_*.cc
Calls then propagate thru kernel_intercept.cc then to kernel_proxy.cc then to a particular mount.
The interface you need to implement is in filesystem.h and node.h
As the base class implements a lot of behavior, you may not need to implement everything.


Thanks,
Mainak

mgac...@gmail.com

unread,
Mar 22, 2015, 2:17:47 AM3/22/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
Can Mr Clegg share his views? He's the mentor for this project, so I'd like to know his thoughts on this.

mgac...@gmail.com

unread,
Mar 22, 2015, 2:55:21 AM3/22/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
I have three more questions:

  1. Some remotes, like GDrive, Box and Amazon Cloud Drive, need the project to be registered for the API to work. How do I go about this?
  2. Scope of the project. At the moment, my plan is to implement around 5-6 remote drives. My plan is to implement a remote drive and create a demo to test it, and then work on the cache system. Once the whole implementation is tested properly, extending it to other remote drives should be easier. Is the scope of my work sufficient for a GSOC project, or should I look at more filesystems?
  3. Privacy. Once a file has been cached, it will be available to anyone unless deleted. If some sensitive files are cached, this might be a problem. Do i need to think about this?

Thanks,

Mainak

Bradley Nelson

unread,
Mar 23, 2015, 1:52:56 PM3/23/15
to Native Client Discuss, Sam Clegg, Ben Smith
On Sat, Mar 21, 2015 at 11:55 PM, <mgac...@gmail.com> wrote:
I have three more questions:

  1. Some remotes, like GDrive, Box and Amazon Cloud Drive, need the project to be registered for the API to work. How do I go about this?

I think it's fair to assume that an app that uses these mounts will have to be individually registered with the appropriate service for real use.
You're demo probably should include the steps someone would need to go through to create an test key.

  1. Scope of the project. At the moment, my plan is to implement around 5-6 remote drives. My plan is to implement a remote drive and create a demo to test it, and then work on the cache system. Once the whole implementation is tested properly, extending it to other remote drives should be easier. Is the scope of my work sufficient for a GSOC project, or should I look at more filesystems?

I would think that even a single drive type (gdrive, box, or amazon) might be sufficient. If you're enthusiastic about supporting all the APIs go for it. Getting to the caching/patching aspect of the problem should definitely be a goal.
 
  1. Privacy. Once a file has been cached, it will be available to anyone unless deleted. If some sensitive files are cached, this might be a problem. Do i need to think about this?
Definitely a good point that this means that the entire origin using the filesystem has access to the files cached. Although that's probably a given with-in one web app anyway.
Additionally, anyone outside the browser (in a native app) who can decode the html5 filesystem layout (somewhat opaque) can access the cache contents as well.

Depending on the application, encrypting the cache might be interesting, however, usually the user's browser profile is assumed to be accessible to anyone able to run stuff as that user.
I think for the purposes of this project, making the simplifying assumption that the entire app has the same set of permissions, and assuming that html5 storage is sufficiently secure is ok.

Someone could separately port an encrypted pass thru fs. Though if you're browser profile is compromised, your browser cache likely is too.
 

Thanks,

Mainak

mgac...@gmail.com

unread,
Mar 24, 2015, 11:23:33 AM3/24/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
This is going to be my proposal. Please have a look: http://pastebin.com/hwZUrfir

Thanks,
Mainak

Bradley Nelson

unread,
Mar 24, 2015, 3:58:02 PM3/24/15
to Native Client Discuss, Sam Clegg, Ben Smith
Generally looks good.

 An html5fs mount will be created at the same point(/mount/target), which will serve as the cache. All calls will first be routed to this cache system. If these calls fail due to absence in the cache, cloudfs will use the provider object to call the server. The returned file from the server is then written to the cache. This behavior will change depending on the function being implemented, but the general idea should remain the same.

I'd suggest making the cache a separate filesystem type.
/mnt/cloud ( raw access to a cloud provider, mounted with parameters related to the cloud api's credential mechanism)
/mnt/cloud_cache_data ( an html5fs mounted at start )
/mnt/cached_cloud ( mounted with a reference to /mnt/cloud as the data source, and /mnt/cloud_cache_data as the cache data location,
                    passes through if a file exists in /mnt/cloud_cache_data, queries /mnt/cloud if not,
                    eventually registers for notifications from /mnt/cloud of changes to know when to invalidate paths )

-BradN

--

mgac...@gmail.com

unread,
Mar 24, 2015, 11:23:11 PM3/24/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
I made the necessary changes . Please have a look. I'll submit it after you give the go ahead: http://pastebin.com/HxV8YBcD

Bradley Nelson

unread,
Mar 25, 2015, 1:33:03 PM3/25/15
to Native Client Discuss, Sam Clegg, Ben Smith
  1. 3 cloudfs will keep an array of structures fileMap, where fileMap contains {fileName,access_time,modified_time}. Every file in the cache will have an entry in fileMap. The time fields can be populated using the metadata from the cloud server and the local time from the machine. Caching functions can use the time fields to various ways. This array can be written to the cache storage periodically or before unmounting, so that this data persists.

It might be preferrable to write to the cache more aggressively, so processes can share the cache and so abrupt shutdown doesn't leave the disk cache stale.

Otherwise looks good.

-BradN



On Tue, Mar 24, 2015 at 8:23 PM, <mgac...@gmail.com> wrote:
I made the necessary changes . Please have a look. I'll submit it after you give the go ahead: http://pastebin.com/HxV8YBcD

--

mgac...@gmail.com

unread,
Mar 26, 2015, 1:06:19 PM3/26/15
to native-cli...@googlegroups.com, s...@google.com, bi...@google.com
Could you explain what you mean by aggressive caching?

My understanding till now has been that the caching process will be on demand - only before and after file operations. This won't leave the cache stale compared to the view the client has of the cache at any point. A background thread may be kept which polls the server periodically for any changes in the files open in the cache. Files which are changed are retrieved automatically. But this will introduce more network traffic than the on demand caching, and it might lead to clashes where the cache and the server versions are different. Is this behavior acceptable?

Bradley Nelson

unread,
Mar 27, 2015, 2:07:32 PM3/27/15
to Native Client Discuss, Sam Clegg, Ben Smith
On Thu, Mar 26, 2015 at 10:06 AM, <mgac...@gmail.com> wrote:
Could you explain what you mean by aggressive caching?

My understanding till now has been that the caching process will be on demand - only before and after file operations. This won't leave the cache stale compared to the view the client has of the cache at any point. A background thread may be kept which polls the server periodically for any changes in the files open in the cache. Files which are changed are retrieved automatically. But this will introduce more network traffic than the on demand caching, and it might lead to clashes where the cache and the server versions are different. Is this behavior acceptable?

Sorry, I wasn't terribly clear.

I'm imagining a use case like mounting a user's home directory.
At any given time, a small number of files maybe open. But potentially the same files may be read over and over, unchanged.
When a directory's contents are listed, if not in the cache, you'd want to then consult the server, and then cache the listing metadata, writing this all the way out to html5fs (rather than keeping the listing only in one processes' memory). This is useful for example in the case where a bash prompt spawns 'ls' which lists a directory (fs caches result), then exits. If 'ls -l' is then spawned, ideally the same listing entry can be used without consulting the server. Something similar applies to file contents.

There are two parts to keep the cache valid (though there obviously can be races / staleness in general, goes with the territory):

The Drive API provides request types to watch a subset of files for changes or for all changes in the user's drive. Using the later, filtering to a subdirectory, the disk cache can be freshed to keep in sync with changes as they come. Note this only works while the app is running, if changes are not being subscribed to, the cache will fall behind. As ideally each process is not independently doing a background check for changes, you might want to do this from Javascript in naclprocess.js and then broadcast changes to all running modules (or a single one to update the cache). Though starting with this happening in each process is fine.

Second, you'll want to cope with a cache that's been offline for a while. The simplest thing might be to keep a stamp in the html5fs directory recording the last update. If this is too stale, you'll need freshen things on demand. The simplest thing would be to empty the cache, however, you can do better. Most files will not have changed and the Drive API lets you check the revision id of a file. If this has not changed, the old contents is still valid.

These two combined mean you might want to store the cached contents with a filenames that have their revision id prepended. This will allow natural expiration without touching everything. (One other issue is cleaning out the cache if too full. A background task to dump old / large things will be required eventually).

In general, I would reduce polling as much as possible.

Hope that helps. There's definitely room for experimentation and tuning around the ideal caching style and policy.
I think if you implement the uncached drive mount with an eye to interoperating with a cache, we can iterate on improving cache behavior.
Reply all
Reply to author
Forward
0 new messages