Similar systems in the science world and a couple of questions

259 views

Skip to first unread message

Brian Van Klaveren

unread,

Feb 22, 2017, 9:14:22 PM2/22/17

to Upspin

Hi Upspin,

I've been working on a similar system (github.com/slaclab/datacat), which provides a global namespace for distributed files and replicas. It's functionality and intent actually overlaps a bit with both upspin and and Goods, though it doesn't really handle the data access mechanisms directly. Something like datacat plus Globus/GridFTP (and a bit of glue) might be a closer to Upspin.

There's other similar systems that exist out there for science, most notable is probably IRODS. ATLAS at CERN also has Rucio. DIRAC, a grid computing framework, has one built-in. So does Pegasus WMS. Most of these are all solving similar problems - file replica management and file sharing, and to a lesser extent, data discovery. The confluence between distributed file/replica management and good metadata/provenance support (especially for data discovery) is increasingly important for medium to large science projects, and I think the same is starting to become true for any organization which houses data multiple datacenters and file systems. I realize that's not really the problem you are attempting to solve, but I think there's a lot of common ground.

A couple of questions about Upspin:

* Was AFS was an influence for this project?

* Are permissions checks effectively cached? If so, are they eventually consistent, or do you rely on watches for cache invalidation or something?

Thanks, and I'm very interested in watching how upspin evolves. It's great to see something new!

Brian Van Klaveren

P.S. Hi Rob, I work in Data Management on LSST.

A few additional notes in how datacat is potentially similar/different to upspin:

* ACLs on containers (e.g. folders) only. This is effectively protecting the metadata only, but it could be used by actual data servers.

* We use a similar set of permissions, derived from AFS: read, insert, delete, write, admin (admin allows a user to modify ACLs). Those are only for the metadata operations and have no bearing on underlying data.

* ACLs are valid until overridden

* Groups are possible (we rely on authentication systems to provide a list of groups for a given user). Users are always in a special "public" group, authenticated users are always in a "protected" group. A special group is added based on the IP of the originating request.

* Typed metadata, supporting floats, integers, strings, and timestamps.

* A VFS layer does permissions enforcement with a DirCache (permissions are valid for ~30 seconds). That calls into a data store API. The two together seem similar to DirServer. The data store API is currently implemented with a relational database underneath, but really that could be anything.

* Access to the VFS is performed through a REST API

* A slim python client is written to interact with the REST API.

* We've written file serving proxies which use the REST API and a common authentication mechanism (and/or tokens) to allow users to directly download files from the proxies. Proxies could be deployed at multiple datacenters.

* The server itself never serves data, it's up to clients to determine how best to access the data.

* A prototype PyFUSE module has been written to use the Python Client API and file serving proxies to present a logical file system. Metadata is optionally exposed as extended attributes.

Andrew Gerrand

unread,

Feb 26, 2017, 10:11:23 PM2/26/17

to Brian Van Klaveren, Upspin

Hey Brian, thanks for your interest in Upspin. :-)

On 23 February 2017 at 13:09, Brian Van Klaveren <bri...@stanford.edu> wrote:

* Was AFS was an influence for this project?

Not directly, but it's certainly something we were aware of.

* Are permissions checks effectively cached? If so, are they eventually consistent, or do you rely on watches for cache invalidation or something?

Cached by whom? Usually the directory server that cares about the permissions is also the holder of the Access file. In cases where an Access file refers to a Group file that lives on another DirServer, I think the answer is "eventually consistent."