sshfs and globus

Falk Herwig

unread,

Feb 21, 2021, 12:54:39 PM2/21/21

to Discuss

I was wondering if there is something or plans for something that would combine the file system - like access of sshfs with the speed of globus. In my use case I have several repositories of 100s of TBs in different locations which I would like to combine into one data analysis platform. Typically each analysis would need only 100s of GB from several of these large repos. Often the favourite ones are used repeatedly. Bringing all data over is prohibitive. A large local cache which holds the favourite ones automatically combined with the speed of globus to fetch those missing would be the solution. I guess this is essentially sshfs but with globus inserted as transfer mechanism instead of scp. Does something like that exist?

Thanks - Falk

Falk Herwig

unread,

Feb 21, 2021, 3:23:45 PM2/21/21

to Ames, Sasha, Discuss

Sasha,

If there was such an effort we would serve as demonstration users, and we would deploy this capability and showcase it in our 3D stellar hydro server https://www.ppmstar.org - which has a public hub (see tab Hubs) in which anybody can authenticate with GitHub user name and explore data. We could stage some data on your institutions server, for example, and explore it on our virtual research environment.

Falk.

On Feb 21, 2021, at 12:05 PM, Ames, Sasha <am...@llnl.gov> wrote:

That would be a really cool student project to write a FUSE file system on top of the Globus API.

Cheers,

-Sasha

On Feb 21, 2021, at 9:54 AM, Falk Herwig <fherwi...@gmail.com> wrote:

I was wondering if there is something or plans for something that would combine the file system - like access of sshfs with the speed of globus. In my use case I have several repositories of 100s of TBs in different locations which I would like to combine into one data analysis platform. Typically each analysis would need only 100s of GB from several of these large repos. Often the favourite ones are used repeatedly. Bringing all data over is prohibitive. A large local cache which holds the favourite ones automatically combined with the speed of globus to fetch those missing would be the solution. I guess this is essentially sshfs but with globus inserted as transfer mechanism instead of scp. Does something like that exist?

Thanks - Falk

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@globus.org.

Stephen Rosen

unread,

Feb 22, 2021, 1:22:24 PM2/22/21

to Falk Herwig, Ames, Sasha, Discuss

I'm not aware of any current project which tries to provide a filesystem-like interface via Globus Transfers.

However, circa 2017 a student at UChicago built a Globus Transfer-backed FUSE FS as a masters project.

I never saw the code, so I unfortunately can't provide a reference, but I recall speaking with him about it.

The main issue was crippling latency.

There are many more hops and steps in a Globus file transfer than a direct SSH tunnel -- the latency issue isn't something we can solve at our end, at least not under the current application architecture.

Aggressive caching, as you mentioned in your initial email, would be a viable approach to handling the latency issue, but only after initial load.

And caching from a read-only source is way easier than handling writes back to those files.

If you wanted to build something, the main advice I would offer based on this experience would be to think carefully about the types of workloads you do or do not want to support.

Using Globus Transfers as the underlying mechanism for low-latency, many-small-files cases will not work well.

To support both low- and high-latency workloads, you'd probably need to provide an alternate channel for data transfers and some kind of configurable smart-dispatch (e.g. based on file size).

And consider whether or not you'd want the resulting "filesystem" to be writable.

As always, we're here to help you in whatever capacity we can, but I wanted to warn you about the issues you may face.

Cheers,

-Stephen

ch...@uchicago.edu

unread,

Feb 22, 2021, 2:37:48 PM2/22/21

to Stephen Rosen, Falk Herwig, Ames, Sasha, Discuss

The student’s FUSE implementation is here: https://github.com/austinbyers/GlobusFS

This was built as part of a class project and it worked ok in small tests. The code was developed about 5 years ago, so it would likely would need some updates. It probably would serve as a good example for future student projects. This student implemented a cache to hide some of the network latency and reduce the number of network calls, but latency was still a problem.

The Whole Tale project (wholetale.org) implements a read-only Globus file system abstraction in the Girder framework (https://github.com/whole-tale/globus_handler). This implementation also makes use of a cache to store files in the Whole Tale environment and make them accessible to running “Tales” (e.g., notebooks hosted in Docker containers).

Kyle

Falk Herwig

unread,

Feb 22, 2021, 3:56:41 PM2/22/21

to Discuss, sir...@globus.org, Ames, Sasha, Discuss, Falk Herwig

If there was someone on the receiving end I would be happy to draft a requirements document.

I don't see the latency as a problem. Also, as a starter and for our use case a read-only access would be absolutely fine. We would deploy this actually primarily in a JypyterHub server environment. Our current public access server can be found in the Pubic/Outreach server (Hubs tab) on https://www.ppmstar.org. As the Readme.md in there explains under /data we have the (for the user) immutable data repositories mounted (actually right now using sshfs). However, sshfs access to larger remote volumes is simply too slow. Latency is not a problem, because even with latency a globus fuse fs would be a great improvement over the current current workflow. It essentially involves manually bringing over those data sets for which fast local access is needed, and then administrating that local volume (what essentially is a cache) manually. Very time consuming and thus not often happening.

Falk.

Ian Foster

unread,

Feb 22, 2021, 4:53:54 PM2/22/21

to Falk Herwig, Discuss, sir...@globus.org, Ames, Sasha

At the risk of adding complexity to this discussion, a few thoughts:

1) Am I right in thinking that:

a) you want to access entire files, not partial files

b) you want read-only access only

?

2) If so, then an alternative approach to Fuse is to implement simple “prefetch” and “access” functions, something like the following:

a) Prefetch (non-blocking): Check a local cache and initiate a transfer to the cache via Globus if the file is not present (and no transfer for that file is in progress).

b) Access (blocking): Check a local cache and initiate a transfer to the cache via Globus if the file is not present (and no transfer for that file is in progress); block until the file is present locally.

Then:

— One can run a program that needs to access a remote file F without modification by running “access(F)” before running the program.

— Or, if a program needs to access multiple files while running, then it can make the access() call internally.

— As an optimization, one can make a prefetch() call prior to an access() call.

I suspect that this will ultimately perform better and be simpler than trying to use Fuse. I may well be wrong, however.

Ian

Reply all

Reply to author

Forward