Good solution for sharing largish data sets?

Aaron Watters

unread,

Mar 15, 2018, 2:06:51 PM3/15/18

to Project Jupyter

Hi folks,

I'm interested in techniques for sharing data in scientific workflows.

Tools like git/github and docker/repo2docker are great for sharing computational

environments and moderate sized data, but not good for sharing (say)

hundreds of gigabytes of data. What do people do?

I have in mind something like this: a scientist on a good network

spins up a jupyter server in a Docker container containing a workflow

using github and repo2docker. In the container s/he provides some

authorization credentials and data for the workflow appears in the

container if the credentials are valid, maybe with read/write access

of some sort if the credentials are really good.

If we are interested in provided publically accessible data in read

only mode we could just dump the data to a web server anywhere

and pull it down using HTTP,

I don't know the right way to do this if we want to have limited access

to the data and sometime provide the ability to write the data.

I'm also interested in the case where the scientist is remote --

ie, certain people are allowed to use our compute cluster possible

with data they have locally or with other data out there somewhere...

Any and all thoughts or pointers appreciated. Thanks!

Sorry if the question is silly or too vague.

-- Aaron Watters

Matthew Turk

unread,

Mar 15, 2018, 3:34:18 PM3/15/18

to jup...@googlegroups.com

Hi!

Great question. My name's Matt Turk and along with some other folks
(lurking?) on this list I work on a project called Whole Tale. We
just had an overview paper published (gold OA) at
https://doi.org/10.1016/j.future.2017.12.029 that gives some
architectural information, but the gist is that we're trying to solve
that exact problem. Our website isn't the best, and we're not
confident of a stable, running instance until early summer (I bet if
you logged in you could find ways to break it or prickly bits in the
UI), but you can find a bit more at wholetale.org and
github.com/whole-tale . You could even launch your own instance,
should you want to.

The long and the short of it is that we run docker containers (not
only Jupyter, but it's currently used as one of the defaults) with
computational environments and "inject" data through a handcrafted
FUSE fs.

The ultimate location of the data is not important (can be both local
or remote), as long as you provide a valid uri containing both
location and transfer protocol (e.g. 'http://example.com/file',
'globus:/endpoint/foo/bar'). There's a couple of additional attributes
you need to provide (size & name, although over HTTP sometimes we can
get these). We keep track of all of those using an external db
(MongoDB via Girder) which is subsequently used by FUSE to resolve
OS-level IO calls into appropriate requests for data. For example,
when you open() a file that's registered as a 'http://' url, it will
(invisibly) locally cache it and present it as though it were local.

Kacper Kowalik, our software architect, recently gave a presentation
on it that you can see here that might be of interest:
http://use.yt/upload/c8236396 .

I'd be happy to share more here or offline, too, but this is something
we're working on pretty hard and while we have a ways to go --
especially in smoothing things out from a UI/UX perspective and
getting stability of the platform, we're working hard on it and really
want to engage much more deeply with folks throughout the community.

-Matt, on behalf of the Whole Tale team

> --
> You received this message because you are subscribed to the Google Groups
> "Project Jupyter" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jupyter+u...@googlegroups.com.
> To post to this group, send email to jup...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jupyter/003e34fa-a547-40c5-a617-8997ee5db326%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jason Grout

unread,

Mar 15, 2018, 3:45:28 PM3/15/18

to jup...@googlegroups.com

There was a guest post on the Jupyter blog the other day about Quilt, which may be interesting for you to look at: https://blog.jupyter.org/reproducible-data-dependencies-for-python-guest-post-d0f68293a99

Jason

To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/CALO3%3D5HfjR69tOPB37pkCQo4yiWftsGVfVvzAOhyJnaDn6a3cQ%40mail.gmail.com.

Reply all

Reply to author

Forward