RFC: Dataverse + Globus Integration

181 views
Skip to first unread message

Gerrick Teague

unread,
Aug 5, 2020, 5:19:11 PM8/5/20
to Dataverse Big Data
I kindly ask for comments of the following proposal. My experience with Dataverse and its development is pretty basic, so please correct any misapprehensions I have even if it is basic. (No ego here!)

The Cognitive and Neurobiological Approaches to Plasticity Center at Kansas State University needs an open science gateway in which Dataverse largely fits the bill. Two big wrinkles are:
1. Big data support
2. HPC integration with our local Cluster.
Thanks to NIH grant P20 GM113109, We're dusting Globus off the shelf to see what we can glue together.
We've created a tool we've coined Synapse (Original, I know) (https://github.com/cnap-cobre/synapse-globus) to resolve these two wrinkles. Here we focus on Big data support.

Globus is almost kind-of like Dropbox on steroids. With an expansive permissions / control structure as well as API support, it's designed to allow many and large files be transferred between 'endpoints' manually, or automatically. We currently use it as described here:

SynapsePhase1Diagram.png


Step 7 is a python cli job that checks if any new data appears in the 'inbox'. If so, it grabs the ID from the inbox, then retrieves the manifest from the Synapse web server. If Globus says the transfer is complete, we utilize the standard Dataverse API to import the files as normal, then do clean up. Since the large file is already "on" the Dataverse server (which itself is on the HPC cluster), the importing into Dataverse won't be constrained by bandwidth/connection issues. But this is a partial solution.


We are planning on checking if a file is large, and if so, import a 'dummy' file into the API with the correct metadata, etc. but not the actual file itself. (or perhaps just a truncated version of the file) Once the small dummy is imported, we then replace the Dataverse file with the large file, then go into PostgreSQL and update the file size in the corresponding record. 


A Few Notes:
1. I am currently in the process of generating documentation, and deploying the demo onto a production server.
2. This process has successfully demonstrated on development machines, I am currently working on deploying to production. 
3. Right now it is slightly specific to our needs, but the goal is to generalize it so that anybody else interested in a Globus + Dataverse integration can utilize this project (Fork / Pull request away!)
4. A video demonstration can be found here: https://youtu.be/VYq8Fr_3dhU

Please share this info to any group / body that may be interested!

Thank you for your time.

me...@hkl.hms.harvard.edu

unread,
Aug 6, 2020, 6:38:55 PM8/6/20
to Dataverse Big Data
Hi,

It's exciting to see this - having data repositories hooked up to HPC sites brings a lot of advantages, especially with big data (in my opinion, it would be great if all data repositories had attached compute resources, but that's a longer story).

Assuming 1) that you're using POSIX storage, 2) that you're concerned about space efficiency - could you say a little (or point to the sections of the code) about how you're handling the different assumptions Globus and Dataverse have about filesystem-level storage (ex. Dataverse assuming files are in a single-directory named by storage identifiers with filenames and paths in the database, vs Globus assumption of filenames and path heirarchies in the filesystem organization), and how you're handling file versioning?  Of course, my assumptions may be wrong (and I may be operating on out of date knowledge of current Dataverse/Globus models of the world) - if so, these might not be things you've had to address.  My apologies if this is something addressed already in the video - I haven't watched it yet.

I believe that Scholar's Portal is also working on Dataverse/Globus integration (although if I'm remembering correctly, without the requirement for in-place compute in its initial stages) - if you hadn't been aware, it may be worth looking into that, and seeing if there are ways you could help each other.

Best,
Pete

Philip Durbin

unread,
Aug 7, 2020, 10:31:46 AM8/7/20
to Dataverse Big Data
Hi Gerrick,

Your video is fantastic and this is a very exciting development! I know we've talked a bit on the main list, but it was a nice surprise. :)

I'm glad you've already found this big data mailing list. You're in the right place. I'd say you should also feel free to mention what you're up to on the main list (dataverse-community). Perhaps you could link to the thread here.

I have a couple of suggestions for you.

First, please consider joining the Remote Storage/Large Datasets Working Group. This and other working groups are still forming so you can get in on the ground floor. For details, please see the "GDCC working groups" thread at https://groups.google.com/g/dataverse-community/c/EY0dduRj3Ac/m/EDcEQHLoAwAJ

Second, I highly recommend watching the "Globus Transfer Integration" talk by Meghan Goodchild from Scholars Portal, where she summarizes their recent efforts to integrate Globus with Dataverse and gives a great demo at the end. (This is the work that Pete mentioned.) The talk can be found at https://youtu.be/LHyiA3JeiwE?t=725 and the code at https://github.com/scholarsportal/dataverse/tree/dataverse-v4.17-SP-globus-phase1

As for your implementation, it looks great to me. I hope you haven't had to fork Dataverse. If there are changes you need to the core code, please go ahead and open issues on GitHub.

Since OnDemand is part of your solution, you might want to comment on this issue I opened about integrating with data repositories: https://github.com/OSC/ondemand/issues/354

I was pretty excited to see the "send to cluster" button you put into Dataverse. As we may have talked about elsewhere, when Dataverse is configured for rsync the dataset landing page will indicate where the files are on the cluster, which directory they reside in. That is to say we definitely understand the need for this functionality, mostly thanks to Pete.

You mentioned dummy files. We use that exact trick to get large files into Dataverse, sometimes. Hopefully various large data efforts will obviate the need for that trick. :)

The last thing I'll mention for now is that we maintain a list of integrations as part of the Dataverse Admin Guide. When Synapse is ready to be added to the list, please go ahead and open an issue. Here's the list as of the latest release, Dataverse 4.20: http://guides.dataverse.org/en/4.20/admin/integrations.html

Thanks for putting all this together. Again, it looks great.

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Big Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-big-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-big-data/fa931767-5fde-413c-957d-3f9d7405cd06n%40googlegroups.com.


--

Gerrick Teague

unread,
Aug 14, 2020, 5:54:47 PM8/14/20
to Dataverse Big Data
Pete, great questions! 
Regarding the different storage strategies between Dataverse and Globus: 
We leave that as an exercise for the user! We keep the file system hierarchy intact as it flows through Globus, but since we still import the data via the Dataverse API (Even if it's a 'dummy' file) we still follow Dataverse Hierarchy system of datasets. e.g., a user would have to zip the file or utilize Dataverse's Hierarchy system. I just found the "directoryLabel" key when uploading a file via the API - I'm sure there's a way to access that data when flowing out, so I think it might be possible to utilize that value in re-constructing the file system when 'exporting' via Globus. Not ideal, but perhaps do-able.

Regarding versioning:
I'm not exactly sure. I assume (cringe) versioning will following normal Dataverse protocols. Since we are still utilizing the Dataverse API, and merely overwriting the file that Dataverse created in the file system -  if the file will be replaced, Dataverse will give us the path to the existing file. If the file is to be a new version of an existing file, Dataverse will give us a file path to a new file, which we will replace. Phil, does this sound right?

Gerrick

me...@hkl.hms.harvard.edu

unread,
Aug 17, 2020, 10:18:59 AM8/17/20
to Dataverse Big Data
Hi Gerrick,

From your answers, it sounds like the problem was my assumptions about what you were doing (the frequent curse of assumptions being wrong...).  I'd been thinking that you were trying to have a single copy of the data files for both "in dataverse" and computing; but having >1 copy makes things much simpler.  And if you're doing it that way, dataverse versioning won't be an issue (as far as I know; but I'll defer to Phil's knowledge about it).

Best,
Pete

Victoria Lubitch

unread,
Aug 19, 2020, 10:44:06 AM8/19/20
to Dataverse Big Data

Hello, I would like to know the setup steps (installation) of synapse. We would like to try it out.

Victoria
Scholars Portal

Gerrick

unread,
Aug 19, 2020, 4:38:13 PM8/19/20
to Victoria Lubitch, Dataverse Big Data
Victoria, I will write up an installation guide soon, and will link to it here.

Gerrick

--
You received this message because you are subscribed to the Google Groups "Dataverse Big Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-big-d...@googlegroups.com.

Philip Durbin

unread,
Aug 25, 2020, 2:10:36 PM8/25/20
to Dataverse Big Data
Regarding this...

"Since we are still utilizing the Dataverse API, and merely overwriting the file that Dataverse created in the file system -  if the file will be replaced, Dataverse will give us the path to the existing file. If the file is to be a new version of an existing file, Dataverse will give us a file path to a new file, which we will replace. Phil, does this sound right?"

... overall, this makes sense. There are probably some edge cases to consider. Yes, Dataverse will tell you the path (directoryLabel) for each file. Internally, the only way Dataverse knows if a file in one version (let's say 2.0) is absolutely intended to replace a file in a previous version (let's say 1.0) is if the "replace file" feature is used such that the "previousdatafileid" field of the datafile* table is populated. Before this feature was introduced (and still today), people can also indicate that the file is the same simply by using the same filename across versions. The "replace file" feature makes the relationship explicit.

I hope this helps,

Phil




--
You received this message because you are subscribed to the Google Groups "Dataverse Big Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-big-d...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages