Linking big data

Gerrick Teague

unread,

Nov 7, 2019, 2:55:38 PM11/7/19

to Dataverse Users Community

Our institution is thinking about leveraging Dataverse. We won't host our own instance yet, so rsync is out. But we do have a HPC cluster with scads of disk space. What options do we have for large data files? Is there currently a way to 'link' (i.e., a pointer) to off-site storage of large files?

I think I could pretty easily develop a poor-mans version of-off site linking:

Utilizing the DV API, A script will get attributes of any large files (name, hash, DOI, Globus endpoint link, accessible URL, metadata etc..) and store this data in a file as a 'link' file in DV. Utilizing the API and scripts, it will sync any changes. (ex, file is updated on our servers, we update the file in DV with new hash, spawning new revision etc..) This way, collaborators utilizing the Harvard Dataverse can search the data as normal, the only difference is, if the user wanted to 'download' the large file, they would have to open the 'link' file then click on the URL (or Globus link) to actually get to the data.

Since I don't really know much about Dataverse (minus the week or so stalking the docs, Github, and Google Group):

1. Is this a problem you've already solved, and I don't know it?

2. If not, would the idea above be feasible?

Thanks!

Gerrick

Philip Durbin

unread,

Nov 7, 2019, 3:24:13 PM11/7/19

to dataverse...@googlegroups.com

Hi Gerrick,

Since you're talking about hosting data with Harvard Dataverse (as opposed to the other 49 installations or a new installation you host), I think your question is a policy question that you should ask the Harvard Dataverse curators by emailing sup...@dataverse.harvard.edu

I hope this helps,

Phil

p.s. A reference website for Harvard Dataverse is being created that might answer questions like this in the future: https://github.com/IQSS/dataverse.harvard.edu/issues/26

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c680d0d1-1960-4428-bc3a-0a795dd32433%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Gerrick Teague

unread,

Nov 7, 2019, 4:01:11 PM11/7/19

to Dataverse Users Community

Thanks Philip!

I shall certainly ask them about this, but my post here was more of a technical feasibility one:

1) Does the Dataverse project have support of 'big data' outside of rsync?

2) If not, would the idea of parking data outside of a Dataverse installation, but linking that data in a Dataverse instance make sense given the limitations described? (for example would it likely mess up the data providence system? or You didn't think about how citations would work with this setup, etc..)

Sorry for the confusion!

Philip Durbin

unread,

Nov 7, 2019, 4:24:41 PM11/7/19

to dataverse...@googlegroups.com

rsync is the only supported "big data" feature at this time and I don't believe anyone is using it in production yet but at least one installation is close. In our list of features we call it experimental: https://dataverse.org/software-features

There is a ton of interest in big data and a dedicated mailing list for discussion. It hasn't been very active (discussions have been happening elsewhere) but a new message was posted this week about TDL's direct S3 upload idea: https://groups.google.com/d/msg/dataverse-big-data/nWf57CXrRyc/WoKqpwa_AQAJ

Direct S3 uploads would complement well direct S3 downloads, which are a life saver because the data is streamed directly from S3 to the client rather than passing through Apache and Glassfish. Please see "dataverse.files.s3-download-redirect" at http://guides.dataverse.org/en/4.17/installation/config.html#s3-storage-options

On Tuesday's community call Scholars Portal told us all about their ideas for Globus: https://groups.google.com/d/msg/dataverse-community/0hu9xXrwOPI/TaMZwOhFAwAJ

Odum has a project going called TRSA (Trusted Remote Storage Agent) which is for sensitive data but could also be used for big data. In general, just linking to data on someone else's server is a bit frowned upon. The idea behind "trusted" is that the installation of Dataverse trusts that the storage will never go away. Please don't link to Geocities or other sites that will go dark some day. :) TRSA is a lot to unpack but I recommend watching https://www.youtube.com/watch?v=MKsrsV6KWsQ and maybe checking out some of the links at https://github.com/IQSS/dataverse/issues/5213 especially http://cyberimpact.us/dataverse-trusted-remote-storage-agent-update/

I'm probably forgetting other big data stuff but I hope this gets you started. Please keep the questions coming!

Thanks,

Phil

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3ba87d6b-f47c-42de-bd1b-148ab96f64ec%40googlegroups.com.

Gerrick Teague

unread,

Nov 7, 2019, 5:38:53 PM11/7/19

to Dataverse Users Community

Thanks for the links! It seems my simple idea is conceptually similar to both TDL's S3 Upload/download and TRSA's system in that they all house the data on 'other' (trusted) servers, not burdening the DV installation's resources, but still leveraging the metadata etc.. awesomeness. I really like the TRSA's idea of having the Download button in effect an interface to various storage connectors links (local file system, TSRA's specific, Globus endpoints, or just an offsite URL or DOI) This I think can be relatively simple, (Well, it always seems simple in the beginning...) with great bang for buck!

As a developer first, and just getting acquainted with the research scene, Is your phrase "In general, just linking to data on someone else's server is a bit frowned upon." still apply if the data is house on your own servers? I would think the biggest deal, as you alluded to, is to preserve the link indefinitely, and, if necessary, ensure access control / security.

I'm getting excited for the future of Dataverse!

Thanks again,

Gerrick

James Myers

unread,

Nov 7, 2019, 5:52:13 PM11/7/19

to dataverse...@googlegroups.com

(just read your latest as I’m sending this…)

Gerrick,

Your thinking is spot on. I think there are basically two areas to think about with external storage in which Dataverse would reference the files.

The first is access control and ownership. Dataverse limits file access prior to publication and allows users to restrict files and/or require acceptance of access terms post-publication and to document use (Make Data Count). Dataverse is also used by organizations intending to keep data accessible long-term once it’s published. For these reasons, storing data in a separate system with its own access control presents issues. Some are ameliorated if the same org owns the storage and can guarantee it stays around, but something (tech and/or policy) still has to be done to, for example, stop users from deleting external files after publication. The TRSA and Globus work Phil mentions are addressing these. The S3 work I’m doing for TDL, which is extending the existing ability for Dataverse to manage files in a (potentially remote) S3 store, is trying to avoid it by allowing Dataverse to have/maintain control (but allowing users, e.g. those at a computing center, to stream their data directly from computer center storage into an S3 bucket at the center.).

The other area of concern is that Dataverse has functionality that requires touching the file bits – it can inspect the file to determine the mimetype, extract metadata from some file types, create derived files from tabular formats, unzip zip files to manage the content files individually, full-text index files, preview files, checking fixity with hashes, create thumbnails, track downloads, gather info in guestbooks, etc. It’s not clear that all of these make sense for large files, but a general solution either has to manage a way to implement this functionality (e.g. to have the remote store able to handle it), or to turn it off. In the latter case, since this type of functionality is very useful, turning it off probably needs to optional (with a size cut-off or some way for an admin/users to decide which datasets/files are treated which way.) While I’m less sure how this is being handled across the current dev efforts, I think you’ll see people trying different trade-offs between making the remote stores simple (by turning things off) and trying to implement things within/near the store. For the TDL work, I think the plan will be to allow Dataverse to run with more than one store at a time and to put smaller files in one where Dataverse will do all of its current functionality and larger files in a store(s) that will still control access/access tracking but will avoid operations that require Dataverse to read all of the data. (details TBD as we just finished a proof-of-concept as noted in https://groups.google.com/forum/#!forum/dataverse-big-data - feedback/ideas on how to make a general solution(s) welcome there.)

-- Jim

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3ba87d6b-f47c-42de-bd1b-148ab96f64ec%40googlegroups.com.

Gerrick Teague

unread,

Nov 7, 2019, 6:40:01 PM11/7/19

to dataverse...@googlegroups.com

Jim, Thanks for the distillation of the current landscape. Your approach fully realized seems to be real close to our use case. Unfortunately, I'm under some significant limitations at the moment: A short development period, and lack of human resources to start up and maintain a Dataverse instance that I can hack away on to enable multiple stores, etc.. This could well change in the future, since DV can offer much to our university. I'll need to refine what our requirements and options are in regards to access control / metadata extraction.

Thanks to you and Phil, I'm now following the dataverse-big-data group. Hopefully things will line up at some point where I can be of use!

-Gerrick

You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/uWMTkn8PSkg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/BL0PR07MB392144C63D5072AAF26B7129BF780%40BL0PR07MB3921.namprd07.prod.outlook.com.

Reply all

Reply to author

Forward