notes for 2019-02-26 Dataverse Community Call

23 views
Skip to first unread message

Philip Durbin

unread,
Feb 28, 2019, 1:35:23 PM2/28/19
to dataverse...@googlegroups.com

Apologies if I got some of the details wrong.

Phil

2019-02-26 Dataverse Community Call

Agenda

* Large Data and Sensitive Data Efforts in the Community
* Community Questions

Attendees

* Danny Brooke (IQSS)
* Jim Myers (QDR)
* Pete Meyer (HMS)
* Jonathan Crabtree (Odum)
* Gustavo Durand (IQSS)
* Courtney Mumma (TDL)
* Phil Durbin (IQSS)
* Slava (DataverseEU/DANS)
* Sherry Lake (UVA)
* Amber Leahey (SP)
* Anna Dabrowski (Texas Advanced Computing Center)
* Julian Gautier (IQSS)
* Meghan Goodchild (SP)

Notes

* Large Data and Sensitive Data Efforts in the Community
   * (Danny) IQSS has partnered with HMS to implement rsync upload and download of large structural biology datasets: http://guides.dataverse.org/en/4.11/developers/big-data-support.html
   * (Danny) Sensitive Data - too sensitive to move.
   * (Jon) We can't move some data into Dataverse. We need to store it elsewhere. Each TRSA is tied to a home dataverse. A GUI tool to choose datasets. Notary service. Protected Research Data Network. VMs spun up under made up id for temporary work. No persistence of the environment. Showing an image in ImpactTRSAConcept.pdf. Everything is a URL for flexibility. Could be Globus, iRods, etc. Input is welcome. We'd like a standard way to do this.
      * (Courtney) Computation occurring in environment. Download only or computation? Where does the user start? In Dataverse? In TRSA?
         * (Jon) Computation using Singularity. You are handed an image, approved Singularity containers for Stata, R, GraphML, etc. With a fresh VM we don't have to worry about it being infected. In this use case, the TRSAs are going to be loaded onto datastores that already exist. Sometimes we are teaching groups about adding metadata for their existing data. Depositors start in Dataverse to create a landing page for their data. File level metadata is generated and pushed over.
      * (Danny) What's the best way to keep up with development?
         * (Jon) http://cyberimpact.us and I will put the image in a new blog post.
      * (Amber) We would benefit from talking about this as well. Users with data hosted externally may set up Globus endpoints using the Globus file transfer standard. We want Dataverse to support Globus endpoints. Users should be able to provide URLs to Globus endpoints when adding files in Dataverse. We're wondering how it would look for end users who want to download data. In Dataverse, supporting multiple external storage environments, including Globus, would be great.
         * (Jon) There needs to be enough information in Dataverse to know how to download the file no matter how it's stored.
      * (Pete) Amber, Globus endpoint per file or per dataset? We saw a pain point in how files are handled in Dataverse vs. Globus.
         * (Amber) Maybe per file. Maybe both. It might depend on the size of the file. ComputeCanada is already using Globus and automatically generates Globus endpoints for users. I think both file and dataset level URLs are available. The integration we imagine is much simpler than TRSA.
   * (Amber) In addition to linking to large files that are externally hosted, we're also thinking about trying to scale the system to support what we're calling "medium" sized files, moving them to object storage. We're evaluating S3 and Swift. We want to be able to bypass uploading files to a temporary folder. We think this might be a bottleneck. Looking for a 10 GB range. We also want to look into the rsync Data Capture Module (DCM) but also talk to the community about something browser based.
      * (Pete) For direct uploads to the object store, you might want to look at the work Matthew Dunlap did on integrating DCM with S3. https://github.com/IQSS/dataverse/issues/4949
   * (Courtney) Definitely not as advanced as Odum or Scholars Portal in this work. We are mostly gathering use cases. We have users with large data. Interested in an external storage location. The compute environment Odum is talking about could be useful to us as well. HIPAA, FERPA data use cases.
   * (Sherry) I'd love to hear more about large data and sensitive data during the community meeting.
   * (Jim) FWIW: The Uploader would benefit from not having to use the temp file as well.
* Community Questions
   * (Jim) New metrics: css classes like btn-download - do they do anything with google yet?[a][b]
      * (Danny) Will be answered here soon
      * (Phil) Is there some opportunity to share the configurations that we’re using?
         * (Jim) Yes, that’s the goal

[a]The new style classes added to buttons in 4.11 as part of #4660 "do nothing". They were added to the UI front-end code so that sysadmin can write their own custom analytics event tracking code, then adding it to their installation using the new `WebAnalyticsCode` settings config.
[b] https://github.com/IQSS/dataverse/issues/4660
Reply all
Reply to author
Forward
0 new messages