Big Data and Dataverse

150 views
Skip to first unread message

danny...@g.harvard.edu

unread,
Oct 1, 2018, 11:55:07 PM10/1/18
to Dataverse Users Community
Hi everyone,

There are a few different efforts underway to support big(ger) data in Dataverse. Notably:

- The Dataverse Team and Harvard Medical School SBGrid Team are working on functionality to enable large data transfer using rsync
- Trusted Remote Storage Agents, developed by the Odum Institute at UNC, will allow Dataverse to be used as a metadata store, while storing the data itself in a different place (due to the data being too large or too sensitive)
- Texas Digital Libraries has made changes that allow larger files to be uploaded through the UI
- Scholar's Portal has some interest in allowing large data transfer
- Something else I'm sure I'm missing...

On a recent community call, it was suggested that there may be some opportunity to share information and coordinate efforts on a call. If you'd like to discuss this in a bit more detail, please fill out the Doodle Poll with your availability:


Thanks,

Danny

meghan.good...@gmail.com

unread,
Oct 12, 2018, 9:23:19 AM10/12/18
to Dataverse Users Community

Hi Danny,

We are really looking forward to this discussion. Any word about when this will be scheduled?

Many thanks,
Meghan
Scholars Portal

danny...@g.harvard.edu

unread,
Oct 15, 2018, 12:26:49 PM10/15/18
to Dataverse Users Community
Hi Meghan,

Thanks for checking in. I'm just back in the office this week but I'll review and send an invite to the individuals who filled out the Doodle Poll ASAP. Sorry for the delay.

- Danny

danny...@g.harvard.edu

unread,
Oct 15, 2018, 12:47:28 PM10/15/18
to Dataverse Users Community
Hi everyone,

I just scheduled a meeting for this Thursday (10/18) at Noon ET. 

The agenda, connection details, and a collaborative notes document can be found here:


- Danny


On Monday, October 15, 2018 at 12:26:49 PM UTC-4, danny...@g.harvard.edu wrote:
Hi Meghan,

Thanks for checking in. I'm just back in the office this week but I'll review and send an invite to the individuals who filled out the Doodle Poll ASAP. Sorry for the delay.

- Danny

Philip Durbin

unread,
Oct 18, 2018, 3:13:41 PM10/18/18
to dataverse...@googlegroups.com
Great meeting! Here are the notes:

Dataverse Big Data - October 18th, 2018

* Participants  - https://global.gotomeeting.com/join/264615709
   * Danny Brooke (IQSS)
   * Pete Meyer (HMS)
   * Jim Myers (QDR, TDL)
   * Julian Gautier (IQSS)
   * Phil Durbin (IQSS)
   * Jonathan Crabtree (Odum)
   * Courtney Mumma (TDL)
   * Tania Schlatter (IQSS)
   * Gustavo Durand (IQSS)
   * Mike Heppler (IQSS)
   * Amber (Scholars Portal)
   * Bikram (Scholars Portal)
   * Meghan (Scholars Portal)
   * Len Wisniewski (IQSS)
   * Slava (DANS)
   * Sherry Lake (UVa)
   * Kevin Condon (IQSS)
* Agenda
   * Review efforts underway by the Harvard IQSS Team
      * rsync
         * Large number of files, large files, direct access to compute
         * POSIX storage, not object
         * Implemented using the DCM to get data in ( http://guides.dataverse.org/en/4.9.4/developers/big-data-support.html#data-capture-module-dcm ) and RSAL ( http://guides.dataverse.org/en/4.9.4/developers/big-data-support.html#repository-storage-abstraction-layer-rsal ) to get data out
         * Drawbacks are that the files are represented as a package file (all the files together) that cannot be restricted, designed for UNIX systems, not Windows
            * ":PublicInstall" is documented at http://guides.dataverse.org/en/4.9.4/installation/config.html#publicinstall
            * Package files are described at http://guides.dataverse.org/en/4.9.4/user/find-use-data.html#downloading-a-dataverse-package-via-rsync and http://guides.dataverse.org/en/4.9.4/user/dataset-management.html#rsync-upload
            * See also "no ingest" as another drawback below.
         * https://dataverse.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1
         * No demo site available but the "docker-dcm" code is suggested to spin it up today: https://github.com/IQSS/dataverse/tree/develop/conf/docker-dcm . In the future we may set up a demo site.
         * Screenshots in "Scalable Options - Dual Mode" at  https://github.com/IQSS/dataverse/issues/4610
         * (Pete) forgot to mention that DCM uploads side-step “ingest” of data files
         * Next steps -- move to S3 - currently can get data in using DCM but cannot yet get data out, will be able to bring out through url using a download manager
      * Download package files from S3
         * https://github.com/IQSS/dataverse/issues/4949
         * Design doc w/ preliminary mockup: https://docs.google.com/document/d/1zcOt4Xwz3kxbJITM1HuLDK2tUaMpV7QAnDUWjmDV3io/edit?usp=sharing
         * Jim - how is this different than just using a S3 URL?
            * Pete - With the RSAL, there is not a way to get a package file out of Dataverse; will switch to archive file from S3 (vs directory from rsyncd)
      * File hierarchy
         * https://github.com/IQSS/dataverse/issues/2249
         * Design document with issue summary and discussion: https://docs.google.com/document/d/1LfZIBnJQBdTseZoryHfTCSv9WewAi7iJoG9bW2Rm-kI/edit?usp=sharing
   * Review efforts underway and efforts planned by the Dataverse Community, Resources available
      * TDL - Added some additional error checking for larger uploads and zips being unpacked. This feature was added in Dataverse 4.9: https://github.com/IQSS/dataverse/issues/4511
         * apache:glassfish timeouts
      * TDL - Command-line Java Uploader
         * A separate app delivered as a jar file that uses the Dataverse API
         * For example,`java -cp .\DVUploadClient.jar;. org.sead.acr.client.DVUploader -key=8599b802-659e-49ef-823c-20abd8efc05c -did=doi:10.5072/FK2/TUNNVE -server=https://dataverse-dev.tdl.org -verify test` would upload all of the files in the 'test' directory (relative to the current directory where the java command is run) to the 'Bulk upload testing' dataset (DOI as shown) I created in dataverse-dev instance of TDL, verifying the transfers[a] by comparing the hash value generated by Dataverse with one generated from the local file.
      * Scholar’s Portal
         * SWIFT object storage
         * Globus endpoints
            * Dataverse installation pointing to storage elsewhere
            * seems like relatively close overlap w\ “data locality”
         * Large file and volume upload / download (DCM)
      * TRSA – trusted remote storage agent
         * https://github.com/IQSS/dataverse/issues/5213
         * http://cyberimpact.us/architecture-overview/
         * Data can be too big or too sensitive or the storage manager feels they can control and preserve the data best and they may not release the data
         * design (and policy) to avoid “bad links”
         * Pushing variable[b][c] metadata to Dataverse from an application on the local machine
         * “trusted” is important element
   * Common problems across community
      * HTTP upload middle ground 
      * Download solution (cgi server? S3 solution)
   * Big data and sensitive data share some same requirements like remoteness of data. How can we coordinate efforts.
      * This "Dataverse File Access Configurations" doc is potentially relevant https://docs.google.com/document/d/1f2NxOr0WLJSbXDDSehTMyUqMWe1QGbq4N1z-RQ0Jg_4/edit
   * Questions and Next Steps
      * Pros and Cons of S3 move - preservation with Chronopolis
      * Danny - Work with Pete and IQSS team to set up a big data test server
      * Jim - PR for documentation changes for timeouts
      * Danny - How to implement the “middle ground” like 10 GB - needs testing to see what the current issues are
         * maybe generalized handoff download CGI would be helpful for non-s3
      * Danny - Google Group, Follow up email with doodle
   * Scalability issues (notes, not questions):
      * Size
         * Glassfish has limitations for large transfers
         * Network and service speed can cause timeouts
         * Http versus reliable and/or parallel transfer
         * Beyond that, data can be just too big to move
      * Number of files
         * Dataverse has performance degradation for 1000’s of files in a Dataset, e.g. for restrict, publish operations
         * Using the upload GUI, it can be hard to see which files have been sent (e.g. relative to a whole directory)
         * Flat views of files versus managing hierarchy
[a]Curious about how this deals with different versions of files?
[b]file level; sounds like
[c]Well, variables are children of files. See http://phoenix.dataverse.org/schemaspy/latest/tables/datatable.html

On Mon, Oct 15, 2018 at 12:47 PM <danny...@g.harvard.edu> wrote:
Hi everyone,

I just scheduled a meeting for this Thursday (10/18) at Noon ET. 

The agenda, connection details, and a collaborative notes document can be found here:


- Danny

On Monday, October 15, 2018 at 12:26:49 PM UTC-4, danny...@g.harvard.edu wrote:
Hi Meghan,

Thanks for checking in. I'm just back in the office this week but I'll review and send an invite to the individuals who filled out the Doodle Poll ASAP. Sorry for the delay.

- Danny


On Friday, October 12, 2018 at 9:23:19 AM UTC-4, meghan.good...@gmail.com wrote:

Hi Danny,

We are really looking forward to this discussion. Any word about when this will be scheduled?

Many thanks,
Meghan
Scholars Portal

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/d30940d4-8a00-419e-b11d-f2158b6ff32c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Reply all
Reply to author
Forward
0 new messages