Dataverse Big Data - October 18th, 2018
* Participants -
https://global.gotomeeting.com/join/264615709 * Danny Brooke (IQSS)
* Pete Meyer (HMS)
* Jim Myers (QDR, TDL)
* Julian Gautier (IQSS)
* Phil Durbin (IQSS)
* Jonathan Crabtree (Odum)
* Courtney Mumma (TDL)
* Tania Schlatter (IQSS)
* Gustavo Durand (IQSS)
* Mike Heppler (IQSS)
* Amber (Scholars Portal)
* Bikram (Scholars Portal)
* Meghan (Scholars Portal)
* Len Wisniewski (IQSS)
* Slava (DANS)
* Sherry Lake (UVa)
* Kevin Condon (IQSS)
* Agenda
* Review efforts underway by the Harvard IQSS Team
* rsync
* Large number of files, large files, direct access to compute
* POSIX storage, not object
* Implemented using the DCM to get data in (
http://guides.dataverse.org/en/4.9.4/developers/big-data-support.html#data-capture-module-dcm ) and RSAL (
http://guides.dataverse.org/en/4.9.4/developers/big-data-support.html#repository-storage-abstraction-layer-rsal ) to get data out
* Drawbacks are that the files are represented as a package file (all the files together) that cannot be restricted, designed for UNIX systems, not Windows
* ":PublicInstall" is documented at
http://guides.dataverse.org/en/4.9.4/installation/config.html#publicinstall * Package files are described at
http://guides.dataverse.org/en/4.9.4/user/find-use-data.html#downloading-a-dataverse-package-via-rsync and
http://guides.dataverse.org/en/4.9.4/user/dataset-management.html#rsync-upload * See also "no ingest" as another drawback below.
*
https://dataverse.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1 * No demo site available but the "docker-dcm" code is suggested to spin it up today:
https://github.com/IQSS/dataverse/tree/develop/conf/docker-dcm . In the future we may set up a demo site.
* Screenshots in "Scalable Options - Dual Mode" at
https://github.com/IQSS/dataverse/issues/4610 * (Pete) forgot to mention that DCM uploads side-step “ingest” of data files
* Next steps -- move to S3 - currently can get data in using DCM but cannot yet get data out, will be able to bring out through url using a download manager
* Download package files from S3
*
https://github.com/IQSS/dataverse/issues/4949 * Design doc w/ preliminary mockup:
https://docs.google.com/document/d/1zcOt4Xwz3kxbJITM1HuLDK2tUaMpV7QAnDUWjmDV3io/edit?usp=sharing * Jim - how is this different than just using a S3 URL?
* Pete - With the RSAL, there is not a way to get a package file out of Dataverse; will switch to archive file from S3 (vs directory from rsyncd)
* File hierarchy
*
https://github.com/IQSS/dataverse/issues/2249 * Design document with issue summary and discussion:
https://docs.google.com/document/d/1LfZIBnJQBdTseZoryHfTCSv9WewAi7iJoG9bW2Rm-kI/edit?usp=sharing * Review efforts underway and efforts planned by the Dataverse Community, Resources available
* TDL - Added some additional error checking for larger uploads and zips being unpacked. This feature was added in Dataverse 4.9:
https://github.com/IQSS/dataverse/issues/4511 * apache:glassfish timeouts
* TDL - Command-line Java Uploader
* A separate app delivered as a jar file that uses the Dataverse API
* For example,`java -cp .\DVUploadClient.jar;. org.sead.acr.client.DVUploader -key=8599b802-659e-49ef-823c-20abd8efc05c -did=doi:10.5072/FK2/TUNNVE -server=
https://dataverse-dev.tdl.org -verify test` would upload all of the files in the 'test' directory (relative to the current directory where the java command is run) to the 'Bulk upload testing' dataset (DOI as shown) I created in dataverse-dev instance of TDL, verifying the transfers[a] by comparing the hash value generated by Dataverse with one generated from the local file.
* Scholar’s Portal
* SWIFT object storage
* Globus endpoints
* Dataverse installation pointing to storage elsewhere
* seems like relatively close overlap w\ “data locality”
* Large file and volume upload / download (DCM)
* TRSA – trusted remote storage agent
*
https://github.com/IQSS/dataverse/issues/5213 *
http://cyberimpact.us/architecture-overview/ * Data can be too big or too sensitive or the storage manager feels they can control and preserve the data best and they may not release the data
* design (and policy) to avoid “bad links”
* Pushing variable[b][c] metadata to Dataverse from an application on the local machine
* “trusted” is important element
* Common problems across community
* HTTP upload middle ground
* Download solution (cgi server? S3 solution)
* Big data and sensitive data share some same requirements like remoteness of data. How can we coordinate efforts.
* This "Dataverse File Access Configurations" doc is potentially relevant
https://docs.google.com/document/d/1f2NxOr0WLJSbXDDSehTMyUqMWe1QGbq4N1z-RQ0Jg_4/edit * Questions and Next Steps
* Pros and Cons of S3 move - preservation with Chronopolis
* Danny - Work with Pete and IQSS team to set up a big data test server
* Jim - PR for documentation changes for timeouts
* Danny - How to implement the “middle ground” like 10 GB - needs testing to see what the current issues are
* maybe generalized handoff download CGI would be helpful for non-s3
* Danny - Google Group, Follow up email with doodle
* Scalability issues (notes, not questions):
* Size
* Glassfish has limitations for large transfers
* Network and service speed can cause timeouts
* Http versus reliable and/or parallel transfer
* Beyond that, data can be just too big to move
* Number of files
* Dataverse has performance degradation for 1000’s of files in a Dataset, e.g. for restrict, publish operations
* Using the upload GUI, it can be hard to see which files have been sent (e.g. relative to a whole directory)
* Flat views of files versus managing hierarchy
[a]Curious about how this deals with different versions of files?
[b]file level; sounds like
[c]Well, variables are children of files. See
http://phoenix.dataverse.org/schemaspy/latest/tables/datatable.html