First Meeting Notes

21 views
Skip to first unread message

danny...@g.harvard.edu

unread,
Oct 19, 2018, 3:14:13 PM10/19/18
to Dataverse Big Data


------


Dataverse Big Data - October 18th, 2018

* Participants  - https://global.gotomeeting.com/join/264615709 
   * Danny Brooke (IQSS)
   * Pete Meyer (HMS)
   * Jim Myers (QDR, TDL)
   * Julian Gautier (IQSS)
   * Phil Durbin (IQSS)
   * Jonathan Crabtree (Odum)
   * Courtney Mumma (TDL)
   * Tania Schlatter (IQSS)
   * Gustavo Durand (IQSS)
   * Mike Heppler (IQSS)
   * Amber (Scholars Portal)
   * Bikram (Scholars Portal)
   * Meghan (Scholars Portal)
   * Len Wisniewski (IQSS)
   * Slava (DANS)
   * Sherry Lake (UVa)
   * Kevin Condon (IQSS)
* Agenda
   * Review efforts underway by the Harvard IQSS Team 
      * rsync
         * Large number of files, large files, direct access to compute
         * POSIX storage, not object
         * Implemented using the DCM to get data in ( http://guides.dataverse.org/en/4.9.4/developers/big-data-support.html#data-capture-module-dcm ) and RSAL ( http://guides.dataverse.org/en/4.9.4/developers/big-data-support.html#repository-storage-abstraction-layer-rsal ) to get data out 
         * Drawbacks are that the files are represented as a package file (all the files together) that cannot be restricted, designed for UNIX systems, not Windows
            * ":PublicInstall" is documented at http://guides.dataverse.org/en/4.9.4/installation/config.html#publicinstall
            * Package files are described at http://guides.dataverse.org/en/4.9.4/user/find-use-data.html#downloading-a-dataverse-package-via-rsync and http://guides.dataverse.org/en/4.9.4/user/dataset-management.html#rsync-upload 
            * See also "no ingest" as another drawback below.
         * https://dataverse.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1 
         * No demo site available but the "docker-dcm" code is suggested to spin it up today: https://github.com/IQSS/dataverse/tree/develop/conf/docker-dcm . In the future we may set up a demo site.
         * Screenshots in "Scalable Options - Dual Mode" at  https://github.com/IQSS/dataverse/issues/4610
         * (Pete) forgot to mention that DCM uploads side-step “ingest” of data files
         * Next steps -- move to S3 - currently can get data in using DCM but cannot yet get data out, will be able to bring out through url using a download manager
      * Download package files from S3
         * https://github.com/IQSS/dataverse/issues/4949
         * Design doc w/ preliminary mockup: https://docs.google.com/document/d/1zcOt4Xwz3kxbJITM1HuLDK2tUaMpV7QAnDUWjmDV3io/edit?usp=sharing
         * Jim - how is this different than just using a S3 URL?
            * Pete - With the RSAL, there is not a way to get a package file out of Dataverse; will switch to archive file from S3 (vs directory from rsyncd)
      * File hierarchy
         * https://github.com/IQSS/dataverse/issues/2249
         * Design document with issue summary and discussion: https://docs.google.com/document/d/1LfZIBnJQBdTseZoryHfTCSv9WewAi7iJoG9bW2Rm-kI/edit?usp=sharing
   * Review efforts underway and efforts planned by the Dataverse Community, Resources available
      * TDL - Added some additional error checking for larger uploads and zips being unpacked. This feature was added in Dataverse 4.9: https://github.com/IQSS/dataverse/issues/4511 
         * apache:glassfish timeouts
      * TDL - Command-line Java Uploader
         * A separate app delivered as a jar file that uses the Dataverse API
         * For example,`java -cp .\DVUploadClient.jar;. org.sead.acr.client.DVUploader -key=8599b802-659e-49ef-823c-20abd8efc05c -did=doi:10.5072/FK2/TUNNVE -server=https://dataverse-dev.tdl.org -verify test` would upload all of the files in the 'test' directory (relative to the current directory where the java command is run) to the 'Bulk upload testing' dataset (DOI as shown) I created in dataverse-dev instance of TDL, verifying the transfers[a] by comparing the hash value generated by Dataverse with one generated from the local file.
      * Scholar’s Portal
         * SWIFT object storage 
         * Globus endpoints
            * Dataverse installation pointing to storage elsewhere
            * seems like relatively close overlap w\ “data locality”
         * Large file and volume upload / download (DCM)
      * TRSA – trusted remote storage agent
         * https://github.com/IQSS/dataverse/issues/5213
         * http://cyberimpact.us/architecture-overview/ 
         * Data can be too big or too sensitive or the storage manager feels they can control and preserve the data best and they may not release the data
         * design (and policy) to avoid “bad links”
         * Pushing variable[b][c] metadata to Dataverse from an application on the local machine
         * “trusted” is important element
   * Common problems across community
      * HTTP upload middle ground  
      * Download solution (cgi server? S3 solution)
   * Big data and sensitive data share some same requirements like remoteness of data. How can we coordinate efforts.
      * This "Dataverse File Access Configurations" doc is potentially relevant https://docs.google.com/document/d/1f2NxOr0WLJSbXDDSehTMyUqMWe1QGbq4N1z-RQ0Jg_4/edit 
   * Questions and Next Steps
      * Pros and Cons of S3 move - preservation with Chronopolis
      * Danny - Work with Pete and IQSS team to set up a big data test server
      * Jim - PR for documentation changes for timeouts
      * Danny - How to implement the “middle ground” like 10 GB - needs testing to see what the current issues are
         * maybe generalized handoff download CGI would be helpful for non-s3
      * Danny - Google Group, Follow up email with doodle 
   * Scalability issues (notes, not questions):
      * Size
         * Glassfish has limitations for large transfers
         * Network and service speed can cause timeouts
         * Http versus reliable and/or parallel transfer
         * Beyond that, data can be just too big to move
      * Number of files
         * Dataverse has performance degradation for 1000’s of files in a Dataset, e.g. for restrict, publish operations
         * Using the upload GUI, it can be hard to see which files have been sent (e.g. relative to a whole directory)
         * Flat views of files versus managing hierarchy
[a]Curious about how this deals with different versions of files?
[b]file level; sounds like
[c]Well, variables are children of files. See http://phoenix.dataverse.org/schemaspy/latest/tables/datatable.html
Reply all
Reply to author
Forward
0 new messages