--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c680d0d1-1960-4428-bc3a-0a795dd32433%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3ba87d6b-f47c-42de-bd1b-148ab96f64ec%40googlegroups.com.
(just read your latest as I’m sending this…)
Gerrick,
Your thinking is spot on. I think there are basically two areas to think about with external storage in which Dataverse would reference the files.
The first is access control and ownership. Dataverse limits file access prior to publication and allows users to restrict files and/or require acceptance of access terms post-publication and to document use (Make Data Count). Dataverse is also used by organizations intending to keep data accessible long-term once it’s published. For these reasons, storing data in a separate system with its own access control presents issues. Some are ameliorated if the same org owns the storage and can guarantee it stays around, but something (tech and/or policy) still has to be done to, for example, stop users from deleting external files after publication. The TRSA and Globus work Phil mentions are addressing these. The S3 work I’m doing for TDL, which is extending the existing ability for Dataverse to manage files in a (potentially remote) S3 store, is trying to avoid it by allowing Dataverse to have/maintain control (but allowing users, e.g. those at a computing center, to stream their data directly from computer center storage into an S3 bucket at the center.).
The other area of concern is that Dataverse has functionality that requires touching the file bits – it can inspect the file to determine the mimetype, extract metadata from some file types, create derived files from tabular formats, unzip zip files to manage the content files individually, full-text index files, preview files, checking fixity with hashes, create thumbnails, track downloads, gather info in guestbooks, etc. It’s not clear that all of these make sense for large files, but a general solution either has to manage a way to implement this functionality (e.g. to have the remote store able to handle it), or to turn it off. In the latter case, since this type of functionality is very useful, turning it off probably needs to optional (with a size cut-off or some way for an admin/users to decide which datasets/files are treated which way.) While I’m less sure how this is being handled across the current dev efforts, I think you’ll see people trying different trade-offs between making the remote stores simple (by turning things off) and trying to implement things within/near the store. For the TDL work, I think the plan will be to allow Dataverse to run with more than one store at a time and to put smaller files in one where Dataverse will do all of its current functionality and larger files in a store(s) that will still control access/access tracking but will avoid operations that require Dataverse to read all of the data. (details TBD as we just finished a proof-of-concept as noted in https://groups.google.com/forum/#!forum/dataverse-big-data - feedback/ideas on how to make a general solution(s) welcome there.)
-- Jim
--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3ba87d6b-f47c-42de-bd1b-148ab96f64ec%40googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/uWMTkn8PSkg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/BL0PR07MB392144C63D5072AAF26B7129BF780%40BL0PR07MB3921.namprd07.prod.outlook.com.