Round tripping the contents of DVN

149 views
Skip to first unread message

Joerg Messer

unread,
May 14, 2013, 7:30:57 PM5/14/13
to dataverse...@googlegroups.com
Greetings,

I was wondering how one would go about importing, exporting and re-importing data into DVN in such a way that no data is lost.  I explored the 'export' facility and it didn't seem to export all the info although I'm not quite sure why.  For one thing, it seems to be missing the text for the file labels which could easily be retrieved from the database.  It would really be nice to archive the entire contents of a DVN in such as way that it could be restored with ease.  This would do wonders for general peace of mind.  What I would really like to see is something along the lines of the DSpace AIP or itemimport/itemexport capability.  Does your road map contain anything along these lines?

//Joerg Messer (UBC)

Philip Durbin

unread,
May 15, 2013, 8:14:04 AM5/15/13
to dataverse...@googlegroups.com
Hi Joerg,

When I think of exporting the contents of a DVN installation I think
of backing up the database, study files, local customizations, etc. as
described at http://guides.thedata.org/book/i-backup-and-restore

Then, if you zip it all up into a single .tar.gz or .zip you would
have the entire contents, ready for re-importing.

It sounds like you have something else in mind, however... You
mentioned ease of restore...

Do you have a link that describes the DSpace equivalent you're familiar with?

Phil
> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataverse-commu...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Joerg Messer

unread,
May 15, 2013, 2:18:50 PM5/15/13
to dataverse...@googlegroups.com, philip...@harvard.edu
Philip,

I was thinking something more along the lines of the DSpace AIP facility.  It provides the ability to secure the entire repository (data + metadata + structure) in a portable format.  AIP would be ideal but even the older DSpace Simple Archive Format would be useful.  Both are are much more portable than a database table dump and allow the repository contents to be easily re-imported.  This useful for backup, archiving for preservation and migration to another DVN or another repository system entirely. 

The DSpace tools are also available from the command line which facilitates system administration through scripting.  Are you planning on exposing your admin interface in this way?  Maybe through a REST interface? 

https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
https://wiki.duraspace.org/display/DSDOC3x/Importing+and+Exporting+Items+via+Simple+Archive+Format
http://www.slideshare.net/tdonohue/improving-dspace-backups-restores-migrations
> email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
May 15, 2013, 3:13:22 PM5/15/13
to Joerg Messer, dataverse...@googlegroups.com
Thanks for the links. Very interesting.

http://thedata.org/book/upcoming-releases says "Initial implementation
of Data Deposit API, with SWORD 2 compliance" and from a look at
http://en.wikipedia.org/wiki/SWORD_%28protocol%29 it seems like this
might help with interoperability in general...

I gather that you're a fan of AIP ("Archival Information Packages"),
which is new to me. Are you really asking for AIP support? Is it a
common format?
>> > email to dataverse-commu...@googlegroups.com.

Joerg Messer

unread,
May 15, 2013, 7:23:16 PM5/15/13
to dataverse...@googlegroups.com, Joerg Messer, philip...@harvard.edu
Philip,

I believe AIP is part of the OAIS digital preservation standard.  We have an ongoing project to allow our DSpace AIP packages to feed directly into the Artefactual Archivematica preservation tool.  This is a core part of our UBC Library digital preservation strategy.  We consider long term preservation essential for our digital assets and would consider this a very important DVN feature if you folks decide to pursue it.

https://en.wikipedia.org/wiki/Open_Archival_Information_System
https://www.archivematica.org/wiki/Main_Page
>> > email to dataverse-community+unsub...@googlegroups.com.

Joerg Messer

unread,
May 15, 2013, 7:39:38 PM5/15/13
to dataverse...@googlegroups.com, Joerg Messer, philip...@harvard.edu

BTW, SWORD is very welcome.  Nice to see.

Stephen Marks

unread,
May 15, 2013, 10:07:25 PM5/15/13
to dataverse...@googlegroups.com, Joerg Messer, philip...@harvard.edu
For what it's worth, portable archival packages are a very desirable feature for us too. We love Dataverse, but as memory institutions, we have to guard against the day when the project is no longer maintained and the content has to be migrated elsewhere.

s



To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Merce Crosas

unread,
May 15, 2013, 10:36:23 PM5/15/13
to dataverse...@googlegroups.com, Joerg Messer, Philip Durbin
I agree. As you know, we already use OAI and LOCKSS to export multiple copies of the data and metadata. To keep the entire application data to reproduce the structure in another Dataverse Network, we need to do additional work, and decide the best approach. It's a good idea to design it through a REST API.
--
Mercè Crosas, Ph.D.
Director of Data Science, IQSS
Harvard University



Philip Durbin

unread,
May 16, 2013, 6:56:32 AM5/16/13
to dataverse...@googlegroups.com, Joerg Messer
Is anyone aware of a REST API that does this already? Is there an API
we could look at as an example, one that's geared toward preserving
the contents of an entire application? We're talking about the
equivalent of a database dump and zipping up files on the file system,
right?

This conversation reminds me of the developers of ThinkUp asking, "Can
an app's architecture prevent it from becoming a silo?" at
https://plus.google.com/+GinaTrapani/posts/USwCuJj9LFq and
http://branch.com/b/can-an-app-s-architecture-prevent-it-from-becoming-a-silo

Stephen Marks

unread,
May 16, 2013, 8:19:47 AM5/16/13
to dataverse...@googlegroups.com
Sort of. The idea is that the package would be self-contained (that is, would contain both data files and metadata) for everything at a given level of aggregation. Probably the easiest and most useful level would be at the study level, but I could also see a scenario where you might want to aggregate at the dataverse or collection level. A dump of the full application would be, as you say, a backup. It could be useful, but would likely require more processing afterwards if the user wanted to isolate specific studies for migration.

To be clear, the AIP is just a concept that means that everything needed to interpret the data is kept logically together with the data, and that it can be verified as being whole and integral. Most often, this means something as simple as the fact that they are stored in a directory together. An example of a very simple AIP for a study with a single file might look like this:

<directory>
|
+- manifest.txt - a list of all the files in the package with checksums
|
+- studymetadata.xml - likely the DDI metadata for the study
|
+- datafile.sav - the actual data file itself.

A good starting point for understanding how archival software handles this might be to look at the Archivematica project:
https://www.archivematica.org/wiki/Main_Page
It's a piece of archival processing software designed to conform to the OAIS, and it's well documented and produces AIPs that are nice and well-formed.

I'd be happy to jump on a call sometime next week to talk about this if it would be helpful to get a library perspective on it. I'm fairly familiar with OAIS so it should be fairly painless. Jeorg (and Alex?) probably have a lot to say too.

Sorry, I'd talk about this in IRC, but I am at a conference all this week.

s


Merce Crosas

unread,
May 16, 2013, 9:41:20 AM5/16/13
to dataverse...@googlegroups.com
To be sure - is the goal here to have a self-contained preservation copy of the data and metadata (which we almost have, except for the manifest file), or to have and export/import tool to take the contents from one Dataverse Network to another?

Merce

Stephen Marks

unread,
May 16, 2013, 10:07:17 AM5/16/13
to dataverse...@googlegroups.com
For me, both. Although I would amend the second statement to say "take the content to another DVN or another system". But I would think of these two as being the same thing.

Looking back, I realize this has diverged a bit from Jeorg's original question. I think the idea is the same: an exportable package that is fully self-contained and self-describing, but I am hesitant to press further as I may be misrepresenting his question. It also sounds like I should take a look at the LOCKSS interface. Though we do not use LOCKSS, it may be that I can get what I am looking for out of the packages being harvested by the LOCKSS network.

Philip Durbin

unread,
May 16, 2013, 10:27:14 AM5/16/13
to dataverse...@googlegroups.com
I'm a fan of Google's Data Liberation Front, which reminds me of what
you're talking about, Stephen. They want you to be able to get your
data out of their products: http://www.dataliberation.org

What Google doesn't guarantee is that the zip file you download, the
"takeout" of your data (i.e. Google+ posts), can be easily imported
into some other system.

When the liberation strategy involves a standard protocol (i.e. IMAP)
and the data is relatively simple (i.e. email messages), it's easier
to move from one system to another. DVN feels quite a bit more
specialized than, say, email.

Also, Stephen, ruebot and I agree with you:
http://irclog.iq.harvard.edu/dvn/2013-05-16 :)

Philip Durbin

unread,
May 16, 2013, 10:54:47 AM5/16/13
to dataverse...@googlegroups.com
I'm interested in hearing more from Joerg as well.

Check this out, via #code4lib:

"The Repository Exchange Package (RXP) is a hierarchical packaging
format designed to facilitate the exchange of Archival Information
Packages (AIPs) between digital repositories." --
http://wiki.fcla.edu/TIPR/21

On Thu, May 16, 2013 at 10:27 AM, Philip Durbin

Stephen Marks

unread,
May 16, 2013, 12:10:42 PM5/16/13
to dataverse...@googlegroups.com
I'll just say one last thing and hope I'm not being pedantic. =)

The Google Data Lib Front is a very good comparison. The only difference is where the GDLF pitches it as a "customer service" feature, having a clear migration path is an essential feature to libraries/archives. It's important to us not to have content get tied up in platforms, as it's incumbent on us to plan for the long term.

You're right that being able to export is no guarantee of being able to import elsewhere, but that's why we try to push for standards, and I think that's at the heart of the request that kicked off this thread originally. (Again, not speaking for others.) In the absence of standards, we will settle for well-structured, predictable data packages because at least we know we can parse through them reliably.



Stephen Marks

unread,
May 16, 2013, 12:13:24 PM5/16/13
to dataverse...@googlegroups.com
Also want to acknowledge that the architecture of Dataverse means that generating these packages ourselves through database dumping and transferring files is a possibility already, so that is a good thing. But official support is a good thing too. =)

s


Merce Crosas

unread,
May 16, 2013, 1:04:04 PM5/16/13
to dataverse...@googlegroups.com
Agreed. Preservation and platform independence are important. For us, part of what makes the Dataverse a good (and widely used) data sharing and archival platform is the support of standards, so definitely we want to continue pushing in the standards direction as much as possible.

Philip Durbin

unread,
May 17, 2019, 11:57:45 AM5/17/19
to dataverse...@googlegroups.com
I know this thread is older than some of our kids but I'm bringing it back from the dead because of a very interesting post by Stefan Kasberger, pyDataverse author/hacker from AUSSDA at https://github.com/AUSSDA/pyDataverse/issues/5#issuecomment-493431015 which I'll include below. For context, I'll include my question from that issue as well. Feedback and thoughts are very welcome!

# docs for "DVTree" (Dataverse Tree) format

## Phil's question

As I mentioned at https://github.com/IQSS/dataverse/issues/5235#issuecomment-492875277 I'm curious if the "DVTree" (Dataverse Tree) format could be used to upload sample data to a brand new Dataverse installation for use in demos and usability testing.

I would love to see some docs. Or a pointer to the code for now. Thanks! :)

## Stephan's answer

The structure so far is not defined, it's just a rough idea I had, inspired by @petermr CTree structure. I definitely want to talk with some of the Dataverse Devs about the idea -> if this would work right now and in the long run. The idea in general is, that only the filenames and the structure of the folders and files tell about, what should/can be inside and how to treat different files then. Like, every dataverse folder must have a metadata file, same for datasets. The content of the metadata file then must not be strictly defined, but most likely will also have mandatory attributes. This then can be used to create a local export independent of OS and connecting programming language, which also can be used by humans.

Here my first draft of the structure:

Naming Conventions:

- Dataverse: dv_IDENTIFIER, prefix dv_, id = alias
- Dataset: ds_IDENTIFIER, prefix ds_, id = id
- Datafile: FILENAME

├── dv_harvard/
│     ├── metadata.json
│     └── dv_iqss/
│            ├── metadata.json
│            └── ds_microcensus-2018/
│                    ├── metadata.json
│                    └── datafiles/
│                           ├── documentation.pdf
│                           └── data.csv
│     └── metadata.json
└── dv_aussda/
       └── ds_survey-labour-2016/
              ├── metadata.json
              └── datafiles/
                     ├── docs.pdf
                     └── data.tsv

Some open questions:

- are the filenaming conventions compatible, possibly? e.g. is it always okay/possible to convert the dataverse alias to a filename string and store it on every operating system?
- is the filename the best identifier for the datafiles? or is it's hash better?
- how to handle versioning? is a DVTree only for one version possible or should there be another level of folders, like v1/?
- do we need to seperate metadata into 1) general metadata and 2) metadata for API upload (add api.json or so)?



James Myers

unread,
May 17, 2019, 1:01:51 PM5/17/19
to dataverse...@googlegroups.com

FWIW: The OAI_ORE metadata export and/or the BagIT archiving capabilities might play in this as well. The OAI_ORE file is ‘complete’ as far as I know with one exception. ‘Complete’ means it has all of the metadata entered via the GUI or API (but not metadata generated during ingest which, at least prior to the Curation tool from Scholar’s Portal, would be recreated during a round trip upload into another Dataverse). The one exception is provenance – I’ve recently added the free-text provenance to the OAI_ORE (PR tbd) but getting access to any auxiliary provenance file was hard enough given the current design that I skipped it for v1. The BagIT zip contains all files and, after the recent DV updates to show the directory hierarchy, I’ve made an update to make the /data directory in the Bag use the directory hierarchy (another PR tbd).

 

W.r.t. the round-trip part - the code in the DVUploader already uploads a directory tree of files, i.e. could re-upload the data files from an unzipped bag. The original code from SEAD for that also read an OAI-ORE map to also upload metadata – I haven’t updated that code to upload metadata to Dataverse, or to read directly from a zipped Bag, but that could be added (I just haven’t had time/$ so far…).

 

I think the biggest difference with the DV tree concept I see is that the Bags are per dataset and don’t cover the Dataverse hierarchy or metadata. Combining the two, it might be possible to either use the OAI_ORE metadata file (available via the export api) as the metadata file at the dataset level, or just drop Bags in the tree at the dataset level. Other than avoiding format proliferation, I think the only advantage of the OAI_ORE approach is that it is already json-ld, mapping internal DV metadata to external vocabularies – part of why the RDA Research Data Repository Interoperability WG picked it and BagIt as a way to get closer to round-tripping between repositories. Conversely, I don’t think that the OAI_ORE file is any harder to parse, e.g.  for implementing a metadata upload in python, since it is ‘just json’.

 

Other thoughts on round-trip:

Should it be latest version only or all versions?

How should differences in installed metadata blocks be handled (incoming data with metadata that isn’t represented in the new Dataverse, a new Dataverse with required fields that are not in the uploaded datasets)?

 

-- Jim

Reply all
Reply to author
Forward
0 new messages