notes from 2019-04-09 Dataverse Community Call

Philip Durbin

unread,

Apr 9, 2019, 12:23:11 PM4/9/19

to dataverse...@googlegroups.com

Thanks for calling in, everyone. Here are the notes from https://docs.google.com/document/d/14ijt_OJv2cwZCaZM6YHsVEqspyF-6UmftlmEaQsjbZU/edit?usp=sharing

2019-04-09 Dataverse Community Call

Agenda

* Community Questions

Attendees

* Gustavo Durand (IQSS)
* Phil Durbin (IQSS)
* Jim Myers (QDR)
* Paul Boon (DANS)
* Jamie Jamison (UCLA)
* Sherry Lake (UVa - Home of the 2019 NCAA Men’s Basketball National Champions) ;-)
* Courtney Mumma (TDL, sad for Texas Tech)

Notes

* Community Questions
   * (Jim) FYI: File previewers available at https://github.com/QualitativeDataRepository/dataverse-previewers . As with Data Explorer, you don't have to install these external tools on your own server if you don't want to. You can use the version hosted on GitHub Pages. These tools will be added to the list of external tools in https://github.com/IQSS/dataverse/issues/5738
   * (Courtney) Is anyone implementing the distributed digital preservation backend? We’re likely to add Chronopolis replication this fall (TDL). How frequently would use it? For us there is a manual process and it passes through Duracloud.
      * (Jim) Once DPN went away, we looked at the cost of Chronopolis and decided to use Google Cloud "cold storage" and we're testing that now. I have some code that does this. We would push on publish. We'll push the earlier versions as well.
   * (Sherry) How is Archivematica related? Scholars Portal is using it. UVa is part of AB trust.
      * (Phil) I believe Scholars Portal has both Dataverse and Archivematica in the same data center. Both are open source.
      * (Gustavo) It might be interesting to see if the Archivematica solution could be moved into the workflow framework Jim added for preservation.
   * (Phil) I'm interested in your favorite features of Dataverse because I'd like to update the "Features" page on the project website. Please reply at https://groups.google.com/d/msg/dataverse-community/cy6Jc0oZ-wM/1fkwgfaaAgAJ or https://github.com/IQSS/dataverse.org/issues/65

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Janet McDougall - Australian Data Archive

unread,

Apr 11, 2019, 11:59:55 PM4/11/19

to Dataverse Users Community

hi Gustavo & Jim

Where can I find Jim's workflow framework? I

* (Gustavo) It might be interesting to see if the Archivematica solution could be moved into the workflow framework Jim added for preservation.

Thanks

Janet

Philip Durbin

unread,

Apr 12, 2019, 7:25:46 AM4/12/19

to dataverse...@googlegroups.com

Hi Janet,

Sorry, those notes I took weren't especially clear.

The workflow framework was created initially for the "big data" use case of large amounts of data needing to be moved from one filesystem to another when the "publish" button is clicked: https://github.com/IQSS/dataverse/issues/3561

"Dataverse has a flexible workflow mechanism that can be used to trigger actions before and after Dataset publication." http://guides.dataverse.org/en/4.12/developers/workflows.html

Jim made good use of this framework when he implemented the DuraCloud/Chronopolis integration: http://guides.dataverse.org/en/4.12/admin/integrations.html#duracloud-chronopolis

From that "integrations" links you can find the details of how to set it up in the installation guide: http://guides.dataverse.org/en/4.12/installation/config.html#duracloud-chronopolis-integration

Of course, one can also use this workflow as a model for developing (and contributing!) one's own workflows. :)

It's one of the ways to extend the functionality of Dataverse.

I hope this helps,

Phil

p.s. Ok, back to vacation. :)

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/af2a74cf-1567-4746-8ce5-8d9e62ce7094%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Myers

unread,

Apr 12, 2019, 10:52:38 AM4/12/19

to dataverse...@googlegroups.com

Janet,

A few more notes…

-- Jim

As Phil says, the core workflow mechanism itself is quite general. It can include multiple steps and even make calls out to other web services and then pause/wait for a response before continuing. I didn’t do much to change it aside from allowing customization of which settings a workflow can access and fixing a problem with workflows being unable to make some database updates.

I used that mechanism to create an ‘archiver framework’ – to write some classes to package the data and metadata from a dataset into a single ~RDA-conformant zipped BagIt file with an OAI-ORE metadata map and then send that to an archive. That was originally somewhat monolithic and through the discussion with Gustavo/IQSS I ended up making changes to make it easier to target other repositories and to allow the same archiving code to be called through an API, e.g. to allow an administrator to archive older versions that have already been published. (I also added the ORE map was one of the available metadata export formats…) It’s this stuff that Gustavo was calling Jim’s framework.

The core of that framework is the edu.harvard.iq.dataverse.engine.command.impl.AbstractSubmitToArchiveCommand and some related Dataverse settings that let you specify a specific class to invoke and which repository specific settings it should have access to. The first example was the DuraCloudSubmitToArchiveCommand which is what’s in the documentation – it can be sent a host, port, and context, and gets a username/password from jvm settings.

I’ve just recently created a GoogleCloudSubmitToArchiveCommand that uses the same abstract command class and just directs content to Google Cloud Storage instead (will share this at some point too). This was very easy to create since the abstract class already creates this a workflow and api call and I just had to rewrite one method – calling the same code to create the zipped bag and just replacing the Duracloud API calls with Google’s. It should take minimal programming to send things to Amazon or Microsoft, or, hopefully, to any other archive that can read RDA-conformant Bags.

Conceptually, the archiver mechanism focuses on packaging a dataset for external storage (versus coordinating with a service that’s going to make further changes and potentially interact with Dataverse over time.) For Archivematica, depending on whether you’re thinking of the integration as a one-time transfer of data/metadata or an interaction between Archivematica and Dataverse, you might want to consider designs based on a basic workflow, the abstract archiver class I’ve made, or even on an external tool at the dataset level (open issue #5028). The benefit of the archiver class is that it handles creating both a workflow and api call automatically. Further, the ability to create an ORE map file and/or a zipped Bag would save you from having to write calls to retrieve all the files and metadata (at the cost of having to read the Bag and ORE file). The archiver mechanism would probably not be as useful if you want any sort of interaction over time – for Archivematica to pull files/metadata over time, for Archivematica to push new metadata/provenance back to Dataverse, etc. For those, either the workflow that allows you to call an external service and wait for a callback, or an external tool design where Archivematica could be called and given an apiToken to call back to the Dataverse API as needed, would probably be better. That said, you might still be able to leverage the ORE map file and/or zipped Bag in a basic workflow or external tool design, without using the archiver class itself, to simplify data/metadata transfer.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8F2rE13bk__DP%2BMJqCU_4Lh0_LYVR8uraRKnL8p5En8tw%40mail.gmail.com.

Janet McDougall - Australian Data Archive

unread,

Apr 14, 2019, 9:13:50 PM4/14/19

to Dataverse Users Community

hi Jim & Phil

Thanks for the detailed information - plenty to think about. I have been looking at your bagit feature, and we now have an Archivematica installation I've just started looking at. We are still overhauling all our archiving procedures after the full migration from Nesstar to Dataverse, including Marina working towards integrating ADA requirements for the Guestbook to be available for 'request access' calls as the tool to make grant/deny decisions on restricted data files.

I would join in the community calls except our times are now even further apart as we head into winter here.

Janet

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/af2a74cf-1567-4746-8ce5-8d9e62ce7094%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Reply all

Reply to author

Forward