Dear Dataverse Community!
I am the lead of the Renku platform (
https://renkulab.io) project at the Swiss Data Science Center. Recently we implemented a feature that I believe many members of this community might find interesting: a simple way to use Dataverse (and other) datasets in collaborative data science projects. Datasets are simply "linked" to a Renku project and made available in compute sessions immediately; data is not copied up-front but is instead streamed on-demand and cached. In addition, Renku makes it easy to find other projects that use the same data, so identifying additional uses of the same data is quite simple.
The data connectors to these published data sets can be combined in the same project with code repositories and pre-configured compute environments, available at the click of a button. It is then very easy to create actionable projects that demonstrate how data can be (re)used or that serve as replication examples.
For a simple demo of this functionality, you can try it out in this
example project.
More information about the published data feature can be found
here.
This is a brand-new feature! If you find bugs or have ideas on how to make it better or what is missing, please let us know! Note that we have tested it against several different Dataverse installations, but we cannot guarantee that it works with _all_ installations - if you find one that fails, please drop me a line.
Technical details:
We implemented this by adding a "doi" backend to the popular cloud-storage tool rclone. This means that you can use rclone to access Dataverse datasets very easily from the command line. The work has not been merged upstream yet, but if you are interested you can download a release of our fork here:
https://github.com/SwissDataScienceCenter/rclone/
Finally, Phil has kindly invited me to speak about this and the new iteration of the Renku platform at the July community call so if you are interested to chat about this and other aspects of what we are building, please do join the call!
Rok