New Blog Post: Aiding Reproducible Research By Adding Provenance in Data Citations

Skip to first unread message

Eleni Castro

Oct 29, 2014, 4:20:35 PM10/29/14

In partnership with the Harvard School of Engineering and Applied Sciences (SEAS), and the Dataverse Project at the Institute for Quantitative Social Science (IQSS) at Harvard, we are pleased to announce the launch of a new project to capture and incorporate meaningful provenance into data citations in order to facilitate research reproducibility and reuse

Funded by an EAGER grant from the National Science Foundation, this project, titled “Citation++: Data citation, provenance, and documentation”, will result in designing and prototyping mechanisms to add provenance metadata within the data citation. The PIs for this project (Margo Seltzer, Gary King, and Mercè Crosas) also plan to work with the research data community and especially groups working on data citation solutions, including DataCite, to incorporate provenance more broadly.

In the digital world, provenance or lineage, is the history of how an artifact came to be in its current state. It typically includes precise references to both the inputs and the transformations that led to an object’s existence. There are myriad uses of provenance, but one of the most frequently mentioned [1, 2, 3, 4] is in data citation. Nonetheless, no existing widely-used data citation standards nor services include provenance.

This project leverages research in data citation and provenance to prototype and evaluate a provenance-enabled citation service that will allow researchers to access to the history of a data set. To do this, this project will deliver at least two instances of data citation services: R-based transformations of a dataset and sql-based transformations of a dataset, by using Dataverse and in conjunction with the USENIX open access repository. PI Margo Seltzer fromSEAS has this to say about the collaboration:

“Having worked on provenance systems for the past several years, I’m excited to be working closely with colleagues from IQSS and Dataverse to put our research into practice and help keep Dataverse on the cutting edge of data sharing.”

Evaluating the success of this project will be based upon at least the following metrics: fraction of citations added after deployment of our service that incorporate the provenance keyword, the absolute number of provenance queries issued, and the ratio of non-provenance metadata queries to provenance queries issued.



[1]  P. Buneman. The providence of provenance. In Proceedings of the 29th British National Conference on Big Data, BNCOD’13, pages 7–12, Berlin, Heidelberg, 2013. Springer-Verlag.

[2]  P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, pages 539–550, New York, NY, USA, 2006. ACM.

[3]  P. Buneman, S. Khanna, and W. C. Tan. Data provenance: Some basic issues. In Proceedings of the 20th Conference on Foundations of Software Technology and Theoretical Computer Science, FST TCS 2000, pages 87–93, London, UK, 2000. Springer-Verlag.

[4]  R. B. o. R. D. Paul E. Uhlir, I. Policy, and G. A. N. R. Council. For Attribution – Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. The National Academies Press, 2012.

Reply all
Reply to author
0 new messages