FW: Data Citation Best Practice Discussion Document (continuing "Nature paper with Dryad data")

32 views

Skip to first unread message

Pollard, Tom

unread,

May 26, 2011, 12:17:26 PM5/26/11

to datacit...@googlegroups.com, Farquhar, Adam, Brase, Jan

Dear all,

Following recent discussion about best practice for citing datasets within research articles, David Shotton has created a Data Citation Best Practice Discussion Document on Google Docs:

https://docs.google.com/document/d/1kF8-faB72l4dKTLEyx6Z5cIabk68GrJ9GraCtWnK0qQ/edit?hl=en_GB&authkey=CPPW46wL#

If you haven’t already seen it then I recommend taking a look (particularly at the section on ‘Best Practice Recommendation for Citing Data in Data Repositories’). For further information, please see David’s message below.

Best wishes,
Tom

---
Tom Pollard
Datasets Outreach Officer

The British Library
Digital Library Technology
Room 8 Floor 6
96 Euston Road London NW1 2DB
W: www.bl.uk/datasets

T: +44 (0) 207 412 7767
M: +44 (0) 750 012 6200
E: tom.p...@bl.uk

www.twitter.com/DatasetsBL

---

From: wg-digitald...@nescent.org [mailto:wg-digitald...@nescent.org] On Behalf Of David Shotton
Sent: 12 May 2011 10:00
To: Heather Piwowar
Cc: Lyubomir Penev; Gerry Lawson; Dryad Group
Subject: [Wg-digitaldata] Data Citation Best Practice Discussion Document (continuing "Nature paper with Dryad data")

Dear Heather, Dryad folk and other colleagues,

In previous e-mails under the subject "Nature paper with Dryad data", and on Monday at the Beyond the Impact Factor workshop, we have been in a conversation about the best practice for citing datasets, for example those in Dryad, from within research papers.

As an approach towards developing that best practice, I have written a Data Citation Best Practice Discussion Document that is available on Google Docs at https://docs.google.com/document/d/1kF8-faB72l4dKTLEyx6Z5cIabk68GrJ9GraCtWnK0qQ/edit?hl=en_GB&authkey=CPPW46wL#.

In this document I first compare what is recommended by DataCite and by Altman and King with what currently practised by Dryad and what presently occurs 'in the wild' in a handful of journal articles that reference Dryad datasets. I then propose recommendations for Dryad to adopt, and conclude with draft Data Citation Best Practice Recommendations that I hope we as a community can develop into a document that we can publish for the world to adopt. As I say in the preface to the document:

"Since Dryad is pioneering data management in terms of data resources that are linked to journal articles, it is to be hoped that by first developing citation best practice in the Dryad context we can thereby catalyse its wider spread. If we can thus agree what such best practice should be among the Dryad community and implement such best practice proposals, we can then update this document accordingly and publish it to promote such practices within the wider scholarly community."

I now see that much of the confusion and disagreement concerning the best method of citing data resources within the previous e-mails in the "Nature paper with Dryad data" e-mail exchanges results from a conflation of ideas about two entities which in the conventional citation of journal articles are quite distinct:

the in-text citation containing an in-text reference pointer, e.g. "this paper builds upon the work of Jones et al. [15]."
and
the actual reference to Jones et al. within the article's reference list, e.g. "[15] Jones A, Bloggs B and Smith C (2008) Title. JournalName 14:132-134. doi:*****."

Thus in my e-mail of 27 April, where I said

"Excellent, but what we really want is for the data citations to be included in the reference list along with the bibliographic citations, following the DataCite model: Creator (PublicationYear): Title. Version. Publisher. ResourceType. Identifier "

. . ., I should also have stressed the need for explicit in-text citations that denote such references.

All that is explained within the Google Docs paper. In that paper I also propose that we work towards having a separate Data Resources section at the end of the body text of a journal article, in which data resource citations can be gathered. That does not preclude these resources also being cited, where appropriate, within the Methods and Materials or Results sections of the paper, but is designed to put data resource citations "on the map", so to speak, as important new publication performative acts.

It is not appropriate, in my mind, for data citations to be included in the Acknowledgements section of a paper, which is designed for acknowledging contributions to the work from people and funding agencies, even if Thomson Reuters has developed methods to parse such entries, since they also have well-established mechanisms for harvesting proper (data) references from the reference list.

At present the Google Docs paper is made accessible to you as a read-only document, but I am very happy to extend edit privileges to anyone who wishes to work actively with me as an author to develop it into a publishable article, perhaps for submission to the Data Standardization, Sharing and Publication series in BMC Research Notes, for which Heather and I are both editors.

And, of course, it goes without saying that all the tag terms required to mark up in-text reference pointers and their textual contexts, references, reference lists, etc., to permit automated detection and harvesting of data citations and references, are available as RDF within the SPAR (Semantic Publishing and Referencing) Ontologies (http://purl.org/spar/), which were designed precisely to facilitate such work.

I look forward to hearing from you all.

David

On 07/05/2011 02:02, Hilmar Lapp wrote:

The perspective I'd like to add to this is that extracting the link to the dataset should be as best as possible facilitated for machines. This applies to an XML representation of the article (such as for full text indexing or mining), or ePub, or PDF formats. If the fully rendered (PDF or not) version of the article should have the dataset as one of the items in the list of references, then this is at best rather difficult, and likely not possible.

Ideally common XML representations and ePub would adopt an additional tag or element that gives the link to the dataset, and rendering these in the Acknowledgements section using a template sentence should be straightforward. And a convention along those lines should also make parsing the text for the expected link feasible, but identifying in an automated fashion which of the references is the dataset could be fraught with issues.

L. Penev from PenSoft mentioned already the desire to add an element to the XML that identifies the Dryad-deposited dataset. And tools like Mendeley can provide a lot more value to its users if the link to the dataset is another piece of metadata that they can parse out of the text with reasonable success.

-hilmar

Sent with a tap.

On May 6, 2011, at 5:45 PM, Angus Whyte <a.w...@ed.ac.uk> wrote:

Hi Heather,

I'm not sure if this still an open issue, but in case its relevant.... At a DCC workshop today we had a talk by Gerry Lawson, the Business and research Information coordinator for the NERC - the UK environmental research funder. He mentioned they are recommending that authors place DOIs for datasets in the Acknowledgements section. The reason is there is an agreement between NERC & other funders and Thomson &Elsevier to harvest data on the funders and grant numbers for WoS/Scopus from Acknowledgement sections , using agreed ascii characters to delimit these. Unfortunately I didn't get a chance to discuss details afterwards but he did say he would be happy to follow up talater. So I don't know how far, if at all, this agreement extends to harvesting DOIs (or for that matter why it doesn't apply to reference lists) but could enquire if you like. I dare say the Beyond Impact workshop may be another opportunity to find out more... or have you heard of all this already?

See you Monday!
cheers
Angus

On 04/05/2011 16:39, Heather Piwowar wrote:

This is an important conversation so I thought I'd keep it alive.

After talking with Todd and others, fwiw I think Dryad should recommend and model bibliographic citations to Dryad datasets even in the data sharing article. (agreeing with David here).

Unlike Genbank etc, Dryad clearly believes that *reuse* of datasets should be attributed via a formal reference in the bibliography section rather than simply mentioned in the methods or Acknowledgment full text. As a result, unlike Genbank etc, we have both something that looks/feels/acts citable, and a strong motivation to get people used to citing datasets. We are in a different place.

Here are advantages to the cultural norm of citing Dryad datasets in the bibliography of the *data collection* article:

gets investigators (and therefore funders, policy makers, etc) more used to seeing citations to datasets, so they don't think it is weird

educates and gives models to readers for how to properly cite datasets they are going to reuse, so they are more likely to do it and do it properly when the time comes

trains Dryad-depositing investigators on how to do it in their instance with hands-on cut and paste text... they are probably more likely to do it in the future themselves upon data reuse

makes our "here's how to refer to your dataset" instructions a lot simpler

every Dryad dataset gets at least one citation ;)

and most importantly, it creates more explicit, unambiguous, best practice links between datasets and papers. The links are in the bibliography which is often in front of paywalls, is certainly indexed more than full text, and is where convention has it that links go. Better and consistent linked data and papers = win for everybody in many dimensions.

Here are some disadvantages:

Different than how people reference Genbank etc data, so blazing new ground

No journals have standardized on this approach so far

Adding the Dryad reference comes late in the lifecycle of paper publication. At this point is it more difficult to add another reference to the draft than a sentence in full text, especially for papers that have a maximum-number-of-citations rule?

It makes citation context more ambiguous, since a reference could be for sharing or reuse. This really complicates my life, but oh well! Bring on CiTO :)

Thoughts?

Heather

On Thu, Apr 28, 2011 at 9:36 AM, Hilmar Lapp <hl...@nescent.org> wrote:

Where do genbank accession numbers, GEO/ArrayExpress accessions, and for those journals that require TreeBASE deposition, TreeBASE identifiers currently go? My inclination would be that Dryad data identifiers for the data for the same article should probably go there as well.

-hilmar

On Apr 28, 2011, at 12:04 PM, Elena Feinstein wrote:

David says:

what we really want is for the data citations to be included in the reference list along with the bibliographic citations

This has come up in a couple different contexts recently, so I thought I would comment. Is this what we really want? I thought differently, since these are the original articles referring to their own underlying data. Later cases of data reuse or otherwise referring to data that was first associated with a different publication should use that citation style and should probably place that in the reference list, but I don't think that's what is being discussed here.

It's an interesting question, since different journals and authors seem to be putting data identifiers in all kinds of places, which makes it more difficult for them to be found, but do they belong in the references?

Elena

On Wed, Apr 27, 2011 at 12:13 PM, David Shotton <david....@zoo.ox.ac.uk> wrote:

Excellent, but what we really want is for the data citations to be included in the reference list along with the bibliographic citations, following the DataCite model:

Creator (PublicationYear): Title. Version. Publisher. ResourceType. Identifier

Kind regards,

David

Dr David Shotton david....@zoo.ox.ac.uk

Reader in Image Bioinformatics

Image Bioinformatics Research Group http://ibrg.zoo.ox.ac.uk
Department of Zoology, University of Oxford tel: +44-(0)1865-271193
South Parks Road, Oxford OX1 3PS, UK fax: +44-(0)1865-310447

Alex Ball

unread,

May 27, 2011, 5:32:59 AM5/27/11

to datacit...@googlegroups.com, Pollard, Tom, Farquhar, Adam, Brase, Jan

On Thursday 26 May 2011 17:17:26 Pollard, Tom wrote:
> Following recent discussion about best practice for citing datasets
> within research articles, David Shotton has created a Data Citation Best
> Practice Discussion Document on Google Docs:

I have an interest in this as I am writing a separate guide on the topic for
the Digital Curation Centre. I have added the following comments to David's
document.

1. The UNF is actually format-independent. The Altman & King article explictly
says: 'A UNF works by first translating the data into a canonical form with
fixed degrees of numerical precision and then applies a cryptographic hash
function to produce the short string. The advantage of canonicalization is
that UNFs (but not raw hash functions) are format-independent: they keep the
same value even if the data set is moved between software programs, file
storage systems, compression schemes, operating systems, or hardware
platforms.' The support that UNF hashing tools actually have for different
formats is another matter.

2. Where you say the title should be 'Data from: <title of paper> (<paper's
bibliographic reference>)', the recommendation should be that the
bibliographic reference is required where the paper does not itself appear as
a reference. To require it regardless, and make a special case of Dryad
landing pages because they do refer to the paper, seems unfair.

While on that subject, it strikes me that if the dataset and its accompanying
paper are both in the reference list, and the dataset does not have a distinct
title, it would be more efficient not to pretend it has but to use an in-text
citation instead, e.g. 'Data from Jones & Bloggs (2008)'.

3. I am uneasy about the recommendation to reproduce the DOI in both URN
(doi:) and URL (http:) form for the benefit of print, since some DOIs can be
quite long (NERC uses GUIDs, for example) and in print materials space is at a
premium. It would make more sense to recommend using just the URL form for
print, and suggest that publishers abbreviate this to the hyperlinked URN for
HTML surrogates (but not HTML-only papers, if we are concerned about people
printing them out). After all, everyone knows how to handle the URLs, and
anyone for whom the URN is meaningful will know how to generate it from the
URL. I note this is what the IQSS Dataverse Network recommends for its
Handles.

Cheers,
Alex.
--
Alex Ball
Research Officer
UKOLN, University of Bath, UK. BA2 7AY
T: +44 1225 383668 F: +44 1225 386256
http://www.ukoln.ac.uk/

Reply all

Reply to author

Forward

0 new messages