yep, not much innovation here. The setup is pretty much standard except:
* only one round of revision to ensure fast publication
* this means also to reject lot of the work, since it seems not worth a
second revision round (like in some conferences)
* not much re-editing of the paper and additional work => fast, but
likely to reject good work, since it is not yet ready
No title yet, just quite broadly: new, top-tier, open access journal
for biomedical and life sciences research.
I guess, the innovative step is: how to get top quality publications
without having to go through different revision cycles. Not sure how
this would work.
I wonder, how much time the funders want to leave to the journal to end
up in the top-tier. This could take some time, since open access is now
established through other journals already.
For the data data journal:
* sounds like a good and interesting idea
* it makes sense to work together with those guys http://www.datacite.org/
* giving the different data collection types a definition /
specification seems to be mandatory
* it would be important to think about the fact that the data also
requires revision => could be tedious
* working together with different journals to collect / organise the
supplementary data could be an interesting service.
This is what comes to my mind.
-drs-
> for data � an open access place for data that are sound, but not
> necessarily expected to have high impact on science. If that sounds
> less than promising, recall the beginnings of PLoS ONE, which was
> initially considered below par, but proved otherwise and became the
> worlds largest journal with lots of value.
>
> Scientists need an open access place for data since there are many
> datasets that do not fit well within existing repositories or are not
> suitable for inclusion in a publication (the so-called long-tail
> problem). With a lightweight documented calling interface (API), apps
> can easily be developed by the community to operate on subsets of the
> data (including reviewing the data) in the same way that apps are
> built to maximize the utility of a piece of hardware or a corpus of
> literature, as is the case with something like Elsevier�s SciVerse.
> The more data are published this way, the more useful the data journal
> becomes, which in turn encourages more data publication. The sum of
> the parts are greater than the whole � not true of a typical journal.
>
> A number of pieces will be critical to a successful data journal.
> Understanding the needs of a scientific community and getting its
> feedback will keep the journal relevant; this is an ongoing activity
> effectively is achieved by establishing an appropriate social
> infrastructure. Providing a simple, bullet-proof rights framework will
> ensure that needless barriers to sharing and preservation are
> minimized. Usage and impact metrics should be enabled by citation
> conventions backed up by unique persistent identifiers for datasets
> and researchers themselves. Some concept of �quality�, if not peer
> review, also needs to be defined and piloted.
>
> The impact on scientific data analysis could be profound. The devil is
> in the details � how can this be made versatile enough to embrace
> Bank (PDB), Genbank, and PharmGKB � These are functioning well, but
> are expensive to operate and there is no business model beyond
> continued federal funding. As a result some of these resources have
> closed (e.g., the Arabidopsis database) or are under threat.
>
> 2. Repositories that contain data specifically related to a
> publication, but for which no recognized domain specific repository
> exists. Dryad is an example of such a repository. Publishers have
> associated with these resources since it relieves them of the burden
> of managing the data as a large supplement to the paper, which many
> publishers are ill-equipped to do. These resources are just emerging
> and seem destined to also continue to rely on federal funding or buy-
> in from the publishers, in other words, there is no established
> business model to provide sustainability at this time.
>
> 3. Institutional repositories that manage the data produced by
> researchers at a given institution. At this time their success is far
> from assured � what is the incentive for a data producer to put data
> there? What is the business model for their sustainability? How do
> institutional repositories relate to each other? There are exceptions
> (e.g., the California Digital Library) but even there I would say
> there is a failure to achieve broad buy-in by faculty. There need to
> be faculty who have seen a significant gain and who champion such
> developments.
>
> 4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an
> open data model and are usually domain specific, e.g., disease
> modeling.
>
> In reality, much of the data generated, most of which conforms to the
> long tail � the very large number of small datasets not handled {well}
> Notwithstanding the provision of a reward, the barrier to �publishing�
> with the Data Journal must be kept very low, at least initially. This
> implies collecting only a minimal amount of metadata to provide
> provenance and to identify the dataset uniquely. Over time, based on
> demand from scientists to publish in the Data Journal, the
> requirements can be ramped up, but also offset by increased
> automation. Here are three suggested phases of development:
>
> � Phase I (one year) � deploy the backend curation repository to
> appropriately store, retrieve, search and audit datasets. Develop
> simple upload mechanisms for small and large datasets. Define a simple
> review and data integrity system to validate the depositions. See
> whether the journal gets any �data papers� submitted. This could be
> based on an existing infrastructure at the California Digital Library
> (CDL).
>
> � Phase II (one year) � If phase I is successful start to accumulate
> more metadata on entries through automated extraction based on data
> format and data type. An example of a popular data format will be MS
> Excel spreadsheets (coincidentally by June 2012 CDL will have released
> an open-source Excel �add-in� designed to make spreadsheet data easier
> to publish, share, and preserve); an example of a data type would be
> biological sequence data which has 1-2 well-structured forms and which
> does not belong in any existing repository.
>
> � Phase III (on-going) expand the capabilities established in phase II
> to visually respond to common and established data formats and data
> types. Data visualized in a traditional journal is static. Here the
> data can be bought to life, often using tools developed by communities
> themselves and collated by the Data Journal. If successful this will
> become an accepted form of publishing and the depositors/authors will
> treat it like any other journal and the merger between database and
> journal, predicted 6 years ago will be realized.
>
> Where Will This Be Done and By Whom?
> This remains open at this time, but one option is in partnership with
> PLoS or F1000, as a publisher and the University of California as a
> backend provider. The California Digital Library could provide data
> repository, curation, citation, and publishing services (with DataONE
> and DataCite) as well as long experience with aggregators and
> publishers.
>
> As a faculty advocate and database person I am willing to put
> significant effort into this, but it must be a broad team effort.
>
> How and Why Will It Be Funded?
> I have thought long and hard about this since the very successful
> �Beyond the PDF Workshop� (BtPDF), which we hosted at UCSD in January
> 2011. I have become convinced that to precipitate change in the way we
> manage scholarship in the digital era we need an exemplar that can
> touch your average scientist, who sees a need for change, but will not
> be part of it without a reward. A Data Journal is such a reward �
> citation and preservation of an important part of their scholarship.
> With great concern about preservation and open accessibility of
> digital data from funders, publishers and scientists, the time would
> seem right to take this step. The intent is to approach funders who
> are vested in the ideals of BtPDF to seed such a Data Journal
> initiative.
>
> To start we will initiate a joint call with supporters of BtPDF � the
> CDL, the National Cancer Institute (NCI), The Gordon and Betty Moore
> Foundation (GBMF), The Doris Duke Foundation, The Alfred P. Sloan
> Foundation, Sage Bionetworks, Science Commons and Microsoft with the
> goal of obtaining short term seed funding for Phase I. Later for Phase
> II the emerging team of like-minded folks who form and seek funding
> from NSF etc. for Phase II and a sustained open access business model
> for Phase III. That is, for Phase III researchers would have put a
> line item in their research budgets to cover the cost of preservation.
> For-profits will be encouraged to pay small amounts to upload data no
> longer of proprietary value and will be compensated with limited
> advertising.
>
> Acknowledgements
> Thus far thanks to Lee Dirks (Microsoft), Josh Greenberg (Sloan
> Foundation), Tony Hey (Microsoft), John Kunze (CDL), David Lipman
> (NLM), Chris Mentzel (GBMF), Mark Patterson (PLoS), Brian
> Schottlaender (UCSD Library) for discussions in coming to the
> conclusions presented here.
>
>
--
Dietrich Rebholz-Schuhmann, MD, PhD - Research Group Leader
EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD (UK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TM support:www.ebi.ac.uk/Rebholz-svr | tm-su...@ebi.ac.uk
ISMB/ECCB 2011: http://www.iscb.org/ismbeccb2011
Twitter: jbiomedsem
I think this would be fantastic. I also like the bootstrap approach you
outlined. One should require the minimal amount of information to make
it easy to add data.
This would be also be a good start to helping grow a community of
journals for publishing both data and experiments (e.g. Open Research
Computation [1])
regards,
Paul
[1] http://www.openresearchcomputation.com/
--
Dr. Paul Groth (p.t....@vu.nl)
http://www.few.vu.nl/~pgroth/
Knowledge Representation & Reasoning Group
Artificial Intelligence Section
Department of Computer Science
VU University Amsterdam
I have a few basic thoughts relating to your idea for a Data Journal.
1 - Different domains of study have differing requirements and approaches regarding data, e.g. biology, social sciences, astronomy, medical (HIPPA), psychology...
2 - Technical keys to the problem may include common metadata specifications, stable identifiers, robust provenance, and reliability of the core data storage facilities.
3 - Social keys to the problem certainly include recruitment / motivation / reward for participation, and more broadly, incorporating a data journal into the workflow / communications ecosystem of the ordinary scientist.
4 - It would probably be mistaken to try to develop a one-size-fits-all complete metadata model for all domains at the outset. Any metadata model should be very basic but "regionally extensible".
5 - There are multiple formats that have been proposed for stable data identifiers, and these differences relate among other things to:
-- different existing repositories for the data -
-- different opinions and approaches to standard identifiers
-- different data citation models
6 - Fundamentally, "stable identifiers" are also a form of metadata - so-called "surrogate keys" in database parlance.
7 - A data repository is really only useful if its datasets are connected via data citation to the methods used to generate it and the interpretations given it by its originators.
I would particularly support creating a repository *pre-aligned with a journal*, for example PLOS itself, with a really useful citation mechanism for existing journals to opt into, requiring it of their contributors. This ensures that your prototype application answers point 7 above.
At the same time, one cannot expect that such a repository would be the only one or would have the "one ring" citation model. So it would also be important to provide for interoperability with other models and other repositories.
I would enjoy discussing this idea further one on one or in a small group.
The Dataverse Network (http://thedata.org) people have a very interesting open source software framework which might well be adaptable in an endeavor such as you propose. Merce Crosas, the Dataverse PM, and I have had a number of discussions on this topic and I am certain she also would be happy to chat further. Anita De Waard and I have discussed similar ideas.
Also recommend looking at some of the outputs of a workshop organized by Merce and her colleagues during May this year, http://projects.iq.harvard.edu/datacitation_workshop/
Best
Tim
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
> I found the announcement yesterday of a new high profile open access
> journal somewhat discouraging (see http://www.hhmi.org/news/20110627.html)
> .
Me too, but then it's not in my domain anyway.
> After much discussion with a number of people I have concluded that a
> logical next step is to consider starting a high visibility data
> journal for the reasons I outline below in a brief rationale. I am
> considering starting to chase funds for such an endeavor and before I
> do so I would very much appreciate the thoughts of this group on
> whether this is a good idea, whether it conflicts with other efforts
This is actually pretty similar to what I recently suggested in reply
to a proposal by Anita to start an "executable journal". The ideal
looks good to me and you have come up with good ideas concerning many
of the organizational aspects.
One topic you do not mention is refereeing. What would your data
journal do to ensure at the very least that the submitted data is what
it claims to be, and is usable by someone other than the submitting
authors? Without some minimal reviewing procedure, some people are
likely to deposit junk data in order to get a citable publication.
Konrad.
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research at khinsen dot fastmail dot net
---------------------------------------------------------------------
Technically we should compare a Data Journal with three existing
approaches:
1. Document Repositories like arXiv.
2. Data registries like CKAN
3. Source code repositories like github
I'd favour the third, but scientists will more likely be familiar with
the first. In my opinon it is crucial to be familiar with these
different approaches to know the strength and weaknesses of each
variant. If you add a peer-review to the journal, you will soon limit
its scope to some fields of research, but I think you cannot avoid some
topical focus anyway. Only experts in
a domain can judge data and metadata.
Jakob
--
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Vo�
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242
http://www.gbv.de
jakob...@gbv.de
After much discussion with a number of people I have concluded that a
logical next step is to consider starting a high visibility data
journal for the reasons I outline below in a brief rationale. I am
considering starting to chase funds for such an endeavor and before I
do so I would very much appreciate the thoughts of this group on
whether this is a good idea, whether it conflicts with other efforts
etc. I am less concerned about the technology to be used at this stage
and more concerned about whether this would move improved scholarship
in the right direction? Your council would be much appreciated.
I share your sentiment regarding the HHMI initiative and enjoyed
reading your alternative proposal.
While you have sketched out the technical hurdles and approaches in
quite some detail, I am missing a complement on the cultural side: as
PMR just pointed out, the key to cultural change is community
engagement, and this works better where the community already is than
in new venues. There are reasons why several scope-limited PLoS
journals were started before PLoS ONE, and many of them were cultural.
I think many of these would also apply to the launch of PLoS DATA,
with technical and legal issues further complicating the matter.
One way to go about this would be to try to get existing journals to
adopt a policy to add a statement on data sharing to each article they
publish, just as it is now common practice to have a statement on
potential conflicts of interest. Such data sharing statements are
already in effect at some journals [1] while under consideration at
others [2], and the little cultural change they require could be
achievable in a fair number of communities that do not yet have
mandatory data sharing [3]. If the statement is suitably implemented
(e.g. automatically generated on the basis of responses collected by a
standardized form), some basic metadata generation (and its flagging
[4]) could also be automated, so there would be little effort to be
expanded on the side of authors or publishers.
First results of such an approach (i.e. a standardized statement
generator, and journals using it) could realistically be expected on
the scale of months, thereby raising community awareness of the issue
already during your stage I of technical development. Other
community-based options - acting on different time scales - would be
getting involved with the standardization of technical aspects of data
sharing for scope-limited data journals [5], or even a SCOAP3
approach, which would also address the business model issue [6]. Given
a wider community involvement of such kind, significant community
participation in the technical development should be more easily
achievable than via a purely technical approach.
Note that a reward system is not required for anything thus far, but
these data statements (if machine readable) could serve as a basis for
a reward system that could later be expanded (e.g. to include reuse
metrics) as the technical development proceeds.
Thanks again for this stimulating proposal - I wish you could have
submitted it to a "Journal of Research Proposals" [7] for appropriate
reward.
Cheers,
Daniel
[1] http://dx.doi.org/10.1136/bmj.c564
[2] http://blogs.openaccesscentral.com/blogs/bmcblog/entry/dear_scientist_help_us_to
[3] http://www.datadryad.org/jdap
[4] http://onsclaims.wikispaces.com/
[5] http://www.gbif.org/communications/news-and-events/showsingle/article/new-incentive-for-biodiversity-data-publishing/
[6] http://scoap3.org/
[7] http://evomri.net/?page_id=36
More recently I have been watching the Journal of Network Biology from
the inside, and although there are technical challenges, the social
challenges of what it means to "peer review" a data set make the
technology look simple. And the odds are good that the first pass on
what review means will be wrong, and need to be incrementally improved
for a while, or perhaps even thrown out for something that is less wrong
and upon which incremental innovation can occur.
jtw
--
John Wilbanks
VP for Science
Creative Commons
web: http://creativecommons.org/science
blog: http://scienceblogs.com/commonknowledge
twitter: @wilbanks
First a confession - What I am really suggesting is not a Data Journal but a database, but my faculty (my metric for success) do not respect nor understand databases. So lets call it a journal (which they revere) even though it is a database underneath. Its only a white lie :) Now to your points:
I agree that there needs to be a way to cite the data and working with DataCite or others makes perfect sense, but it is not enough to entice your average scientist in my view. You need to be able to cite the dataset also with a journal like citation (journal name, year, volume etc.) and include that in the references to a paper. Even then it will not be given much credence in the beginning, but it could grow in perceived value in same way perception of the value of PLoS ONE articles have changed. What drove the change in my opinion in the perception of ONE was the impact factor and article level metrics showing the value of the papers therein. Data level metrics are a start to show that. You cant ignore the relative value of a paper only cited by the original authors vs a dataset that is downloaded by 100 people and cited by 10 in the papers resulting from its use. In time (I dream) data citations and paper citations will become indistinguishable.
The issues of data types, metadata extraction, validation etc are huge of course, but often solved in part by the respective labs who generate the data, but for their own needs (unless managed by a community database). There lies the beauty (I dream again) these tools now become part of the solution in a shared space and are themselves open.
None of this precludes relationships with journals and the articles that say more about the data that in the data journal itself. In fact publishers should be relieved since it relieves them of the burden of trying to handle data. What makes what I am thinking different than something like Dryad which is doing this with published articles is that there does not have to be a publication now or ever.
Cheers../Phil B
>> for data – an open access place for data that are sound, but not
>> necessarily expected to have high impact on science. If that sounds
>> less than promising, recall the beginnings of PLoS ONE, which was
>> initially considered below par, but proved otherwise and became the
>> worlds largest journal with lots of value.
>>
>> Scientists need an open access place for data since there are many
>> datasets that do not fit well within existing repositories or are not
>> suitable for inclusion in a publication (the so-called long-tail
>> problem). With a lightweight documented calling interface (API), apps
>> can easily be developed by the community to operate on subsets of the
>> data (including reviewing the data) in the same way that apps are
>> built to maximize the utility of a piece of hardware or a corpus of
>> literature, as is the case with something like Elsevier’s SciVerse.
>> The more data are published this way, the more useful the data journal
>> becomes, which in turn encourages more data publication. The sum of
>> the parts are greater than the whole – not true of a typical journal.
>>
>> A number of pieces will be critical to a successful data journal.
>> Understanding the needs of a scientific community and getting its
>> feedback will keep the journal relevant; this is an ongoing activity
>> effectively is achieved by establishing an appropriate social
>> infrastructure. Providing a simple, bullet-proof rights framework will
>> ensure that needless barriers to sharing and preservation are
>> minimized. Usage and impact metrics should be enabled by citation
>> conventions backed up by unique persistent identifiers for datasets
>> and researchers themselves. Some concept of “quality”, if not peer
>> review, also needs to be defined and piloted.
>>
>> The impact on scientific data analysis could be profound. The devil is
>> in the details – how can this be made versatile enough to embrace
>> Bank (PDB), Genbank, and PharmGKB – These are functioning well, but
>> are expensive to operate and there is no business model beyond
>> continued federal funding. As a result some of these resources have
>> closed (e.g., the Arabidopsis database) or are under threat.
>>
>> 2. Repositories that contain data specifically related to a
>> publication, but for which no recognized domain specific repository
>> exists. Dryad is an example of such a repository. Publishers have
>> associated with these resources since it relieves them of the burden
>> of managing the data as a large supplement to the paper, which many
>> publishers are ill-equipped to do. These resources are just emerging
>> and seem destined to also continue to rely on federal funding or buy-
>> in from the publishers, in other words, there is no established
>> business model to provide sustainability at this time.
>>
>> 3. Institutional repositories that manage the data produced by
>> researchers at a given institution. At this time their success is far
>> from assured – what is the incentive for a data producer to put data
>> there? What is the business model for their sustainability? How do
>> institutional repositories relate to each other? There are exceptions
>> (e.g., the California Digital Library) but even there I would say
>> there is a failure to achieve broad buy-in by faculty. There need to
>> be faculty who have seen a significant gain and who champion such
>> developments.
>>
>> 4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an
>> open data model and are usually domain specific, e.g., disease
>> modeling.
>>
>> In reality, much of the data generated, most of which conforms to the
>> long tail – the very large number of small datasets not handled {well}
>> Notwithstanding the provision of a reward, the barrier to “publishing”
>> with the Data Journal must be kept very low, at least initially. This
>> implies collecting only a minimal amount of metadata to provide
>> provenance and to identify the dataset uniquely. Over time, based on
>> demand from scientists to publish in the Data Journal, the
>> requirements can be ramped up, but also offset by increased
>> automation. Here are three suggested phases of development:
>>
>> • Phase I (one year) – deploy the backend curation repository to
>> appropriately store, retrieve, search and audit datasets. Develop
>> simple upload mechanisms for small and large datasets. Define a simple
>> review and data integrity system to validate the depositions. See
>> whether the journal gets any “data papers” submitted. This could be
>> based on an existing infrastructure at the California Digital Library
>> (CDL).
>>
>> • Phase II (one year) – If phase I is successful start to accumulate
>> more metadata on entries through automated extraction based on data
>> format and data type. An example of a popular data format will be MS
>> Excel spreadsheets (coincidentally by June 2012 CDL will have released
>> an open-source Excel “add-in” designed to make spreadsheet data easier
>> to publish, share, and preserve); an example of a data type would be
>> biological sequence data which has 1-2 well-structured forms and which
>> does not belong in any existing repository.
>>
>> • Phase III (on-going) expand the capabilities established in phase II
>> to visually respond to common and established data formats and data
>> types. Data visualized in a traditional journal is static. Here the
>> data can be bought to life, often using tools developed by communities
>> themselves and collated by the Data Journal. If successful this will
>> become an accepted form of publishing and the depositors/authors will
>> treat it like any other journal and the merger between database and
>> journal, predicted 6 years ago will be realized.
>>
>> Where Will This Be Done and By Whom?
>> This remains open at this time, but one option is in partnership with
>> PLoS or F1000, as a publisher and the University of California as a
>> backend provider. The California Digital Library could provide data
>> repository, curation, citation, and publishing services (with DataONE
>> and DataCite) as well as long experience with aggregators and
>> publishers.
>>
>> As a faculty advocate and database person I am willing to put
>> significant effort into this, but it must be a broad team effort.
>>
>> How and Why Will It Be Funded?
>> I have thought long and hard about this since the very successful
>> “Beyond the PDF Workshop” (BtPDF), which we hosted at UCSD in January
>> 2011. I have become convinced that to precipitate change in the way we
>> manage scholarship in the digital era we need an exemplar that can
>> touch your average scientist, who sees a need for change, but will not
>> be part of it without a reward. A Data Journal is such a reward –
>> citation and preservation of an important part of their scholarship.
>> With great concern about preservation and open accessibility of
>> digital data from funders, publishers and scientists, the time would
>> seem right to take this step. The intent is to approach funders who
>> are vested in the ideals of BtPDF to seed such a Data Journal
>> initiative.
>>
>> To start we will initiate a joint call with supporters of BtPDF – the
>> for data – an open access place for data that are sound, but not
>> necessarily expected to have high impact on science. If that sounds
>> less than promising, recall the beginnings of PLoS ONE, which was
>> initially considered below par, but proved otherwise and became the
>> worlds largest journal with lots of value.
>>
>> Scientists need an open access place for data since there are many
>> datasets that do not fit well within existing repositories or are not
>> suitable for inclusion in a publication (the so-called long-tail
>> problem). With a lightweight documented calling interface (API), apps
>> can easily be developed by the community to operate on subsets of the
>> data (including reviewing the data) in the same way that apps are
>> built to maximize the utility of a piece of hardware or a corpus of
>> literature, as is the case with something like Elsevier’s SciVerse.
>> The more data are published this way, the more useful the data journal
>> becomes, which in turn encourages more data publication. The sum of
>> the parts are greater than the whole – not true of a typical journal.
>>
>> A number of pieces will be critical to a successful data journal.
>> Understanding the needs of a scientific community and getting its
>> feedback will keep the journal relevant; this is an ongoing activity
>> effectively is achieved by establishing an appropriate social
>> infrastructure. Providing a simple, bullet-proof rights framework will
>> ensure that needless barriers to sharing and preservation are
>> minimized. Usage and impact metrics should be enabled by citation
>> conventions backed up by unique persistent identifiers for datasets
>> and researchers themselves. Some concept of “quality”, if not peer
>> review, also needs to be defined and piloted.
>>
>> The impact on scientific data analysis could be profound. The devil is
>> in the details – how can this be made versatile enough to embrace
>> Bank (PDB), Genbank, and PharmGKB – These are functioning well, but
>> are expensive to operate and there is no business model beyond
>> continued federal funding. As a result some of these resources have
>> closed (e.g., the Arabidopsis database) or are under threat.
>>
>> 2. Repositories that contain data specifically related to a
>> publication, but for which no recognized domain specific repository
>> exists. Dryad is an example of such a repository. Publishers have
>> associated with these resources since it relieves them of the burden
>> of managing the data as a large supplement to the paper, which many
>> publishers are ill-equipped to do. These resources are just emerging
>> and seem destined to also continue to rely on federal funding or buy-
>> in from the publishers, in other words, there is no established
>> business model to provide sustainability at this time.
>>
>> 3. Institutional repositories that manage the data produced by
>> researchers at a given institution. At this time their success is far
>> from assured – what is the incentive for a data producer to put data
>> there? What is the business model for their sustainability? How do
>> institutional repositories relate to each other? There are exceptions
>> (e.g., the California Digital Library) but even there I would say
>> there is a failure to achieve broad buy-in by faculty. There need to
>> be faculty who have seen a significant gain and who champion such
>> developments.
>>
>> 4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an
>> open data model and are usually domain specific, e.g., disease
>> modeling.
>>
>> In reality, much of the data generated, most of which conforms to the
>> long tail – the very large number of small datasets not handled {well}
>> Notwithstanding the provision of a reward, the barrier to “publishing”
>> with the Data Journal must be kept very low, at least initially. This
>> implies collecting only a minimal amount of metadata to provide
>> provenance and to identify the dataset uniquely. Over time, based on
>> demand from scientists to publish in the Data Journal, the
>> requirements can be ramped up, but also offset by increased
>> automation. Here are three suggested phases of development:
>>
>> • Phase I (one year) – deploy the backend curation repository to
>> appropriately store, retrieve, search and audit datasets. Develop
>> simple upload mechanisms for small and large datasets. Define a simple
>> review and data integrity system to validate the depositions. See
>> whether the journal gets any “data papers” submitted. This could be
>> based on an existing infrastructure at the California Digital Library
>> (CDL).
>>
>> • Phase II (one year) – If phase I is successful start to accumulate
>> more metadata on entries through automated extraction based on data
>> format and data type. An example of a popular data format will be MS
>> Excel spreadsheets (coincidentally by June 2012 CDL will have released
>> an open-source Excel “add-in” designed to make spreadsheet data easier
>> to publish, share, and preserve); an example of a data type would be
>> biological sequence data which has 1-2 well-structured forms and which
>> does not belong in any existing repository.
>>
>> • Phase III (on-going) expand the capabilities established in phase II
>> to visually respond to common and established data formats and data
>> types. Data visualized in a traditional journal is static. Here the
>> data can be bought to life, often using tools developed by communities
>> themselves and collated by the Data Journal. If successful this will
>> become an accepted form of publishing and the depositors/authors will
>> treat it like any other journal and the merger between database and
>> journal, predicted 6 years ago will be realized.
>>
>> Where Will This Be Done and By Whom?
>> This remains open at this time, but one option is in partnership with
>> PLoS or F1000, as a publisher and the University of California as a
>> backend provider. The California Digital Library could provide data
>> repository, curation, citation, and publishing services (with DataONE
>> and DataCite) as well as long experience with aggregators and
>> publishers.
>>
>> As a faculty advocate and database person I am willing to put
>> significant effort into this, but it must be a broad team effort.
>>
>> How and Why Will It Be Funded?
>> I have thought long and hard about this since the very successful
>> “Beyond the PDF Workshop” (BtPDF), which we hosted at UCSD in January
>> 2011. I have become convinced that to precipitate change in the way we
>> manage scholarship in the digital era we need an exemplar that can
>> touch your average scientist, who sees a need for change, but will not
>> be part of it without a reward. A Data Journal is such a reward –
>> citation and preservation of an important part of their scholarship.
>> With great concern about preservation and open accessibility of
>> digital data from funders, publishers and scientists, the time would
>> seem right to take this step. The intent is to approach funders who
>> are vested in the ideals of BtPDF to seed such a Data Journal
>> initiative.
>>
>> To start we will initiate a joint call with supporters of BtPDF – the
re 1 - true domains are very different, capturing a few might be enough to start change... Science and Nature cross disciplines so may a Data Journal
re 2 - agreed - but many of these problems are well addressed by this wonderful community and only need be applied
re 3 - yes this is the big hurdle in my view - but wait till you have to prove your data sharing to get your next grant - it will be a cake walk at that point.
re 4 - agreed
re 5,6 - agreed so who wins? Maybe it will be teh one that delivers useful data to a customer - okay a circular argument since you have to identify the data, but my point is a focus on data sharing first. If the PDB (with its own identifiers) and GenBank (with its own) said we have a new system that covers both datatypes chances are it would be adopted (ie cited in other places including papers).
re 7 - true of immature datatypes, is it true of say SNPs (a more mature datatype) - yes in theory but in practice, not so sure .. this requires discussion for sure.
I dont claim anything re the idea - others as you name have thought about this deeply - that the idea reoccurs tells us something positive, the trick is to go from idea to resource to my fellow faculty clamoring for data citations.
Lots to chat about.. cheers../Phil
-John
��������� Penev L, Mietchen D, Chavan V, Hagedorn G, Remsen D, Smith V, Shotton D (2011). Pensoft Data Publishing Policies and Guidelines for Biodiversity Data. Pensoft Publishers, http://www.pensoft.net/J_FILES/Pensoft_Data_Publishing_Policies_and_Guidelines.pdf.
�David
). Do not get me wrong, support for open access is of course welcome, but that this is the best a group of top scientists could come up with to improve scholarship leaves me wondering. It is my opinion we need to coax these folks to greater thoughts through initiatives that they can see make sense, even if not initially, but soon. In my opinion the SMA application that came from the workshop in Jan. is a great example. With that under way I started to contemplate what else could be done. After much discussion with a number of people I have concluded that a logical next step is to consider starting a high visibility data journal for the reasons I outline below in a brief rationale. I am considering starting to chase funds for such an endeavor and before I do so I would very much appreciate the thoughts of this group on whether this is a good idea, whether it conflicts with other efforts etc. I am less concerned about the technology to be used at this stage and more concerned about whether this would move improved scholarship in the right direction? Your council would be much appreciated. Best../Phil B. A Proposal for a Data Journal Philip E Bourne PhD June 24, 2011 Executive Summary After thinking long and hard post the Beyond the PDF workshop and talking at length to scientists, publishers, librarians, funders and others vested in bettering scholarship I have concluded we need a bold yet doable step which will engage your typical faculty member in the process of change. Without broad active participation of scientists at large we will continue to invoke change at a glacial pace and in just a small number of disciplines. Participation requires reward and other perceived gains in a competitive workplace. A Data Journal can provide that reward and gain. The scientist receives a true citation that can be used in traditional journals for their dataset. These data are preserved and the scientist is in compliance with emerging data sharing policies being put in place by the funding agencies. Think of it as PLoS ONE for data � an open access place for data that are sound, but not necessarily expected to have high impact on science. If that sounds less than promising, recall the beginnings of PLoS ONE, which was initially considered below par, but proved otherwise and became the worlds largest journal with lots of value. Scientists need an open access place for data since there are many datasets that do not fit well within existing repositories or are not suitable for inclusion in a publication (the so-called long-tail problem). With a lightweight documented calling interface (API), apps can easily be developed by the community to operate on subsets of the data (including reviewing the data) in the same way that apps are built to maximize the utility of a piece of hardware or a corpus of literature, as is the case with something like Elsevier�s SciVerse. The more data are published this way, the more useful the data journal becomes, which in turn encourages more data publication. The sum of the parts are greater than the whole � not true of a typical journal. A number of pieces will be critical to a successful data journal. Understanding the needs of a scientific community and getting its feedback will keep the journal relevant; this is an ongoing activity effectively is achieved by establishing an appropriate social infrastructure. Providing a simple, bullet-proof rights framework will ensure that needless barriers to sharing and preservation are minimized. Usage and impact metrics should be enabled by citation conventions backed up by unique persistent identifiers for datasets and researchers themselves. Some concept of �quality�, if not peer review, also needs to be defined and piloted. The impact on scientific data analysis could be profound. The devil is in the details � how can this be made versatile enough to embrace multiple types of data retrieved in a way that are comfortable to the intended discipline of users? This is a problem that outside of the database community does not seem to be well solved. Thus database developers need to be part of the solution. At the same time, it will be important that the effort not try to run before it can walk; solid, forward-thinking infrastructure first, (but definitely not a build it and they will come, but rather built it with them so that they might come) with enriched application development later. The objective proposed here is to prototype such a database (but call it a journal) in one year. The prototype development requires willing data contributors, database developers, publishers, software engineers and others engaged in digital scholarship. The process involves requirements gathering from scientists, a workshop to define the technical requirements and the subsequent prototyping (possibly using an existing system) with a goal to launch a Data Journal in the second year. You are receiving this document because in some way you could contribute to this effort. I hope you will. Comments welcome at any time. Problem Statement Traditional scholarly communication is struggling to adapt to the Internet era. One notable shortcoming (drawing from the biosciences, but more generally applicable) is that large amounts of digital data are being generated, often from high-throughput experiments, which may or may not be associated with a publication, and subsequently lost . True, some of that data finds its way into existing repositories that I would characterize as follows: 1. Established repositories with stable funding where the incentive to deposit comes from a prerequisite defined by the journal accepting a publication associated with the data. Examples are the Protein Data Bank (PDB), Genbank, and PharmGKB � These are functioning well, but are expensive to operate and there is no business model beyond continued federal funding. As a result some of these resources have closed (e.g., the Arabidopsis database) or are under threat. 2. Repositories that contain data specifically related to a publication, but for which no recognized domain specific repository exists. Dryad is an example of such a repository. Publishers have associated with these resources since it relieves them of the burden of managing the data as a large supplement to the paper, which many publishers are ill-equipped to do. These resources are just emerging and seem destined to also continue to rely on federal funding or buy- in from the publishers, in other words, there is no established business model to provide sustainability at this time. 3. Institutional repositories that manage the data produced by researchers at a given institution. At this time their success is far from assured � what is the incentive for a data producer to put data there? What is the business model for their sustainability? How do institutional repositories relate to each other? There are exceptions (e.g., the California Digital Library) but even there I would say there is a failure to achieve broad buy-in by faculty. There need to be faculty who have seen a significant gain and who champion such developments. 4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an open data model and are usually domain specific, e.g., disease modeling. In reality, much of the data generated, most of which conforms to the long tail � the very large number of small datasets not handled {well} by these repositories - are lost. While much of these data could easily be reproduced and many will never be in demand, there is also valuable data being lost and no business model for sustaining such data. Efforts like DataCite are a step in the right direction but a DOI alone is not enough to incentivize the scientific community at large. What is the Solution and Why Will it Work? The solution is to establish a data journal. What does this mean and why will it work? The major impediment for many researchers in preserving their data is incentive. Funding agencies, e.g., NIH, NSF are providing incentive through mandatory data sharing policies, but at this time there is confusion surrounding these policies and no indication of, if and how they will be enforced, or indeed where to store the data to conform to the policies! Perhaps more important, there is no reward for depositing data, except in the case of 1 above, but that covers only a small amount of the data generated. A data journal provides that reward through a citation (e.g., hypothetically PLoS Data 2011 6:3 d000001) that can be included on resumes, promotion files etc. Associated with the citation will be a DOI (from DataCite or elsewhere) that provides a definitive resolvable reference to that dataset. Initially the value of that citation will be limited, but over time we anticipate it will be accepted by aggregation services, e.g., Thomson Reuters ISI and PubMed. Another reason a Data Journal will be successful is the corresponding demise of the research article. Already many research articles do little else than report on a set of data, a Data Journal simply acknowledges the fact. The goal of a Data Journal is to drive the notion that one day (it should have been yesterday!) a research article with zero citations (beyond self citing) will be less valuable that a citable dataset that has been downloaded by many researchers worldwide (bibliometrics on access to the data and commentary by users of the data would be maintained from day 1). How Will This Work? Notwithstanding the provision of a reward, the barrier to �publishing� with the Data Journal must be kept very low, at least initially. This implies collecting only a minimal amount of metadata to provide provenance and to identify the dataset uniquely. Over time, based on demand from scientists to publish in the Data Journal, the requirements can be ramped up, but also offset by increased automation. Here are three suggested phases of development: � Phase I (one year) � deploy the backend curation repository to appropriately store, retrieve, search and audit datasets. Develop simple upload mechanisms for small and large datasets. Define a simple review and data integrity system to validate the depositions. See whether the journal gets any �data papers� submitted. This could be based on an existing infrastructure at the California Digital Library (CDL). � Phase II (one year) � If phase I is successful start to accumulate more metadata on entries through automated extraction based on data format and data type. An example of a popular data format will be MS Excel spreadsheets (coincidentally by June 2012 CDL will have released an open-source Excel �add-in� designed to make spreadsheet data easier to publish, share, and preserve); an example of a data type would be biological sequence data which has 1-2 well-structured forms and which does not belong in any existing repository. � Phase III (on-going) expand the capabilities established in phase II to visually respond to common and established data formats and data types. Data visualized in a traditional journal is static. Here the data can be bought to life, often using tools developed by communities themselves and collated by the Data Journal. If successful this will become an accepted form of publishing and the depositors/authors will treat it like any other journal and the merger between database and journal, predicted 6 years ago will be realized. Where Will This Be Done and By Whom? This remains open at this time, but one option is in partnership with PLoS or F1000, as a publisher and the University of California as a backend provider. The California Digital Library could provide data repository, curation, citation, and publishing services (with DataONE and DataCite) as well as long experience with aggregators and publishers. As a faculty advocate and database person I am willing to put significant effort into this, but it must be a broad team effort. How and Why Will It Be Funded? I have thought long and hard about this since the very successful �Beyond the PDF Workshop� (BtPDF), which we hosted at UCSD in January 2011. I have become convinced that to precipitate change in the way we manage scholarship in the digital era we need an exemplar that can touch your average scientist, who sees a need for change, but will not be part of it without a reward. A Data Journal is such a reward � citation and preservation of an important part of their scholarship. With great concern about preservation and open accessibility of digital data from funders, publishers and scientists, the time would seem right to take this step. The intent is to approach funders who are vested in the ideals of BtPDF to seed such a Data Journal initiative. To start we will initiate a joint call with supporters of BtPDF � the CDL, the National Cancer Institute (NCI), The Gordon and Betty Moore Foundation (GBMF), The Doris Duke Foundation, The Alfred P. Sloan Foundation, Sage Bionetworks, Science Commons and Microsoft with the goal of obtaining short term seed funding for Phase I. Later for Phase II the emerging team of like-minded folks who form and seek funding from NSF etc. for Phase II and a sustained open access business model for Phase III. That is, for Phase III researchers would have put a line item in their research budgets to cover the cost of preservation. For-profits will be encouraged to pay small amounts to upload data no longer of proprietary value and will be compensated with limited advertising. Acknowledgements Thus far thanks to Lee Dirks (Microsoft), Josh Greenberg (Sloan Foundation), Tony Hey (Microsoft), John Kunze (CDL), David Lipman (NLM), Chris Mentzel (GBMF), Mark Patterson (PLoS), Brian Schottlaender (UCSD Library) for discussions in coming to the conclusions presented here.
-- Dr Lyubomir Penev Managing Director Pensoft Publishers 13a Geo Milev Street 1111 Sofia, Bulgaria Tel +359-2-8704281 Fax +359-2-8704282 www.pensoft.net www.pensoft.net/journals/zookeys www.pensoft.net/journals/phytokeys in...@pensoft.net
> I would be delighted to be part of exploring data journals. I think they will might from bottom-up revolution aiming to democratise science. Because publishing data is about changing values and will face enormous opposition from established beneficiaries of the concentration on "the publication" (whether PDF or anything else). I would prefer us to publish things that work, that do things, that can be used. Software, data, protocols, etc. For these, I'm afraid we have to build reward systems. We shouldn't have to, but it seems essential.
I am less pessimistic about this. I believe that the scientific reward system (citations) will slowly evolve to include data citations and software citations, as soon as they become as easy to handle as today's paper citations. A data journal whose contents can be referenced exactly like today's journals could clearly help there. Once it is referenced by the big citation databases, data is part of "the system".
Another point worth investigating is the search for natural allies in this process. One group of candidates is big instruments such as synchrotrons. Their output is raw data, so they have an interest in data being highly valued. Since they are big, they are very visible and well-known to science policy makers. Suppose that big instruments would start to make all data collected on their site available after a one-year period, except if the experimenters pay a fee for exclusive access. We'd have tons of published data very quickly. But this does require an integration into today's reward system, i.e. citations.
Konrad.
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research AT khinsen DOT fastmail DOT net
---------------------------------------------------------------------
No time to write anything substantive but just wanted to make two points.
Every organization and their dog is setting up something that broadly fits into the category of data journal or repository at the moment. BMC are looking at it, F1000 are building something, Dryad is operational, DataCite are doing great work, other publishers are working on journals, and others in the thread have mentioned other things. Rather than add to this I think trying pull these efforts together to ensure that they are complementary, and not just going to compete so nothing gets achieved would be more valuable. The one thing all of these efforts have in common is precisely the 'giving data a journal like citation' game.
Secondly the question of evaluation and reward is key on my view. The citation is not enough, nor is a DOI but both are tremendously useful steps. (as an aside I'm with Phil on wishing we weren't being driven into using DOIs but they seem to solve the key social problem so I will live with the potential downstream technical considerations) The key is making these citations matter. I have had lots of recent productive discussions with major funders that indicate that they desperately want to make the publication of data matter to the people they fund. The direction is there, the technical means are largely in hand, the challenge lies in making those meet in the middle.
So I'd like to offer an alternative approach, at least as a straw man. We cheat the system. We know funders care. We know we can make the citation game work technically. So we don't tell other people, just let them observe as we work the system to our advantage. Present the slam dunk data management plan. Show the citations stats for your data on your CV. These are dog whistles to members of our club when sitting on panels, and will be seen and noted by key people at funding agencies. It wont initially have much of an effect at institutional levels but I think we can start to have it make a difference at funder level over the next 12-24 months. I'm not saying to hide what is going on but actually create a visible 'old boys club' that others will then want to join. I think we can look to see that data citation and publication are explicitly included on CV Or bio sketch requirements for many key funders over the next year or so. If we are ready to exploit that ourselves then that may send a much stronger message out to the wider community than 'please use and cite my data' and 'here is yet another journal like thing to think about'.
As I say, bit of a straw man but interested to see people's response. I will try to write up a more extensive response to the Wellcome/Hughes/MPG announcement. I was actually at the meeting and you can bet I was arguing for more radical steps but I think they may have actually got the balance about right for the aims they have. Reading between the lines I think some of the more radical technical suggestions might make it through although this will depend a lot on the conservatism of the EiC. I've written one post on it (most recent one at http://cameronneylon.net) and will try to write another to capture some of the discussion when I have a moment.
Cheers
Cameron
--
Scanned by iCritical.
, 29 Jun 2011, at 08:45, "Konrad Hinsen" <rese...@khinsen.fastmail.net<mailto:rese...@khinsen.fastmail.net>> wrote:
Another point worth investigating is the search for natural allies in this process. One group of candidates is big instruments such as synchrotrons. Their output is raw data, so they have an interest in data being highly valued. Since they are big, they are very visible and well-known to science policy makers. Suppose that big instruments would start to make all data collected on their site available after a one-year period, except if the experimenters pay a fee for exclusive access. We'd have tons of published data very quickly. But this does require an integration into today's reward system, i.e. citations.
We are actually doing a series of things along these lines at ISIS (UKs national neutron source). The technical infrastructure is going in to make the raw data published in some sense, although this is nit terribly useful in it's own right, and there is a move towards making things available on some timeframe. At the same time I am trying to work on capturing more of the processed data at least for for some types of experiments in a way that can be co-published with or at least accessible from the same places as the raw data will be. There is a lot of concern in our organization about how to reward both data sharing but also give credit for code sharing and development. The thinking isn't as yet sophisticated as this group but the direction of travel is right.
Cheers
Cameron
--
Scanned by iCritical.
Different disciplines have very different requirements for databases and
data repositories, both in terms of the data and the metadata. The one
thing in common is that they do need metadata.
As you know, in biology the development of minimal information standards
has become a bit of cottage industry. As it's broadest, a minimal
information standard is basically a document, describing a dataset. In
short, it's a publication.
So, to me, this would be what a data journal would be; a collection of
reports about the existance of a dataset, using structure and
organisation where appropriate data standards exist, and using free text
where they do not.
If you want a equivalent example, I have released various bits of code
that I written over the years. When I want to refer to something that I
have written, I generally send the URI to the *documentation* and not a
tarball, or version control system which is where the actual digitial
object is. Obviously, the documentation will contain a link to digital
object.
The key difference here between the current system is that the paper
would not require an experiment, would not require a thesis, or analysis
of the results; it would not need a significant introduction, nor would
the paper have to start with the phrase "In recent years...".
Phil Lord
Also, here's a copy-paste from the Beyond the PDF website list of links on Data Citation:
* The Australian National Data Service has a nice page on data citation awareness: http://ands.org.au/guides/data-citation-awareness.html
* The challenges with tracking dataset reuse today, based on DOIs and paper-oriented tools: http://researchremix.wordpress.com/2010/11/09/tracking-dataset-citations-using-common-citation-tracking-tools-doesnt-work/
* Gary King on data sharing http://gking.harvard.edu/projects/repl.shtml
* UNF:
o To ensure that the data set can be used for replication, one recommendation is that in addition to a handle, DOI, or other identifier, a universal numerical fingerprint (UNF) is used. UNF was proposed in Micah Altman, Gary King, 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data", D-Lib 13(3/4) http://www.dlib.org/dlib/march07/altman/03altman.html
o Example: http://thedata.org/citation/standard
Best,
- Anita.
Anita de Waard
Disruptive Technologies Director, Elsevier Labs
http://elsatglabs.com/labs/anita/
a.de...@elsevier.com
-----Original Message-----
From: beyond-...@googlegroups.com on behalf of cameron...@stfc.ac.uk
Sent: Wed 6/29/2011 4:21
To: beyond-...@googlegroups.com
Subject: Re: Does a Data Journal Make Sense?
Cheers
Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677 (The Netherlands)
The answer is: 'for this data set, articles with data links acquired 20% more citations (compared to articles without these links).'
This might be useful data point in trying to convince the community at large that data citations are important.
- Anita.
Anita de Waard
Disruptive Technologies Director, Elsevier Labs
http://elsatglabs.com/labs/anita/
a.de...@elsevier.com
-----Original Message-----
From: beyond-...@googlegroups.com on behalf of cameron...@stfc.ac.uk
Sent: Wed 6/29/2011 4:10
To: beyond-...@googlegroups.com
Subject: Re: Does a Data Journal Make Sense?
Hi All
Cheers
Cameron
Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677 (The Netherlands)
The list Anita sent is nice, more impressive perhaps is that it is now
already 8 years old. Most of the links do, indeed, still work. New
efforts are at www.usvao.org and www.ivoa.net. The new ADS stuff is
at adsabs.org/ui
Cheers,
Michael
On Thu, Jun 30, 2011 at 10:47 AM, Waard, Anita de A (ELS-AMS)
<A.de...@elsevier.com> wrote:
> Very much agree with Cameron and others about connecting to existing efforts; just wanted to add Astronomy as a 'natural ally' - I am sure Mike Kurtz has a much better insight re. data citations in astronomy, but but this list is quite impressive already: http://www.astro.caltech.edu/~pls/astronomy/archives.html
--
Dr. Michael J. Kurtz
Harvard-Smithsonian Center for Astrophysics
60 Garden Street
Cambridge, MA 02138
USA
ku...@cfa.harvard.edu
+1 617 495 7434
www.cfa.harvard.edu/~kurtz
Cheers,
Michael
--