Does a Data Journal Make Sense?

Phil B

unread,

Jun 28, 2011, 12:10:37 PM6/28/11

to Beyond the PDF

Hi:

I found the announcement yesterday of a new high profile open access
journal somewhat discouraging (see http://www.hhmi.org/news/20110627.html).
Do not get me wrong, support for open access is of course welcome, but
that this is the best a group of top scientists could come up with to
improve scholarship leaves me wondering. It is my opinion we need to
coax these folks to greater thoughts through initiatives that they can
see make sense, even if not initially, but soon. In my opinion the SMA
application that came from the workshop in Jan. is a great example.
With that under way I started to contemplate what else could be done.

After much discussion with a number of people I have concluded that a
logical next step is to consider starting a high visibility data
journal for the reasons I outline below in a brief rationale. I am
considering starting to chase funds for such an endeavor and before I
do so I would very much appreciate the thoughts of this group on
whether this is a good idea, whether it conflicts with other efforts
etc. I am less concerned about the technology to be used at this stage
and more concerned about whether this would move improved scholarship
in the right direction? Your council would be much appreciated.

Best../Phil B.

A Proposal for a Data Journal
Philip E Bourne PhD June 24, 2011

Executive Summary

After thinking long and hard post the Beyond the PDF workshop and
talking at length to scientists, publishers, librarians, funders and
others vested in bettering scholarship I have concluded we need a bold
yet doable step which will engage your typical faculty member in the
process of change. Without broad active participation of scientists at
large we will continue to invoke change at a glacial pace and in just
a small number of disciplines.

Participation requires reward and other perceived gains in a
competitive workplace. A Data Journal can provide that reward and
gain. The scientist receives a true citation that can be used in
traditional journals for their dataset. These data are preserved and
the scientist is in compliance with emerging data sharing policies
being put in place by the funding agencies. Think of it as PLoS ONE
for data – an open access place for data that are sound, but not
necessarily expected to have high impact on science. If that sounds
less than promising, recall the beginnings of PLoS ONE, which was
initially considered below par, but proved otherwise and became the
worlds largest journal with lots of value.

Scientists need an open access place for data since there are many
datasets that do not fit well within existing repositories or are not
suitable for inclusion in a publication (the so-called long-tail
problem). With a lightweight documented calling interface (API), apps
can easily be developed by the community to operate on subsets of the
data (including reviewing the data) in the same way that apps are
built to maximize the utility of a piece of hardware or a corpus of
literature, as is the case with something like Elsevier’s SciVerse.
The more data are published this way, the more useful the data journal
becomes, which in turn encourages more data publication. The sum of
the parts are greater than the whole – not true of a typical journal.

A number of pieces will be critical to a successful data journal.
Understanding the needs of a scientific community and getting its
feedback will keep the journal relevant; this is an ongoing activity
effectively is achieved by establishing an appropriate social
infrastructure. Providing a simple, bullet-proof rights framework will
ensure that needless barriers to sharing and preservation are
minimized. Usage and impact metrics should be enabled by citation
conventions backed up by unique persistent identifiers for datasets
and researchers themselves. Some concept of “quality”, if not peer
review, also needs to be defined and piloted.

The impact on scientific data analysis could be profound. The devil is
in the details – how can this be made versatile enough to embrace
multiple types of data retrieved in a way that are comfortable to the
intended discipline of users? This is a problem that outside of the
database community does not seem to be well solved. Thus database
developers need to be part of the solution. At the same time, it will
be important that the effort not try to run before it can walk; solid,
forward-thinking infrastructure first, (but definitely not a build it
and they will come, but rather built it with them so that they might
come) with enriched application development later.

The objective proposed here is to prototype such a database (but call
it a journal) in one year. The prototype development requires willing
data contributors, database developers, publishers, software engineers
and others engaged in digital scholarship. The process involves
requirements gathering from scientists, a workshop to define the
technical requirements and the subsequent prototyping (possibly using
an existing system) with a goal to launch a Data Journal in the second
year.

You are receiving this document because in some way you could
contribute to this effort. I hope you will. Comments welcome at any
time.

Problem Statement
Traditional scholarly communication is struggling to adapt to the
Internet era. One notable shortcoming (drawing from the biosciences,
but more generally applicable) is that large amounts of digital data
are being generated, often from high-throughput experiments, which may
or may not be associated with a publication, and subsequently lost .
True, some of that data finds its way into existing repositories that
I would characterize as follows:

1. Established repositories with stable funding where the incentive to
deposit comes from a prerequisite defined by the journal accepting a
publication associated with the data. Examples are the Protein Data
Bank (PDB), Genbank, and PharmGKB – These are functioning well, but
are expensive to operate and there is no business model beyond
continued federal funding. As a result some of these resources have
closed (e.g., the Arabidopsis database) or are under threat.

2. Repositories that contain data specifically related to a
publication, but for which no recognized domain specific repository
exists. Dryad is an example of such a repository. Publishers have
associated with these resources since it relieves them of the burden
of managing the data as a large supplement to the paper, which many
publishers are ill-equipped to do. These resources are just emerging
and seem destined to also continue to rely on federal funding or buy-
in from the publishers, in other words, there is no established
business model to provide sustainability at this time.

3. Institutional repositories that manage the data produced by
researchers at a given institution. At this time their success is far
from assured – what is the incentive for a data producer to put data
there? What is the business model for their sustainability? How do
institutional repositories relate to each other? There are exceptions
(e.g., the California Digital Library) but even there I would say
there is a failure to achieve broad buy-in by faculty. There need to
be faculty who have seen a significant gain and who champion such
developments.

4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an
open data model and are usually domain specific, e.g., disease
modeling.

In reality, much of the data generated, most of which conforms to the
long tail – the very large number of small datasets not handled {well}
by these repositories - are lost. While much of these data could
easily be reproduced and many will never be in demand, there is also
valuable data being lost and no business model for sustaining such
data. Efforts like DataCite are a step in the right direction but a
DOI alone is not enough to incentivize the scientific community at
large.

What is the Solution and Why Will it Work?
The solution is to establish a data journal. What does this mean and
why will it work?

The major impediment for many researchers in preserving their data is
incentive. Funding agencies, e.g., NIH, NSF are providing incentive
through mandatory data sharing policies, but at this time there is
confusion surrounding these policies and no indication of, if and how
they will be enforced, or indeed where to store the data to conform to
the policies!

Perhaps more important, there is no reward for depositing data, except
in the case of 1 above, but that covers only a small amount of the
data generated. A data journal provides that reward through a citation
(e.g., hypothetically PLoS Data 2011 6:3 d000001) that can be
included on resumes, promotion files etc. Associated with the citation
will be a DOI (from DataCite or elsewhere) that provides a definitive
resolvable reference to that dataset. Initially the value of that
citation will be limited, but over time we anticipate it will be
accepted by aggregation services, e.g., Thomson Reuters ISI and
PubMed.

Another reason a Data Journal will be successful is the corresponding
demise of the research article. Already many research articles do
little else than report on a set of data, a Data Journal simply
acknowledges the fact. The goal of a Data Journal is to drive the
notion that one day (it should have been yesterday!) a research
article with zero citations (beyond self citing) will be less valuable
that a citable dataset that has been downloaded by many researchers
worldwide (bibliometrics on access to the data and commentary by users
of the data would be maintained from day 1).

How Will This Work?
Notwithstanding the provision of a reward, the barrier to “publishing”
with the Data Journal must be kept very low, at least initially. This
implies collecting only a minimal amount of metadata to provide
provenance and to identify the dataset uniquely. Over time, based on
demand from scientists to publish in the Data Journal, the
requirements can be ramped up, but also offset by increased
automation. Here are three suggested phases of development:

• Phase I (one year) – deploy the backend curation repository to
appropriately store, retrieve, search and audit datasets. Develop
simple upload mechanisms for small and large datasets. Define a simple
review and data integrity system to validate the depositions. See
whether the journal gets any “data papers” submitted. This could be
based on an existing infrastructure at the California Digital Library
(CDL).

• Phase II (one year) – If phase I is successful start to accumulate
more metadata on entries through automated extraction based on data
format and data type. An example of a popular data format will be MS
Excel spreadsheets (coincidentally by June 2012 CDL will have released
an open-source Excel “add-in” designed to make spreadsheet data easier
to publish, share, and preserve); an example of a data type would be
biological sequence data which has 1-2 well-structured forms and which
does not belong in any existing repository.

• Phase III (on-going) expand the capabilities established in phase II
to visually respond to common and established data formats and data
types. Data visualized in a traditional journal is static. Here the
data can be bought to life, often using tools developed by communities
themselves and collated by the Data Journal. If successful this will
become an accepted form of publishing and the depositors/authors will
treat it like any other journal and the merger between database and
journal, predicted 6 years ago will be realized.

Where Will This Be Done and By Whom?
This remains open at this time, but one option is in partnership with
PLoS or F1000, as a publisher and the University of California as a
backend provider. The California Digital Library could provide data
repository, curation, citation, and publishing services (with DataONE
and DataCite) as well as long experience with aggregators and
publishers.

As a faculty advocate and database person I am willing to put
significant effort into this, but it must be a broad team effort.

How and Why Will It Be Funded?
I have thought long and hard about this since the very successful
“Beyond the PDF Workshop” (BtPDF), which we hosted at UCSD in January
2011. I have become convinced that to precipitate change in the way we
manage scholarship in the digital era we need an exemplar that can
touch your average scientist, who sees a need for change, but will not
be part of it without a reward. A Data Journal is such a reward –
citation and preservation of an important part of their scholarship.
With great concern about preservation and open accessibility of
digital data from funders, publishers and scientists, the time would
seem right to take this step. The intent is to approach funders who
are vested in the ideals of BtPDF to seed such a Data Journal
initiative.

To start we will initiate a joint call with supporters of BtPDF – the
CDL, the National Cancer Institute (NCI), The Gordon and Betty Moore
Foundation (GBMF), The Doris Duke Foundation, The Alfred P. Sloan
Foundation, Sage Bionetworks, Science Commons and Microsoft with the
goal of obtaining short term seed funding for Phase I. Later for Phase
II the emerging team of like-minded folks who form and seek funding
from NSF etc. for Phase II and a sustained open access business model
for Phase III. That is, for Phase III researchers would have put a
line item in their research budgets to cover the cost of preservation.
For-profits will be encouraged to pay small amounts to upload data no
longer of proprietary value and will be compensated with limited
advertising.

Acknowledgements
Thus far thanks to Lee Dirks (Microsoft), Josh Greenberg (Sloan
Foundation), Tony Hey (Microsoft), John Kunze (CDL), David Lipman
(NLM), Chris Mentzel (GBMF), Mark Patterson (PLoS), Brian
Schottlaender (UCSD Library) for discussions in coming to the
conclusions presented here.

Rebholz

unread,

Jun 28, 2011, 12:29:23 PM6/28/11

to beyond-...@googlegroups.com, Phil B

Hi,

yep, not much innovation here. The setup is pretty much standard except:
* only one round of revision to ensure fast publication
* this means also to reject lot of the work, since it seems not worth a
second revision round (like in some conferences)
* not much re-editing of the paper and additional work => fast, but
likely to reject good work, since it is not yet ready

No title yet, just quite broadly: new, top-tier, open access journal
for biomedical and life sciences research.

I guess, the innovative step is: how to get top quality publications
without having to go through different revision cycles. Not sure how
this would work.

I wonder, how much time the funders want to leave to the journal to end
up in the top-tier. This could take some time, since open access is now
established through other journals already.

For the data data journal:
* sounds like a good and interesting idea
* it makes sense to work together with those guys http://www.datacite.org/
* giving the different data collection types a definition /
specification seems to be mandatory
* it would be important to think about the fact that the data also
requires revision => could be tedious
* working together with different journals to collect / organise the
supplementary data could be an interesting service.

This is what comes to my mind.

-drs-

> for data ï¿½ an open access place for data that are sound, but not

> necessarily expected to have high impact on science. If that sounds
> less than promising, recall the beginnings of PLoS ONE, which was
> initially considered below par, but proved otherwise and became the
> worlds largest journal with lots of value.
>
> Scientists need an open access place for data since there are many
> datasets that do not fit well within existing repositories or are not
> suitable for inclusion in a publication (the so-called long-tail
> problem). With a lightweight documented calling interface (API), apps
> can easily be developed by the community to operate on subsets of the
> data (including reviewing the data) in the same way that apps are
> built to maximize the utility of a piece of hardware or a corpus of

> literature, as is the case with something like Elsevierï¿½s SciVerse.

> The more data are published this way, the more useful the data journal
> becomes, which in turn encourages more data publication. The sum of

> the parts are greater than the whole ï¿½ not true of a typical journal.

>
> A number of pieces will be critical to a successful data journal.
> Understanding the needs of a scientific community and getting its
> feedback will keep the journal relevant; this is an ongoing activity
> effectively is achieved by establishing an appropriate social
> infrastructure. Providing a simple, bullet-proof rights framework will
> ensure that needless barriers to sharing and preservation are
> minimized. Usage and impact metrics should be enabled by citation
> conventions backed up by unique persistent identifiers for datasets

> and researchers themselves. Some concept of ï¿½qualityï¿½, if not peer

> review, also needs to be defined and piloted.
>
> The impact on scientific data analysis could be profound. The devil is

> in the details ï¿½ how can this be made versatile enough to embrace

> Bank (PDB), Genbank, and PharmGKB ï¿½ These are functioning well, but

> are expensive to operate and there is no business model beyond
> continued federal funding. As a result some of these resources have
> closed (e.g., the Arabidopsis database) or are under threat.
>
> 2. Repositories that contain data specifically related to a
> publication, but for which no recognized domain specific repository
> exists. Dryad is an example of such a repository. Publishers have
> associated with these resources since it relieves them of the burden
> of managing the data as a large supplement to the paper, which many
> publishers are ill-equipped to do. These resources are just emerging
> and seem destined to also continue to rely on federal funding or buy-
> in from the publishers, in other words, there is no established
> business model to provide sustainability at this time.
>
> 3. Institutional repositories that manage the data produced by
> researchers at a given institution. At this time their success is far

> from assured ï¿½ what is the incentive for a data producer to put data

> there? What is the business model for their sustainability? How do
> institutional repositories relate to each other? There are exceptions
> (e.g., the California Digital Library) but even there I would say
> there is a failure to achieve broad buy-in by faculty. There need to
> be faculty who have seen a significant gain and who champion such
> developments.
>
> 4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an
> open data model and are usually domain specific, e.g., disease
> modeling.
>
> In reality, much of the data generated, most of which conforms to the

> long tail ï¿½ the very large number of small datasets not handled {well}

> Notwithstanding the provision of a reward, the barrier to ï¿½publishingï¿½

> with the Data Journal must be kept very low, at least initially. This
> implies collecting only a minimal amount of metadata to provide
> provenance and to identify the dataset uniquely. Over time, based on
> demand from scientists to publish in the Data Journal, the
> requirements can be ramped up, but also offset by increased
> automation. Here are three suggested phases of development:
>

> ï¿½ Phase I (one year) ï¿½ deploy the backend curation repository to

> appropriately store, retrieve, search and audit datasets. Develop
> simple upload mechanisms for small and large datasets. Define a simple
> review and data integrity system to validate the depositions. See

> whether the journal gets any ï¿½data papersï¿½ submitted. This could be

> based on an existing infrastructure at the California Digital Library
> (CDL).
>

> ï¿½ Phase II (one year) ï¿½ If phase I is successful start to accumulate

> more metadata on entries through automated extraction based on data
> format and data type. An example of a popular data format will be MS
> Excel spreadsheets (coincidentally by June 2012 CDL will have released

> an open-source Excel ï¿½add-inï¿½ designed to make spreadsheet data easier

> to publish, share, and preserve); an example of a data type would be
> biological sequence data which has 1-2 well-structured forms and which
> does not belong in any existing repository.
>

> ï¿½ Phase III (on-going) expand the capabilities established in phase II

> to visually respond to common and established data formats and data
> types. Data visualized in a traditional journal is static. Here the
> data can be bought to life, often using tools developed by communities
> themselves and collated by the Data Journal. If successful this will
> become an accepted form of publishing and the depositors/authors will
> treat it like any other journal and the merger between database and
> journal, predicted 6 years ago will be realized.
>
> Where Will This Be Done and By Whom?
> This remains open at this time, but one option is in partnership with
> PLoS or F1000, as a publisher and the University of California as a
> backend provider. The California Digital Library could provide data
> repository, curation, citation, and publishing services (with DataONE
> and DataCite) as well as long experience with aggregators and
> publishers.
>
> As a faculty advocate and database person I am willing to put
> significant effort into this, but it must be a broad team effort.
>
> How and Why Will It Be Funded?
> I have thought long and hard about this since the very successful

> ï¿½Beyond the PDF Workshopï¿½ (BtPDF), which we hosted at UCSD in January

> 2011. I have become convinced that to precipitate change in the way we
> manage scholarship in the digital era we need an exemplar that can
> touch your average scientist, who sees a need for change, but will not

> be part of it without a reward. A Data Journal is such a reward ï¿½

> citation and preservation of an important part of their scholarship.
> With great concern about preservation and open accessibility of
> digital data from funders, publishers and scientists, the time would
> seem right to take this step. The intent is to approach funders who
> are vested in the ideals of BtPDF to seed such a Data Journal
> initiative.
>

> To start we will initiate a joint call with supporters of BtPDF ï¿½ the

> CDL, the National Cancer Institute (NCI), The Gordon and Betty Moore
> Foundation (GBMF), The Doris Duke Foundation, The Alfred P. Sloan
> Foundation, Sage Bionetworks, Science Commons and Microsoft with the
> goal of obtaining short term seed funding for Phase I. Later for Phase
> II the emerging team of like-minded folks who form and seek funding
> from NSF etc. for Phase II and a sustained open access business model
> for Phase III. That is, for Phase III researchers would have put a
> line item in their research budgets to cover the cost of preservation.
> For-profits will be encouraged to pay small amounts to upload data no
> longer of proprietary value and will be compensated with limited
> advertising.
>
> Acknowledgements
> Thus far thanks to Lee Dirks (Microsoft), Josh Greenberg (Sloan
> Foundation), Tony Hey (Microsoft), John Kunze (CDL), David Lipman
> (NLM), Chris Mentzel (GBMF), Mark Patterson (PLoS), Brian
> Schottlaender (UCSD Library) for discussions in coming to the
> conclusions presented here.
>
>

--
Dietrich Rebholz-Schuhmann, MD, PhD - Research Group Leader
EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD (UK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TM support:www.ebi.ac.uk/Rebholz-svr | tm-su...@ebi.ac.uk
ISMB/ECCB 2011: http://www.iscb.org/ismbeccb2011
Twitter: jbiomedsem

Paul Groth

unread,

Jun 28, 2011, 12:34:50 PM6/28/11

to beyond-...@googlegroups.com

Phil,

I think this would be fantastic. I also like the bootstrap approach you
outlined. One should require the minimal amount of information to make
it easy to add data.

This would be also be a good start to helping grow a community of
journals for publishing both data and experiments (e.g. Open Research
Computation [1])

regards,
Paul

[1] http://www.openresearchcomputation.com/

--
Dr. Paul Groth (p.t....@vu.nl)
http://www.few.vu.nl/~pgroth/
Knowledge Representation & Reasoning Group
Artificial Intelligence Section
Department of Computer Science
VU University Amsterdam

Tim Clark

unread,

Jun 28, 2011, 1:24:52 PM6/28/11

to beyond-...@googlegroups.com

Hi Phil,

I have a few basic thoughts relating to your idea for a Data Journal.

1 - Different domains of study have differing requirements and approaches regarding data, e.g. biology, social sciences, astronomy, medical (HIPPA), psychology...

2 - Technical keys to the problem may include common metadata specifications, stable identifiers, robust provenance, and reliability of the core data storage facilities.

3 - Social keys to the problem certainly include recruitment / motivation / reward for participation, and more broadly, incorporating a data journal into the workflow / communications ecosystem of the ordinary scientist.

4 - It would probably be mistaken to try to develop a one-size-fits-all complete metadata model for all domains at the outset. Any metadata model should be very basic but "regionally extensible".

5 - There are multiple formats that have been proposed for stable data identifiers, and these differences relate among other things to:
-- different existing repositories for the data -
-- different opinions and approaches to standard identifiers
-- different data citation models

6 - Fundamentally, "stable identifiers" are also a form of metadata - so-called "surrogate keys" in database parlance.

7 - A data repository is really only useful if its datasets are connected via data citation to the methods used to generate it and the interpretations given it by its originators.

I would particularly support creating a repository *pre-aligned with a journal*, for example PLOS itself, with a really useful citation mechanism for existing journals to opt into, requiring it of their contributors. This ensures that your prototype application answers point 7 above.

At the same time, one cannot expect that such a repository would be the only one or would have the "one ring" citation model. So it would also be important to provide for interoperability with other models and other repositories.

I would enjoy discussing this idea further one on one or in a small group.

The Dataverse Network (http://thedata.org) people have a very interesting open source software framework which might well be adaptable in an endeavor such as you propose. Merce Crosas, the Dataverse PM, and I have had a number of discussions on this topic and I am certain she also would be happy to chat further. Anita De Waard and I have discussed similar ideas.

Also recommend looking at some of the outputs of a workshop organized by Merce and her colleagues during May this year, http://projects.iq.harvard.edu/datacitation_workshop/

Best

Tim

The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

Tim Clark

unread,

Jun 28, 2011, 1:37:40 PM6/28/11

to beyond-...@googlegroups.com

PS Phil, in email below where I said "...PLOS itself..." I meant to say "PLOS One itself".

Begin forwarded message:

Konrad Hinsen

unread,

Jun 28, 2011, 3:58:57 PM6/28/11

to beyond-...@googlegroups.com

On 28 Jun 2011, at 18:10, Phil B wrote:

> I found the announcement yesterday of a new high profile open access
> journal somewhat discouraging (see http://www.hhmi.org/news/20110627.html)
> .

Me too, but then it's not in my domain anyway.

> After much discussion with a number of people I have concluded that a
> logical next step is to consider starting a high visibility data
> journal for the reasons I outline below in a brief rationale. I am
> considering starting to chase funds for such an endeavor and before I
> do so I would very much appreciate the thoughts of this group on
> whether this is a good idea, whether it conflicts with other efforts

This is actually pretty similar to what I recently suggested in reply
to a proposal by Anita to start an "executable journal". The ideal
looks good to me and you have come up with good ideas concerning many
of the organizational aspects.

One topic you do not mention is refereeing. What would your data
journal do to ensure at the very least that the submitted data is what
it claims to be, and is usable by someone other than the submitting
authors? Without some minimal reviewing procedure, some people are
likely to deposit junk data in order to get a citable publication.

Konrad.
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research at khinsen dot fastmail dot net
---------------------------------------------------------------------

Jakob Voss

unread,

Jun 28, 2011, 5:02:35 PM6/28/11

to beyond-...@googlegroups.com

Hi Phil,

Technically we should compare a Data Journal with three existing
approaches:

1. Document Repositories like arXiv.
2. Data registries like CKAN
3. Source code repositories like github

I'd favour the third, but scientists will more likely be familiar with
the first. In my opinon it is crucial to be familiar with these
different approaches to know the strength and weaknesses of each
variant. If you add a peer-review to the journal, you will soon limit
its scope to some fields of research, but I think you cannot avoid some
topical focus anyway. Only experts in
a domain can judge data and metadata.

Jakob

--
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Vo�
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242
http://www.gbv.de
jakob...@gbv.de

Peter Murray-Rust

unread,

Jun 28, 2011, 6:22:34 PM6/28/11

to beyond-...@googlegroups.com

On Tue, Jun 28, 2011 at 5:10 PM, Phil B <pebo...@gmail.com> wrote:

Hi:

After much discussion with a number of people I have concluded that a
logical next step is to consider starting a high visibility data
journal for the reasons I outline below in a brief rationale. I am
considering starting to chase funds for such an endeavor and before I
do so I would very much appreciate the thoughts of this group on
whether this is a good idea, whether it conflicts with other efforts
etc. I am less concerned about the technology to be used at this stage
and more concerned about whether this would move improved scholarship
in the right direction? Your council would be much appreciated.

In conjunction with IUCr We have started to explore a "data journal" based on the data from Acta Cryst E and using Wordpress as the rendering engine and aiming at HTML5 as the carrier. We have also started to explore this in computational chemistry. The main problem, as always, is people not technology.

The main virtue - and its a small one but important - about the Wellcome/HHMI/MP journal (and I have only read the press release) is that it tries to regain control of the scholarly publishing process from the organised publishing sector which tends to stifle and ossify progress. Whether it will succeed depends on whether academics will give up self-centered publication for real communication. Which I doubt.

I would be delighted to be part of exploring data journals. I think they will might from bottom-up revolution aiming to democratise science. Because publishing data is about changing values and will face enormous opposition from established beneficiaries of the concentration on "the publication" (whether PDF or anything else). I would prefer us to publish things that work, that do things, that can be used. Software, data, protocols, etc. For these, I'm afraid we have to build reward systems. We shouldn't have to, but it seems essential.

P.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Daniel Mietchen

unread,

Jun 28, 2011, 7:45:40 PM6/28/11

to beyond-...@googlegroups.com

Hi Phil,

I share your sentiment regarding the HHMI initiative and enjoyed
reading your alternative proposal.

While you have sketched out the technical hurdles and approaches in
quite some detail, I am missing a complement on the cultural side: as
PMR just pointed out, the key to cultural change is community
engagement, and this works better where the community already is than
in new venues. There are reasons why several scope-limited PLoS
journals were started before PLoS ONE, and many of them were cultural.
I think many of these would also apply to the launch of PLoS DATA,
with technical and legal issues further complicating the matter.

One way to go about this would be to try to get existing journals to
adopt a policy to add a statement on data sharing to each article they
publish, just as it is now common practice to have a statement on
potential conflicts of interest. Such data sharing statements are
already in effect at some journals [1] while under consideration at
others [2], and the little cultural change they require could be
achievable in a fair number of communities that do not yet have
mandatory data sharing [3]. If the statement is suitably implemented
(e.g. automatically generated on the basis of responses collected by a
standardized form), some basic metadata generation (and its flagging
[4]) could also be automated, so there would be little effort to be
expanded on the side of authors or publishers.

First results of such an approach (i.e. a standardized statement
generator, and journals using it) could realistically be expected on
the scale of months, thereby raising community awareness of the issue
already during your stage I of technical development. Other
community-based options - acting on different time scales - would be
getting involved with the standardization of technical aspects of data
sharing for scope-limited data journals [5], or even a SCOAP3
approach, which would also address the business model issue [6]. Given
a wider community involvement of such kind, significant community
participation in the technical development should be more easily
achievable than via a purely technical approach.

Note that a reward system is not required for anything thus far, but
these data statements (if machine readable) could serve as a basis for
a reward system that could later be expanded (e.g. to include reuse
metrics) as the technical development proceeds.

Thanks again for this stimulating proposal - I wish you could have
submitted it to a "Journal of Research Proposals" [7] for appropriate
reward.

Cheers,

Daniel

[1] http://dx.doi.org/10.1136/bmj.c564
[2] http://blogs.openaccesscentral.com/blogs/bmcblog/entry/dear_scientist_help_us_to
[3] http://www.datadryad.org/jdap
[4] http://onsclaims.wikispaces.com/
[5] http://www.gbif.org/communications/news-and-events/showsingle/article/new-incentive-for-biodiversity-data-publishing/
[6] http://scoap3.org/
[7] http://evomri.net/?page_id=36

john wilbanks

unread,

Jun 28, 2011, 8:26:29 PM6/28/11

to beyond-...@googlegroups.com

Just to make sure everyone on the list knows, there are already several
"data" journals - we have collected some requirements and some links to
existing efforts in Jonathan Rees's "Recommendations for independent
scholarly publication of data sets" (available at
http://sciencecommons.org/wp-content/uploads/datapaperpaper.pdf ) -
indeed, ecological archives has been publishing data for 11 years.
There's a lot to learn there.

More recently I have been watching the Journal of Network Biology from
the inside, and although there are technical challenges, the social
challenges of what it means to "peer review" a data set make the
technology look simple. And the odds are good that the first pass on
what review means will be wrong, and need to be incrementally improved
for a while, or perhaps even thrown out for something that is less wrong
and upon which incremental innovation can occur.

jtw

--
John Wilbanks
VP for Science
Creative Commons
web: http://creativecommons.org/science
blog: http://scienceblogs.com/commonknowledge
twitter: @wilbanks

Philip Bourne

unread,

Jun 28, 2011, 8:32:40 PM6/28/11

to beyond-...@googlegroups.com, Phil B

Hi:

First a confession - What I am really suggesting is not a Data Journal but a database, but my faculty (my metric for success) do not respect nor understand databases. So lets call it a journal (which they revere) even though it is a database underneath. Its only a white lie :) Now to your points:

I agree that there needs to be a way to cite the data and working with DataCite or others makes perfect sense, but it is not enough to entice your average scientist in my view. You need to be able to cite the dataset also with a journal like citation (journal name, year, volume etc.) and include that in the references to a paper. Even then it will not be given much credence in the beginning, but it could grow in perceived value in same way perception of the value of PLoS ONE articles have changed. What drove the change in my opinion in the perception of ONE was the impact factor and article level metrics showing the value of the papers therein. Data level metrics are a start to show that. You cant ignore the relative value of a paper only cited by the original authors vs a dataset that is downloaded by 100 people and cited by 10 in the papers resulting from its use. In time (I dream) data citations and paper citations will become indistinguishable.

The issues of data types, metadata extraction, validation etc are huge of course, but often solved in part by the respective labs who generate the data, but for their own needs (unless managed by a community database). There lies the beauty (I dream again) these tools now become part of the solution in a shared space and are themselves open.

None of this precludes relationships with journals and the articles that say more about the data that in the data journal itself. In fact publishers should be relieved since it relieves them of the burden of trying to handle data. What makes what I am thinking different than something like Dryad which is doing this with published articles is that there does not have to be a publication now or ever.

Cheers../Phil B

>> for data – an open access place for data that are sound, but not

>> necessarily expected to have high impact on science. If that sounds
>> less than promising, recall the beginnings of PLoS ONE, which was
>> initially considered below par, but proved otherwise and became the
>> worlds largest journal with lots of value.
>>
>> Scientists need an open access place for data since there are many
>> datasets that do not fit well within existing repositories or are not
>> suitable for inclusion in a publication (the so-called long-tail
>> problem). With a lightweight documented calling interface (API), apps
>> can easily be developed by the community to operate on subsets of the
>> data (including reviewing the data) in the same way that apps are
>> built to maximize the utility of a piece of hardware or a corpus of

>> literature, as is the case with something like Elsevier’s SciVerse.

>> The more data are published this way, the more useful the data journal
>> becomes, which in turn encourages more data publication. The sum of

>> the parts are greater than the whole – not true of a typical journal.

>>
>> A number of pieces will be critical to a successful data journal.
>> Understanding the needs of a scientific community and getting its
>> feedback will keep the journal relevant; this is an ongoing activity
>> effectively is achieved by establishing an appropriate social
>> infrastructure. Providing a simple, bullet-proof rights framework will
>> ensure that needless barriers to sharing and preservation are
>> minimized. Usage and impact metrics should be enabled by citation
>> conventions backed up by unique persistent identifiers for datasets

>> and researchers themselves. Some concept of “quality”, if not peer

>> review, also needs to be defined and piloted.
>>
>> The impact on scientific data analysis could be profound. The devil is

>> in the details – how can this be made versatile enough to embrace

>> Bank (PDB), Genbank, and PharmGKB – These are functioning well, but

>> are expensive to operate and there is no business model beyond
>> continued federal funding. As a result some of these resources have
>> closed (e.g., the Arabidopsis database) or are under threat.
>>
>> 2. Repositories that contain data specifically related to a
>> publication, but for which no recognized domain specific repository
>> exists. Dryad is an example of such a repository. Publishers have
>> associated with these resources since it relieves them of the burden
>> of managing the data as a large supplement to the paper, which many
>> publishers are ill-equipped to do. These resources are just emerging
>> and seem destined to also continue to rely on federal funding or buy-
>> in from the publishers, in other words, there is no established
>> business model to provide sustainability at this time.
>>
>> 3. Institutional repositories that manage the data produced by
>> researchers at a given institution. At this time their success is far

>> from assured – what is the incentive for a data producer to put data

>> there? What is the business model for their sustainability? How do
>> institutional repositories relate to each other? There are exceptions
>> (e.g., the California Digital Library) but even there I would say
>> there is a failure to achieve broad buy-in by faculty. There need to
>> be faculty who have seen a significant gain and who champion such
>> developments.
>>
>> 4. Grass roots efforts, e.g., Sage Bionetworks, which are pushing an
>> open data model and are usually domain specific, e.g., disease
>> modeling.
>>
>> In reality, much of the data generated, most of which conforms to the

>> long tail – the very large number of small datasets not handled {well}

>> Notwithstanding the provision of a reward, the barrier to “publishing”

>> with the Data Journal must be kept very low, at least initially. This
>> implies collecting only a minimal amount of metadata to provide
>> provenance and to identify the dataset uniquely. Over time, based on
>> demand from scientists to publish in the Data Journal, the
>> requirements can be ramped up, but also offset by increased
>> automation. Here are three suggested phases of development:
>>

>> • Phase I (one year) – deploy the backend curation repository to

>> appropriately store, retrieve, search and audit datasets. Develop
>> simple upload mechanisms for small and large datasets. Define a simple
>> review and data integrity system to validate the depositions. See

>> whether the journal gets any “data papers” submitted. This could be

>> based on an existing infrastructure at the California Digital Library
>> (CDL).
>>

>> • Phase II (one year) – If phase I is successful start to accumulate

>> more metadata on entries through automated extraction based on data
>> format and data type. An example of a popular data format will be MS
>> Excel spreadsheets (coincidentally by June 2012 CDL will have released

>> an open-source Excel “add-in” designed to make spreadsheet data easier

>> to publish, share, and preserve); an example of a data type would be
>> biological sequence data which has 1-2 well-structured forms and which
>> does not belong in any existing repository.
>>

>> • Phase III (on-going) expand the capabilities established in phase II

>> to visually respond to common and established data formats and data
>> types. Data visualized in a traditional journal is static. Here the
>> data can be bought to life, often using tools developed by communities
>> themselves and collated by the Data Journal. If successful this will
>> become an accepted form of publishing and the depositors/authors will
>> treat it like any other journal and the merger between database and
>> journal, predicted 6 years ago will be realized.
>>
>> Where Will This Be Done and By Whom?
>> This remains open at this time, but one option is in partnership with
>> PLoS or F1000, as a publisher and the University of California as a
>> backend provider. The California Digital Library could provide data
>> repository, curation, citation, and publishing services (with DataONE
>> and DataCite) as well as long experience with aggregators and
>> publishers.
>>
>> As a faculty advocate and database person I am willing to put
>> significant effort into this, but it must be a broad team effort.
>>
>> How and Why Will It Be Funded?
>> I have thought long and hard about this since the very successful

>> “Beyond the PDF Workshop” (BtPDF), which we hosted at UCSD in January

>> 2011. I have become convinced that to precipitate change in the way we
>> manage scholarship in the digital era we need an exemplar that can
>> touch your average scientist, who sees a need for change, but will not

>> be part of it without a reward. A Data Journal is such a reward –

>> citation and preservation of an important part of their scholarship.
>> With great concern about preservation and open accessibility of
>> digital data from funders, publishers and scientists, the time would
>> seem right to take this step. The intent is to approach funders who
>> are vested in the ideals of BtPDF to seed such a Data Journal
>> initiative.
>>

>> To start we will initiate a joint call with supporters of BtPDF – the

Philip Bourne

unread,

Jun 28, 2011, 8:33:59 PM6/28/11

to beyond-...@googlegroups.com

Agreed.. cheers../Phil B