Sorry for the cross post, hope you find this relevant nonetheless.
The organizers and guests of the BioHackathon ('08/'09/'10) in Tokyo
have been invited to propose contributions to a special issue of the
Journal of Biomedical Semantics on topics discussed at these meetings,
or stemming from them.
Christian Zmasek and I have been discussing whether this might be a
good forum to propose a controlled vocabulary of biological events
that can be mapped on trees.
A number of projects that we are well aware of recognize terms for
such events. For example, the Tree of Life emits XML that includes
annotations to indicate that a node in the ToL represents an extinct
taxon. Likewise, phyloXML recognizes speciations and gene
duplications; the eNewick flavours for networks have special keywords
to indicate that a reticulation is the result of lateral gene transfer
(for example).
We feel that there is a good opportunity to present a controlled
vocabulary of these terms, living under a PURL namespace. Visualizers
of any of the commonly-used tree formats (e.g. NHX, phyloXML, eNewick,
NeXML, NEXUS-with-hot-comments) can then decide to plot these
standardized events on trees (e.g. I'm sure you can imagine some sad
icon that can be used to represent extinction), and PhyloWS services
can allow these as search predicates. An ontologized version of this
vocabulary could be developed by extending CDAO.
Would you be interested in participating in the preparation of such a
manuscript?
Rutger
--
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com
Hi all,
Sorry for the cross post, hope you find this relevant nonetheless.
The organizers and guests of the BioHackathon ('08/'09/'10) in Tokyo
have been invited to propose contributions to a special issue of the
Journal of Biomedical Semantics on topics discussed at these meetings,
or stemming from them.
Christian Zmasek and I have been discussing whether this might be a
good forum to propose a controlled vocabulary of biological events
that can be mapped on trees.
Rationale
Millions of trees are generated each year in association with published research. But, of these millions of trees, only a tiny fraction is published, typically in the form of a graphical image— a picture to look at. To a computer, such image files are informational dead-ends. Of the thousands (tens of thousands?) of trees published each year in association with journal articles, a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format). This exposes the topology and branch lengths of the tree.
However, such trees typically are not adequate for data integration, re-use, and re-purposing. To understand why, consider the following example:
((my_arbitrary_name1:0.34, idiosyncratic_name:0.19):0.11, my_other_name:0.44)What this tree means depends entirely on what the labels refer to, but the labels are arbitrary. To interpret this tree, to validate it, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable entity. In general, if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning— and indeed, most trees that are archived lack clearly identifiable information allowing them to be linked with other data, except under the guidance of an expert communicating with the authors of the paper.
The ultimate goal of our effort here is to make trees more interoperable. We believe that if the forest of trees produced by researchers each year were computationally accessible, the scientific community would have a much greater capacity to validate and extend phylogeny-based research. The benefits of linked data have been discussed elsewhere (http://www.taxonconcept.org/taxonconcept-blog/2010/8/5/why-linked-open-data-makes-sense-for-biodiversity-informatic.html).
As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons:
Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, new technologies are emerging to make this easier.
I don't know if I agree that the hurdles that you mention (e.g.
identifiability) need to be taken before we can have this
conversation. What we were thinking about is an essentially
technology-neutral and platform-neutral collection of terms that can
be applied to any data format that accepts "terms", much in the same
way that dublin core (and darwin core) can be used inside web page
headers, rss feeds, etc. For example, NHX already recognizes a number
of these terms for biological events, without needing GUIDs, taxonomic
normalization, etc. All we want to do is catalogue them and describe
them.
Rutger
> --
> You received this message because you are subscribed to the Google Groups
> "PhyloWS" group.
> To post to this group, send email to phy...@googlegroups.com.
> To unsubscribe from this group, send email to
> phylows+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/phylows?hl=en.