biological events on trees

Rutger Vos

unread,

Oct 14, 2010, 4:51:25 AM10/14/10

to PhyloWS Google Group, EvoIO Google Group, ToLWeb2.0 Google Group, CDAO list, NeXML-discuss (list)

Hi all,

Sorry for the cross post, hope you find this relevant nonetheless.

The organizers and guests of the BioHackathon ('08/'09/'10) in Tokyo
have been invited to propose contributions to a special issue of the
Journal of Biomedical Semantics on topics discussed at these meetings,
or stemming from them.

Christian Zmasek and I have been discussing whether this might be a
good forum to propose a controlled vocabulary of biological events
that can be mapped on trees.

A number of projects that we are well aware of recognize terms for
such events. For example, the Tree of Life emits XML that includes
annotations to indicate that a node in the ToL represents an extinct
taxon. Likewise, phyloXML recognizes speciations and gene
duplications; the eNewick flavours for networks have special keywords
to indicate that a reticulation is the result of lateral gene transfer
(for example).

We feel that there is a good opportunity to present a controlled
vocabulary of these terms, living under a PURL namespace. Visualizers
of any of the commonly-used tree formats (e.g. NHX, phyloXML, eNewick,
NeXML, NEXUS-with-hot-comments) can then decide to plot these
standardized events on trees (e.g. I'm sure you can imagine some sad
icon that can be used to represent extinction), and PhyloWS services
can allow these as search predicates. An ontologized version of this
vocabulary could be developed by extending CDAO.

Would you be interested in participating in the preparation of such a
manuscript?

Rutger

--
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com

Arlin Stoltzfus

unread,

Oct 14, 2010, 3:37:15 PM10/14/10

to PhyloWS Google Group, EvoIO Google Group, ToLWeb2.0 Google Group, CDAO list, NeXML-discuss (list)

Rutger, I think its a good idea, but I'm not sure that the world is ready for this.

At the TDWG (Biodiversity Information Standards) meeting last month, a few of us started to work on a simpler goal of assessing current best practices for publishing trees electronically, so that they can be integrated or re-purposed. To achieve this, at minimum, the trees need to refer to something identifiable, ideally via a GUID (globally unique identifier). Most file formats in common use can't handle that minimum requirement, even if we could get the experts to agree on a precise syntax.

I'll announce this project later in the month. In case anyone is interested, the rationale is given below this message, and our preliminary report is being assembled here:

http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTrees2010

Arlin

On Oct 14, 2010, at 4:51 AM, Rutger Vos wrote:

Hi all,

Sorry for the cross post, hope you find this relevant nonetheless.

The organizers and guests of the BioHackathon ('08/'09/'10) in Tokyo
have been invited to propose contributions to a special issue of the
Journal of Biomedical Semantics on topics discussed at these meetings,
or stemming from them.

Christian Zmasek and I have been discussing whether this might be a
good forum to propose a controlled vocabulary of biological events
that can be mapped on trees.

-------

Arlin Stoltzfus (ar...@umd.edu)

Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST

IBBR, 9600 Gudelsky Drive, Rockville, MD

tel: 240 314 6208; web: www.molevol.org

----

Rationale

Millions of trees are generated each year in association with published research. But, of these millions of trees, only a tiny fraction is published, typically in the form of a graphical image— a picture to look at. To a computer, such image files are informational dead-ends. Of the thousands (tens of thousands?) of trees published each year in association with journal articles, a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format). This exposes the topology and branch lengths of the tree.

However, such trees typically are not adequate for data integration, re-use, and re-purposing. To understand why, consider the following example:

((my_arbitrary_name1:0.34, idiosyncratic_name:0.19):0.11, my_other_name:0.44)

What this tree means depends entirely on what the labels refer to, but the labels are arbitrary. To interpret this tree, to validate it, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable entity. In general, if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning— and indeed, most trees that are archived lack clearly identifiable information allowing them to be linked with other data, except under the guidance of an expert communicating with the authors of the paper.

The ultimate goal of our effort here is to make trees more interoperable. We believe that if the forest of trees produced by researchers each year were computationally accessible, the scientific community would have a much greater capacity to validate and extend phylogeny-based research. The benefits of linked data have been discussed elsewhere (http://www.taxonconcept.org/taxonconcept-blog/2010/8/5/why-linked-open-data-makes-sense-for-biodiversity-informatic.html).

As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons:

While in the past, many scientists felt no incentive to share data, recent research has shown that making data available in public archives increases citations (ref: Piwowar, research remix), widely understood as an indicator of professional success;
In early 2010, some key journals in evolution and systematics announced plans to implement a data-archiving policy: to publish in these journals, researchers will need to start archiving their trees;
The only major electronic repository of trees, TreeBase?, has recently completed a major upgrade of features, including its submission process; at the same time, a more loosely structured archive called Dryad was launched and will accept various kind of electronic files, including those with trees;
NSF has recently increased its requirements for data-sharing plans in grant proposals. Thus, scientists will be motivated by funding agencies to share data electronically;
In recent years, phyloinformatics researchers have been developing supporting technologies to enable interoperability, including XML file formats (NeXML?, PhyloXML?), an ontology (CDAO) and a web-services standard (PhyloWS?).

Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, new technologies are emerging to make this easier.

Rutger Vos

unread,

Oct 20, 2010, 8:50:59 AM10/20/10

to Arlin Stoltzfus, PhyloWS Google Group, EvoIO Google Group, ToLWeb2.0 Google Group, CDAO list, NeXML-discuss (list)

Hi Arlin,

I don't know if I agree that the hurdles that you mention (e.g.
identifiability) need to be taken before we can have this
conversation. What we were thinking about is an essentially
technology-neutral and platform-neutral collection of terms that can
be applied to any data format that accepts "terms", much in the same
way that dublin core (and darwin core) can be used inside web page
headers, rss feeds, etc. For example, NHX already recognizes a number
of these terms for biological events, without needing GUIDs, taxonomic
normalization, etc. All we want to do is catalogue them and describe
them.

Rutger

> --
> You received this message because you are subscribed to the Google Groups
> "PhyloWS" group.
> To post to this group, send email to phy...@googlegroups.com.
> To unsubscribe from this group, send email to
> phylows+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/phylows?hl=en.

Reply all

Reply to author

Forward