remedying mismatched OTU names

9 views
Skip to first unread message

Arlin Stoltzfus

unread,
Aug 12, 2011, 1:43:53 PM8/12/11
to MIAPA, TreeBASE devel
Dear all--

A common problem with data sharing in phylogenetics is that OTU names
do not match between files, e.g., between the alignment and the tree
from the same study. I think I heard it from Bill that this is a
common problem in TreeBASE submissions. I have encountered it many
times and have thought about how to design software to deal with the
problem.

After discussing this with Vivek, I decided to make a more formal
description of the problem which is available here (sorry about the
pptx format):

http://dl.dropbox.com/u/7727158/name_matching.pptx

This includes real examples of mismatched names collected in the wild,
an explanation of why the problem occurs, mock-ups of interactive
user sessions, and implementation notes. Vivek already started
playing with some of the concepts and put an app on appspot (the link
is in the presentation).

Comments are welcome. If implemented as described, how well would
this tool serve the community need for name-matching? What would make
it better?

Arlin
-------
Arlin Stoltzfus (ar...@umd.edu)
Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
IBBR, 9600 Gudelsky Drive, Rockville, MD
tel: 240 314 6208; web: www.molevol.org

Brian O'Meara

unread,
Aug 13, 2011, 2:19:00 AM8/13/11
to Arlin Stoltzfus, MIAPA, TreeBASE devel
I agree that name matching is a problem. There is some recent work that might be of interest:

iPlant has done something similar to do just the name match up between two files in their discovery environment. Select a data file and a tree file, and it will find the names that match and then present the remainder to allow manual matching (there was talk of using fuzzy matching to get good preliminary guesses, but I don't know if that's implemented yet). It has a very similar interface to the one outlined in the slides. 

However, the long term solution might be automatic name matching. For 30 taxa, doing fuzzy match with user curation can work, but there are now trees with tens of thousands of taxa. Having the names in two different files matched to a standard taxonomy [sadly, one has to say "a standard taxonomy" rather than "the standard taxonomy"] will allow them to paired together as well as connect to existing information. There's a fairly new tool at http://tnrs.iplantcollaborative.org/ that does much of this now. It takes a list of plant names and matches it to a set of names from the Tropicos database. It can correct typos in names, deal with changes in taxonomy [something being moved to a different genus], etc. Due to its current database, it's limited to plants, but it's supposed to be written so that someone else can substitute a different names database. You can set it to automatically select the best match or return a set of possible matches. It also has an API that is pretty easy to use: I wrote a function to call it from within R to convert names on a phylogeny to standardized names (see code here) and it worked on a tree of 50K species. 

Brian


_______________________________________
Brian O'Meara
Assistant Professor
Dept. of Ecology & Evolutionary Biology
U. of Tennessee, Knoxville
http://www.brianomeara.info

Students wanted: Applications due Dec. 15, annually
Postdoc collaborators wanted: Check NIMBioS' website
Funding wanted: Want to collaborate on a grant?



--
You received this message because you are subscribed to the Google
Groups "MIAPA" group.
For more options, visit this group at
http://groups.google.com/group/miapa-discuss?hl=en

William Piel

unread,
Aug 13, 2011, 11:26:28 AM8/13/11
to Arlin Stoltzfus, MIAPA, TreeBASE devel
Thanks Arlin.  Indeed, this is a big issue. 

I'd say that there are two major sub-issues:

1. Taxon label consistency among objects within a submission/study. I gather that this is mostly what Arlin et al.'s PPT was addressing: if the set of taxon labels in the alignment don't match with the tree(s), users can't do much with the data until this is fixed. Some minor comments to add: 

A- One source of the error comes from different programs having different levels of compliance with NEXUS format. For example, open your tree file in Dendroscope and then save it and you'll find that the rules regarding illegal punctuation and underscore usage will have changed, creating a mismatch with the original NEXUS alignment. Likewise, MacClade automatically converts *all* underscores to spaces even if they are single quoted, whereas Mesquite "hard codes" underscores if the token as single quotes around it.  Like Dendroscope, Archaeopteryx saves Newick and NEXUS trees  so that the the labels change (Christian wasn't aware of the arcane tokenization rules -- we just recently discussed this with him, so this may be fixed soon).  This does call for smart algorithms that can read improperly tokenized files (i.e. the "relaxed" setting in PAUP) -- which is tough, seeing as the program has to guess at the meaning of "," or "(" in a Newick string -- is it a new node or a token that was not quoted? And it calls for the ability to synonymize as needed, e.g. automatically recognizing that 'Homo_sapiens_x-2' in one file = 'Homo sapiens x-2' in another file. 

B- Mismatches sometimes arise when users try to indicate the Genbank accession numbers for separate locus alignments, but the tree is the result of simultaneous analysis. i.e., one alignment will use "Homo_sapiens_AJ23423", another uses "Homo_sapiens_AJ564667", and the tree uses "Homo_sapiens". It's laudable that they want to include this valuable metadata, but it would be better to code it as metadata in a NeXML file. And this calls for easy-to-use NeXML editors. e.g. add the ability to enter Genbank accession numbers in Mesquite, and then save as NeXML, thus preserving "Homo_sapiens" consistently in all alignments and resulting trees, while still communicating the respective accession numbers for each locus. Summer-of-Code project here. 

C- The basic data model of matrix-rows-matching-with-tree-OTUs works for 99% of datasets, but a growing number of studies use BEAST species inference (and other similar methods) where the tree ends in species OTUs, but the alignment has many more haplotype OTUs. -- i.e. there is, on purpose, a complete mismatch between alignment row labels and tree OTUs. Mesquite can handle this using a taxon association table, though I don't know that this is formal NEXUS or just a Mesquite invention. I don't think that NeXML or PhyloML can handle this. This calls for expanding the capabilities of NeXML and PhyloML.

2. Taxon labels not mapped or not mappable to external authorities or standards. This issue is not really the focus of Arlin et al's PPT, but is what Brian was addressing below. Yet it's equally important for data sharing, if not more so. Some comments:

A- Until taxon concepts are truly identifiable/citable, the mapping of taxon labels to "taxa" will always be imprecise (with precise taxonomic circumscription, usage, and meaning epistemologically impossible to communicate), but at least gross homonyms need to be addressed. This is a challenge for automated services -- the iPlant TNRS has some advantage given that it does not (yet) include animal or bacterial names, but even within a code there are inter-rank homonyms (e.g. "Drosophila" the genus or subgenus?). A "smart" service would resolve the gross homonym based on the topology of the submitted tree -- i.e. ((Aotus,Homo),Lemur) should cause the service to pick Aotus the monkey instead of Aotus the Eudicot. 

B- Abbreviations in the taxon labels make it very difficult to do a smart TNRS lookup. Some of the examples of "resolved" labels in the PPT are nonetheless unacceptable with respect to TNRS resolution. Even something as ubiquitous as "E. coli" could refer to (or be confused with) Entamoeba coli (Grassi, 1879) instead of Escherichia coli (Migula 1895). 

C- Another source of Homonym is with virus names. This is a big problem for TreeBASE because TreeBASE's semi-automated name service starts by ignoring trailing strings that start with capital letters or that contain numbers -- e.g. the assumption is that the third part of "Homo_sapiens_AJ23423" is not part of the name, whereas the third part of "Homo_sapiens_sapiens" is part of the name. Yet, while "Neodiprion abietis" is a sawfly, "Neodiprion abietis NPV" is a gammabaculovirus that happens to infect the sawfly -- naturally, TreeBASE first tries to match the beginning part of the virus name to the host name, and the submitter needs to be sharp enough to notice and correct the problem. I'm going to guess that iPlant's TNRS will map "Ammi majus latent virus" to bishop's-weed, A. majus instead of to a Potyvirus. 

bp

Rutger Vos

unread,
Aug 15, 2011, 12:13:38 AM8/15/11
to William Piel, Arlin Stoltzfus, MIAPA, TreeBASE devel
> this calls for easy-to-use NeXML editors. e.g. add the ability to enter
> Genbank accession numbers in Mesquite, and then save as NeXML, thus
> preserving "Homo_sapiens" consistently in all alignments and resulting
> trees, while still communicating the respective accession numbers for each
> locus. Summer-of-Code project here.

Indeed.

> C- The basic data model of matrix-rows-matching-with-tree-OTUs works for 99%
> of datasets, but a growing number of studies use BEAST species inference
> (and other similar methods) where the tree ends in species OTUs, but the
> alignment has many more haplotype OTUs. -- i.e. there is, on purpose, a
> complete mismatch between alignment row labels and tree OTUs. Mesquite can
> handle this using a taxon association table, though I don't know that this
> is formal NEXUS or just a Mesquite invention. I don't think that NeXML or
> PhyloML can handle this. This calls for expanding the capabilities of NeXML
> and PhyloML.

Yes and no. Multiple matrix rows can reference the same otu, but
that's not quite what we want. Multiple, separately annotatable matrix
row segments would be a good feature to have, also for TreeBASE's
needs.

--
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading, RG6 6BX, United Kingdom
Tel: +44 (0) 118 378 7535
http://rutgervos.blogspot.com

Roderic Page

unread,
Aug 15, 2011, 4:09:09 AM8/15/11
to Rutger Vos, William Piel, MIAPA, TreeBASE devel, Arlin Stoltzfus
There some additional tricks that could be used.

Mapping tree names to matrix names could be formulated as a bipartite matching problem, where we have two lists of names and want to find the best matching. See http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html for more details.

This approach could extended to, say, matching names in a NEXUS file to those in a publication, or a GenBank POPSET from a publication. For example, if we have a NEXUS file and a POPSET we could compute the best matching between the two sets of names. Or taxon names and/or accession numbers could be retrieved from the publication.

This would also help provide the context to help avoid homonyms, such as matching animal names to plant names. 

Regards

Rod


------------------------------------------------------------------------------
uberSVN's rich system and user administration capabilities and model
configuration take the hassle out of deploying and managing Subversion and
the tools developers use with it. Learn more about uberSVN and get a free
download at:  http://p.sf.net/sfu/wandisco-dev2dev
_______________________________________________
Treebase-devel mailing list
Treebas...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/treebase-devel



---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.p...@bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpa...@aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

Arlin Stoltzfus

unread,
Aug 19, 2011, 11:06:41 AM8/19/11
to Roderic Page, MIAPA, TreeBASE devel
On Aug 15, 2011, at 4:09 AM, Roderic Page wrote:

Mapping tree names to matrix names could be formulated as a bipartite matching problem, where we have two lists of names and want to find the best matching. See http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html for more details.

In computer science, this is called the "marriage problem" when the two lists are the same size.  We have a set { X } and a set { Y } of elements with some properties.  We have a function f( X_i, Y_j ) that computes a match score for each pair, using the properties.  In our case, the only property is the name-string.  The marriage problem is to find a pairwise mapping that is optimal in some way.   If optimality means minimizing the cost of the worst match, then this is (apparently, to me) the same as the linear bottleneck assignment problem.  

An obvious function to use (not necessarily the best for our case) is the edit distance, i.e., the number of character-wise edit operations to convert X_i into Y_j.   This is called the Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance). 

But there is nothing to stop us from creating a distance function that is optimized to work well in phyloinformatics.  We could test different functions using real cases such as the ones in my slideshow.  

One special condition is that, for us, the cost of a s/<underscore>/<space> / edit is very low.   Another special condition is reflected in Rod's longest-common-substring method of matching-- we often have pairs of matching names that have long matching substrings and differ by interruptions.   Maybe we need a gap-open and gap-extend penalty like in sequence alignment algorithms. 

Arlin

Hilmar Lapp

unread,
Aug 22, 2011, 8:39:32 PM8/22/11
to Mark Holder, MIAPA, TreeBASE devel
The developer of that is here at the BioHackathon, so let me know if I should pull him aside and ask questions.

-hilmar

Sent with a tap.

On Aug 22, 2011, at 6:39 PM, Mark Holder <mtho...@gmail.com> wrote:

Hi all,
I just noticed that Hilmar tweeted a link to Linnaeus:  http://linnaeus.sourceforge.net/ which seems relevant to this thread.

all the best,
Mark

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the

Hilmar Lapp

unread,
Aug 27, 2011, 5:28:43 AM8/27/11
to Mark Holder, MIAPA, TreeBASE devel
I spoke with the developer, Martin Gerner. He thought it might be well applicable to this task, even though the tool does a lot more than we possibly need here. For example, it tokenizes the input, and also is capable of applying some special "inference" rules (for instance, "HeLa cells" will be tagged with "Homo sapiens") that are quite useful if the purpose is linking of text to knowledge terms, but go beyond simple synonym matching (which it does, too, though). The dictionaries are pluggable, and apparently it is quite fast in principle.

-hilmar

On Aug 22, 2011, at 6:39 PM, Mark Holder wrote:

Hi all,
I just noticed that Hilmar tweeted a link to Linnaeus:  http://linnaeus.sourceforge.net/ which seems relevant to this thread.

all the best,
Mark

On Aug 19, 2011, at 11:06 AM, Arlin Stoltzfus wrote:

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
------------------------------------------------------------------------------
uberSVN's rich system and user administration capabilities and model
configuration take the hassle out of deploying and managing Subversion and
the tools developers use with it. Learn more about uberSVN and get a free
download at:  http://p.sf.net/sfu/wandisco-dev2dev
_______________________________________________
Treebase-devel mailing list
Treebas...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/treebase-devel

-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================



Arlin Stoltzfus

unread,
Nov 4, 2011, 1:53:49 PM11/4/11
to Hilmar Lapp, Mark Holder, MIAPA, TreeBASE devel
I added a link to Linnaeus (see last slide) in my PowerPoint presentation of this problem:


Let me know if there are other resources that I should note.  That way we won't lose the knowledge that we accumulated in this discussion thread.  

Arlin

--
You received this message because you are subscribed to the Google
Groups "MIAPA" group.
For more options, visit this group at
http://groups.google.com/group/miapa-discuss?hl=en
Reply all
Reply to author
Forward
0 new messages