Duplicated OTT entries

17 views
Skip to first unread message

Yan Wong

unread,
Oct 8, 2015, 12:26:29 PM10/8/15
to Open Tree of Life
Out of interest, what is the process by which OTT taxonomy entries with overlapping source IDs become produced? Here's one example:

4795683 |       987171  |       Lentisphaerae bacterium 8KG-4   |       no rank - terminal      |       ncbi:760260,silva:AB558581
....
5291394 |       4795682 |       Oligosphaera ethanolica |       species |       ncbi:760260

I assume that in this case ncbi maps 2 species names as synonyms, with the same ncbi number, but OToL has reason to separate them - perhaps silva has them separate?

Jonathan A Rees

unread,
Oct 8, 2015, 1:47:03 PM10/8/15
to opentre...@googlegroups.com
Let's see... in OTT 2.8, the silva cluster is mapped to a genbank id (the one for its reference sequence), and that gets mapped to NCBI taxonomy via the genbank record. That's the NCBI taxon id that shows up in the silva taxon record. Separately, at a different time, NCBI Taxonomy was loaded and all the nonconflicting taxa were added to OTT. So my guess would be version skew; that NCBI taxon id might have been labeled as Lentisphaerae at one time, and relabeled as Oligosphaea at a later time, perhaps after a 'nomenclatural act' of some kind, or a correction to the genbank record (maybe a misidentification being corrected?).

Putting the NCBI taxon id in the silva taxon record is useful for the way smasher works internally but has caused a lot of confusion, so I would like to remove the NCBI ids and come up with a different way to do the silva/ncbi alignment. When there is a match, there should be one OTT record showing both silva and ncbi, but when there is a mismatch (as here) there would be two different OTT records.

I'll file an issue for this.

--
You received this message because you are subscribed to the Google Groups "Open Tree of Life" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opentreeoflif...@googlegroups.com.
To post to this group, send email to opentre...@googlegroups.com.
Visit this group at http://groups.google.com/group/opentreeoflife.
For more options, visit https://groups.google.com/d/optout.

Yan Wong

unread,
Oct 8, 2015, 4:25:26 PM10/8/15
to opentre...@googlegroups.com
On 8 Oct 2015, at 18:47, Jonathan A Rees <re...@mumble.net> wrote:

> Putting the NCBI taxon id in the silva taxon record is useful for the way smasher works internally but has caused a lot of confusion, so I would like to remove the NCBI ids and come up with a different way to do the silva/ncbi alignment.

Perhaps you could call it ncbi_silva:XXXX in the cases where the id has been sourced indirectly like this. It would be a little less confusing, but still retain the data.

Yan

Yan Wong

unread,
Oct 9, 2015, 4:38:53 AM10/9/15
to Open Tree of Life
From a very quick perusal, it seems like when and ncbi ID is immediately followed by a silva ID in taxonomy.tsv, then it is likely to have come via this roundabout route. In other cases, the ncbi ID is placed after the silva ID. Not a robust way of detecting the difference, but perhaps something to go on until a solution is found.

Yan Wong

unread,
Oct 9, 2015, 6:05:52 AM10/9/15
to Open Tree of Life
I've just looked through the entire 2.9 taxonomy.tsv file, and apart from 3 entries, those the duplicated ncbi IDs via silva are the only times when taxa share ncbi, worms, gbif, or if numbers. I thought that you might want to know the 3 exceptions, at least 2 of which are misspellings.

5316647 |       324856  |       Togniniella acerosa     |       species |       ncbi:312314,if:519582,gbif:2565673      |               |               |       
4927483 |       324856  |       Togniniella microspore  |       species |       if:519582       |               |       edited  |       

5302390 |       770426  |       Hyphodiscus hyaloscyphoides     |       species |       ncbi:1071709,if:518017  |               |               |       
4926725 |       770426  |       Hyphodiscus hyaloscyphoide      |       species |       if:518017       |               |       edited  |       

358854  |       959450  |       Marasmius kuthubutheenii        |       species |       if:512635,ncbi:650431   |               |               |       
4935920 |       959450  |       Marasmius kuthubutheeni |       species |       if:512635       |               |       edited  |       

Jonathan A Rees

unread,
Oct 9, 2015, 4:07:48 PM10/9/15
to opentre...@googlegroups.com
These seem to be errors in one of the edits files. Issue: https://github.com/OpenTreeOfLife/reference-taxonomy/issues/168

Yan Wong

unread,
Oct 9, 2015, 4:11:02 PM10/9/15
to opentre...@googlegroups.com
On 9 Oct 2015, at 21:07, Jonathan A Rees <re...@mumble.net> wrote:

> These seem to be errors in one of the edits files. Issue: https://github.com/OpenTreeOfLife/reference-taxonomy/issues/168

No idea what that means, but I assume it’s in hand then :)

I have 1.3 million wikilinks for you…

Yan

Jonathan A Rees

unread,
Oct 9, 2015, 4:20:17 PM10/9/15
to opentre...@googlegroups.com
We had a version 1 patch system consisting of a bunch of files full of edits. Those files are what I just called "edit files". They are in the feed/ott/edits/ directory in the reference-taxonomy repository. You can make the repair yourself, if you like, and submit a pull request...

Regarding the links, that's great. I don't know how soon we'll be able to make use of them. For now the best solution is probably to put them on our static file server (files.opentreeoflife.org) and document the fact that they exist.  This is a two-column csv file, or something similar, right?

Did you run against OTT 2.8 or 2.9?  2.9 isn't finished yet and the identifiers that the drafts use for new taxa (those not in 2.8) are not yet stable.

Yan Wong

unread,
Oct 9, 2015, 4:28:03 PM10/9/15
to opentre...@googlegroups.com
On 9 Oct 2015, at 21:20, Jonathan A Rees <re...@mumble.net> wrote:

> You can make the repair yourself, if you like, and submit a pull request…

Probably easier if you do, since it’s only 3 files, and you know what you are doing (I certainly don’t)

> For now the best solution is probably to put them on our static file server (files.opentreeoflife.org) and document the fact that they exist.

Probably sensible for me to send you the python script that produces them, then you can update agains 2.9 or whatever. It only takes about 1/2 a gig of memory and about 10 mins to run on my laptop, once I’ve downloaded a wikidata dump.

> This is a two-column csv file, or something similar, right?

Well, tab-separated OTT_id Wikidata_id, and I also stuck an extra EOL_ID column in there, since it was easy to get and OToL doesn’t have that yet. But you might want to get those direct from EoL once you have something sorted.

> Did you run against OTT 2.8 or 2.9? 2.9 isn't finished yet and the identifiers that the drafts use for new taxa (those not in 2.8) are not yet stable.

2.9, but can easily re-run. Presumably no major hurry. I’ll check the links work first

Yan
Reply all
Reply to author
Forward
0 new messages