Taxonomy database in Uniprot format for classification with USEARCH

19 views
Skip to first unread message

sbue...@asu.edu

unread,
Jan 5, 2018, 1:29:15 PM1/5/18
to metacoder
Hi Zachary and all metacoder users,

first of all, thanks for developing this - metacoder is exactly what was urgently needed with the diversity of formats taxonomic data is written in.
Now, I have a special problem, that I was not able to figure out based on the template codes. I am not a frequent R user (using Matlab more often) and so I could misunderstand some basics here. Hopefully you can help.

I would like to change the header of a taxonomic database file that was retrieved from Uniprot. It is a fasta format file that includes protein sequences. I intend to use this file as database for a taxonomic classification using USEARCH which requires these taxonomic annotations: https://www.drive5.com/usearch/manual/tax_annot.html
An excerpt of the database file is attached. Could you guide me to the right code with metacoder to replace the current headers with the lineage annotation described on the Usearch website? That would be incredibly helpful!

Cheers,
steffen
Archaea_uniprot_proteindb1.fasta

Zachary Foster

unread,
Jan 5, 2018, 7:51:14 PM1/5/18
to metacoder
Hello Steffen,

Sure! See the attached and let me know if you have questions. The code might be hard to understand if you are not comfortable with regular expressions and the apply functions in R. Its a bit tricky since Uniprot only supplies species names and has a difficult to parse header format. I just updated metacoder, so things will look different then the documentation you read.

Best,

Zach
example.pdf
2018_01_05--steffen--headers.zip

sbue...@asu.edu

unread,
Jan 9, 2018, 3:22:21 PM1/9/18
to metacoder
Zach, thank you so much for your help.. the example file is great to really understand what you did, thank you!

cheers,
steffen
Reply all
Reply to author
Forward
0 new messages