Combine taxo and fasta file to append info on the fasta description line

142 views
Skip to first unread message

sergio....@nioz.nl

unread,
Jan 27, 2017, 5:04:32 AM1/27/17
to Qiime 1 Forum
Hello everyone

From a sequence database formatted in a qiime friendly format (1 fasta file with seq_id and one taxonomy.txt file with taxonomy associated to each seq_id) I extracted a list of sequence of interest (so again 2 files, one with the taxonomy and another with the sequences) which look like shown below

FASTA file
>FJ896224.1.1675_U
GGCTCATTAAATCAGTTATAGTTTATTTGATAGTCCTTTACTACTTGGATAACCGTAGTAATTCTAGAGCTAATACATGCATCAACTCCCAACTGCTTGTCGGACGGGATGTATTTATTAGATAGAAACCAATGCGGGGCAACCCGGTATTGTGGCGAATCATGATAACTTTGCGGATCGCCGGCTTTTGCCAGCGACGAATCATTCAAGTTTCTGCCCTATCAGCTTTGGATGGTAGGGTATTGGCCTACCATGGCTCTAACGGGTAACGGAGAATTGGGGTTCGATTCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGTAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATAACAATGCCGGGGTTTAACTCTGGCAATTGGAATGAGAACAATTTAAATCCCTTATCGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATACTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGCAGGGACGGCTGGTCGGTTCCGATAAGGGGCCGTACTATTGTTGGTTCCTGTCATCCTTGGGGAGAGCGATTCTGGCATTAAGTTGTTGGGGTCGGGATCCCTATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCCTGAATACATTAGCATGGAATAATAAGATACGACCTTGGTGGTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATCGGTGGGTGCATTGTAAGGCCCCATCGGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTACTGGGAATGCTTAGCATTGCCAGAGACTTCTTAGAGGGACTTTCGGCGCTAGGCCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCTTAGATGTCCTGGGCCGCACGCGCGCTACACTGATGCGTTCAACGAGTTTATAACCTTGTCCGGAAGGACCGGGTAATCTTGAAATGCGCATCGTGATAGGGATAGATTATTGCAACTATTAATCTTGAACGAGGAATTCCTAGTAAACGCGAGTCATCAGCTCGCATTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCACCTACCGATTGAATGATTCGGTGAAGCTTTCGGATTGCGCCACTGGCCTCGGTCGGCAGCGTGAGAAGTTATCTAAACCTCATCATTTAGAGGAAGGAGAAATCGTAACAAGGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCAGACCTGGCCGGGTGGTCTGCCTCACGGTATGTACTGTCTGGCTGGGTCTTACCTCTTGGTGAGCCGGCATGCCCTTTACTGGGTGTGTCGGGGAACCAGGACTTTTACCTTGAGAAAATTAGAGTGTTCAAAGCAGGCCTATGCCTGAATACATTAGCATGGAATAATAAAATAGGACGTGCGGTTCTATTTTGTTGGTTTCTAGAGTCGCCGTAATGATTAATAGGGATAGTTGGGGGCATTAGTATTCAGTTGCTAGAGGTGAAATTCTTGGATTTACTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAGGTTAGGGGATCGAAAACGATCAGATACCGTTGTAGTCTTAACAGTAAACTATGCCGACTAGGGATCGGGCGACCTCAATCTTATGTGTCGCTCGGCACCTTACGAGAAATCAAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTCACCAGGTCCAGACATGACTAGGATTGACAGATGATAGCTCTTTCATGATTTTATGGGTGGTGGTGCATGGCCGTTCTTAGTGGTGAGTGATTTGTCTGGTAATTCCGATACGAACGAGACCTAACCTGCTATAGCCAGCGCTTTGCTGTCGCGGCTCTAGAGGACTGTCTGCGTCTAGCAGACGGAGGTTGAGCATAACAGCTTGATGCCCTTAAGATGTTCTGGCCGCACGCGCCCTAC
>HQ866156.1.905_U
GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGAGGTGATCGGTCGGCCTCGCAAGGGGTCTGCACCTGTATCGTCCTTTGCCATCCTTCAGGAAGGCGCTTCTTGTATTAACTTACGGGTTGCGAACTCCTGATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGGCCTTGGTGGTTTTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGTTACCTAGACTCCGTCAGCACCTTCCGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTGAGGATTGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTGTCGGTAATGCTTCTGCATTGCCGTTTCTACTTCTTAGAGGGACTTTCGGTGACTAACCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCAAAGGCGAATT
>HQ866557.1.896_U
GTGCCAAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGAGGTGATCGGTCGGCCTCGCAAGGGGTCTGCACCTGTATCGTCCTTTGCCATCCTTCAGGAAGGCGCTTCTTGTATTAACTTACGGGTTGCGAACTCCTGATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGGCCTTGGTGGTTTTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGTTACCTAGACTCCGTCAGCACCTTCCGAGAAGTCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGAGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTGAGGATTGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTGTCGGTAATGCTTCTGCATTGCCGTTTCTACTTCTTAGAGGGACTTTCGGTGACTAACCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCA
>EF695169.1.817_U
TCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGTAAATTACCCAATCCTGACACAGGGAGGTAGTGACAAAAAATAACAATGCCGGGCTTTTTCAAGTCTGGCAATTGGAATGAGAACAATTTAAATCCCGTATCGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGGGCTGTCGGCCTGGCTCCGAAAGGGGTCGGCGCTTGTACACGCCTGGCCATCCTCGGGGGAAGCTTTGCTGGCATTAAGTTGTCGGCGGAGTGACGCTCGTCGTTTACTGTGAACAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGACTTTGGTGGTCTATTTTGTTGGTTTGCACGCCGAAGTAATGATTAATAGGGGCGGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGAAGATTAGATACCTTCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGCTTGTTCAGGCTCCGTCAGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGAC
>HQ866312.1.898_U
TGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGAGGTGATCGGTCGGCCTCGCAAGGGGTCTGCACCTGTATCGTCCTTTGCCATCCTTCAGGAAGGCGCTTCTTGTATTAACTTACGGGTTGCGAACTCCTGATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGGCCTTGGTGGTTTTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGTTACCTAGACTCCGTCAGCACCTTCCGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTGAGGATTGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTGTCGGTAATGCTTCTGCATTGCCGTTTCTACTTCTTAGAGGGACTTTCGGTGACTAACCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCAAGGG

taxonomy file

FJ896224.1.1675_U    Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_sp.;
JX185299.1.1748_U    Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_salina;
U41052.1.1798_U    Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Pseudocharaciopsis;Pseudocharaciopsis_minuta;
JF489992.1.1790_U    Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_salina;

Now I'd like to merge the two files such that to each sequence identifier I can append the taxonomic description to get each sequence look like:

>FJ896224.1.1675_U Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_sp.;
GGCTCATTAAATCAGTTATAGTTTATTTGATAGTCCTTTACTACTTGGATAACCGTAGTAATTCTAGAGCTAATACATGCATCAACTCCCAACTGCTTGTCGGACGGGATGTATTTATTAGATAGAAACCAATGCGGGGCAACCCGGTATTGTGGCGAATCATGATAACTTTGCGGATCGCCGGCTTTTGCCAGCGACGAATCATTCAAGTTTCTGCCCTATCAGCTTTGGATGGTAGGGTATTGGCCTACCATGGCTCTAACGGGTAACGGAGAATTGGGGTTCGATTCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGTAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATAACAATGCCGGGGTTTAACTCTGGCAATTGGAATGAGAACAATTTAAATCCCTTATCGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATACTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGCAGGGACGGCTGGTCGGTTCCGATAAGGGGCCGTACTATTGTTGGTTCCTGTCATCCTTGGGGAGAGCGATTCTGGCATTAAGTTGTTGGGGTCGGGATCCCTATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCCTGAATACATTAGCATGGAATAATAAGATACGACCTTGGTGGTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATCGGTGGGTGCATTGTAAGGCCCCATCGGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTACTGGGAATGCTTAGCATTGCCAGAGACTTCTTAGAGGGACTTTCGGCGCTAGGCCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCTTAGATGTCCTGGGCCGCACGCGCGCTACACTGATGCGTTCAACGAGTTTATAACCTTGTCCGGAAGGACCGGGTAATCTTGAAATGCGCATCGTGATAGGGATAGATTATTGCAACTATTAATCTTGAACGAGGAATTCCTAGTAAACGCGAGTCATCAGCTCGCATTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCACCTACCGATTGAATGATTCGGTGAAGCTTTCGGATTGCGCCACTGGCCTCGGTCGGCAGCGTGAGAAGTTATCTAAACCTCATCATTTAGAGGAAGGAGAAATCGTAACAAGGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCAGACCTGGCCGGGTGGTCTGCCTCACGGTATGTACTGTCTGGCTGGGTCTTACCTCTTGGTGAGCCGGCATGCCCTTTACTGGGTGTGTCGGGGAACCAGGACTTTTACCTTGAGAAAATTAGAGTGTTCAAAGCAGGCCTATGCCTGAATACATTAGCATGGAATAATAAAATAGGACGTGCGGTTCTATTTTGTTGGTTTCTAGAGTCGCCGTAATGATTAATAGGGATAGTTGGGGGCATTAGTATTCAGTTGCTAGAGGTGAAATTCTTGGATTTACTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAGGTTAGGGGATCGAAAACGATCAGATACCGTTGTAGTCTTAACAGTAAACTATGCCGACTAGGGATCGGGCGACCTCAATCTTATGTGTCGCTCGGCACCTTACGAGAAATCAAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTCACCAGGTCCAGACATGACTAGGATTGACAGATGATAGCTCTTTCATGATTTTATGGGTGGTGGTGCATGGCCGTTCTTAGTGGTGAGTGATTTGTCTGGTAATTCCGATACGAACGAGACCTAACCTGCTATAGCCAGCGCTTTGCTGTCGCGGCTCTAGAGGACTGTCTGCGTCTAGCAGACGGAGGTTGAGCATAACAGCTTGATGCCCTTAAGATGTTCTGGCCGCACGCGCCCTAC



My question is if there is a qiime script which allows doing that. I had a look on the list of scripts available but could not find any. In alternative do you know any unix/python/biopython command which allows me to do that?

Briefly I need to do the opposite of what we do with prep_silva_data.py and prep_silva_taxonomy.py (https://github.com/mikerobeson/Misc_Code/tree/master/SILVA_to_RDP) when we wanna format a sequence database in a qiime friendly way splitting the fasta file in 2.

Let me know if you have any suggestion

thanks for your help

Sergio

 

Colin Brislawn

unread,
Jan 27, 2017, 1:48:29 PM1/27/17
to Qiime 1 Forum
Hello Sergio,

This is a great question. I would also like to know of an elegant way to do this.

Colin

TonyWalters

unread,
Jan 27, 2017, 2:25:24 PM1/27/17
to Qiime 1 Forum
Hello Sergio,

I modified one of the scripts used for building the SILVA database which should be able to do what you're looking for here: https://gist.github.com/walterst/9147f9405cadf67a88471cc87b508333

I haven't tested it though, so you'd want to check the output to make sure it's reasonable.

-Tony

Colin Brislawn

unread,
Jan 27, 2017, 2:44:59 PM1/27/17
to Qiime 1 Forum
That script works great Tony!

Sergio, let me know if that works for you and I can help troubleshoot it. 

Colin

Reply all
Reply to author
Forward
0 new messages