Hello everyone
From a sequence database formatted in a qiime friendly format (1 fasta file with seq_id and one taxonomy.txt file with taxonomy associated to each seq_id) I extracted a list of sequence of interest (so again 2 files, one with the taxonomy and another with the sequences) which look like shown below
FASTA file
>FJ896224.1.1675_U
GGCTCATTAAATCAGTTATAGTTTATTTGATAGTCCTTTACTACTTGGATAACCGTAGTAATTCTAGAGCTAATACATGCATCAACTCCCAACTGCTTGTCGGACGGGATGTATTTATTAGATAGAAACCAATGCGGGGCAACCCGGTATTGTGGCGAATCATGATAACTTTGCGGATCGCCGGCTTTTGCCAGCGACGAATCATTCAAGTTTCTGCCCTATCAGCTTTGGATGGTAGGGTATTGGCCTACCATGGCTCTAACGGGTAACGGAGAATTGGGGTTCGATTCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGTAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATAACAATGCCGGGGTTTAACTCTGGCAATTGGAATGAGAACAATTTAAATCCCTTATCGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATACTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGCAGGGACGGCTGGTCGGTTCCGATAAGGGGCCGTACTATTGTTGGTTCCTGTCATCCTTGGGGAGAGCGATTCTGGCATTAAGTTGTTGGGGTCGGGATCCCTATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCCTGAATACATTAGCATGGAATAATAAGATACGACCTTGGTGGTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATCGGTGGGTGCATTGTAAGGCCCCATCGGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTACTGGGAATGCTTAGCATTGCCAGAGACTTCTTAGAGGGACTTTCGGCGCTAGGCCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCTTAGATGTCCTGGGCCGCACGCGCGCTACACTGATGCGTTCAACGAGTTTATAACCTTGTCCGGAAGGACCGGGTAATCTTGAAATGCGCATCGTGATAGGGATAGATTATTGCAACTATTAATCTTGAACGAGGAATTCCTAGTAAACGCGAGTCATCAGCTCGCATTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCACCTACCGATTGAATGATTCGGTGAAGCTTTCGGATTGCGCCACTGGCCTCGGTCGGCAGCGTGAGAAGTTATCTAAACCTCATCATTTAGAGGAAGGAGAAATCGTAACAAGGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCAGACCTGGCCGGGTGGTCTGCCTCACGGTATGTACTGTCTGGCTGGGTCTTACCTCTTGGTGAGCCGGCATGCCCTTTACTGGGTGTGTCGGGGAACCAGGACTTTTACCTTGAGAAAATTAGAGTGTTCAAAGCAGGCCTATGCCTGAATACATTAGCATGGAATAATAAAATAGGACGTGCGGTTCTATTTTGTTGGTTTCTAGAGTCGCCGTAATGATTAATAGGGATAGTTGGGGGCATTAGTATTCAGTTGCTAGAGGTGAAATTCTTGGATTTACTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAGGTTAGGGGATCGAAAACGATCAGATACCGTTGTAGTCTTAACAGTAAACTATGCCGACTAGGGATCGGGCGACCTCAATCTTATGTGTCGCTCGGCACCTTACGAGAAATCAAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTCACCAGGTCCAGACATGACTAGGATTGACAGATGATAGCTCTTTCATGATTTTATGGGTGGTGGTGCATGGCCGTTCTTAGTGGTGAGTGATTTGTCTGGTAATTCCGATACGAACGAGACCTAACCTGCTATAGCCAGCGCTTTGCTGTCGCGGCTCTAGAGGACTGTCTGCGTCTAGCAGACGGAGGTTGAGCATAACAGCTTGATGCCCTTAAGATGTTCTGGCCGCACGCGCCCTAC
>HQ866156.1.905_U
GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGAGGTGATCGGTCGGCCTCGCAAGGGGTCTGCACCTGTATCGTCCTTTGCCATCCTTCAGGAAGGCGCTTCTTGTATTAACTTACGGGTTGCGAACTCCTGATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGGCCTTGGTGGTTTTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGTTACCTAGACTCCGTCAGCACCTTCCGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTGAGGATTGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTGTCGGTAATGCTTCTGCATTGCCGTTTCTACTTCTTAGAGGGACTTTCGGTGACTAACCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCAAAGGCGAATT
>HQ866557.1.896_U
GTGCCAAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGAGGTGATCGGTCGGCCTCGCAAGGGGTCTGCACCTGTATCGTCCTTTGCCATCCTTCAGGAAGGCGCTTCTTGTATTAACTTACGGGTTGCGAACTCCTGATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGGCCTTGGTGGTTTTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGTTACCTAGACTCCGTCAGCACCTTCCGAGAAGTCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGAGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTGAGGATTGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTGTCGGTAATGCTTCTGCATTGCCGTTTCTACTTCTTAGAGGGACTTTCGGTGACTAACCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCA
>EF695169.1.817_U
TCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGTAAATTACCCAATCCTGACACAGGGAGGTAGTGACAAAAAATAACAATGCCGGGCTTTTTCAAGTCTGGCAATTGGAATGAGAACAATTTAAATCCCGTATCGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGGGCTGTCGGCCTGGCTCCGAAAGGGGTCGGCGCTTGTACACGCCTGGCCATCCTCGGGGGAAGCTTTGCTGGCATTAAGTTGTCGGCGGAGTGACGCTCGTCGTTTACTGTGAACAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGACTTTGGTGGTCTATTTTGTTGGTTTGCACGCCGAAGTAATGATTAATAGGGGCGGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGAAGATTAGATACCTTCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGCTTGTTCAGGCTCCGTCAGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGAC
>HQ866312.1.898_U
TGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGAGGTGATCGGTCGGCCTCGCAAGGGGTCTGCACCTGTATCGTCCTTTGCCATCCTTCAGGAAGGCGCTTCTTGTATTAACTTACGGGTTGCGAACTCCTGATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCGTTGAATACATTAGCATGGAATAATGAGATAGGGCCTTGGTGGTTTTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATTGGCGGTCGTTACCTAGACTCCGTCAGCACCTTCCGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTGAGGATTGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTGTCGGTAATGCTTCTGCATTGCCGTTTCTACTTCTTAGAGGGACTTTCGGTGACTAACCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCAAGGG
taxonomy file
FJ896224.1.1675_U Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_sp.;
JX185299.1.1748_U Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_salina;
U41052.1.1798_U Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Pseudocharaciopsis;Pseudocharaciopsis_minuta;
JF489992.1.1790_U Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_salina;
Now I'd like to merge the two files such that to each sequence identifier I can append the taxonomic description to get each sequence look like:
>FJ896224.1.1675_U Stramenopiles;Ochrophyta;Eustigmatophyceae;Eustigmatophyceae_X;Eustigmatophyceae_XX;Nannochloropsis;Nannochloropsis_sp.;
GGCTCATTAAATCAGTTATAGTTTATTTGATAGTCCTTTACTACTTGGATAACCGTAGTAATTCTAGAGCTAATACATGCATCAACTCCCAACTGCTTGTCGGACGGGATGTATTTATTAGATAGAAACCAATGCGGGGCAACCCGGTATTGTGGCGAATCATGATAACTTTGCGGATCGCCGGCTTTTGCCAGCGACGAATCATTCAAGTTTCTGCCCTATCAGCTTTGGATGGTAGGGTATTGGCCTACCATGGCTCTAACGGGTAACGGAGAATTGGGGTTCGATTCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGTAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATAACAATGCCGGGGTTTAACTCTGGCAATTGGAATGAGAACAATTTAAATCCCTTATCGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATACTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGCAGGGACGGCTGGTCGGTTCCGATAAGGGGCCGTACTATTGTTGGTTCCTGTCATCCTTGGGGAGAGCGATTCTGGCATTAAGTTGTTGGGGTCGGGATCCCTATCTTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTAGGCCCTGAATACATTAGCATGGAATAATAAGATACGACCTTGGTGGTCTATTTTGTTGGTTTGCACGCCAAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATGGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATTAGATACCATCGTAGTCTTAACCATAAACTATGCCGACTAGGGATCGGTGGGTGCATTGTAAGGCCCCATCGGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGAAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCCCCGCCTGCTAAATAGTACTGGGAATGCTTAGCATTGCCAGAGACTTCTTAGAGGGACTTTCGGCGCTAGGCCGAAGGAAGTTGGGGGCAATAACAGGTCTGTGATGCCCTTAGATGTCCTGGGCCGCACGCGCGCTACACTGATGCGTTCAACGAGTTTATAACCTTGTCCGGAAGGACCGGGTAATCTTGAAATGCGCATCGTGATAGGGATAGATTATTGCAACTATTAATCTTGAACGAGGAATTCCTAGTAAACGCGAGTCATCAGCTCGCATTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCACCTACCGATTGAATGATTCGGTGAAGCTTTCGGATTGCGCCACTGGCCTCGGTCGGCAGCGTGAGAAGTTATCTAAACCTCATCATTTAGAGGAAGGAGAAATCGTAACAAGGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCAGACCTGGCCGGGTGGTCTGCCTCACGGTATGTACTGTCTGGCTGGGTCTTACCTCTTGGTGAGCCGGCATGCCCTTTACTGGGTGTGTCGGGGAACCAGGACTTTTACCTTGAGAAAATTAGAGTGTTCAAAGCAGGCCTATGCCTGAATACATTAGCATGGAATAATAAAATAGGACGTGCGGTTCTATTTTGTTGGTTTCTAGAGTCGCCGTAATGATTAATAGGGATAGTTGGGGGCATTAGTATTCAGTTGCTAGAGGTGAAATTCTTGGATTTACTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAGGTTAGGGGATCGAAAACGATCAGATACCGTTGTAGTCTTAACAGTAAACTATGCCGACTAGGGATCGGGCGACCTCAATCTTATGTGTCGCTCGGCACCTTACGAGAAATCAAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTCACCAGGTCCAGACATGACTAGGATTGACAGATGATAGCTCTTTCATGATTTTATGGGTGGTGGTGCATGGCCGTTCTTAGTGGTGAGTGATTTGTCTGGTAATTCCGATACGAACGAGACCTAACCTGCTATAGCCAGCGCTTTGCTGTCGCGGCTCTAGAGGACTGTCTGCGTCTAGCAGACGGAGGTTGAGCATAACAGCTTGATGCCCTTAAGATGTTCTGGCCGCACGCGCCCTAC
My question is if there is a qiime script which allows doing that. I had a look on the list of scripts available but could not find any. In alternative do you know any unix/python/biopython command which allows me to do that?
Briefly I need to do the opposite of what we do with prep_silva_data.py and prep_silva_taxonomy.py (
https://github.com/mikerobeson/Misc_Code/tree/master/SILVA_to_RDP) when we wanna format a sequence database in a qiime friendly way splitting the fasta file in 2.
Let me know if you have any suggestion
thanks for your help
Sergio