assigning taxonomy to OTUs with a custom made database

Sachia

unread,

Dec 15, 2015, 8:08:36 AM12/15/15

to Qiime 1 Forum

Dear QIIME community,

I am trying to assign taxonomy using a database we made ourselves (picking out organisms, and sequencing each of them).

for this custom DB I have :
1) a .fas file with all the sequences and a unique ID to each
2) a .txt file containing the unique IDs and the corresponding taxonomic information. the .txt file is separated by ";".

However, when I try:

parallel_assign_taxonomy_uclust.py -i ~/../repset_OTUs.fasta -r ~/../JELLY_no_align.fas -t ~/../FASTA_TAX.txt -o assign_custom -O 4 --uclust_max_accepts 10 --similarity 0.99

it never finishes, and when i check what is running with "top", I see nothing related to the parallel_assign command appearing. the output folder is being generated with a couple of temporary files, but no log file.

I am suspecting the DB format.. but I can't find more points on how the construct it other than making the .txt file ";" separated and have the same IDs in both files.

I have attached the two DB files to this message.

Thank you for your time,

Sachia

FASTA_TAX.txt

JELLY_no_align.fas

Colin Brislawn

unread,

Dec 15, 2015, 12:34:17 PM12/15/15

to Qiime 1 Forum

Hello Sachia,

I don't know all that much database construction (a qiime dev can comment more), but I have two suggestions about the command you ran:

--similarity 0.99 is way too high. This means that only hits over 99% will be reported! Are you expecting 10 hits over 99%? Try leaving it at the default of 90%. This will still report the best hits, but is less stringent.

--retain_temp_files could be useful to help you see how far the script got.

--poll_directly may prevent the script from hanging. Sometimes I find this works better for me, when I can't see the qiime script running in top.

I'll be building a database like this soon and am also interested in any advice the qiime devs have.

Colin

Sachia

unread,

Dec 15, 2015, 1:31:46 PM12/15/15

to Qiime 1 Forum

Hi Colin,

how are you? :)

thanks for the suggestions I will try them first think tomorrow.

as for the 0.99 similarity, I did it because the sequences are 18S and when we went over the results using 0.97 (both during clustering and later taxonomy) alot of things wasn't adding up with the taxonomy. Also, the sequences in the custom DB, many are more than 0.98 similar.

Based on that would you still take the OTUs (with 0.99 clustering) and assign taxonomy with 0.9?

Thanks for replying (again) to my stream of questions!

Sachia

Colin Brislawn

unread,

Dec 15, 2015, 2:00:54 PM12/15/15

to Qiime 1 Forum

Based on that would you still take the OTUs (with 0.99 clustering) and assign taxonomy with 0.9?

Absolutely! This is because OTU picking and taxonomy assignment use this threshold differently.

During OTU picking, each read only goes into one OTU, so you want it to be a very high match. 97% or up is good choice. During taxonomy assignment, each read is mapped to multiple taxa, so having several good hits is good. Actually, having several bad hits (say at 93%, 91%, and 91%) is still a good thing, because this accurately reflects the lack of a perfect match in your taxonomy database.

Colin

Sachia

unread,

Dec 15, 2015, 3:34:50 PM12/15/15

to Qiime 1 Forum

So during taxonomy assignment the threshold of 0.9 is only to evaluate which assignments to list in the result file?

ah, sorry for the stupid question, just want to be sure I get what you mean.

Sachia

Colin Brislawn

unread,

Dec 15, 2015, 3:41:10 PM12/15/15

to Qiime 1 Forum

Yes. 0.9 is the minimum to be listed as one of the three hits (or 10 hits in your example). If you take a look at your log file, you will see that many of your hits are closer to 99% or 98%.

Colin

Sachia

unread,

Dec 16, 2015, 3:09:30 AM12/16/15

to Qiime 1 Forum

Update: it still won´t work.

I ran:

parallel_assign_taxonomy_uclust.py -i ~/../repset_OTUs.fasta -r ~/../JELLY_no_align.fas -t ~/../FASTA_TAX.txt -o assign_custom -O 4 --poll_directly --retain_temp_files

and it seems like its the same. never see commands related to assign taxonomy, to job seems to hang and I just have a bunch of temp files in the output directory.

Mauro Tutino

unread,

Dec 16, 2015, 9:50:42 AM12/16/15

to Qiime 1 Forum

Hi Sachia,

I have been having the same problem but with parallel_assign_taxonomy_rdp.py.

When I use "assign_taxonomy.py -m rdp" I do not have such a problem.

Have you tried the not parallelized one?

In my case, since RDP perfectly works with assign_taxonomy.py, I suppose it is a problem related to installation and how it runs parallel jobs.

Unfortunately I work on a server and I cannot look directly at the problem.

Hope that helps.

Mauro.

Sachia

unread,

Dec 16, 2015, 10:01:46 AM12/16/15

to Qiime 1 Forum

Hi Mauro,

I did try assign_taxonomy.py but with standard settings - and I get an error about specific lines in the script ending on "valueError need more than 1 value to unpack".

However, I did not try RDP, and I will give that a try. Could I ask how your ref_db and tax files look? like the layout, should I avoid specific things like "," or empty space etc.

thank you for the input, I'll keep you updated :)

Sachia

Mauro Tutino

unread,

Dec 16, 2015, 11:22:47 AM12/16/15

to Qiime 1 Forum

Here are the first two fasta from Greengenes ref:

>1111883

GCTGGCGGCGTGCCTAACACATGTAAGTCGAACGGGACTGGGGGCAACTCCAGTTCAGTGGCAGACGGGTGCGTAACACGTGAGCAACTTGTCCGACGGCGGGGGATAGCCGGCCCAACGGCCGGGTAATACCGCGTACGCTCGTTTAGGGACATCCCTGAATGAGGAAAGCCGTAAGGCACCGACGGAGAGGCTCGCGGCCTATCAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGACGACGGGTAGCTGGTCTGAGAGGATGGCCAGCCACATTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGCGCAATGGCCGCAAGGCTGACGCAGCGACGCCGCGTGTGGGATGACGGCCTTCGGGTTGTAAACCACTGTCGGGAGGAACGAATACTCGGCTAGTCCGAGGGTGACGGTACCTCCAAAGGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGTGGCCCGTTAAGTGGCTGGTGAAATCCCGGGGCTCAACTCCGGGGCTGCCGGTCAGACTGGCGAGCTAGAGCACGGTAGGGGCAGATGGAATTCCCGGTGTAGCGGTGGAATGCGTAGATATCGGGAAGAATACCAGTGGCGAAGGCGTTCTGCTGGGCCGTTGCTGACACTGAGGCGCGACAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGACACTAGACGTCGGGGGGAGCGACCCTCCCGGTGTCGTCGCTAACGCAGTAAGTGTCCCGCCTGGGGAGTACGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCTGGGCTTGACATGCTGGTGCAAGCCGGTGGAAACATCGGCCCCTCTTCGGAGCGCCAGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACTCTCGCTCCCAGTTGCCAGCGGTTCGGCCGGGGACTCTGGGGGGACTGCCGGCGTTAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGTCCAGGGCGACACACGTGCTACAATGCCTGGTACAGCGCGTCGCGAACTCGCAAGAGGGAGCCAATCGCCAAAAGCCGGGCTAAGTTCGGATTGTCGTCTGCAACTCGACGGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCCACGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACGCCATGGAAGCCGGAGGGACCCGAAACCGGTGGGCCAACCGCAAGGGGGCAGCCGTCTAAGGT

>1111882

AGAGTTTGATCATGGCTCAGGATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGGTAGAGGCTTTCGGGCCTTGAGACCGGCGCACGGGTGCGTAACGCGTATGCAATCTGCCTTGTACTAAGGGATAGCCCAGAGAAATTTGGATTAATACCTTATAGTATATAGATGTGGCATCACATTTCTATTAAAGATTTATCGGTACAAGATGAGCATGCGTCCCATTAGCTAGTTGGTATGGTAACGGCATACCAAGGCAATGATGGGTAGGGGTCCTGAGAGGGAGATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAAGCCTGAACCAGCCATGCCGCGTGCAGGATGACGGTCCTATGGATTGTAAACTGCTTTTGTACGGGAAGAAACACTCCTACGTGTAGGGGCTTGACGGTACCGTAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCAAGCGTTATCCGGAATCATTGGGTTTAAAGGGTCCGTAGGCGGTTTTATAAGTCAGTGGTGAAATCCGGCAGCTCAACTGTCGAACTGCCATTGATACTGTAGAACTTGAATTACTGTGAAGTAACTAGAATATGTAGTGTAGCGGTGAAATGCTTAGATATTACATGGAATACCAATTGCGAAGGCAGGTTACTAACAGTATATTGACGCTGATGGACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGATACTAGCTGTTTGGCAGCAATGCTGAGTGGCTAAGCGAAAGTGTTAAGTATCCCACCTGGGGAGTACGAACGCAAGTTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCAGGGCTTAAATGTAGAGTGACAGGACTGGAAACAGTTTTTTCTTCGGACACTTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCAGGTTAAGTCCTATAACGAGCGCAACCCCTGTTGTTAGTTGCCAGCGAGTAATGTCGGGAACTCTAACAAGACTGCCGGTGCAAACCGTGAGGAAGGTGGGGATGACGTCAAATCATCACGGCCCTTACGTCCTGGGCTACACACGTGCTACAATGGCCGGTACAGAGAGCAGCCACCTCGCGAGGGGGAGCGAATCTATAAAGCCGGTCACAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGCTGGAATCGCTAGTAATCGGATATCAGCCATGATCCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCAAGCCATGGAAGCTGGGGGTACCTGAAGTCGGTGACCGCAAGGAGCTGCCTAGGGTAAAACTGGTAACTGGGGCTAAGTCGTACAAGGTAGCCGTA

And here are the corresponding lines in the taxa file for those OTUs. There is a TAB between the OTU_ID and the taxonomy. Between taxa levels there are only spaces, no TABs.

1111883 k__Bacteria; p__Gemmatimonadetes; c__Gemm-1; o__; f__; g__; s__

1111882 k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__

I gave a quick look at your files and you only used semicolons in the taxa file instead of TABs and there is the Windows character ^M in your fasta file. I am not sure that is a problem for uclust but it is always better using a file without those characters.

Let me know if that makes any difference.

Mauro

Sachia

unread,

Dec 18, 2015, 2:26:33 AM12/18/15

to Qiime 1 Forum

Hi guys,

sorry for the delay, I'm back with some results.

making sure the ID string and taxonomy string was tab separated and making sure there was a space between each taxonomic level, in the tax string seems to have worked (this is all in the taxonomy.txt file). Then i ran:

assign_taxonomy.py -i repset_OTUs.fasta -o assign_RDP -m rdp -t JELLY_TAX.txt -r JELLY_no_align.fas

I have attached the log file and it seems to have processed the data the correct way?

However, I am not sure I am using the right settings, so I a few more questions if you don't mind.

1) The length of the sequences in the reference DB, should i trim them to fit the specific region of the amplicons?

2) Does it matter with the reference DB, should I preferably use the aligned sequences? or is this solely for later using UniFrac?

Good luck with doing this Colin I hope it will run smoothly and thanks for the suggestions Mauro :)

Cheers,

Sachia

repset99_OTUs_tax_assignments.log

Mauro Tutino

unread,

Dec 18, 2015, 5:30:22 AM12/18/15

to Qiime 1 Forum

Hi Sachia,

glad to hear it worked.

I usually use the default settings with RDP.

1) I do not think trimming the ref makes any difference for the assignment taxonomy step. It can be done when aligning the sequences. I did not try it yet but it is something I want to do.

2) RDP does not accept aligned sequences.

Just one question, is your log file called repset99_OTUs_tax_assignment.log because you clustered the OTUs at 99% of similarity?

Mauro

Sachia

unread,

Dec 21, 2015, 7:41:24 AM12/21/15

to Qiime 1 Forum

Hi Mauro,

sorry for dropping out for a while.
Yes we clustered at 99% because its 18S and less than 250 bp length.
I know there is a discrepancy between the commands I write in here and then in the log, I just forgot to write the real name here.

Again, thank you so much for helping out.

Cheers,
Sachia

Reply all

Reply to author

Forward