parallel_assign_taxonomy_rdpy.py taking a LONG time, help?

Jessica Hardwicke

unread,

May 12, 2016, 2:09:49 PM5/12/16

to Qiime 1 Forum

I use qiime with a PBS server, which gives me a limit of 336 hours wall time to run a job. My workflow is currently stuck at trying to use RDP to assign taxonomy to my representative set of sequences. The rep set file is about 15 mb, and I'm running 32 parallel jobs - but after 200+ hours it is still running, and I'm wondering if there's anything I'm doing wrong. Is this normal? Is there something I could do to speed it up?

If it helps, my output folder for assign taxonomy currently contains two folders - RDP_VWge1_ and RDP_yH7FF_

Colin Brislawn

unread,

May 12, 2016, 6:31:12 PM5/12/16

to Qiime 1 Forum

Hello Jessica,

Taxonomy assignment can be one of the longest steps. However, with an input file of 15 MB, it should take a few minutes, not a few hours! Something must be wrong here.

Perhaps your server is reaching a memory limit. Have you considered lowering the max memory with the --rdp_max_memory flag or reducing the total number of running jobs?

You could also try using the uclust LCA taxonomy assigner. I have had a very good experience using this script, and I find that 8 threads is reasonably fast on large files without using too much RAM.

http://qiime.org/scripts/parallel_assign_taxonomy_uclust.html

I hope that helps,

Colin

Jessica Hardwicke

unread,

May 16, 2016, 2:47:28 PM5/16/16

to Qiime 1 Forum

Thanks for the fast response Colin. I tried tweaking the memory settings with RDP as you suggested without any luck (so far). LCA worked quickly though, so I may stick with that. I think my issue may be with the server, and I'm going to keep playing around with RDP to find what I'm doing wrong.

Colin Brislawn

unread,

May 16, 2016, 5:21:01 PM5/16/16

to Qiime 1 Forum

Good to hear! The uclust LCA assigner is now the default in qiime, and I personally find the underlying method pretty intuitive. I think that's a reasonable choice if RDP is still having issues.

Colin

Kevin Zhang

unread,

Jul 20, 2016, 3:00:21 PM7/20/16

to Qiime 1 Forum

I am having a similar problem. My rep set sequence is 127 mb, and it has been running for 24+ hours, I see some of the files but not all of them. This was the command that I used:

parallel_assign_taxonomy_rdp.py -i /home/qiime/Desktop/Kevin/work/uclust_rep_set/uclust_rep_seqs.fasta -o RDP_classifier_test/ -t /home/qiime/Desktop/Greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r /home/qiime/Desktop/Greengenes/gg_13_8_otus/rep_set/97_otus.fasta -O 16 --rdp_max_memory 14000 -R &&\

is there a method to selecting how many cores and how much memory to assign it?

Colin Brislawn

unread,

Jul 20, 2016, 5:46:17 PM7/20/16

to Qiime 1 Forum

Hello Kevin,

Thanks for getting in touch. Let's see...

is there a method to selecting how many cores and how much memory to assign it?

Yes. -O controls cores (16 in your case) and --rdp_max_memory controls memory. I'm not sure if it's 14,000 MB for the whole program or 14,000 MB per core. If it's MB per core, than you have assigned a huge amount of memory!

When I use large scripts like this, I often have activity monitor up (or some other program to monitor RAM and CPU) to make sure that I don't 'run out' of RAM. Once RAM is fully used, the computer starts using swap, which is much slower and grinds the program to a halt.

I hope that helps! Let me know what you try next,

Colin

Kevin Zhang

unread,

Jul 20, 2016, 8:31:28 PM7/20/16

to Qiime 1 Forum

Thank you for getting back so quickly. In virtual box I have a total of 20 cores and 22gb of RAM. I figured 14 would be okay, however when I do use activity monitor it reaches max and starts using swap, is that why it is taking so long? Should I decrease the amount of RAM to 8 gb? I am also not sure if the 14gb is for the entire process or per core, the manual does not mention, it would be awesome to know, that way I can continue to work on this issue.

Colin Brislawn

unread,

Jul 20, 2016, 9:08:52 PM7/20/16

to Qiime 1 Forum

Hello Kevin,

Ah, it sounds like you have a pretty large server, so I would definitely use a system monitor to watch RAM usage as this script runs. Once you see what's being maxed out, you can decrease setting so that it runs better. I'm also not sure exactly the --rdp_max_memory does, so watching the RAM yourself and trying different settings is really important.

This process of trial and error let's you see what's working, and focus on just the parts that are not.

Let me know how it works!

Colin

Kevin Zhang

unread,

Aug 4, 2016, 2:54:10 PM8/4/16

to Qiime 1 Forum

After much messing around, I lowered CPU to 8 and rdp memory to 12000, it seems to have worked. Thank you Colin!

Colin Brislawn

unread,

Aug 4, 2016, 4:41:34 PM8/4/16

to Qiime 1 Forum

Good to hear.

Thanks for troubleshooting this problem with me.

Colin

Reply all

Reply to author

Forward