open reference OTU picking ground to a halt?

Michael Baron

unread,

Feb 9, 2016, 12:08:35 PM2/9/16

to qiime...@googlegroups.com

Hello everybody!

I've got a MiSeq 16S v4 dataset on soil samples (Greg's protocol used by the EMP). My run was a little overloaded, so it isn't the best data, but after demultiplexing I still got about 13-14 million sequences.

It appears to me that picking the OTUs has ground to a hold though. I'm using a relatively recent core i5 (3.6 GHz, 4 threads) with 8 GB of RAM (alas on Win7) and the virtual machine, but currently the CPU is not breaking a sweat (1-5% load) and my memory isn't even loaded to a quarter.

!time pick_open_reference_otus.py -o otus/ -i slout/seqs.fna -p uc_fast_params.txt -a

So far (several hours of runtime) it has created the step1_otus folder containing two POTU_* folders.

Has anyone got an idea what's going on?

Is the bash time command making problems? The sequencing was part of an undergraduate teaching practical, so it would be nice to report some processing times to them.

Is it a problem with the lack of absolute paths? I followed the Illumina overview tutorial, which doesn't use absolute paths and only later discovered the stern reminder when I dove through the documentation.

Thanks a bunch!

Michael

PS. On a related matter, I'm searching for a good way to sub-sample my multiplexed data set, so I have more manageable data to work with my students. I would be grateful for suggestions.

Michael Baron

unread,

Feb 9, 2016, 12:19:17 PM2/9/16

to qiime...@googlegroups.com

Hmmm, seems like the virtual box version is failing the full-version test:


qiime@qiime-190-virtual-box:~$ print_qiime_config.py -tf


System information
==================
         Platform: linux2
   Python version: 2.7.3 (default, Dec 18 2014, 19:10:20)  [GCC 4.6.3]
Python executable: /usr/bin/python


QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:
 https://github.com/biocore/qiime-default-reference/releases/tag/0.1.2


Dependency versions
===================
                QIIME library version: 1.9.1
                 QIIME script version: 1.9.1
      qiime-default-reference version: 0.1.2
                        NumPy version: 1.9.2
                        SciPy version: 0.15.1
                       pandas version: 0.16.1
                   matplotlib version: 1.4.3
                  biom-format version: 2.1.4
                         h5py version: 2.4.0 (HDF5 version: 1.8.4)
                         qcli version: 0.1.1
                         pyqi version: 0.3.2
                   scikit-bio version: 0.2.3
                       PyNAST version: 1.2.2
                      Emperor version: 0.9.51
                      burrito version: 0.9.1
             burrito-fillings version: 0.1.1
                    sortmerna version: SortMeRNA version 2.0, 29/11/2014
                    sumaclust version: SUMACLUST Version 1.0.00
                        swarm version: Swarm 1.2.19 [May 26 2015 13:50:14]
                                gdata: Installed.
RDP Classifier version (if installed): rdp_classifier-2.2.jar
          Java version (if installed): 1.6.0_35


QIIME config values
===================
For definitions of these settings and to learn how to configure QIIME, see here:
 http://qiime.org/install/qiime_config.html
 http://qiime.org/tutorials/parallel_qiime.html


                     blastmat_dir: /qiime_software/blast-2.2.22-release/data
      pick_otus_reference_seqs_fp: /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                         sc_queue: all.q
      topiaryexplorer_project_dir: None
     pynast_template_alignment_fp: /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta
                  cluster_jobs_fp: start_parallel_jobs.py
pynast_template_alignment_blastdb: None
assign_taxonomy_reference_seqs_fp: /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                     torque_queue: friendlyq
                    jobs_to_start: 1
                       slurm_time: None
            denoiser_min_per_core: 50
assign_taxonomy_id_to_taxonomy_fp: /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
                         temp_dir: /tmp/
                     slurm_memory: None
                      slurm_queue: None
                      blastall_fp: /qiime_software/blast-2.2.22-release/bin/blastall
                 seconds_to_sleep: 1


QIIME full install test results
===============================
..........................F
======================================================================
FAIL: test_usearch_supported_version (__main__.QIIMEDependencyFull)
usearch is in path and version is supported
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/print_qiime_config.py", line 650, in test_usearch_supported_version
    "which components of QIIME you plan to use.")
AssertionError: usearch not found. This may or may not be a problem depending on which components of QIIME you plan to use.


----------------------------------------------------------------------
Ran 27 tests in 0.072s


FAILED (failures=1)

Michael Baron

unread,

Feb 9, 2016, 12:34:32 PM2/9/16

to qiime...@googlegroups.com

I've just tested again with the first 10000 demultiplexed sequences. uclust seems to stress the CPU properly (when I checked stress via 'top') and then it falls off again.

pick_open_reference_otus.py -o otus2/ -i slout/seqs10k.fna -p uc_fast_params.txt -aO 4

Colin Brislawn

unread,

Feb 9, 2016, 1:06:17 PM2/9/16

to Qiime 1 Forum

Hello Michael,

Thanks for getting in touch with us!

At the start, you mentioned that memory and CPU are minimal, but that was when usearch/uclust was messed up. Now that uclust is working, can you check memory again? This often 'fills up' and overflows into swap, bringing everything to a crawl. This would explain why CPU usage was up, then drop off after some time.

You could also try using absolute file paths, like you mentioned. Several pages of the qiime documentation demand this, but I've never had a problem with relative paths.

As for subsampling, vsearch --fastx_subsample seems like a perfect fit. Check it out:

https://github.com/torognes/vsearch

Colin Brislawn

Michael Baron

unread,

Feb 9, 2016, 1:17:39 PM2/9/16

to Qiime 1 Forum

Hi Colin,

Thanks for helping out!

On Tuesday, February 9, 2016 at 6:06:17 PM UTC, Colin Brislawn wrote:

Hello Michael,

Thanks for getting in touch with us!

At the start, you mentioned that memory and CPU are minimal, but that was when usearch/uclust was messed up. Now that uclust is working, can you check memory again? This often 'fills up' and overflows into swap, bringing everything to a crawl. This would explain why CPU usage was up, then drop off after some time.

I'm not sure I understand what you mean. Or maybe my explanation wasn't really clear. So I'll try again:

My CPU is only stressed (I'm expecting this, I do want it to crunch through the data after all) at the beginning of the pick_open_reference_otus script. Then it falls off rapidly and the program seems to halt. I got a little confused in 'top' as qiime is also the username in the virtual machine, but I cannot see any Qiime programs stressing the CPU then.

But you where right about uclust filling up the memory, it repeatedly crashes my browser when I run pick_open_reference_otus. Could that ground the program to a halt?

Do I make sense, or where you trying to make a different point?

You could also try using absolute file paths, like you mentioned. Several pages of the qiime documentation demand this, but I've never had a problem with relative paths.

As for subsampling, vsearch --fastx_subsample seems like a perfect fit. Check it out:
https://github.com/torognes/vsearch

Great, I'll look into this. Thanks!

Colin Brislawn

Colin Brislawn

unread,

Feb 9, 2016, 1:24:02 PM2/9/16

to Qiime 1 Forum

Hello Michael,

But you where right about uclust filling up the memory, it repeatedly crashes my browser when I run pick_open_reference_otus. Could that ground the program to a halt?

Yep, that's what I was talking about. When running, top the important part to check is that your RAM is not totally full and very little swap is being used. I've highlighted these in the output of top from my local cluster:

top - 10:19:32 up 10 days, 22:40, 96 users, load average: 5.56, 5.13, 4.03

Tasks: 2120 total, 5 running, 2114 sleeping, 1 stopped, 0 zombie

Cpu(s): 2.3%us, 5.1%sy, 0.0%ni, 89.4%id, 3.0%wa, 0.0%hi, 0.3%si, 0.0%st

Mem: 65859452k total, 65580916k used, 278536k free, 53732k buffers

Swap: 65535996k total, 120k used, 65535876k free, 50734196k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

44756 d3x345 20 0 85604 54m 1364 R 100.0 0.1 10146:08 Xvnc

If OTU picking runs out of RAM, it will be so slow it might as well have crashed. A good heuristic is that you should have as much RAM as your input seqs.fna file. How large is your seqs.fna file you are trying to run on your 8GB VM?

Colin

Michael Baron

unread,

Feb 9, 2016, 1:30:22 PM2/9/16

to Qiime 1 Forum

Hi Colin,

In this case I'm not too terribly matched. My sequence file is about 5GB, which is the same amount of RAM I allocated to the VM. Though, when I ran the program on only the first 10000 lines of the sequences (1.8MB), I ran into the same issues. A burst of activity, suddenly nothing.

I will run my small trial again and have an eye out on the memory.

Colin

Thanks!

Michael

Colin Brislawn

unread,

Feb 9, 2016, 1:36:30 PM2/9/16

to Qiime 1 Forum

Sounds like a plan!

I should mention that when you run OTU picking in parallel, a full copy of the reference database is loaded during some steps. Switching to fewer parallel pieces, say two instead of four, could help minimize RAM usage. RAM usage from multiple copies of the reference database stays high, even if a small input file of 1.8 MB was used.

Colin Brislawn

Michael Baron

unread,

Feb 9, 2016, 1:48:49 PM2/9/16

to Qiime 1 Forum

Aha - success! It worked on my small file (single threaded).

Thanks for all your help. I think I will attempt the big file with two threads, possibly add another GB of RAM to the VM.

Do you have a rough idea about the run time of the program on such large data sets?

Michael

Colin Brislawn

unread,

Feb 9, 2016, 2:21:04 PM2/9/16

to Qiime 1 Forum

I'm not sure about timing. It depends on many different factors include the size and complexity of the data set and the algorithms you use. You will have to try it and find out.

I run qiime off a single node of a cluster, and I can run a 12GB data set in about half a day. If I had larger data sets or less compute power, I would start playing with the algorithms and settings to find a faster method.

Previously, I have rented supercomputers from amazon to process data, which worked really well. The c3.8xlarge ones are great.

Colin

Michael Baron

unread,

Feb 9, 2016, 3:11:49 PM2/9/16

to qiime...@googlegroups.com

I believe you were correct about my memory issues. For the most time I was hovering about 80% usuage, eventually rose to 97% (at least from what I could tell in top) and a full swap file. Then it presumably stopped. This time I had an actual error:

~/Desktop/miseq$ pick_open_reference_otus.py -o $PWD/otus/ -i $PWD/slout/seqs.fna -p $PWD/uc_fast_params.txt -f

Traceback (most recent call last):


  File "/usr/local/bin/pick_open_reference_otus.py", line 453, in <module>
    main()
  File "/usr/local/bin/pick_open_reference_otus.py", line 432, in main
    minimum_failure_threshold=minimum_failure_threshold)
  File "/usr/local/lib/python2.7/dist-packages/qiime/workflow/pick_open_reference_otus.py", line 713, in pick_subsampled_open_reference_otus
    close_logger_on_success=False)
  File "/usr/local/lib/python2.7/dist-packages/qiime/workflow/util.py", line 122, in call_commands_serially
    raise WorkflowError(msg)
qiime.workflow.util.WorkflowError: 


*** ERROR RAISED DURING STEP: Pick Reference OTUs
Command run was:
 pick_otus.py -i /home/qiime/Desktop/miseq/slout/seqs.fna -o /home/qiime/Desktop/miseq/otus//step1_otus -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -m uclust_ref --enable_rev_strand_match --suppress_new_clusters
Command returned exit status: 137
Stdout:


Stderr
Killed

Darn.

Would it make sense to chop the file in half (no idea how) and run the iterative OTU picking program? Failing that, I'll probably have to use closed ref picking for now.

edit: though musing about the closed reference OTU, this uses the same pick_otus.py program, so that probably won't change anything about my memory issues.

Thanks,

Michael

Colin Brislawn

unread,

Feb 9, 2016, 4:39:01 PM2/9/16

to Qiime 1 Forum

Hello Micheal,

First, try removing --enable_re_strand_match because that will double memory usage.

Closed ref can scale better depending on how it's used, so that's another good option to try. This is the only method where you can split your input file, pick OTUs separately, then combine the outputs. (OTU tables produced by open-ref or de novo methods cannot be combined later because different OTU will end up with the same names, causing conflict.)

I'm not a fan of iterative OTU picking methods, but the qiime devs like them. You have probably found this already:

http://qiime.org/tutorials/open_reference_illumina_processing.html

Colin

Colin Brislawn

unread,

Feb 9, 2016, 4:42:57 PM2/9/16

to Qiime 1 Forum

I want to mention renting supercomputers from Amazon again. Whoever paid to sequence 5GB of data, and is paying you to analyze it, should be willing to spend $1.60 an hour on a computer to analyze it. Or if you are at an institution or lab with some supercomputer power, you could also run qiime from a linux cluster. Once you complete OTU picking and chimera checking, you should be able to do downstream analysis on the Virtual Box.

Colin

Michael Baron

unread,

Feb 9, 2016, 4:55:31 PM2/9/16

to Qiime 1 Forum

I'm giving the run another go without the reverse-strand matching. Closed references on the full file has failed with the same 137 error, as predicted.

I will give splitting the file a go and the iterative process might be useful.

I agree, they should pay (and I would pursue this avenue if I weren't pressed for time - the joy of teaching on a tight schedule - and completely inexperienced in setting up EC2). It is my fault for not foreseeing the memory issues. It might be a good idea to include that in the Illumina overview tutorial somewhere.

Thanks for patiently helping me, your input is very useful.

Michael

Michael Baron

unread,

Feb 10, 2016, 4:14:40 AM2/10/16

to qiime...@googlegroups.com

Just an update in case someone struggles with the same issues. I've managed to finally get the run finished. Here's what I did:

* Instead using my whole multiplexed dataset, I only used one of the reads to demultiplex (that halved the data)

* After demultiplexing I had

count_seqs.py -i slout/seqs.fna

6703034 : Total

* Using this number it becomes relatively easy to split the file in half with two Bash commands. Each sequence uses one line for the header, one line for the nucleotides:

head -6703034 slout/seqs.fna > slout/seq1of2.fna
tail -6703034 slout/seqs.fna > slout/seq2of2.fna

* And finally the iterative approach:

pick_open_reference_otus.py -i $PWD/slout/seq1of2.fna,$PWD/slout/seq2of2.fna -o $PWD/otus_iter/

Another word of caution: make sure you have enough hard-drive space available, or the VM will pause. My virtual machine has now reached about 50GB. Problem about dynamically expanding VMs hard-drives is that they don't dynamically shrink again, so I'll have to clear (zero-free) it out at some point again.

Reply all

Reply to author

Forward